In this two-part technical tutorial we'll explore the deployment of a web proxy cache, sometimes referred to as a Web cache or a proxy server, for a small to medium sized corporate enterprise. A web proxy cache is surprisingly easy to implement and maintain, and when built using open-source software it can be quite economical as well.
Bandwidth on a corporate Internet connection is a valuable and often critical business resource, and in most cases is also fairly expensive. Unfortunately, even for small and medium sized companies that precious lifeline can become consumed by Web traffic from the company's own internal systems. This leads to a slow and unresponsive connection during peak work hours. An analysis of the Web surfing habits of a company's user population will often show a number of "hot" Web sites, such as competitors, stock tracking sites, and items of personal interest to employees. Visits by multiple individuals to hot sites leads to inefficiencies, because each client browser must use the relatively slow corporate Internet connection to fetch the same data.
Popular browsers help to reduce inefficiencies by locally caching Web objects. This locally reduces demand and increases performance, but browser caches aren't shared across an enterprise. The implementation of a web proxy cache can save additional bandwidth. Just what is a Web cache? Simply put, it's an intermediary (or proxy) computer system between Web browsers and Internet Web servers. Instead of sending requests for Web pages directly to origin servers on the Internet, browsers instead contact a web proxy cache server on the local high-speed network, which in turn contacts the origin server on behalf of the browser. The proxy fetches the object from the Internet and forwards it back to the browser, but also keeps a copy for itself. Subsequent requests for the same object from any browser in the enterprise won't require a visit to the origin server. They can be fulfilled locally from the Web cache. This has the effect of speeding response time for everyone and reducing bandwidth demands on the Internet connection. It is not unusual for a cache system to reduce demand by 20-30%.
A variety of web proxy cache products are available. Some are free while others are very expensive, particularly for large corporate environments. When per-user licensing fees are evaluated for some of the products, the costs can become prohibitive. Fortunately, at least one mature, reliable, and popular Open Source alternative exists, the Squid web proxy cache. Squid is funded by the US National Science Foundation and is developed through the unpaid contributions of many volunteers. Squid is free, licensed under the GNU Public License. Squid runs on nearly all flavors of Unix, including Linux and FreeBSD.
A web proxy cache requires a generous amount of memory and a fast disk I/O subsystem. Memory is needed to maintain lists of cached objects, and disks must be capable of keeping up with a steady flood of random reads and writes. Typically processor speed is not a limiting factor, and a modest processor can make a satisfactory proxy server given the appropriate I/O and memory configuration.
In this tutorial, we'll be configuring Squid for a pair of Intel systems running Linux and intended to serve up to 2000 client browsers. Since Internet demand and usage patterns are site-specific, your site may need more or less hardware as your needs dictate. For the purposes of this example, the following specifications are adequate:
In our example configuration we'll begin with a working Redhat Linux 6.0
system (including the gcc C compiler) on ultra-wide SCSI disk
/dev/sda. This partition will also hold Squid and its log files.
Two more disks,
/dev/sdc, will contain
the cached Web objects. To start, the cache disks are assumed to contain unused
ext2 (Linux native) partitions
/dev/sdc1. By placing the cache on multiple disks, we increase
cache performance. This distributes I/O and takes advantage of Squid's ability
to manage multiple cache disks simultaneously. (If you are configuring Squid for
a small installation, you may choose to cache to your system disk instead.) For
even better performance, we could place the disks on separate SCSI channels.
Note that IDE disk interfaces are not recommended for heavily loaded proxy
servers because of the inherent random nature of the cache I/O.
We'll be installing Squid into its default location,
/usr/local/squid. It is recommended to make the
large enough to handle Squid's log files which can grow very big on a production
server. We will also run Squid under a special user created for the purpose,
appropriately called "squid" with a special group also called "squid."
A web proxy cache will write a large number of small files in its cache directories. Therefore, you should create the filesystems for the two cache disks with a relatively large number of inodes. If the inode configuration is new to you, don't worry about it at this point - it's easy to reconfigure the cache disks later if necessary.
While you may find a current precompiled binary package for your system, we'll compile Squid from source code for this tutorial. Squid compiles easily and offers complete control over where it is installed.
First, create directories for Squid:
# mkdir -p /usr/local/squid/src
Next, set ownership and the SGID permission on the top level and source directories. This ensures that all new files have the squid group owner, allowing multiple sysadmins to manage Squid without using root privilege:
# chown -R squid.squid /usr/local/squid # chmod g+s /usr/local/squid /usr/local/squid/src
Create the squid user (under Redhat Linux, this also creates the squid group):
# useradd squid -d /usr/local/squid
Use your browser or FTP client to transfer the Squid source distribution from the Squid web proxy cache download page. As of February 1, 2000, the latest version of Squid is known as "2.3.STABLE1," the version we'll use in this example (you should be able to implement any recent stable release without difficulty).
The squid source is stored in a compressed tar file, which should be placed
in the new
src directory. Unpack the compressed tar file:
# cd /usr/local/squid/src # tar zxvf squid-2.3.STABLE1-src.tar.gz
This will leave you with the entire source directory tree under
squid-2.3.STABLE1. There are helpful documents in the
doc directory, including a quick-start guide and installation
instructions. It's worth poking around at this point to familiarize yourself
with the version of Squid you're using. Next, build the software:
# cd /usr/local/squid/src/squid-2.3.STABLE1 # ./configure
The automatic configuration process will profile your system to determine exactly what capabilities exist. You shouldn't have difficulty with this process, but if issues do arise the error messages from configure should help you find quick resolutions.
Next, we compile Squid using the supplied Makefile:
The compilation should take between a few minutes and an hour depending on your system's performance. When the compilation has completed without reporting errors, install it:
# make install
The last line will create a directory hierarchy under
bin (executables like
squid itself and its utilities),
etc (configuration), and
logs (Squid log files). Note that there are no cache directories
set up at this point. To create them, we'll need to mount the two disks we set
aside for the task:
# mkdir /usr/local/squid/cache0 /usr/local/squid/cache1 # mount -t ext2 /dev/sdb1 /usr/local/squid/cache0 # mount -t ext2 /dev/sdc1 /usr/local/squid/cache1
We now need to create a configuration file for Squid, stored in
/usr/local/squid/etc/squid.conf. Listing 1 contains a basic file
that you can use to get started. Later, you'll want to customize your
After creating squid.conf, you're ready to build your cache directories. The cached objects are stored in a large hierarchy. Its framework must be created before launching Squid for the first time. To initiate the cache build use the -z option to squid:
# /usr/local/squid/bin/squid -z
This will exercise your disks for a while as the hierarchy is created. When it completes, you're ready to start Squid for the first time:
# /usr/local/squid/bin/squid -Ns &
To verify that squid is running, take a look at
/usr/local/squid/logs/squid.log. You should see something like
Listing 2, ending in "Ready to serve requests." Squid should now be ready to
accept requests from browsers.
Before moving on to the browser side of things, let's stop to consider some basic security issues involved with using a cache (my thanks to Michael Alan Dorman for raising this important issue). Your intended purpose for deploying a cache will imply an intended user base. In the case of a small to medium sized enterprise, for whom this tutorial is intended, the users are usually the employees of the company, who access the Internet from their internal private LAN. A web cache becomes part of the larger security infrastructure, including firewalls, mail servers, and other technologies. In many such cases the web cache can be deployed behind the firewall because it is intended for access only by users on the LAN. In this configuration, security for the cache server isn't a significant concern because only trusted users have access to it.
However, your situation may dictate that you deploy your Squid system outside your firewall so that it is publicly available on the Internet. In this scenario, security rises to the top of the priority list. As Mike Dorman points out, an unsecured web proxy can be unexpectedly abused by unauthorized outsiders.
To prevent such abuse, you can create an access control methodology to selectively offer caching services only to users you trust. Squid offers this capability through administrator-defined Access Control Lists (ACLs), which can be used to create finely detailed access control schemes. Limitations can be placed on client addresses, destination domains, time of day, port numbers, access methods, browsers, and even users. While a complete treatment of Squid ACLs is far beyond the scope of this tutorial, a simple client-address ACL scheme has been included in the Squid configuration shown in Listing 1. The first part of the ACL setup involves the definition of access groups:
acl all src 0.0.0.0/0.0.0.0 acl mynet src 192.168.1.0/255.255.255.0
The first line defines the group all that includes all possible IP addresses. The second defines a small subgroup of addresses called mynet on the private network 192.168.1.0 (this is just an example - your address configuration will be different). It is only users from mynet that we wish to allow access to the cache, which leads us to the second part of the ACL setup:
http_access deny all http_access allow mynet
Here, we explicitly deny http access to Squid by every possible address as defined in group all, but then turn around and grant access to mynet. The effect is that systems coming from addresses outside of mynet will not be able to access Squid while those inside have full access.
While effective, this ACL configuration only scratches the surface of Squid's capability. A thorough review of ACL usage is essential prior to deployment of a publicly available cache.
To test Squid, we'll manually configure a browser to use Squid instead of origin servers. In Netscape Communicator, this is done using the Edit -> Preferences -> Advanced -> Proxies dialog. Select "Manual Proxy Configuration" and click on "View". For each protocol, enter the IP address of your Squid machine and port number 3128, the default port on which Squid listens for inbound requests. Save your changes and try browsing a site you're familiar with. If everything is working correctly, you should be able to browse as before. The difference is that Squid is now acting as an intermediary, keeping copies of the pages you view in its cache. To see Squid's activity, watch its access log:
# tail -f /usr/local/squid/logs/access.log
You should see a line in that file for each request from browsers. An example
is given in Listing 3, showing the time (since the Unix epoch), requesting IP
address, URL, etc. Each line also will indicate a status of the request with
respect to the cache, such as
TCP_MEM_HIT, among others. Those status messages including the word
HIT indicate that the request was served from the cache.
If everything has gone well up to this point, you should have a functional Squid configuration that serves requests from multiple browsers.
In the second part of this article, we'll complete our enterprise installation of Squid, including:
========= Listing 1 ========= # squid.conf # # a basic configuration file for the Squid Proxy Web Cache # set logging to the lowest level debug_options ALL,1 # define group "all" that encompasses all possible IP addresses # and group "mynet" that represents my class-C network: acl all src 0.0.0.0/0.0.0.0 acl mynet src 192.168.1.0/255.255.255.0 # define an access control for group "all" to deny http access, # and another for group "mynet" to allow http access. # # The effect of using both is to prohibit access to the cache by # any address that doesn't satisfy the criteria established # in group "mynet". http_access deny all http_access allow mynet # set Squid's user and group cache_effective_user squid squid # set log directories cache_access_log /usr/local/squid/logs/access.log cache_log /usr/local/squid/logs/cache.log # set cache directories of 3.5GB each cache_dir ufs /usr/local/squid/cache0 3500 16 256 cache_dir ufs /usr/local/squid/cache1 3500 16 256 # set the cache memory target for the Squid process cache_mem 80 MB # the mailbox of the sysadmin cache_mgr root@localhost
========= Listing 2 ========= 2000/02/01 03:12:10| Starting Squid Cache version 2.3.STABLE1 for i686-pc-linux-gnu... 2000/02/01 03:12:10| Process ID 1188 2000/02/01 03:12:10| With 1024 file descriptors available 2000/02/01 03:12:10| Performing DNS Tests... 2000/02/01 03:12:10| Successful DNS name lookup tests... 2000/02/01 03:12:10| DNS Socket created on FD 5 2000/02/01 03:12:10| idnsParseResolvConf: nameserver 22.214.171.124 2000/02/01 03:12:10| idnsAddNameserver: Added nameserver #0: 126.96.36.199 2000/02/01 03:12:10| idnsParseResolvConf: nameserver 188.8.131.52 2000/02/01 03:12:10| idnsAddNameserver: Added nameserver #1: 184.108.40.206 2000/02/01 03:12:10| Unlinkd pipe opened on FD 10 2000/02/01 03:12:10| Swap maxSize 1024000 KB, estimated 78769 objects 2000/02/01 03:12:10| Target number of buckets: 1575 2000/02/01 03:12:10| Using 8192 Store buckets 2000/02/01 03:12:10| Max Mem size: 40960 KB 2000/02/01 03:12:10| Max Swap size: 1024000 KB 2000/02/01 03:12:10| Rebuilding storage in /usr/local/squid/cache0 (CLEAN) 2000/02/01 03:12:10| Rebuilding storage in /usr/local/squid/cache1 (CLEAN) 2000/02/01 03:12:10| Set Current Directory to /usr/local/squid/cache0 2000/02/01 03:12:10| Loaded Icons. 2000/02/01 03:12:10| Accepting HTTP connections at 0.0.0.0, port 3128, FD 14. 2000/02/01 03:12:10| Accepting ICP messages at 0.0.0.0, port 3130, FD 15. 2000/02/01 03:12:10| WCCP Disabled. 2000/02/01 03:12:10| Ready to serve requests.
Copyright © 2009 O'Reilly Media, Inc.