If you've ever accessed the Internet from an office environment, chances are your communications passed through a proxy. In the next few articles, I'll discuss the advantages of using a proxy and demonstrate the configuration of several proxies available from FreeBSD's ports collection.
You may not already kmow what a proxy does. Take a moment and go to http://www.freebsd.org/ports/, and in the "Search for:" box type in the word "proxy". You may be surprised at the number of proxies available, perhaps even a bit dismayed by all of the terms found in the descriptions: reverse proxy, arp proxy, transparent proxy, etc. Bear with me while I go through the most common proxy terms. Then we can start making sense of the terminology by looking at concrete examples.
In its simplest form, a proxy is a piece of software that "acts on behalf of" a network client. Keep in mind that in a network, a client is an entity that makes a network request and a server is an entity that responds to the request. For example, your web browser is a client which requests web content from a web server.
Depending upon the proxy, there are several ways it can "act on behalf of" the client. The first is to take the place of the client, meaning the client never communicates directly with the server. Instead, the client makes a connection to the proxy and the proxy makes the connection to the server, receives any responses from the server, and relays them back to the client. This is often done with web browsers and looks like this:
web browser ----------------> proxy ---------------> web server <--------------- <--------------
The next time you go to a web site, look at the bottom of your GUI web browser. If it says "Waiting for www.google.ca", your web browser is connecting directly to the specified web server. However, if it says something like "Connecting to 192.168.1.1", your request is going through a proxy located at that address.
Using a proxy offers several advantages to a network. First, the only computer in the network that requires a public IP address is the one hosting the proxy software. This means that an entire network can have access to the Internet, even if you're only able to get one IP address from your Internet Service Provider. Besides saving on cost, this also adds a bit of a security benefit as it hides your network from the Internet. The only IP address an Internet host is aware of is the IP address of the proxy.
Also in FreeBSD Basics:
There are further security advantages to using a proxy. Since all Internet requests pass through the proxy, most proxies allow you to configure which requests are allowed and which are banned. In fact, the amount and ease of configurability is usually what you are looking for when you are evaluating which particular proxy application is most suited for your network.
A proxy will also typically have a cache of previous requests which can save bandwidth. This is similar to your web browser's cache, except that an entire network can take advantage of the cached content. If one user has already requested an URL, the proxy will copy the content to its cache. When the next request for that URL arrives at the proxy, it will return the cached content rather than going back out on the Internet to retrieve the requested web page. Keep in mind that secure content won't be cached. For example, if you give your credit card information at a site whose URL starts with "https://", that information is not cached by the proxy.
A proxy that does caching will use an algorithm to determine how often to "refresh" the contents of its cache. A cache is great for saving Internet bandwidth, but users don't want to receive a page which has been stored in cache for over a month, especially if the original page has changed since then. Also, some pages are more dynamic than others. For example, Slashdot changes its contents often during the day whereas IANA rarely changes. The algorithm contains criteria to help the proxy determine when to refresh its cache and which pages to refresh first.
The most commonly used algorithms are ICP (Internet Cache Protocol), CARP (Cache Array Routing Protocol), and HTCP (HyperText Caching Protocol). You can read about all three protocols at the ICP site. These protocols have an additional advantage in that they allow multiple proxies to share their cache information. This allows a larger network to have a distributed cache and Internet requests can be load balanced.
There is a disadvantage to using a proxy: the client must be preconfigured to use it. This process is known as "client modification". In the example of a web browser, it requires the user to go into the "Preferences" or "Options" portion of their web browser, locate the "Proxy" section, and input the IP address of the proxy and the port number the proxy application is listening on. Other applications may require that special proxy client software be installed and configured on each machine that needs access to the proxy.
This brings us to the other way a proxy may "act on behalf of" a client: as a "transparent" proxy. Transparent means that nothing is preconfigured on the client; in fact, the user may have no idea that their request is going through a proxy. A transparent proxy will intercept the client request, ensure that it is allowed, and then forward it on to the server. This type of proxy is often integrated into a firewall which allows you to configure the proxy as part of the network's security policy.
Now is a good time to mention that most proxies are considered to be "application-specific". Take a closer look at your search results from the FreeBSD ports list. Notice that there are RealAudio proxies, IRC proxies, HTTP proxies, FTP proxies, SMTP proxies, and so on. For every Internet application, there is a separate software proxy. This is an important point; it suggests the true power and configurability of proxy software.
Imagine for a moment a typical network protected by a firewall. Users behind the firewall wish to surf the Internet and send and receive email. The firewall has been configured to allow outbound ports 25 (SMTP), 80 (HTTP), and 110 (POP3) and to allow the responses to those packets back into the network. As a packet arrives at the firewall, its headers will be compared to the firewall's rules to ensure the port number and source/destination IP addresses are allowed.
That sounds pretty secure, doesn't it? As long as the data contained in the packet is what it claims to be, it is secure. One of the limitations of a firewall's rulebase is that it is restricted to the information contained in the packet's headers. (See TCP Protocol Layers Explained.) In order for a firewall to inspect the data portion of a packet, it must understand that data. This is known as "content inspection" and requires additional software that understands the content being inspected. As you may have guessed, that additional software will be an application proxy.
Let's use an HTTP packet as an example. One of the network's users sends out a packet destined for port 80. It reaches the firewall which allows the packet, since the rulebase allows port 80 outbound. However, that packet didn't contain HTTP data. Instead, the user had configured his p2p application to use port 80, knowing that that port was open on the firewall. File sharing was probably the last thing the network administrator wanted to allow in her security policy, yet the firewall rulebase was unable to stop the unwanted packet.
Finally, there is the authentication issue. A firewall can only make authentication decisions based on IP addresses, but IP addresses can be spoofed and more than one user can sit at the same computer. You can't write a firewall rule that says "Gwendolyn can surf from 10.0.0.1 but Martin can't". However, a proxy can be configured to force a user to authenticate before they are allowed Internet access as well as to keep a list of allowed users and their permitted locations.
A proxy sounds great, but why do you need a separate proxy for every application? For the simple reason that every application uses different commands. You may remember using SMTP and POP3 commands. If you're connected to an SMTP server and try to use the LIST command, you'll receive an error. That's because LIST is a POP3 command, not an SMTP command. Similarly, a packet containing SMTP data will contain an SMTP command. An SMTP proxy can look for valid SMTP commands in the data portion of the packet. If it doesn't find any, the packet probably doesn't contain SMTP data.
If you're ever tasked with configuring an application proxy, it's handy to know where to find the commands used by each application, along with an explanation of what each command does. The best source of information is the RFC for that particular protocol. Here I've included the references for the most commonly used applications; as you skim through each RFC, look for the "Commands" section. You'll note that HTTP commands are instead referred to as "Methods".
We'll be revisiting application command sets as we build and configure some of the application proxies available in the ports collection.
There's a few other proxy terms I'd like to discuss before wrapping up this article. The first is a "reverse proxy". Think back to the definition of a proxy: a software application that acts on behalf of a network client. A reverse proxy is the reverse of that: it acts on behalf of a network server. The most common usage of a reverse proxy is to protect a web server. When a user on the Internet requests data from a web server protected by a reverse proxy, the reverse proxy intercepts the request and ensures that the data contained in the request is acceptable. For example, that the data doesn't contain any non-HTTP data or any malicious HTTP commands. If the data is acceptable, the reverse proxy will receive the requested content from the web server and forward it on to the original user. In this way, users on the Internet never directly access your web server.
Another type of proxy is an "ARP proxy". ARP is used whenever a TCP/IP host needs to send a packet. Before the host's interface can create a frame which will be sent to the network, it needs to know the hardware address of the host that will receive the frame. Since the packet itself only contains the IP address, ARP is used to determine which hardware address is associated with that IP address.
There are times, though, that ARP won't be able to find the hardware address. Let's take a look at a simple example:
web server ----- firewall/NAT device ------ Internet router ----- web browser |-----DMZ-------| |------------Internet------------------| |--- 10.0.0.0 --| |-- 126.96.36.199 --|
Here a web server is located in a DMZ which is protected by a firewall. The web server has been assigned the private address of 10.0.0.1. The NAT device has also statically associated that private address with the real address 188.8.131.52. The DNS server has a record pointing to 184.108.40.206 so the world can find the real IP associated with the web server. The firewall/NAT device also has a public IP of 220.127.116.11.
What happens when a web browser wants to access the content on that web server? The web browser will query DNS to resolve the web server's address to 18.104.22.168. It will then send the web request out onto the Internet where routers will route the packet to the 22.214.171.124 network. The Internet router attached to network 126.96.36.199 will send out an ARP request looking for the hardware address of 188.8.131.52.
But there really isn't a physical interface associated with 184.108.40.206. Instead that address is just a logical association that tells the firewall that any packets destined for that address are really to be sent to the webserver located at 10.0.0.1. Because there isn't a physical interface, there is no physical address and no host will respond to the router's ARP request. Without a response, the router will be unable to transmit the packet onto the network.
Enter an ARP proxy. This is where a host (in this case, the firewall)
answers an ARP request with its own hardware address. The assumption is
that once it receives the frame, it knows what to do with it. Your FreeBSD
system has a builtin ARP proxy (the
arp command). Let's
pretend for a moment that the hardware address of the firewall is
AA:BB:CC:11:22:33. To configure that firewall to receive frames for both
its own IP address and for the web server's IP address, use this command
as the superuser:
% arp -s 220.127.116.11 AA:BB:CC:11:22:33 pub
To verify it worked:
% arp -a (18.104.22.168) at aa:bb:cc:11:22:33 on ed0 [ethernet] (22.214.171.124.) at aa:bb:cc:11:22:33 on ed0 permanent published [ethernet]
pub or "published" switch is what invokes the ARP
This article covered the most common proxy terms. In the next article, we'll see some of these terms in action as we install and configure one of the proxies contained in the FreeBSD ports collection.
Dru Lavigne is a network and systems administrator, IT instructor, author and international speaker. She has over a decade of experience administering and teaching Netware, Microsoft, Cisco, Checkpoint, SCO, Solaris, Linux, and BSD systems. A prolific author, she pens the popular FreeBSD Basics column for O'Reilly and is author of BSD Hacks and The Best of FreeBSD Basics.
Read more FreeBSD Basics columns.
Return to the BSD DevCenter.
Copyright © 2009 O'Reilly Media, Inc.