ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: Java Network Programming, 3rd Edition

URLs and URIs, Proxies and Passwords

by Elliotte Rusty Harold

The URLEncoder and URLDecoder Classes

One of the challenges faced by the designers of the Web was dealing with the differences between operating systems. These differences can cause problems with URLs: for example, some operating systems allow spaces in filenames; some don't. Most operating systems won't complain about a # sign in a filename; but in a URL, a # sign indicates that the filename has ended, and a fragment identifier follows. Other special characters, nonalphanumeric characters, and so on, all of which may have a special meaning inside a URL or on another operating system, present similar problems. To solve these problems, characters used in URLs must come from a fixed subset of ASCII, specifically:

Related Reading

Java Network Programming
By Elliotte Rusty Harold

The characters : / & ? @ # ; $ + = and % may also be used, but only for their specified purposes. If these characters occur as part of a filename, they and all other characters should be encoded.

The encoding is very simple. Any characters that are not ASCII numerals, letters, or the punctuation marks specified earlier are converted into bytes and each byte is written as a percent sign followed by two hexadecimal digits. Spaces are a special case because they're so common. Besides being encoded as %20, they can be encoded as a plus sign (+). The plus sign itself is encoded as %2B. The / # = & and ? characters should be encoded when they are used as part of a name, and not as a separator between parts of the URL.

WARNING This scheme doesn't work well in heterogeneous environments with multiple character sets. For example, on a U.S. Windows system, é is encoded as %E9. On a U.S. Mac, it's encoded as %8E. The existence of variations is a distinct shortcoming of the current URI specification that should be addressed in the future through Internationalized Resource Identifiers (IRIs).

The URL class does not perform encoding or decoding automatically. You can construct URL objects that use illegal ASCII and non-ASCII characters and/or percent escapes. Such characters and escapes are not automatically encoded or decoded when output by methods such as getPath() and toExternalForm( ). You are responsible for making sure all such characters are properly encoded in the strings used to construct a URL object.

Luckily, Java provides a URLEncoder class to encode strings in this format. Java 1.2 adds a URLDecoder class that can decode strings in this format. Neither of these classes will be instantiated.

public class URLDecoder extends Object
public class URLEncoder extends Object

URLEncoder

In Java 1.3 and earlier, the java.net.URLEncoder class contains a single static method called encode( ) that encodes a String according to these rules:

public static String encode(String s)

This method always uses the default encoding of the platform on which it runs, so it will produce different results on different systems. As a result, Java 1.4 deprecates this method and replaces it with a method that requires you to specify the encoding:

public static String encode(String s, String encoding) 
  throws UnsupportedEncodingException

Both variants change any nonalphanumeric characters into % sequences (except the space, underscore, hyphen, period, and asterisk characters). Both also encode all non-ASCII characters. The space is converted into a plus sign. These methods are a little over-aggressive; they also convert tildes, single quotes, exclamation points, and parentheses to percent escapes, even though they don't absolutely have to. However, this change isn't forbidden by the URL specification, so web browsers deal reasonably with these excessively encoded URLs.

Both variants return a new String, suitably encoded. The Java 1.3 encode( ) method uses the platform's default encoding to calculate percent escapes. This encoding is typically ISO-8859-1 on U.S. Unix systems, Cp1252 on U.S. Windows systems, MacRoman on U.S. Macs, and so on in other locales. Because both encoding and decoding are platform- and locale-specific, this method is annoyingly non-interoperable, which is precisely why it has been deprecated in Java 1.4 in favor of the variant that requires you to specify an encoding. However, if you just pick the platform default encoding, your program will be as platform- and locale-locked as the Java 1.3 version. Instead, you should always pick UTF-8, never anything else. UTF-8 is compatible with the new IRI specification, the URI class, modern web browsers, and more other software than any other encoding you could choose.

Example 7-8 is a program that uses URLEncoder.encode( ) to print various encoded strings. Java 1.4 or later is required to compile and run it.

Example 7-8. x-www-form-urlencoded strings
import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;


public class EncoderTest {

  public static void main(String[] args) {

    try {
      System.out.println(URLEncoder.encode("This string has spaces",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This*string*has*asterisks",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This%string%has%percent%signs", 
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This+string+has+pluses",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This/string/has/slashes",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This\"string\"has\"quote\"marks", 
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This:string:has:colons",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This~string~has~tildes",  
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This(string)has(parentheses)",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This.string.has.periods", 
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This=string=has=equals=signs", 
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This&string&has&ampersands", 
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("Thiséstringéhasé
                                              non-ASCII characters", "UTF-8"));
    }
    catch (UnsupportedEncodingException ex) {
      throw new RuntimeException("Broken VM does not support UTF-8");
    }

  }

}

Here is the output. Note that the code needs to be saved in something other than ASCII, and the encoding chosen should be passed as an argument to the compiler to account for the non-ASCII characters in the source code.

% javac -encoding UTF8 EncoderTest
% java EncoderTest
This+string+has+spaces
This*string*has*asterisks
This%25string%25has%25percent%25signs
This%2Bstring%2Bhas%2Bpluses
This%2Fstring%2Fhas%2Fslashes
This%22string%22has%22quote%22marks
This%3Astring%3Ahas%3Acolons
This%7Estring%7Ehas%7Etildes
This%28string%29has%28parentheses%29
This.string.has.periods
This%3Dstring%3Dhas%3Dequals%3Dsigns
This%26string%26has%26ampersands
This%C3%A9string%C3%A9has%C3%A9non-ASCII+characters

Notice in particular that this method encodes the forward slash, the ampersand, the equals sign, and the colon. It does not attempt to determine how these characters are being used in a URL. Consequently, you have to encode URLs piece by piece rather than encoding an entire URL in one method call. This is an important point, because the most common use of URLEncoder is in preparing query strings for communicating with server-side programs that use GET. For example, suppose you want to encode this query string used for an AltaVista search:

pg=q&kl=XX&stype=stext&q=+"Java+I/O"&search.x=38&search.y=3

This code fragment encodes it:

String query = URLEncoder.encode(
 "pg=q&kl=XX&stype=stext&q=+\"Java+I/O\"&search.x=38&search.y=3");
System.out.println(query);

Unfortunately, the output is:

pg%3Dq%26kl%3DXX%26stype%3Dstext%26q%3D%2B%22Java%2BI%2FO%22%26search
.x%3D38%26search.y%3D3

The problem is that URLEncoder.encode( ) encodes blindly. It can't distinguish between special characters used as part of the URL or query string, like & and = in the previous string, and characters that need to be encoded. Consequently, URLs need to be encoded a piece at a time like this:

String query = URLEncoder.encode("pg");
query += "=";
query += URLEncoder.encode("q");
query += "&";
query += URLEncoder.encode("kl");
query += "=";
query += URLEncoder.encode("XX");
query += "&";
query += URLEncoder.encode("stype");
query += "=";
query += URLEncoder.encode("stext");
query += "&";
query += URLEncoder.encode("q");
query += "=";
query += URLEncoder.encode("\"Java I/O\"");
query += "&";
query += URLEncoder.encode("search.x");
query += "=";
query += URLEncoder.encode("38");
query += "&";
query += URLEncoder.encode("search.y");
query += "=";
query += URLEncoder.encode("3");
System.out.println(query);

The output of this is what you actually want:

pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3

Example 7-9 is a QueryString class that uses the URLEncoder to encode successive name and value pairs in a Java object, which will be used for sending data to server-side programs. When you create a QueryString, you can supply the first name-value pair to the constructor as individual strings. To add further pairs, call the add( ) method, which also takes two strings as arguments and encodes them. The getQuery( ) method returns the accumulated list of encoded name-value pairs.

Example 7-9. -The QueryString class
package com.macfaq.net;

import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;

public class QueryString {

  private StringBuffer query = new StringBuffer( );

  public QueryString(String name, String value) {
    encode(name, value);
  }
  
  public synchronized void add(String name, String value) { 
    query.append('&');
    encode(name, value);
  }
  
  private synchronized void encode(String name, String value) {
    try { 
      query.append(URLEncoder.encode(name, "UTF-8"));
      query.append('='); 
      query.append(URLEncoder.encode(value, "UTF-8"));
    }
    catch (UnsupportedEncodingException ex) {
      throw new RuntimeException("Broken VM does not support UTF-8");
    }
  }
  
  public String getQuery( ) {
    return query.toString( );
  }
  
  public String toString( ) {
    return getQuery( );
  }

}

Using this class, we can now encode the previous example:

QueryString qs = new QueryString("pg", "q");
qs.add("kl", "XX");
qs.add("stype", "stext");
qs.add("q", "+\"Java I/O\"");
qs.add("search.x", "38");
qs.add("search.y", "3");
String url = "http://www.altavista.com/cgi-bin/query?" + qs;
System.out.println(url);

URLDecoder

The corresponding URLDecoder class has two static methods that decode strings encoded in x-www-form-url-encoded format. That is, they convert all plus signs to spaces and all percent escapes to their corresponding character:

public static String decode(String s) throws Exception
public static String decode(String s, String encoding)  // Java 1.4
  throws UnsupportedEncodingException

The first variant is used in Java 1.3 and 1.2. The second variant is used in Java 1.4 and later. If you have any doubt about which encoding to use, pick UTF-8. It's more likely to be correct than anything else.

An IllegalArgumentException may be thrown if the string contains a percent sign that isn't followed by two hexadecimal digits or decodes into an illegal sequence. Then again it may not be. This is implementation-dependent, and what happens when an illegal sequence is detected and does not throw an IllegalArgumentException is undefined. In Sun's JDK 1.4, no exception is thrown and extra bytes with no apparent meaning are added to the undecodable string. This is truly brain-damaged, and possibly a security hole.

Since this method does not touch non-escaped characters, you can pass an entire URL to it rather than splitting it into pieces first. For example:


String input = "http://www.altavista.com/cgi-bin/" + 
"query?pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3";
 try {
  String output = URLDecoder.decode(input, "UTF-8");
  System.out.println(output);
 }

The URI Class

A URI is an abstraction of a URL that includes not only Uniform Resource Locators but also Uniform Resource Names (URNs). Most URIs used in practice are URLs, but most specifications and standards such as XML are defined in terms of URIs. In Java 1.4 and later, URIs are represented by the java.net.URI class. This class differs from the java.net.URL class in three important ways:

In brief, a URL object is a representation of an application layer protocol for network retrieval, whereas a URI object is purely for string parsing and manipulation. The URI class has no network retrieval capabilities. The URL class has some string parsing methods, such as getFile( ) and getRef( ), but many of these are broken and don't always behave exactly as the relevant specifications say they should. Assuming you're using Java 1.4 or later and therefore have a choice, you should use the URL class when you want to download the content of a URL and the URI class when you want to use the URI for identification rather than retrieval, for instance, to represent an XML namespace URI. In some cases when you need to do both, you may convert from a URI to a URL with the toURL( ) method, and in Java 1.5 you can also convert from a URL to a URI using the toURI( ) method of the URL class.

Constructing a URI

URIs are built from strings. Unlike the URL class, the URI class does not depend on an underlying protocol handler. As long as the URI is syntactically correct, Java does not need to understand its protocol in order to create a representative URI object. Thus, unlike the URL class, the URI class can be used for new and experimental URI schemes.

public URI(String uri) throws URISyntaxException

This is the basic constructor that creates a new URI object from any convenient string. For example,

URI voice = new URI("tel:+1-800-9988-9938");
URI web   = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc");
URI book  = new URI("urn:isbn:1-565-92870-9");

If the string argument does not follow URI syntax rules—for example, if the URI begins with a colon—this constructor throws a URISyntaxException. This is a checked exception, so you need to either catch it or declare that the method where the constructor is invoked can throw it. However, one syntactic rule is not checked. In contradiction to the URI specification, the characters used in the URI are not limited to ASCII. They can include other Unicode characters, such as ø and é. Syntactically, there are very few restrictions on URIs, especially once the need to encode non-ASCII characters is removed and relative URIs are allowed. Almost any string can be interpreted as a URI.

public URI(String scheme, String schemeSpecificPart, String fragment) throws URISyntaxException

This constructor is mostly used for nonhierarchical URIs. The scheme is the URI's protocol, such as http, urn, tel, and so forth. It must be composed exclusively of ASCII letters and digits and the three punctuation characters +, -, and .. It must begin with a letter. Passing null for this argument omits the scheme, thus creating a relative URI. For example:

URI absolute = new URI("http", "//www.ibiblio.org" , null);
URI relative = new URI(null, "/javafaq/index.shtml", "today");

The scheme-specific part depends on the syntax of the URI scheme; it's one thing for an http URL, another for a mailto URL, and something else again for a tel URI. Because the URI class encodes illegal characters with percent escapes, there's effectively no syntax error you can make in this part.

Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.

public URI(String scheme, String host, String path, String fragment) throws URISyntaxException

This constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:

URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");

produces the URI http://www.ibiblio.org/javafaq/index.html#today.

If the constructor cannot form a legal hierarchical URI from the supplied pieces—for instance, if there is a scheme so the URI has to be absolute but the path doesn't start with /—then it throws a URISyntaxException.

public URI(String scheme, String authority, String path, String query, String fragment) throws URISyntaxException

This constructor is basically the same as the previous one, with the addition of a query string component. For example:

URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", 
                   "referrer=cnet&date=2004-08-23",  "today");

As usual, any unescapable syntax errors cause a URISyntaxException to be thrown and null can be passed to omit any of the arguments.

public URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment) throws URISyntaxException

This is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info, host, and port parts, each of which has its own syntax rules. For example:

URI styles = new URI("ftp", "anonymous:elharo@metalab.unc.edu", 
  "ftp.oreilly.com",  21, "/pub/stylesheet", null, null);

However, the resulting URI still has to follow all the usual rules for URIs and again, null can be passed for any argument to omit it from the result.

public static URI create(String uri)

This is not a constructor, but rather a static factory method. Unlike the constructors, it does not throw a URISyntaxException. If you're sure your URIs are legal and do not violate any of the rules, you can use this method. For example, this invocation creates a URI for anonymous FTP access using an email address as password:

URI styles = URI.create(
  "ftp://anonymous:elharo%40metalab.unc.edu@ftp.oreilly.com:
                                         21/pub/stylesheet");

If the URI does prove to be malformed, this method throws an IllegalArgumentException. This is a runtime exception, so you don't have to explicitly declare it or catch it.

The Parts of the URI

A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:

scheme:scheme-specific-part:fragment

If the scheme is omitted, the URI reference is relative. If the fragment identifier is omitted, the URI reference is a pure URI. The URI class has getter methods that return these three parts of each URI object. The getRawFoo( ) methods return the encoded forms of the parts of the URI, while the equivalent getFoo() methods first decode any percent-escaped characters and then return the decoded part:

public String getScheme( )
public String getSchemeSpecificPart( )
public String getRawSchemeSpecificPart( )
public String getFragment( )
public String getRawFragment( )

TIP: There's no getRawScheme( ) method because the URI specification requires that all scheme names be composed exclusively of URI-legal ASCII characters and does not allow percent escapes in scheme names.

These methods all return null if the particular URI object does not have the relevant component: for example, a relative URI without a scheme or an http URI without a fragment identifier.

A URI that has a scheme is an absolute URI. A URI without a scheme is relative. The isAbsolute() method returns true if the URI is absolute, false if it's relative:

public boolean isAbsolute( )

The details of the scheme-specific part vary depending on the type of the scheme. For example, in a tel URL, the scheme-specific part has the syntax of a telephone number. However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a particular hierarchical format divided into an authority, a path, and a query string. The authority is further divided into user info, host, and port. The isOpaque() method returns false if the URI is hierarchical, true if it's not hierarchical—that is, if it's opaque:

public boolean isOpaque( )

If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:

public String getAuthority( )
public String getFragment( )
public String getHost( )
public String getPath( )
public String getPort( )
public String getQuery( )
public String getUserInfo( )

These methods all return the decoded parts; in other words, percent escapes, such as %3C, are changed into the characters they represent, such as <. If you want the raw, encoded parts of the URI, there are five parallel getRawFoo() methods:

public String getRawAuthority( )
public String getRawFragment( )
public String getRawPath( )
public String getRawQuery( )
public String getRawUserInfo( )

Remember the URI class differs from the URI specification in that non-ASCII characters such as é and ü are never percent-escaped in the first place, and thus will still be present in the strings returned by the getRawFoo() methods unless the strings originally used to construct the URI object were encoded.

TIP: There are no getRawPort( ) and getRawHost( ) methods because these components are always guaranteed to be made up of ASCII characters, at least for now. Internationalized domain names are coming, and may require this decision to be rethought in future versions of Java.

In the event that the specific URI does not contain this information—for instance, the URI http://www.example.com has no user info, path, port, or query string—the relevant methods return null. getPort( ) is the single exception. Since it's declared to return an int, it can't return null. Instead, it returns -1 to indicate an omitted port.

For various technical reasons that don't have a lot of practical impact, Java can't always initially detect syntax errors in the authority component. The immediate symptom of this failing is normally an inability to return the individual parts of the authority: port, host, and user info. In this event, you can call parseServerAuthority() to force the authority to be reparsed:

public URI parseServerAuthority( )  throws URISyntaxException

The original URI does not change (URI objects are immutable), but the URI returned will have separate authority parts for user info, host, and port. If the authority cannot be parsed, a URISyntaxException is thrown.

Example 7-10 uses these methods to split URIs entered on the command line into their component parts. It's similar to Example 7-4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.

Example 7-10. The parts of a URI
import java.net.*;

public class URISplitter {

  public static void main(String args[]) {

    for (int i = 0; i < args.length; i++) {
      try {
        URI u = new URI(args[i]);
        System.out.println("The URI is " + u);
        if (u.isOpaque( )) {
          System.out.println("This is an opaque URI."); 
          System.out.println("The scheme is " + u.getScheme( ));        
          System.out.println("The scheme specific part is " 
           + u.getSchemeSpecificPart( ));        
          System.out.println("The fragment ID is " + u.getFragment( ));        
        }
        else {
          System.out.println("This is a hierarchical URI."); 
          System.out.println("The scheme is " + u.getScheme( ));        
          try {       
            u = u.parseServerAuthority( );
            System.out.println("The host is " + u.getUserInfo( ));        
            System.out.println("The user info is " + u.getUserInfo( ));        
            System.out.println("The port is " + u.getPort( ));        
          }
          catch (URISyntaxException ex) {
            // Must be a registry based authority
            System.out.println("The authority is " + u.getAuthority( ));        
          }
          System.out.println("The path is " + u.getPath( ));        
          System.out.println("The query string is " + u.getQuery( ));        
          System.out.println("The fragment ID is " + u.getFragment( )); 
        } // end else       
      }  // end try
      catch (URISyntaxException ex) {
        System.err.println(args[i] + " does not seem to be a URI.");
      }
      System.out.println( );
    }  // end for

  }  // end main

}  // end URISplitter

Here's the result of running this against three of the URI examples in this section:

% java URISplitter tel:+1-800-9988-9938 
\http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \urn:isbn:1-565-92870-9
The URI is tel:+1-800-9988-9938
This is an opaque URI.
The scheme is tel
The scheme specific part is +1-800-9988-9938
The fragment ID is null

The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc
This is a hierarchical URI.
The scheme is http
The host is null
The user info is null
The port is -1
The path is /pub/a/2003/09/17/stax.html
The query string is null
The fragment ID is id=_hbc

The URI is urn:isbn:1-565-92870-9
This is an opaque URI.
The scheme is urn
The scheme specific part is isbn:1-565-92870-9
The fragment ID is null

Resolving Relative URIs

The URI class has three methods for converting back and forth between relative and absolute URIs.

public URI resolve(URI uri)

This method compares the uri argument to this URI and uses it to construct a new URI object that wraps an absolute URI. For example, consider these three lines of code:

URI absolute = new URI("http://www.example.com/");
URI relative = new URI("images/logo.png");
URI resolved = absolute.resolve(relative);

After they've executed, resolved contains the absolute URI http://www.example.com/images/logo.png.

If the invoking URI does not contain an absolute URI itself, the resolve( ) method resolves as much of the URI as it can and returns a new relative URI object as a result. For example, take these three statements:

URI top = new URI("javafaq/books/");
URI relative = new URI("jnp3/examples/07/index.html");
URI resolved = top.resolve(relative);

After they've executed, resolved now contains the relative URI javafaq/books/jnp3/examples/07/index.html with no scheme or authority.

public URI resolve(String uri)

This is a convenience method that simply converts the string argument to a URI and then resolves it against the invoking URI, returning a new URI object as the result. That is, it's equivalent to resolve(newURI(str)). Using this method, the previous two samples can be rewritten as:

URI absolute = new URI("http://www.example.com/");
URI resolved = absolute.resolve("images/logo.png");
URI top = new URI("javafaq/books/");
resolved = top.resolve("jnp3/examples/07/index.html");

public URI relativize(URI uri)

It's also possible to reverse this procedure; that is, to go from an absolute URI to a relative one. The relativize( ) method creates a new URI object from the uri argument that is relative to the invoking URI. The argument is not changed. For example:

URI absolute = new URI("http://www.example.com/images/logo.png");
URI top = new URI("http://www.example.com/");
URI relative = top.relativize(absolute);

The URI object relative now contains the relative URI images/logo.png.

Utility Methods

The URI class has the usual batch of utility methods: equals(), hashCode( ), toString( ), and compareTo( ).

public boolean equals(Object o)

URIs are tested for equality pretty much as you'd expect. It's not a direct string comparison. Equal URIs must both either be hierarchical or opaque. The scheme and authority parts are compared without considering case. That is, http and HTTP are the same scheme, and www.example.com is the same authority as www.EXAMPLE.com. The rest of the URI is case-sensitive, except for hexadecimal digits used to escape illegal characters. Escapes are not decoded before comparing. http://www.example.com/A and http://www.example.com/%41 are unequal URIs.

public int hashCode( )

The hashCode( ) method is a usual hashCode( ) method, nothing special. Equal URIs do have the same hash code and unequal URIs are fairly unlikely to share the same hash code.

public int compareTo(Object o)

URIs can be ordered. The ordering is based on string comparison of the individual parts, in this sequence:

URIs are not comparable to any type except themselves. Comparing a URI to anything except another URI causes a ClassCastException.

public String toString( )

The toString( ) method returns an unencoded string form of the URI. That is, characters like é and \ are not percent-escaped unless they were percent-escaped in the strings used to construct this URI. Therefore, the result of calling this method is not guaranteed to be a syntactically correct URI. This form is sometimes useful for display to human beings, but not for retrieval.

public String toASCIIString( )

The toASCIIString( ) method returns an encoded string form of the URI. Characters like é and \ are always percent-escaped whether or not they were originally escaped. This is the string form of the URI you should use most of the time. Even if the form returned by toString( ) is more legible for humans, they may still copy and paste it into areas that are not expecting an illegal URI. toASCIIString( ) always returns a syntactically correct URI.

Proxies

Many systems access the Web and sometimes other non-HTTP parts of the Internet through proxy servers. A proxy server receives a request for a remote server from a local client. The proxy server makes the request to the remote server and forwards the result back to the local client. Sometimes this is done for security reasons, such as to prevent remote hosts from learning private details about the local network configuration. Other times it's done to prevent users from accessing forbidden sites by filtering outgoing requests and limiting which sites can be viewed. For instance, an elementary school might want to block access to http://www.playboy.com. And still other times it's done purely for performance, to allow multiple users to retrieve the same popular documents from a local cache rather than making repeated downloads from the remote server.

Java programs based on the URL class can work through most common proxy servers and protocols. Indeed, this is one reason you might want to choose to use the URL class rather than rolling your own HTTP or other client on top of raw sockets.

System Properties

For basic operations, all you have to do is set a few system properties to point to the addresses of your local proxy servers. If you are using a pure HTTP proxy, set http.proxyHost to the domain name or the IP address of your proxy server and http.proxyPort to the port of the proxy server (the default is 80). There are several ways to do this, including calling System.setProperty() from within your Java code or using the -D options when launching the program. This example sets the proxy server to 192.168.254.254 and the port to 9000:

% java -Dhttp.proxyHost=192.168.254.254  -Dhttp.proxyPort=9000  
com.domain.Program

If you want to exclude a host from being proxied and connect directly instead, set the http.nonProxyHosts system property to its hostname or IP address. To exclude multiple hosts, separate their names by vertical bars. For example, this code fragment proxies everything except java.oreilly.com and xml.oreilly.com:

System.setProperty("http.proxyHost", "192.168.254.254");
System.setProperty("http.proxyPort", "9000");
System.setProperty("http.nonProxyHosts", "java.oreilly.com|xml.oreilly.com");

You can also use an asterisk as a wildcard to indicate that all the hosts within a particular domain or subdomain should not be proxied. For example, to proxy everything except hosts in the oreilly.com domain:

% java -Dhttp.proxyHost=192.168.254.254  -Dhttp.nonProxyHosts=*.oreilly.com  
com.domain.Program

If you are using an FTP proxy server, set the ftp.proxyHost, ftp.proxyPort, and ftp.nonProxyHosts properties in the same way.

Java does not support any other application layer proxies, but if you're using a transport layer SOCKS proxy for all TCP connections, you can identify it with the socksProxyHost and socksProxyPort system properties. Java does not provide an option for nonproxying with SOCKS. It's an all-or-nothing decision.

The Proxy Class

Java 1.5 allows more fine-grained control of proxy servers from within a Java program. Specifically, this allows you to choose different proxy servers for different remote hosts. The proxies themselves are represented by instances of the java.net.Proxy class. There are still only three kinds of proxies, HTTP, SOCKS, and direct connections (no proxy at all), represented by three constants in the Proxy.Type enum:

Besides its type, the other important piece of information about a proxy is its address and port, given as a SocketAddress object. For example, this code fragment creates a Proxy object representing an HTTP proxy server on port 80 of proxy.example.com:

SocketAddress address = new InetSocketAddress("proxy.example.com", 80);
Proxy proxy = new Proxy(Proxy.Type.HTTP, address);

Although there are only three kinds of proxy objects, there can be many proxies of the same type for different proxy servers on different hosts.

The ProxySelector Class

Each running Java 1.5 virtual machine has a single java.net.ProxySelector object it uses to locate the proxy server for different connections. The default ProxySelector merely inspects the various system properties and the URL's protocol to decide how to connect to different hosts. However, you can install your own subclass of ProxySelector in place of the default selector and use it to choose different proxies based on protocol, host, path, time of day, or other criteria.

The key to this class is the abstract select( ) method:

public abstract List<Proxy> select(URI uri)

Java passes this method a URI object (not a URL object) representing the host to which a connection is needed. For a connection made with the URL class, this object typically has the form http://www.example.com/ or ftp://ftp.example.com/pub/files/, or some such. For a pure TCP connection made with the Socket class, this URI will have the form socket://host:port:, for instance, socket://www.example.com:80. The ProxySelector object then chooses the right proxies for this type of object and returns them in a List<Proxy>.

The second abstract method in this class you must implement is connectFailed( ):

public void connectFailed(URI uri, SocketAddress address, IOException ex)

This is a callback method used to warn a program that the proxy server isn't actually making the connection. Example 7-11 demonstrates with a ProxySelector that attempts to use the proxy server at proxy.example.com for all HTTP connections unless the proxy server has previously failed to resolve a connection to a particular URL. In that case, it suggests a direct connection instead.

Example 7-11. A ProxySelector that remembers what it can connect to
import java.net.*;
import java.util.*;
import java.io.*;

public class LocalProxySelector extends ProxySelector {
  
  private List failed = new ArrayList( );
  
  public List<Proxy> select(URI uri) {
    
    List<Proxy> result = new ArrayList<Proxy>( );
    if (failed.contains(uri) 
      || "http".equalsIgnoreCase(uri.getScheme( ))) {
        result.add(Proxy.NO_PROXY);
    }
    else {
        SocketAddress proxyAddress 
          = new InetSocketAddress( "proxy.example.com", 8000);
        Proxy proxy = new Proxy(Proxy.Type.HTTP, proxyAddress);
        result.add(proxy);
    }
    
    return result;
    
  }
  
  public void connectFailed(URI uri, SocketAddress address, IOException ex) {
    failed.add(uri);
  }
  
}

As I already said, each running virtual machine has exactly one ProxySelector. To change the ProxySelector, pass the new selector to the static ProxySelector.setDefault( ) method, like so:

ProxySelector selector = new LocalProxySelector( ):
ProxySelector.setDefault(selector);

From this point forward, all connections opened by that virtual machine will ask the ProxySelector for the right proxy to use. You normally shouldn't use this in code running in a shared environment. For instance, you wouldn't change the ProxySelector in a servlet because that would change the ProxySelector for all servlets running in the same container.

Communicating with Server-Side Programs Through GET

The URL class makes it easy for Java applets and applications to communicate with server-side programs such as CGIs, servlets, PHP pages, and others that use the GET method. (Server-side programs that use the POST method require the URLConnection class and are discussed in Chapter 15.) All you need to know is what combination of names and values the program expects to receive, and cook up a URL with a query string that provides the requisite names and values. All names and values must be x-www-form-url-encoded—as by the URLEncoder.encode() method, discussed earlier in this chapter.

There are a number of ways to determine the exact syntax for a query string that talks to a particular program. If you've written the server-side program yourself, you already know the name-value pairs it expects. If you've installed a third-party program on your own server, the documentation for that program should tell you what it expects.

On the other hand, if you're talking to a program on a third-party server, matters are a little trickier. You can always ask people at the remote server to provide you with the specifications for talking to their site. However, even if they don't mind doing this, there's probably no single person whose job description includes "telling third-party hackers with whom we have no business relationship exactly how to access our servers." Thus, unless you happen upon a particularly friendly or bored individual who has nothing better to do with their time except write long emails detailing exactly how to access their server, you're going to have to do a little reverse engineering.

TIP: This is beginning to change. A number of web sites have realized the value of opening up their systems to third party developers and have begin publishing developers' kits that provide detailed information on how to construct URLs to access their services. Sites like Safari and Amazon that offer RESTful, URL-based interfaces are easily accessed through the URL class. SOAP-based services like eBay's and Google's are much more difficult to work with.

Many programs are designed to process form input. If this is the case, it's straightforward to figure out what input the program expects. The method the form uses should be the value of the METHOD attribute of the FORM element. This value should be either GET, in which case you use the process described here, or POST, in which case you use the process described in Chapter 15. The part of the URL that precedes the query string is given by the value of the ACTION attribute of the FORM element. Note that this may be a relative URL, in which case you'll need to determine the corresponding absolute URL. Finally, the name-value pairs are simply the NAME attributes of the INPUT elements, except for any INPUT elements whose TYPE attribute has the value submit.

For example, consider this HTML form for the local search engine on my Cafe con Leche site. You can see that it uses the GET method. The program that processes the form is accessed via the URL http://www.google.com/search. It has four separate name-value pairs, three of which have default values:

<form name="search" action="http://www.google.com/search" method="get">
  <input name="q" />
  <input type="hidden" value="cafeconleche.org" name="domains" />
  <input type="hidden" name="sitesearch" value="cafeconleche.org" />
  <input type="hidden" name="sitesearch2" value="cafeconleche.org" />
   <br />
   <input type="image" height="22" width="55" 
      src="images/search_blue.gif" alt="search" border="0" 
      name="search-image" />
</form>

The type of the INPUT field doesn't matter—for instance, it doesn't matter if it's a set of checkboxes, a pop-up list, or a text field—only the name of each INPUT field and the value you give it is significant. The single exception is a submit input that tells the web browser when to send the data but does not give the server any extra information. In some cases, you may find hidden INPUT fields that must have particular required default values. This form has three hidden INPUT fields.

In some cases, the program you're talking to may not be able to handle arbitrary text strings for values of particular inputs. However, since the form is meant to be read and filled in by human beings, it should provide sufficient clues to figure out what input is expected; for instance, that a particular field is supposed to be a two-letter state abbreviation or a phone number.

A program that doesn't respond to a form is much harder to reverse engineer. For example, at http://www.ibiblio.org/nywc/bios.phtml, you'll find a lot of links to PHP pages that talk to a database to retrieve a list of musical works by a particular composer. However, there's no form anywhere that corresponds to this program. It's all done by hardcoded URLs. In this case, the best you can do is look at as many of those URLs as possible and see whether you can guess what the server expects. If the designer hasn't tried to be too devious, this information isn't hard to figure out. For example, these URLs are all found on that page:

http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Anderson  
    &first=Beth&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Austin 
    &first=Dorothea&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Bliss 
    &first=Marilyn&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Hart 
    &first=Jane&middle=Smith

Looking at these, you can guess that this particular program expects three inputs named first, middle, and last, with values that consist of the first, middle, and last names of a composer, respectively. Sometimes the inputs may not have such obvious names. In this case, you have to do some experimenting, first copying some existing values and then tweaking them to see what values are and aren't accepted. You don't need to do this in a Java program. You can simply edit the URL in the Address or Location bar of your web browser window.

TIP: The likelihood that other hackers may experiment with your own server-side programs in such a fashion is a good reason to make them extremely robust against unexpected input.

Regardless of how you determine the set of name-value pairs the server expects, communicating with it once you know them is simple. All you have to do is create a query string that includes the necessary name-value pairs, then form a URL that includes that query string. Send the query string to the server and read its response using the same methods you use to connect to a server and retrieve a static HTML page. There's no special protocol to follow once the URL is constructed. (There is a special protocol to follow for the POST method, however, which is why discussion of that method will have to wait until Chapter 15.)

To demonstrate this procedure, let's write a very simple command-line program to look up topics in the Netscape Open Directory (http://dmoz.org/). This site is shown in Figure 7-3 and it has the advantage of being really simple.


Figure 7-3. The basic user interface for the Open Directory

The basic Open Directory interface is a simple form with one input field named search; input typed in this field is sent to a CGI program at http://search.dmoz.org/cgi-bin/search, which does the actual search. The HTML for the form looks like this:

<form accept-charset="UTF-8"
      action="http://search.dmoz.org/cgi-bin/search" method="GET">
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <input size=30 name=search>

<input type=submit value="Search">
<a href="http://search.dmoz.org/cgi-bin/search?a.x=0">
<small><i>advanced</i></small></a>
</form>

There are only two input fields in this form: the Submit button and a text field named Search. Thus, to submit a search request to the Open Directory, you just need to collect the search string, encode it in a query string, and send it to http://search.dmoz.org/cgi-bin/search. For example, to search for "java", you would open a connection to the URL http://search.dmoz.org/cgi-bin/search?search=java and read the resulting input stream. Example 7-12 does exactly this.

Example 7-12. Do an Open Directory search
import com.macfaq.net.*;

import java.net.*;
import java.io.*;

public class DMoz {

  public static void main(String[] args) {
  
    String target = "";
    
    for (int i = 0; i < args.length; i++) {
      target += args[i] + " ";
    }
    target = target.trim( );
    QueryString query = new QueryString("search", target);
    try {
      URL u = new URL("http://search.dmoz.org/cgi-bin/search?" + query);
      InputStream in = new BufferedInputStream(u.openStream( ));
      InputStreamReader theHTML = new InputStreamReader(in);
      int c;
      while ((c = theHTML.read( )) != -1) {
        System.out.print((char) c);
      } 
    }
    catch (MalformedURLException ex) {
      System.err.println(ex);
    }
    catch (IOException ex) {
      System.err.println(ex);
    }
    
  }

}

Of course, a lot more effort could be expended on parsing and displaying the results. But notice how simple the code was to talk to this server. Aside from the funky-looking URL and the slightly greater likelihood that some pieces of it need to be x-www-form-url-encoded, talking to a server-side program that uses GET is no harder than retrieving any other HTML page.

Accessing Password-Protected Sites

Many popular sites, such as TheWall Street Journal, require a username and password for access. Some sites, such as the W3C member pages, implement this correctly through HTTP authentication. Others, such as the Java Developer Connection, implement it incorrectly through cookies and HTML forms. Java's URL class can access sites that use HTTP authentication, although you'll of course need to tell it what username and password to use. Java does not provide support for sites that use nonstandard, cookie-based authentication, in part because Java doesn't really support cookies in Java 1.4 and earlier, in part because this requires parsing and submitting HTML forms, and, lastly, because cookies are completely contrary to the architecture of the Web. (Java 1.5 does add some cookie support, which we'll discuss in the next chapter. However, it does not treat authentication cookies differently than any other cookies.) You can provide this support yourself using the URLConnection class to read and write the HTTP headers where cookies are set and returned. However, doing so is decidedly nontrivial and often requires custom code for each site you want to connect to. It's really hard to do short of implementing a complete web browser with full HTML forms and cookie support. Accessing sites protected by standard, HTTP authentication is much easier.

The Authenticator Class

The java.net package includes an Authenticator class you can use to provide a username and password for sites that protect themselves using HTTP authentication:

public abstract class Authenticator extends Object // Java 1.2

Since Authenticator is an abstract class, you must subclass it. Different subclasses may retrieve the information in different ways. For example, a character mode program might just ask the user to type the username and password on System.in. A GUI program would likely put up a dialog box like the one shown in Figure 7-4. An automated robot might read the username out of an encrypted file.


Figure 7-4. An authentication dialog

To make the URL class use the subclass, install it as the default authenticator by passing it to the static Authenticator.setDefault() method:

public static void setDefault(Authenticator a)

For example, if you've written an Authenticator subclass named DialogAuthenticator, you'd install it like this:

Authenticator.setDefault(new DialogAuthenticator( ));

You only need to do this once. From this point forward, when the URL class needs a username and password, it will ask the DialogAuthenticator using the static Authenticator.requestPasswordAuthentication() method:

public static PasswordAuthentication requestPasswordAuthentication(
  InetAddress address, int port, String protocol, String prompt, String scheme) 
  throws SecurityException

The address argument is the host for which authentication is required. The port argument is the port on that host, and the protocol argument is the application layer protocol by which the site is being accessed. The HTTP server provides the prompt. It's typically the name of the realm for which authentication is required. (Some large web servers such as www.ibiblio.org have multiple realms, each of which requires different usernames and passwords.) The scheme is the authentication scheme being used. (Here the word scheme is not being used as a synonym for protocol. Rather it is an HTTP authentication scheme, typically basic.)

Untrusted applets are not allowed to ask the user for a name and password. Trusted applets can do so, but only if they possess the requestPasswordAuthenticationNetPermission. Otherwise, Authenticator.requestPasswordAuthentication( ) throws a SecurityException.

The Authenticator subclass must override the getPasswordAuthentication( ) method. Inside this method, you collect the username and password from the user or some other source and return it as an instance of the java.net.PasswordAuthentication class:

protected PasswordAuthentication getPasswordAuthentication( )

If you don't want to authenticate this request, return null, and Java will tell the server it doesn't know how to authenticate the connection. If you submit an incorrect username or password, Java will call getPasswordAuthentication( ) again to give you another chance to provide the right data. You normally have five tries to get the username and password correct; after that, openStream( ) throws a ProtocolException.

Usernames and passwords are cached within the same virtual machine session. Once you set the correct password for a realm, you shouldn't be asked for it again unless you've explicitly deleted the password by zeroing out the char array that contains it.

You can get more details about the request by invoking any of these methods inherited from the Authenticator superclass:

protected final InetAddress getRequestingSite( )
protected final int         getRequestingPort( )
protected final String      getRequestingProtocol( )
protected final String      getRequestingPrompt( )
protected final String      getRequestingScheme( )
protected final String      getRequestingHost( )  // Java 1.4

These methods either return the information as given in the last call to requestPasswordAuthentication( ) or return null if that information is not available. (getRequestingPort( ) returns -1 if the port isn't available.) The last method, getRequestingHost( ), is only available in Java 1.4 and later; in earlier releases you can call getRequestingSite( ).getHostName( ) instead.

Java 1.5 adds two more methods to this class:

protected final String getRequestingURL( )  // Java 1.5
protected Authenticator.RequestorType getRequestorType( )

The getRequestingURL( ) method returns the complete URL for which authentication has been requested—an important detail if a site uses different names and passwords for different files. The getRequestorType( ) method returns one of the two named constants Authenticator.RequestorType.PROXY or Authenticator.RequestorType.SERVER to indicate whether the server or the proxy server is requesting the authentication.

The PasswordAuthentication Class

PasswordAuthentication is a very simple final class that supports two read-only properties: username and password. The username is a String. The password is a char array so that the password can be erased when it's no longer needed. A String would have to wait to be garbage collected before it could be erased, and even then it might still exist somewhere in memory on the local system, possibly even on disk if the block of memory that contained it had been swapped out to virtual memory at one point. Both username and password are set in the constructor:

public PasswordAuthentication(String userName, char[] password)

Each is accessed via a getter method:

public String getUserName( )
public char[] getPassword( )

The JPasswordField Class

One useful tool for asking users for their passwords in a more or less secure fashion is the JPasswordField component from Swing:

public class JPasswordField extends JTextField

This lightweight component behaves almost exactly like a text field. However, anything the user types into it is echoed as an asterisk. This way, the password is safe from anyone looking over the user's shoulder at what's being typed on the screen.

JPasswordField also stores the passwords as a char array so that when you're done with the password you can overwrite it with zeros. It provides the getPassword( ) method to return this:

public char[] getPassword( )

Otherwise, you mostly use the methods it inherits from the JTextField superclass. Example 7-13 demonstrates a Swing-based Authenticator subclass that brings up a dialog to ask the user for his username and password. Most of this code handles the GUI. A JPasswordField collects the password and a simple JTextField retrieves the username. Figure 7-4 showed the rather simple dialog box this produces.

Example 7-13. A GUI authenticator
package com.macfaq.net;

import java.net.*;
import javax.swing.*;
import java.awt.*;
import java.awt.event.*;

public class DialogAuthenticator extends Authenticator {

  private JDialog passwordDialog;  
  private JLabel mainLabel 
   = new JLabel("Please enter username and password: ");
  private JLabel userLabel = new JLabel("Username: ");
  private JLabel passwordLabel = new JLabel("Password: ");
  private JTextField usernameField = new JTextField(20);
  private JPasswordField passwordField = new JPasswordField(20);
  private JButton okButton = new JButton("OK");
  private JButton cancelButton = new JButton("Cancel");
  
  public DialogAuthenticator( ) {
    this("", new JFrame( ));
  }
    
  public DialogAuthenticator(String username) {
    this(username, new JFrame( ));
  }
    
  public DialogAuthenticator(JFrame parent) {
    this("", parent);
  }
    
  public DialogAuthenticator(String username, JFrame parent) {
    
    this.passwordDialog = new JDialog(parent, true);  
    Container pane = passwordDialog.getContentPane( );
    pane.setLayout(new GridLayout(4, 1));
    pane.add(mainLabel);
    JPanel p2 = new JPanel( );
    p2.add(userLabel);
    p2.add(usernameField);
    usernameField.setText(username);
    pane.add(p2);
    JPanel p3 = new JPanel( );
    p3.add(passwordLabel);
    p3.add(passwordField);
    pane.add(p3);
    JPanel p4 = new JPanel( );
    p4.add(okButton);
    p4.add(cancelButton);
    pane.add(p4);   
    passwordDialog.pack( );
    
    ActionListener al = new OKResponse( );
    okButton.addActionListener(al);
    usernameField.addActionListener(al);
    passwordField.addActionListener(al);
    cancelButton.addActionListener(new CancelResponse( ));
    
  }
  
  private void show( ) {
    
    String prompt = this.getRequestingPrompt( );
    if (prompt == null) {
      String site     = this.getRequestingSite( ).getHostName( );
      String protocol = this.getRequestingProtocol( );
      int    port     = this.getRequestingPort( );
      if (site != null & protocol != null) {
        prompt = protocol + "://" + site;
        if (port > 0) prompt += ":" + port;
      }
      else {
        prompt = ""; 
      }
      
    }

    mainLabel.setText("Please enter username and password for "
     + prompt + ": ");
    passwordDialog.pack( );
    passwordDialog.show( );
    
  }
  
  PasswordAuthentication response = null;

  class OKResponse implements ActionListener {
  
    public void actionPerformed(ActionEvent e) {
      
      passwordDialog.hide( );
      // The password is returned as an array of 
      // chars for security reasons.
      char[] password = passwordField.getPassword( );
      String username = usernameField.getText( );
      // Erase the password in case this is used again.
      passwordField.setText("");
      response = new PasswordAuthentication(username, password);
      
    } 
    
  }

  class CancelResponse implements ActionListener {
  
    public void actionPerformed(ActionEvent e) {
      
      passwordDialog.hide( );
      // Erase the password in case this is used again.
      passwordField.setText("");
      response = null;
      
    } 
    
  }

  public PasswordAuthentication getPasswordAuthentication( ) {
    
    this.show( );    
    return this.response;
    
  }

}

Example 7-14 is a revised SourceViewer program that asks the user for a name and password using the DialogAuthenticator class.

Example 7-14. A program to download password-protected web pages
import java.net.*;
import java.io.*;
import com.macfaq.net.DialogAuthenticator;

public class SecureSourceViewer {

  public static void main (String args[]) {

    Authenticator.setDefault(new DialogAuthenticator( ));

    for (int i = 0; i < args.length; i++) {
      
      try {
        //Open the URL for reading
        URL u = new URL(args[i]);
        InputStream in = u.openStream( );
        // buffer the input to increase performance 
        in = new BufferedInputStream(in);       
        // chain the InputStream to a Reader
        Reader r = new InputStreamReader(in);
        int c;
        while ((c = r.read( )) != -1) {
          System.out.print((char) c);
        } 
      }
      catch (MalformedURLException ex) {
        System.err.println(args[0] + " is not a parseable URL");
      }
      catch (IOException ex) {
        System.err.println(ex);
      }
      
      // print a blank line to separate pages
      System.out.println( );
      
    } //  end for
  
    // Since we used the AWT, we have to explicitly exit.
    System.exit(0);

  } // end main

}  // end SecureSourceViewer

Elliotte Rusty Harold is a noted writer and programmer, both on and off the Internet. His previous books include "Java Network Programming", Third Edition, "XML in a Nutshell", Third Edition, and "Java I/O", all from O'Reilly.


View catalog information for Java Network Programming, 3rd Edition

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.