One of the challenges faced by the designers of the Web was dealing with the differences between operating systems. These differences can cause problems with URLs: for example, some operating systems allow spaces in filenames; some don't. Most operating systems won't complain about a # sign in a filename; but in a URL, a # sign indicates that the filename has ended, and a fragment identifier follows. Other special characters, nonalphanumeric characters, and so on, all of which may have a special meaning inside a URL or on another operating system, present similar problems. To solve these problems, characters used in URLs must come from a fixed subset of ASCII, specifically:
|
Related Reading
Java Network Programming |
The capital letters A-Z
The lowercase letters a-z
The digits 0-9
The punctuation characters - _ . ! ~ * ' (and ,)
The characters : / & ? @ # ; $ + = and % may also be used, but only for their specified purposes. If these characters occur as part of a filename, they and all other characters should be encoded.
The encoding is very simple. Any characters that are not ASCII numerals, letters, or the punctuation marks specified earlier are converted into bytes and each byte is written as a percent sign followed by two hexadecimal digits. Spaces are a special case because they're so common. Besides being encoded as %20, they can be encoded as a plus sign (+). The plus sign itself is encoded as %2B. The / # = & and ? characters should be encoded when they are used as part of a name, and not as a separator between parts of the URL.
WARNING This scheme doesn't work well in heterogeneous environments with multiple character sets. For example, on a U.S. Windows system, é is encoded as %E9. On a U.S. Mac, it's encoded as %8E. The existence of variations is a distinct shortcoming of the current URI specification that should be addressed in the future through Internationalized Resource Identifiers (IRIs).
The URL class does not perform encoding or
decoding automatically. You can construct URL
objects that use illegal ASCII and non-ASCII characters and/or
percent escapes. Such characters and escapes are not automatically
encoded or decoded when output by methods such as getPath() and toExternalForm( ). You are
responsible for making sure all such characters are properly encoded
in the strings used to construct a URL object.
Luckily, Java provides a URLEncoder class to
encode strings in this format. Java 1.2 adds a
URLDecoder class that can decode strings in this
format. Neither of these classes will be instantiated.
public class URLDecoder extends Object
public class URLEncoder extends Object
In Java 1.3 and earlier, the
java.net.URLEncoder class contains a single static
method called encode( ) that encodes a String
according to these rules:
public static String encode(String s)
This method always uses the default encoding of the platform on which it runs, so it will produce different results on different systems. As a result, Java 1.4 deprecates this method and replaces it with a method that requires you to specify the encoding:
public static String encode(String s, String encoding)
throws UnsupportedEncodingException
Both variants change any nonalphanumeric
characters into % sequences (except the space, underscore, hyphen,
period, and asterisk characters). Both also encode all non-ASCII
characters. The space is converted into a plus sign. These methods
are a little over-aggressive; they also convert tildes, single
quotes, exclamation points, and parentheses to percent escapes, even
though they don't absolutely have to. However, this
change isn't forbidden by the URL specification, so
web browsers deal reasonably with these excessively encoded URLs.
Both variants return a new String, suitably
encoded. The Java 1.3 encode(
) method uses the
platform's default encoding to calculate percent
escapes. This encoding is typically ISO-8859-1 on U.S. Unix systems,
Cp1252 on U.S. Windows systems, MacRoman on U.S. Macs, and so on in
other locales. Because both encoding and decoding are platform- and
locale-specific, this method is annoyingly non-interoperable, which
is precisely why it has been deprecated in Java 1.4 in favor of the
variant that requires you to specify an encoding. However, if you
just pick the platform default encoding, your program will be as
platform- and locale-locked as the Java 1.3 version. Instead, you
should always pick UTF-8, never anything else. UTF-8 is compatible
with the new IRI specification, the URI class,
modern web browsers, and more other software than any other encoding
you could choose.
Example 7-8 is a program that uses
URLEncoder.encode( ) to print various encoded
strings. Java 1.4 or later is required to compile and run it.
import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;
public class EncoderTest {
public static void main(String[] args) {
try {
System.out.println(URLEncoder.encode("This string has spaces",
"UTF-8"));
System.out.println(URLEncoder.encode("This*string*has*asterisks",
"UTF-8"));
System.out.println(URLEncoder.encode("This%string%has%percent%signs",
"UTF-8"));
System.out.println(URLEncoder.encode("This+string+has+pluses",
"UTF-8"));
System.out.println(URLEncoder.encode("This/string/has/slashes",
"UTF-8"));
System.out.println(URLEncoder.encode("This\"string\"has\"quote\"marks",
"UTF-8"));
System.out.println(URLEncoder.encode("This:string:has:colons",
"UTF-8"));
System.out.println(URLEncoder.encode("This~string~has~tildes",
"UTF-8"));
System.out.println(URLEncoder.encode("This(string)has(parentheses)",
"UTF-8"));
System.out.println(URLEncoder.encode("This.string.has.periods",
"UTF-8"));
System.out.println(URLEncoder.encode("This=string=has=equals=signs",
"UTF-8"));
System.out.println(URLEncoder.encode("This&string&has&ersands",
"UTF-8"));
System.out.println(URLEncoder.encode("Thiséstringéhasé
non-ASCII characters", "UTF-8"));
}
catch (UnsupportedEncodingException ex) {
throw new RuntimeException("Broken VM does not support UTF-8");
}
}
}
Here is the output. Note that the code needs to be saved in something other than ASCII, and the encoding chosen should be passed as an argument to the compiler to account for the non-ASCII characters in the source code.
% javac -encoding UTF8 EncoderTest
% java EncoderTest
This+string+has+spaces
This*string*has*asterisks
This%25string%25has%25percent%25signs
This%2Bstring%2Bhas%2Bpluses
This%2Fstring%2Fhas%2Fslashes
This%22string%22has%22quote%22marks
This%3Astring%3Ahas%3Acolons
This%7Estring%7Ehas%7Etildes
This%28string%29has%28parentheses%29
This.string.has.periods
This%3Dstring%3Dhas%3Dequals%3Dsigns
This%26string%26has%26ampersands
This%C3%A9string%C3%A9has%C3%A9non-ASCII+characters
Notice in particular that this method encodes the forward slash, the
ampersand, the equals sign, and the colon. It does not attempt to
determine how these characters are being used in a URL. Consequently,
you have to encode URLs piece by piece rather than
encoding an entire URL in one method call. This is an important
point, because the most common use of URLEncoder
is in preparing query strings for communicating with
server-side programs that use GET. For example,
suppose you want to encode this query string used for an AltaVista
search:
pg=q&kl=XX&stype=stext&q=+"Java+I/O"&search.x=38&search.y=3
This code fragment encodes it:
String query = URLEncoder.encode(
"pg=q&kl=XX&stype=stext&q=+\"Java+I/O\"&search.x=38&search.y=3");
System.out.println(query);
Unfortunately, the output is:
pg%3Dq%26kl%3DXX%26stype%3Dstext%26q%3D%2B%22Java%2BI%2FO%22%26search
.x%3D38%26search.y%3D3
The problem is that URLEncoder.encode( ) encodes
blindly. It can't distinguish between special
characters used as part of the URL or query string, like
& and = in the previous
string, and characters that need to be encoded. Consequently, URLs
need to be encoded a piece at a time like this:
String query = URLEncoder.encode("pg");
query += "=";
query += URLEncoder.encode("q");
query += "&";
query += URLEncoder.encode("kl");
query += "=";
query += URLEncoder.encode("XX");
query += "&";
query += URLEncoder.encode("stype");
query += "=";
query += URLEncoder.encode("stext");
query += "&";
query += URLEncoder.encode("q");
query += "=";
query += URLEncoder.encode("\"Java I/O\"");
query += "&";
query += URLEncoder.encode("search.x");
query += "=";
query += URLEncoder.encode("38");
query += "&";
query += URLEncoder.encode("search.y");
query += "=";
query += URLEncoder.encode("3");
System.out.println(query);
The output of this is what you actually want:
pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3
Example 7-9 is a QueryString class that
uses the URLEncoder to encode successive name and
value pairs in a Java object, which will be used for sending data to
server-side programs. When you create a
QueryString, you can supply the first name-value
pair to the constructor as individual strings. To add further pairs,
call the add( ) method, which also takes two
strings as arguments and encodes them. The getQuery(
)
method returns the accumulated list of encoded name-value pairs.
package com.macfaq.net;
import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;
public class QueryString {
private StringBuffer query = new StringBuffer( );
public QueryString(String name, String value) {
encode(name, value);
}
public synchronized void add(String name, String value) {
query.append('&');
encode(name, value);
}
private synchronized void encode(String name, String value) {
try {
query.append(URLEncoder.encode(name, "UTF-8"));
query.append('=');
query.append(URLEncoder.encode(value, "UTF-8"));
}
catch (UnsupportedEncodingException ex) {
throw new RuntimeException("Broken VM does not support UTF-8");
}
}
public String getQuery( ) {
return query.toString( );
}
public String toString( ) {
return getQuery( );
}
}
Using this class, we can now encode the previous example:
QueryString qs = new QueryString("pg", "q");
qs.add("kl", "XX");
qs.add("stype", "stext");
qs.add("q", "+\"Java I/O\"");
qs.add("search.x", "38");
qs.add("search.y", "3");
String url = "http://www.altavista.com/cgi-bin/query?" + qs;
System.out.println(url);
The
corresponding URLDecoder class has two static
methods that decode strings encoded in x-www-form-url-encoded format.
That is, they convert all plus signs to spaces and all percent
escapes to their corresponding character:
public static String decode(String s) throws Exception
public static String decode(String s, String encoding) // Java 1.4
throws UnsupportedEncodingException
The first variant is used in Java 1.3 and 1.2. The second variant is used in Java 1.4 and later. If you have any doubt about which encoding to use, pick UTF-8. It's more likely to be correct than anything else.
An IllegalArgumentException may be thrown if the
string contains a percent sign that isn't followed
by two hexadecimal digits or decodes into an illegal sequence. Then
again it may not be. This is implementation-dependent, and what
happens when an illegal sequence is detected and does not throw an
IllegalArgumentException is undefined. In
Sun's JDK 1.4, no exception is thrown and extra
bytes with no apparent meaning are added to the undecodable string.
This is truly brain-damaged, and possibly a security hole.
Since this method does not touch non-escaped characters, you can pass an entire URL to it rather than splitting it into pieces first. For example:
String input = "http://www.altavista.com/cgi-bin/" +
"query?pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3";
try {
String output = URLDecoder.decode(input, "UTF-8");
System.out.println(output);
}
|
A URI is an abstraction of a URL that
includes not only Uniform Resource Locators but also Uniform Resource
Names (URNs). Most URIs used in practice are URLs, but most
specifications and standards such as XML are defined in terms of
URIs. In Java 1.4 and later, URIs are represented by the
java.net.URI class. This class differs from the
java.net.URL class in three important ways:
The URI class is purely about identification of
resources and parsing of URIs. It provides no methods to retrieve a
representation of the resource identified by its URI.
The URI class is more conformant to the relevant
specifications than the URL class.
A URI object can represent a relative URI. The
URL class absolutizes all URIs before storing
them.
In brief, a URL object is a representation of an
application layer protocol for network retrieval, whereas a
URI object is purely for string parsing and
manipulation. The URI class has no network
retrieval capabilities. The URL class has some
string parsing methods, such as getFile( ) and
getRef( ), but many of these are broken and
don't always behave exactly as the relevant
specifications say they should. Assuming you're
using Java 1.4 or later and therefore have a choice, you should use
the URL class when you want to download the
content of a URL and the URI class when you want
to use the URI for identification rather than retrieval, for
instance, to represent an XML namespace URI. In some cases when you
need to do both, you may convert from a URI to a
URL with the toURL( ) method,
and in Java 1.5 you can also convert from a URL to
a URI using the toURI( ) method
of the URL class.
URIs are built from strings. Unlike the
URL class, the URI class does
not depend on an underlying protocol handler. As long as the URI is
syntactically correct, Java does not need to understand its protocol
in order to create a representative URI object. Thus, unlike the
URL class, the URI class can be
used for new and experimental URI schemes.
This is the basic constructor that creates a new
URI object from any convenient string. For
example,
URI voice = new URI("tel:+1-800-9988-9938");
URI web = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc");
URI book = new URI("urn:isbn:1-565-92870-9");
If the string argument does not follow URI syntax rules—for
example, if the URI begins with a colon—this constructor throws
a URISyntaxException. This is a checked exception,
so you need to either catch it or declare that the method where the
constructor is invoked can throw it. However, one syntactic rule is
not checked. In contradiction to the URI specification, the
characters used in the URI are not limited to ASCII. They can include
other Unicode characters, such as ø and é.
Syntactically, there are very few restrictions on URIs, especially
once the need to encode non-ASCII characters is removed and relative
URIs are allowed. Almost any string can be interpreted as a URI.
This constructor is mostly used for nonhierarchical URIs. The scheme
is the URI's protocol, such as http, urn, tel, and
so forth. It must be composed exclusively of ASCII letters and digits
and the three punctuation characters +,
-, and .. It must begin with a
letter. Passing null for this argument omits the scheme, thus
creating a relative URI. For example:
URI absolute = new URI("http", "//www.ibiblio.org" , null);
URI relative = new URI(null, "/javafaq/index.shtml", "today");
The scheme-specific part depends on the syntax of the URI scheme;
it's one thing for an http URL, another for a mailto
URL, and something else again for a tel URI. Because the
URI class encodes illegal characters with percent
escapes, there's effectively no syntax error you can
make in this part.
Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.
This constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:
URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");
produces the URI http://www.ibiblio.org/javafaq/index.html#today.
If the constructor cannot form a legal hierarchical URI from the
supplied pieces—for instance, if there is a scheme so the URI
has to be absolute but the path doesn't start with
/—then it throws a URISyntaxException.
This constructor is basically the same as the previous one, with the addition of a query string component. For example:
URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html",
"referrer=cnet&date=2004-08-23", "today");
As usual, any unescapable syntax errors cause a
URISyntaxException to be thrown and null can be
passed to omit any of the arguments.
This is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info, host, and port parts, each of which has its own syntax rules. For example:
URI styles = new URI("ftp", "anonymous:elharo@metalab.unc.edu",
"ftp.oreilly.com", 21, "/pub/stylesheet", null, null);
However, the resulting URI still has to follow all the usual rules for URIs and again, null can be passed for any argument to omit it from the result.
This is not a constructor, but rather a static factory method. Unlike
the constructors, it does not throw a
URISyntaxException. If you're
sure your URIs are legal and do not violate any of the rules, you can
use this method. For example, this invocation creates a
URI for anonymous FTP access using an email
address as password:
URI styles = URI.create(
"ftp://anonymous:elharo%40metalab.unc.edu@ftp.oreilly.com:
21/pub/stylesheet");
If the URI does prove to be malformed, this method throws an
IllegalArgumentException. This is a runtime
exception, so you don't have to explicitly declare
it or catch it.
A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:
scheme:scheme-specific-part:fragment
If
the scheme is omitted, the URI reference is relative. If the fragment
identifier is omitted, the URI reference is a pure URI. The URI class
has getter methods that return these three parts of each
URI object. The
getRawFoo(
) methods return the encoded forms of the parts of the URI,
while the equivalent
getFoo() methods first decode any percent-escaped characters and
then return the decoded part:
public String getScheme( )
public String getSchemeSpecificPart( )
public String getRawSchemeSpecificPart( )
public String getFragment( )
public String getRawFragment( )
TIP: There's no
getRawScheme( )method because the URI specification requires that all scheme names be composed exclusively of URI-legal ASCII characters and does not allow percent escapes in scheme names.
These methods all return null if the particular
URI object does not have the relevant component:
for example, a relative URI without a scheme or an http URI without a
fragment identifier.
A URI that has a scheme is an
absolute URI. A URI without a scheme is
relative. The isAbsolute() method returns true if the
URI is absolute, false if it's relative:
public boolean isAbsolute( )
The details of the scheme-specific part vary depending on the type of
the scheme. For example, in a tel
URL, the scheme-specific part has the syntax of a telephone number.
However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a
particular hierarchical format divided into an
authority, a path, and a query string. The authority is further
divided into user info, host, and port. The isOpaque() method returns false if the URI is
hierarchical, true if it's not
hierarchical—that is, if it's opaque:
public boolean isOpaque( )
If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:
public String getAuthority( )
public String getFragment( )
public String getHost( )
public String getPath( )
public String getPort( )
public String getQuery( )
public String getUserInfo( )
These methods all return the decoded parts; in other words, percent
escapes, such as %3C, are changed into the characters they represent,
such as <. If you want the raw, encoded parts of the URI, there
are five parallel
getRawFoo() methods:
public String getRawAuthority( )
public String getRawFragment( )
public String getRawPath( )
public String getRawQuery( )
public String getRawUserInfo( )
Remember the URI class differs from the URI specification
in that non-ASCII characters such as é and ü
are never percent-escaped in the first place, and thus will still be
present in the strings returned by the
getRawFoo() methods unless the strings originally used to construct
the URI object were encoded.
TIP: There are no
getRawPort( )andgetRawHost( )methods because these components are always guaranteed to be made up of ASCII characters, at least for now. Internationalized domain names are coming, and may require this decision to be rethought in future versions of Java.
In the event that the specific URI does not contain this
information—for instance, the URI
http://www.example.com has no user info, path,
port, or query string—the relevant methods return null.
getPort( ) is the
single exception. Since it's declared to return an
int, it can't return
null. Instead, it returns -1 to indicate an
omitted port.
For various technical reasons that don't have a lot
of practical impact, Java can't always initially
detect syntax errors in the authority component. The immediate
symptom of this failing is normally an inability to return the
individual parts of the authority: port, host, and user info. In this
event, you can call parseServerAuthority() to force the authority to
be reparsed:
public URI parseServerAuthority( ) throws URISyntaxException
The original URI does not change
(URI objects are immutable), but the
URI returned will have separate authority parts
for user info, host, and port. If the authority cannot be parsed, a
URISyntaxException is thrown.
Example 7-10 uses these methods to split URIs entered on the command line into their component parts. It's similar to Example 7-4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.
import java.net.*;
public class URISplitter {
public static void main(String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URI u = new URI(args[i]);
System.out.println("The URI is " + u);
if (u.isOpaque( )) {
System.out.println("This is an opaque URI.");
System.out.println("The scheme is " + u.getScheme( ));
System.out.println("The scheme specific part is "
+ u.getSchemeSpecificPart( ));
System.out.println("The fragment ID is " + u.getFragment( ));
}
else {
System.out.println("This is a hierarchical URI.");
System.out.println("The scheme is " + u.getScheme( ));
try {
u = u.parseServerAuthority( );
System.out.println("The host is " + u.getUserInfo( ));
System.out.println("The user info is " + u.getUserInfo( ));
System.out.println("The port is " + u.getPort( ));
}
catch (URISyntaxException ex) {
// Must be a registry based authority
System.out.println("The authority is " + u.getAuthority( ));
}
System.out.println("The path is " + u.getPath( ));
System.out.println("The query string is " + u.getQuery( ));
System.out.println("The fragment ID is " + u.getFragment( ));
} // end else
} // end try
catch (URISyntaxException ex) {
System.err.println(args[i] + " does not seem to be a URI.");
}
System.out.println( );
} // end for
} // end main
} // end URISplitter
Here's the result of running this against three of the URI examples in this section:
% java URISplitter tel:+1-800-9988-9938
\http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \urn:isbn:1-565-92870-9
The URI is tel:+1-800-9988-9938
This is an opaque URI.
The scheme is tel
The scheme specific part is +1-800-9988-9938
The fragment ID is null
The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc
This is a hierarchical URI.
The scheme is http
The host is null
The user info is null
The port is -1
The path is /pub/a/2003/09/17/stax.html
The query string is null
The fragment ID is id=_hbc
The URI is urn:isbn:1-565-92870-9
This is an opaque URI.
The scheme is urn
The scheme specific part is isbn:1-565-92870-9
The fragment ID is null
|
The URI class has three methods for converting
back and forth between relative and absolute URIs.
This method compares the
uri argument to this URI and
uses it to construct a new URI object that wraps
an absolute URI. For example, consider these three lines of code:
URI absolute = new URI("http://www.example.com/");
URI relative = new URI("images/logo.png");
URI resolved = absolute.resolve(relative);
After they've executed, resolved
contains the absolute URI
http://www.example.com/images/logo.png.
If the invoking URI does not contain an absolute
URI itself, the resolve( ) method resolves as much
of the URI as it can and returns a new relative URI object as a
result. For example, take these three statements:
URI top = new URI("javafaq/books/");
URI relative = new URI("jnp3/examples/07/index.html");
URI resolved = top.resolve(relative);
After they've executed, resolved
now contains the relative URI
javafaq/books/jnp3/examples/07/index.html with
no scheme or authority.
This is a convenience method that
simply converts the string argument to a URI and then resolves it
against the invoking URI, returning a new URI object as the result.
That is, it's equivalent to
resolve(newURI(str)). Using
this method, the previous two samples can be rewritten as:
URI absolute = new URI("http://www.example.com/");
URI resolved = absolute.resolve("images/logo.png");
URI top = new URI("javafaq/books/");
resolved = top.resolve("jnp3/examples/07/index.html");
It's also possible
to reverse this procedure; that is, to go from an absolute URI to a
relative one. The relativize( ) method creates a
new URI object from the uri
argument that is relative to the invoking URI. The
argument is not changed. For example:
URI absolute = new URI("http://www.example.com/images/logo.png");
URI top = new URI("http://www.example.com/");
URI relative = top.relativize(absolute);
The URI object relative now
contains the relative URI images/logo.png.
The URI
class has the usual batch of utility methods: equals(), hashCode( ), toString(
), and compareTo( ).
URIs are tested for equality pretty much as you'd expect. It's not a direct string comparison. Equal URIs must both either be hierarchical or opaque. The scheme and authority parts are compared without considering case. That is, http and HTTP are the same scheme, and www.example.com is the same authority as www.EXAMPLE.com. The rest of the URI is case-sensitive, except for hexadecimal digits used to escape illegal characters. Escapes are not decoded before comparing. http://www.example.com/A and http://www.example.com/%41 are unequal URIs.
The hashCode( ) method
is a usual hashCode( ) method, nothing special.
Equal URIs do have the same hash code and unequal URIs are fairly
unlikely to share the same hash code.
URIs can be ordered. The ordering is based on string comparison of the individual parts, in this sequence:
If the schemes are different, the schemes are compared, without considering case.
Otherwise, if the schemes are the same, a hierarchical URI is considered to be less than an opaque URI with the same scheme.
If both URIs are opaque URIs, they're ordered according to their scheme-specific parts.
If both the scheme and the opaque scheme-specific parts are equal, the URIs are compared by their fragments.
If both URIs are hierarchical, they're ordered according to their authority components, which are themselves ordered according to user info, host, and port, in that order.
If the schemes and the authorities are equal, the path is used to distinguish them.
If the paths are also equal, the query strings are compared.
If the query strings are equal, the fragments are compared.
URIs are not comparable to any type except themselves. Comparing a
URI to anything except another
URI causes a
ClassCastException.
The toString( ) method
returns an unencoded string form of the
URI. That is, characters like é and \
are not percent-escaped unless they were percent-escaped in the
strings used to construct this URI. Therefore, the
result of calling this method is not guaranteed to be a syntactically
correct URI. This form is sometimes useful for display to human
beings, but not for retrieval.
The toASCIIString( ) method returns an
encoded string form of the
URI. Characters like é and \ are always
percent-escaped whether or not they were originally escaped. This is
the string form of the URI you should use most of the time. Even if
the form returned by toString( ) is more legible
for humans, they may still copy and paste it into areas that are not
expecting an illegal URI. toASCIIString( ) always
returns a syntactically correct URI.
Many systems access the Web and sometimes other non-HTTP parts of the Internet through proxy servers. A proxy server receives a request for a remote server from a local client. The proxy server makes the request to the remote server and forwards the result back to the local client. Sometimes this is done for security reasons, such as to prevent remote hosts from learning private details about the local network configuration. Other times it's done to prevent users from accessing forbidden sites by filtering outgoing requests and limiting which sites can be viewed. For instance, an elementary school might want to block access to http://www.playboy.com. And still other times it's done purely for performance, to allow multiple users to retrieve the same popular documents from a local cache rather than making repeated downloads from the remote server.
Java programs based on the URL class can work
through most common proxy servers and protocols. Indeed, this is one
reason you might want to choose to use the URL
class rather than rolling your own HTTP or other client on top of raw
sockets.
For basic operations, all you have
to do is set a few system properties to point to the addresses of
your local proxy servers. If you are using a pure HTTP proxy, set
http.proxyHost to the domain name or the IP
address of your proxy server and http.proxyPort to
the port of the proxy server (the default is 80). There are several
ways to do this, including calling System.setProperty() from within your Java code or using the -D options when
launching the program. This example sets the proxy server to
192.168.254.254 and the port to 9000:
% java -Dhttp.proxyHost=192.168.254.254 -Dhttp.proxyPort=9000
com.domain.Program
If you want to exclude a host from being proxied and connect directly
instead, set the http.nonProxyHosts system
property to its hostname or IP address. To exclude multiple hosts,
separate their names by vertical bars. For example, this code
fragment proxies everything except
java.oreilly.com and
xml.oreilly.com:
System.setProperty("http.proxyHost", "192.168.254.254");
System.setProperty("http.proxyPort", "9000");
System.setProperty("http.nonProxyHosts", "java.oreilly.com|xml.oreilly.com");
You can also use an asterisk as a wildcard to indicate that all the hosts within a particular domain or subdomain should not be proxied. For example, to proxy everything except hosts in the oreilly.com domain:
% java -Dhttp.proxyHost=192.168.254.254 -Dhttp.nonProxyHosts=*.oreilly.com
com.domain.Program
If you are using an FTP proxy server, set the
ftp.proxyHost, ftp.proxyPort,
and ftp.nonProxyHosts properties in the same way.
Java does not support any other application layer proxies, but if
you're using a transport layer SOCKS proxy for all
TCP connections, you can identify it with the
socksProxyHost and
socksProxyPort system properties. Java does not
provide an option for nonproxying with SOCKS. It's
an all-or-nothing decision.
Java 1.5 allows more fine-grained
control of proxy servers from within a Java program. Specifically,
this allows you to choose different proxy servers for different
remote hosts. The proxies themselves are represented by instances of
the java.net.Proxy class. There are still only
three kinds of proxies, HTTP, SOCKS, and direct
connections (no proxy at all), represented by three constants in the
Proxy.Type enum:
Proxy.Type.DIRECT
Proxy.Type.HTTP
Proxy.Type.SOCKS
Besides its type, the other important piece of information about a
proxy is its address and port, given as a
SocketAddress object. For example, this code
fragment creates a Proxy object representing an
HTTP proxy server on port 80 of
proxy.example.com:
SocketAddress address = new InetSocketAddress("proxy.example.com", 80);
Proxy proxy = new Proxy(Proxy.Type.HTTP, address);
Although there are only three kinds of proxy objects, there can be many proxies of the same type for different proxy servers on different hosts.
Each running Java 1.5 virtual machine
has a single java.net.ProxySelector object it uses
to locate the proxy server for different connections. The default
ProxySelector merely inspects the various system
properties and the URL's protocol to decide how to
connect to different hosts. However, you can install your own
subclass of ProxySelector in place of the default
selector and use it to choose different proxies based on protocol,
host, path, time of day, or other criteria.
The key to this class is the abstract select( )
method:
public abstract List<Proxy> select(URI uri)
Java passes this method a URI object (not a
URL object) representing the host to which a
connection is needed. For a connection made with the URL class, this
object typically has the form
http://www.example.com/ or
ftp://ftp.example.com/pub/files/, or some such.
For a pure TCP connection made with the Socket class, this URI will
have the form socket://host:port:, for instance,
socket://www.example.com:80. The
ProxySelector object then chooses the right
proxies for this type of object and returns them in a
List<Proxy>.
The second abstract method in this class you must implement is
connectFailed( ):
public void connectFailed(URI uri, SocketAddress address, IOException ex)
This is a callback method used to warn a program that the proxy
server isn't actually making the connection. Example 7-11 demonstrates with a
ProxySelector that attempts to use the proxy
server at proxy.example.com for all HTTP
connections unless the proxy server has previously failed to resolve
a connection to a particular URL. In that case, it suggests a direct
connection instead.
import java.net.*;
import java.util.*;
import java.io.*;
public class LocalProxySelector extends ProxySelector {
private List failed = new ArrayList( );
public List<Proxy> select(URI uri) {
List<Proxy> result = new ArrayList<Proxy>( );
if (failed.contains(uri)
|| "http".equalsIgnoreCase(uri.getScheme( ))) {
result.add(Proxy.NO_PROXY);
}
else {
SocketAddress proxyAddress
= new InetSocketAddress( "proxy.example.com", 8000);
Proxy proxy = new Proxy(Proxy.Type.HTTP, proxyAddress);
result.add(proxy);
}
return result;
}
public void connectFailed(URI uri, SocketAddress address, IOException ex) {
failed.add(uri);
}
}
As I already said, each running virtual machine has exactly one
ProxySelector. To change the
ProxySelector, pass the new selector to the static
ProxySelector.setDefault( ) method, like so:
ProxySelector selector = new LocalProxySelector( ):
ProxySelector.setDefault(selector);
From this point forward, all connections opened by that virtual
machine will ask the ProxySelector for the right
proxy to use. You normally shouldn't use this in
code running in a shared environment. For instance, you
wouldn't change the ProxySelector
in a servlet because that would change the
ProxySelector for all servlets running in the same
container.
|
The URL class makes it easy for Java applets and
applications to communicate with server-side programs such as CGIs,
servlets, PHP pages, and others that use the GET
method. (Server-side programs that use the POST
method require the URLConnection class and are
discussed in Chapter 15.) All you need to know
is what combination of names and values the program expects to
receive, and cook up a URL with a query string that provides the
requisite names and values. All names and values must be
x-www-form-url-encoded—as by the URLEncoder.encode() method, discussed earlier in this chapter.
There are a number of ways to determine the exact syntax for a query string that talks to a particular program. If you've written the server-side program yourself, you already know the name-value pairs it expects. If you've installed a third-party program on your own server, the documentation for that program should tell you what it expects.
On the other hand, if you're talking to a program on a third-party server, matters are a little trickier. You can always ask people at the remote server to provide you with the specifications for talking to their site. However, even if they don't mind doing this, there's probably no single person whose job description includes "telling third-party hackers with whom we have no business relationship exactly how to access our servers." Thus, unless you happen upon a particularly friendly or bored individual who has nothing better to do with their time except write long emails detailing exactly how to access their server, you're going to have to do a little reverse engineering.
TIP: This is beginning to change. A number of web sites have realized the value of opening up their systems to third party developers and have begin publishing developers' kits that provide detailed information on how to construct URLs to access their services. Sites like Safari and Amazon that offer RESTful, URL-based interfaces are easily accessed through the
URLclass. SOAP-based services like eBay's and Google's are much more difficult to work with.
Many programs are designed to process
form input. If this is the case, it's
straightforward to figure out what input the program expects. The
method the form uses should be the value of the
METHOD attribute of the FORM
element. This value should be either GET, in which
case you use the process described here, or POST,
in which case you use the process described in Chapter 15. The part of the URL that precedes the
query string is given by the value of the ACTION
attribute of the FORM element. Note that this may
be a relative URL, in which case you'll need to
determine the corresponding absolute URL. Finally, the name-value
pairs are simply the NAME attributes of the
INPUT elements, except for any
INPUT elements whose TYPE
attribute has the value submit.
For example, consider this HTML form for the local search engine on
my Cafe con Leche site. You can see that it uses the
GET method. The program that processes the form is
accessed via the URL http://www.google.com/search. It has four
separate name-value pairs, three of which have default values:
<form name="search" action="http://www.google.com/search" method="get">
<input name="q" />
<input type="hidden" value="cafeconleche.org" name="domains" />
<input type="hidden" name="sitesearch" value="cafeconleche.org" />
<input type="hidden" name="sitesearch2" value="cafeconleche.org" />
<br />
<input type="image" height="22" width="55"
src="images/search_blue.gif" alt="search" border="0"
name="search-image" />
</form>
The type of the INPUT field
doesn't matter—for instance, it
doesn't matter if it's a set of
checkboxes, a pop-up list, or a text field—only the name of
each INPUT field and the value you give it is
significant. The single exception is a submit input that tells the
web browser when to send the data but does not give the server any
extra information. In some cases, you may find hidden
INPUT fields that must have particular required
default values. This form has three hidden INPUT
fields.
In some cases, the program you're talking to may not be able to handle arbitrary text strings for values of particular inputs. However, since the form is meant to be read and filled in by human beings, it should provide sufficient clues to figure out what input is expected; for instance, that a particular field is supposed to be a two-letter state abbreviation or a phone number.
A program that doesn't respond to a form is much harder to reverse engineer. For example, at http://www.ibiblio.org/nywc/bios.phtml, you'll find a lot of links to PHP pages that talk to a database to retrieve a list of musical works by a particular composer. However, there's no form anywhere that corresponds to this program. It's all done by hardcoded URLs. In this case, the best you can do is look at as many of those URLs as possible and see whether you can guess what the server expects. If the designer hasn't tried to be too devious, this information isn't hard to figure out. For example, these URLs are all found on that page:
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Anderson
&first=Beth&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Austin
&first=Dorothea&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Bliss
&first=Marilyn&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Hart
&first=Jane&middle=Smith
Looking at these, you can guess that this particular program expects three inputs named first, middle, and last, with values that consist of the first, middle, and last names of a composer, respectively. Sometimes the inputs may not have such obvious names. In this case, you have to do some experimenting, first copying some existing values and then tweaking them to see what values are and aren't accepted. You don't need to do this in a Java program. You can simply edit the URL in the Address or Location bar of your web browser window.
TIP: The likelihood that other hackers may experiment with your own server-side programs in such a fashion is a good reason to make them extremely robust against unexpected input.
Regardless of how you determine the set of name-value pairs the
server expects, communicating with it once you know them is simple.
All you have to do is create a query string that includes the
necessary name-value pairs, then form a URL that includes that query
string. Send the query string to the server and read its response
using the same methods you use to connect to a server and retrieve a
static HTML page. There's no special protocol to
follow once the URL is constructed. (There is a special protocol to
follow for the POST method, however, which is why
discussion of that method will have to wait until Chapter 15.)
To demonstrate this procedure, let's write a very simple command-line program to look up topics in the Netscape Open Directory (http://dmoz.org/). This site is shown in Figure 7-3 and it has the advantage of being really simple.

Figure 7-3. The basic user interface for the Open Directory
The basic Open Directory interface is a simple
form with one input field named search; input
typed in this field is sent to a CGI program at http://search.dmoz.org/cgi-bin/search, which
does the actual search. The HTML for the form looks like this:
<form accept-charset="UTF-8"
action="http://search.dmoz.org/cgi-bin/search" method="GET">
<input size=30 name=search>
<input type=submit value="Search">
<a href="http://search.dmoz.org/cgi-bin/search?a.x=0">
<small><i>advanced</i></small></a>
</form>
There are only two input fields in this form: the Submit button and a text field named Search. Thus, to submit a search request to the Open Directory, you just need to collect the search string, encode it in a query string, and send it to http://search.dmoz.org/cgi-bin/search. For example, to search for "java", you would open a connection to the URL http://search.dmoz.org/cgi-bin/search?search=java and read the resulting input stream. Example 7-12 does exactly this.
import com.macfaq.net.*;
import java.net.*;
import java.io.*;
public class DMoz {
public static void main(String[] args) {
String target = "";
for (int i = 0; i < args.length; i++) {
target += args[i] + " ";
}
target = target.trim( );
QueryString query = new QueryString("search", target);
try {
URL u = new URL("http://search.dmoz.org/cgi-bin/search?" + query);
InputStream in = new BufferedInputStream(u.openStream( ));
InputStreamReader theHTML = new InputStreamReader(in);
int c;
while ((c = theHTML.read( )) != -1) {
System.out.print((char) c);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
catch (IOException ex) {
System.err.println(ex);
}
}
}
Of course, a lot more effort could be expended on parsing and
displaying the results. But notice how simple the code was to talk to
this server. Aside from the funky-looking URL and the slightly
greater likelihood that some pieces of it need to be
x-www-form-url-encoded, talking to a server-side program that uses
GET is no harder than retrieving any other HTML
page.
Many popular sites, such as
TheWall Street Journal,
require a username and password for access. Some sites, such as the
W3C member pages, implement this correctly through HTTP
authentication. Others, such as the Java Developer Connection,
implement it incorrectly through cookies and HTML forms.
Java's
URL
class can access sites that use HTTP authentication, although
you'll of course need to tell it what username and
password to use. Java does not provide support
for sites that use nonstandard, cookie-based authentication, in part
because Java doesn't really support cookies in Java
1.4 and earlier, in part because this requires parsing and submitting
HTML forms, and, lastly, because cookies are completely contrary to
the architecture of the Web. (Java 1.5 does add some cookie support,
which we'll discuss in the next chapter. However, it
does not treat authentication cookies differently than any other
cookies.) You can provide this support yourself using the
URLConnection class to read and write the HTTP
headers where cookies are set and returned. However, doing so is
decidedly nontrivial and often requires custom code for each site you
want to connect to. It's really hard to do short of
implementing a complete web browser with full HTML forms and cookie
support. Accessing sites protected by standard, HTTP authentication
is much easier.
The
java.net package includes an
Authenticator class you can use to provide a
username and password for sites that protect themselves using HTTP
authentication:
public abstract class Authenticator extends Object // Java 1.2
Since Authenticator is an abstract class, you must
subclass it. Different subclasses may retrieve the information in
different ways. For example, a character mode program might just ask
the user to type the username and password on
System.in. A GUI program would likely put up a
dialog box like the one shown in Figure 7-4. An automated robot might
read the username out of an encrypted file.

Figure 7-4. An authentication dialog
To make the URL class use the subclass, install it
as the default authenticator by passing it to the static
Authenticator.setDefault() method:
public static void setDefault(Authenticator a)
For example, if you've written an
Authenticator subclass named
DialogAuthenticator, you'd
install it like this:
Authenticator.setDefault(new DialogAuthenticator( ));
You only need to do this once. From this point forward, when the
URL class needs a username and password, it will
ask the DialogAuthenticator using the static
Authenticator.requestPasswordAuthentication() method:
public static PasswordAuthentication requestPasswordAuthentication(
InetAddress address, int port, String protocol, String prompt, String scheme)
throws SecurityException
The address argument is the host for which
authentication is required. The port argument is
the port on that host, and the protocol argument
is the application layer protocol by which the site is being
accessed. The HTTP server provides the prompt.
It's typically the name of the realm for which
authentication is required. (Some large web servers such as
www.ibiblio.org have multiple realms, each of
which requires different usernames and passwords.) The
scheme is the authentication scheme
being used. (Here the word scheme is not being
used as a synonym for protocol. Rather it is an
HTTP authentication scheme, typically basic.)
Untrusted applets are not allowed to ask the user for a name and
password. Trusted applets can do so, but only if they possess the
requestPasswordAuthenticationNetPermission. Otherwise,
Authenticator.requestPasswordAuthentication( )
throws a SecurityException.
The Authenticator subclass must override the
getPasswordAuthentication( ) method. Inside this
method, you collect the username and password from the user or some
other source and return it as an instance of the
java.net.PasswordAuthentication class:
protected PasswordAuthentication getPasswordAuthentication( )
If you don't want to authenticate this request,
return null, and Java will tell the server it
doesn't know how to authenticate the connection. If
you submit an incorrect username or password, Java will call
getPasswordAuthentication( ) again to give you
another chance to provide the right data. You normally have five
tries to get the username and password correct; after that,
openStream( ) throws a
ProtocolException.
Usernames and passwords are cached within the same virtual machine
session. Once you set the correct password for a realm, you
shouldn't be asked for it again unless
you've explicitly deleted the password by zeroing
out the char array that contains it.
You can get more details about the request by invoking any of these
methods inherited from the
Authenticator
superclass:
protected final InetAddress getRequestingSite( )
protected final int getRequestingPort( )
protected final String getRequestingProtocol( )
protected final String getRequestingPrompt( )
protected final String getRequestingScheme( )
protected final String getRequestingHost( ) // Java 1.4
These methods either return the information as given in the last call
to requestPasswordAuthentication( ) or return
null if that information is not available.
(getRequestingPort( ) returns -1 if the port
isn't available.) The last method,
getRequestingHost( ), is only available in Java
1.4 and later; in earlier releases you can call
getRequestingSite( ).getHostName( ) instead.
Java 1.5 adds two more methods to this class:
protected final String getRequestingURL( ) // Java 1.5
protected Authenticator.RequestorType getRequestorType( )
The getRequestingURL( ) method returns the
complete URL for which authentication has been requested—an
important detail if a site uses different names and passwords for
different files. The getRequestorType( ) method
returns one of the two named constants
Authenticator.RequestorType.PROXY or
Authenticator.RequestorType.SERVER to indicate
whether the server or the proxy server is requesting the
authentication.
|
PasswordAuthentication is a very simple final class that
supports two read-only properties: username and password. The
username is a String. The password is a
char array so that the password can be erased when
it's no longer needed. A String
would have to wait to be garbage collected before it could be erased,
and even then it might still exist somewhere in memory on the local
system, possibly even on disk if the block of memory that contained
it had been swapped out to virtual memory at one point. Both username
and password are set in the constructor:
public PasswordAuthentication(String userName, char[] password)
Each is accessed via a getter method:
public String getUserName( )
public char[] getPassword( )
One useful tool for asking users for their passwords in a more or
less secure fashion is the
JPasswordField component from Swing:
public class JPasswordField extends JTextField
This lightweight component behaves almost exactly like a text field. However, anything the user types into it is echoed as an asterisk. This way, the password is safe from anyone looking over the user's shoulder at what's being typed on the screen.
JPasswordField also stores the passwords as a
char array so that when you're
done with the password you can overwrite it with zeros. It provides
the getPassword( ) method to
return this:
public char[] getPassword( )
Otherwise, you mostly use the methods it inherits from the
JTextField superclass. Example 7-13 demonstrates a Swing-based
Authenticator subclass that brings up a dialog to
ask the user for his username and password. Most of this code handles
the GUI. A JPasswordField collects the password
and a simple JTextField retrieves the username. Figure 7-4 showed the rather simple dialog box this produces.
package com.macfaq.net;
import java.net.*;
import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
public class DialogAuthenticator extends Authenticator {
private JDialog passwordDialog;
private JLabel mainLabel
= new JLabel("Please enter username and password: ");
private JLabel userLabel = new JLabel("Username: ");
private JLabel passwordLabel = new JLabel("Password: ");
private JTextField usernameField = new JTextField(20);
private JPasswordField passwordField = new JPasswordField(20);
private JButton okButton = new JButton("OK");
private JButton cancelButton = new JButton("Cancel");
public DialogAuthenticator( ) {
this("", new JFrame( ));
}
public DialogAuthenticator(String username) {
this(username, new JFrame( ));
}
public DialogAuthenticator(JFrame parent) {
this("", parent);
}
public DialogAuthenticator(String username, JFrame parent) {
this.passwordDialog = new JDialog(parent, true);
Container pane = passwordDialog.getContentPane( );
pane.setLayout(new GridLayout(4, 1));
pane.add(mainLabel);
JPanel p2 = new JPanel( );
p2.add(userLabel);
p2.add(usernameField);
usernameField.setText(username);
pane.add(p2);
JPanel p3 = new JPanel( );
p3.add(passwordLabel);
p3.add(passwordField);
pane.add(p3);
JPanel p4 = new JPanel( );
p4.add(okButton);
p4.add(cancelButton);
pane.add(p4);
passwordDialog.pack( );
ActionListener al = new OKResponse( );
okButton.addActionListener(al);
usernameField.addActionListener(al);
passwordField.addActionListener(al);
cancelButton.addActionListener(new CancelResponse( ));
}
private void show( ) {
String prompt = this.getRequestingPrompt( );
if (prompt == null) {
String site = this.getRequestingSite( ).getHostName( );
String protocol = this.getRequestingProtocol( );
int port = this.getRequestingPort( );
if (site != null & protocol != null) {
prompt = protocol + "://" + site;
if (port > 0) prompt += ":" + port;
}
else {
prompt = "";
}
}
mainLabel.setText("Please enter username and password for "
+ prompt + ": ");
passwordDialog.pack( );
passwordDialog.show( );
}
PasswordAuthentication response = null;
class OKResponse implements ActionListener {
public void actionPerformed(ActionEvent e) {
passwordDialog.hide( );
// The password is returned as an array of
// chars for security reasons.
char[] password = passwordField.getPassword( );
String username = usernameField.getText( );
// Erase the password in case this is used again.
passwordField.setText("");
response = new PasswordAuthentication(username, password);
}
}
class CancelResponse implements ActionListener {
public void actionPerformed(ActionEvent e) {
passwordDialog.hide( );
// Erase the password in case this is used again.
passwordField.setText("");
response = null;
}
}
public PasswordAuthentication getPasswordAuthentication( ) {
this.show( );
return this.response;
}
}
Example 7-14 is a revised
SourceViewer program that asks the user for a name
and password using the DialogAuthenticator class.
import java.net.*;
import java.io.*;
import com.macfaq.net.DialogAuthenticator;
public class SecureSourceViewer {
public static void main (String args[]) {
Authenticator.setDefault(new DialogAuthenticator( ));
for (int i = 0; i < args.length; i++) {
try {
//Open the URL for reading
URL u = new URL(args[i]);
InputStream in = u.openStream( );
// buffer the input to increase performance
in = new BufferedInputStream(in);
// chain the InputStream to a Reader
Reader r = new InputStreamReader(in);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
}
}
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
// print a blank line to separate pages
System.out.println( );
} // end for
// Since we used the AWT, we have to explicitly exit.
System.exit(0);
} // end main
} // end SecureSourceViewer
Elliotte Rusty Harold is a noted writer and programmer, both on and off the Internet. His previous books include "Java Network Programming", Third Edition, "XML in a Nutshell", Third Edition, and "Java I/O", all from O'Reilly.
View catalog information for Java Network Programming, 3rd Edition
Return to ONJava.com.
Copyright © 2009 O'Reilly Media, Inc.