URLs and URIs, Proxies and Passwords
Pages: 1, 2, 3, 4, 5
The URI Class
A URI is an abstraction of a URL that
includes not only Uniform Resource Locators but also Uniform Resource
Names (URNs). Most URIs used in practice are URLs, but most
specifications and standards such as XML are defined in terms of
URIs. In Java 1.4 and later, URIs are represented by the
java.net.URI class. This class differs from the
java.net.URL class in three important ways:
-
The
URIclass is purely about identification of resources and parsing of URIs. It provides no methods to retrieve a representation of the resource identified by its URI. -
The
URIclass is more conformant to the relevant specifications than theURLclass. -
A
URIobject can represent a relative URI. TheURLclass absolutizes all URIs before storing them.
In brief, a URL object is a representation of an
application layer protocol for network retrieval, whereas a
URI object is purely for string parsing and
manipulation. The URI class has no network
retrieval capabilities. The URL class has some
string parsing methods, such as getFile( ) and
getRef( ), but many of these are broken and
don't always behave exactly as the relevant
specifications say they should. Assuming you're
using Java 1.4 or later and therefore have a choice, you should use
the URL class when you want to download the
content of a URL and the URI class when you want
to use the URI for identification rather than retrieval, for
instance, to represent an XML namespace URI. In some cases when you
need to do both, you may convert from a URI to a
URL with the toURL( ) method,
and in Java 1.5 you can also convert from a URL to
a URI using the toURI( ) method
of the URL class.
Constructing a URI
URIs are built from strings. Unlike the
URL class, the URI class does
not depend on an underlying protocol handler. As long as the URI is
syntactically correct, Java does not need to understand its protocol
in order to create a representative URI object. Thus, unlike the
URL class, the URI class can be
used for new and experimental URI schemes.
public URI(String uri) throws URISyntaxException
This is the basic constructor that creates a new
URI object from any convenient string. For
example,
URI voice = new URI("tel:+1-800-9988-9938");
URI web = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc");
URI book = new URI("urn:isbn:1-565-92870-9");
If the string argument does not follow URI syntax rules—for
example, if the URI begins with a colon—this constructor throws
a URISyntaxException. This is a checked exception,
so you need to either catch it or declare that the method where the
constructor is invoked can throw it. However, one syntactic rule is
not checked. In contradiction to the URI specification, the
characters used in the URI are not limited to ASCII. They can include
other Unicode characters, such as ø and é.
Syntactically, there are very few restrictions on URIs, especially
once the need to encode non-ASCII characters is removed and relative
URIs are allowed. Almost any string can be interpreted as a URI.
public URI(String scheme, String schemeSpecificPart, String fragment) throws URISyntaxException
This constructor is mostly used for nonhierarchical URIs. The scheme
is the URI's protocol, such as http, urn, tel, and
so forth. It must be composed exclusively of ASCII letters and digits
and the three punctuation characters +,
-, and .. It must begin with a
letter. Passing null for this argument omits the scheme, thus
creating a relative URI. For example:
URI absolute = new URI("http", "//www.ibiblio.org" , null);
URI relative = new URI(null, "/javafaq/index.shtml", "today");
The scheme-specific part depends on the syntax of the URI scheme;
it's one thing for an http URL, another for a mailto
URL, and something else again for a tel URI. Because the
URI class encodes illegal characters with percent
escapes, there's effectively no syntax error you can
make in this part.
Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.
public URI(String scheme, String host, String path, String fragment) throws URISyntaxException
This constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:
URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");
produces the URI http://www.ibiblio.org/javafaq/index.html#today.
If the constructor cannot form a legal hierarchical URI from the
supplied pieces—for instance, if there is a scheme so the URI
has to be absolute but the path doesn't start with
/—then it throws a URISyntaxException.
public URI(String scheme, String authority, String path, String query, String fragment) throws URISyntaxException
This constructor is basically the same as the previous one, with the addition of a query string component. For example:
URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html",
"referrer=cnet&date=2004-08-23", "today");
As usual, any unescapable syntax errors cause a
URISyntaxException to be thrown and null can be
passed to omit any of the arguments.
public URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment) throws URISyntaxException
This is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info, host, and port parts, each of which has its own syntax rules. For example:
URI styles = new URI("ftp", "anonymous:elharo@metalab.unc.edu",
"ftp.oreilly.com", 21, "/pub/stylesheet", null, null);
However, the resulting URI still has to follow all the usual rules for URIs and again, null can be passed for any argument to omit it from the result.
public static URI create(String uri)
This is not a constructor, but rather a static factory method. Unlike
the constructors, it does not throw a
URISyntaxException. If you're
sure your URIs are legal and do not violate any of the rules, you can
use this method. For example, this invocation creates a
URI for anonymous FTP access using an email
address as password:
URI styles = URI.create(
"ftp://anonymous:elharo%40metalab.unc.edu@ftp.oreilly.com:
21/pub/stylesheet");
If the URI does prove to be malformed, this method throws an
IllegalArgumentException. This is a runtime
exception, so you don't have to explicitly declare
it or catch it.
The Parts of the URI
A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:
scheme:scheme-specific-part:fragment
If
the scheme is omitted, the URI reference is relative. If the fragment
identifier is omitted, the URI reference is a pure URI. The URI class
has getter methods that return these three parts of each
URI object. The
getRawFoo(
) methods return the encoded forms of the parts of the URI,
while the equivalent
getFoo() methods first decode any percent-escaped characters and
then return the decoded part:
public String getScheme( )
public String getSchemeSpecificPart( )
public String getRawSchemeSpecificPart( )
public String getFragment( )
public String getRawFragment( )
TIP: There's no
getRawScheme( )method because the URI specification requires that all scheme names be composed exclusively of URI-legal ASCII characters and does not allow percent escapes in scheme names.
These methods all return null if the particular
URI object does not have the relevant component:
for example, a relative URI without a scheme or an http URI without a
fragment identifier.
A URI that has a scheme is an
absolute URI. A URI without a scheme is
relative. The isAbsolute() method returns true if the
URI is absolute, false if it's relative:
public boolean isAbsolute( )
The details of the scheme-specific part vary depending on the type of
the scheme. For example, in a tel
URL, the scheme-specific part has the syntax of a telephone number.
However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a
particular hierarchical format divided into an
authority, a path, and a query string. The authority is further
divided into user info, host, and port. The isOpaque() method returns false if the URI is
hierarchical, true if it's not
hierarchical—that is, if it's opaque:
public boolean isOpaque( )
If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:
public String getAuthority( )
public String getFragment( )
public String getHost( )
public String getPath( )
public String getPort( )
public String getQuery( )
public String getUserInfo( )
These methods all return the decoded parts; in other words, percent
escapes, such as %3C, are changed into the characters they represent,
such as <. If you want the raw, encoded parts of the URI, there
are five parallel
getRawFoo() methods:
public String getRawAuthority( )
public String getRawFragment( )
public String getRawPath( )
public String getRawQuery( )
public String getRawUserInfo( )
Remember the URI class differs from the URI specification
in that non-ASCII characters such as é and ü
are never percent-escaped in the first place, and thus will still be
present in the strings returned by the
getRawFoo() methods unless the strings originally used to construct
the URI object were encoded.
TIP: There are no
getRawPort( )andgetRawHost( )methods because these components are always guaranteed to be made up of ASCII characters, at least for now. Internationalized domain names are coming, and may require this decision to be rethought in future versions of Java.
In the event that the specific URI does not contain this
information—for instance, the URI
http://www.example.com has no user info, path,
port, or query string—the relevant methods return null.
getPort( ) is the
single exception. Since it's declared to return an
int, it can't return
null. Instead, it returns -1 to indicate an
omitted port.
For various technical reasons that don't have a lot
of practical impact, Java can't always initially
detect syntax errors in the authority component. The immediate
symptom of this failing is normally an inability to return the
individual parts of the authority: port, host, and user info. In this
event, you can call parseServerAuthority() to force the authority to
be reparsed:
public URI parseServerAuthority( ) throws URISyntaxException
The original URI does not change
(URI objects are immutable), but the
URI returned will have separate authority parts
for user info, host, and port. If the authority cannot be parsed, a
URISyntaxException is thrown.
Example 7-10 uses these methods to split URIs entered on the command line into their component parts. It's similar to Example 7-4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.
import java.net.*;
public class URISplitter {
public static void main(String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URI u = new URI(args[i]);
System.out.println("The URI is " + u);
if (u.isOpaque( )) {
System.out.println("This is an opaque URI.");
System.out.println("The scheme is " + u.getScheme( ));
System.out.println("The scheme specific part is "
+ u.getSchemeSpecificPart( ));
System.out.println("The fragment ID is " + u.getFragment( ));
}
else {
System.out.println("This is a hierarchical URI.");
System.out.println("The scheme is " + u.getScheme( ));
try {
u = u.parseServerAuthority( );
System.out.println("The host is " + u.getUserInfo( ));
System.out.println("The user info is " + u.getUserInfo( ));
System.out.println("The port is " + u.getPort( ));
}
catch (URISyntaxException ex) {
// Must be a registry based authority
System.out.println("The authority is " + u.getAuthority( ));
}
System.out.println("The path is " + u.getPath( ));
System.out.println("The query string is " + u.getQuery( ));
System.out.println("The fragment ID is " + u.getFragment( ));
} // end else
} // end try
catch (URISyntaxException ex) {
System.err.println(args[i] + " does not seem to be a URI.");
}
System.out.println( );
} // end for
} // end main
} // end URISplitter
Here's the result of running this against three of the URI examples in this section:
% java URISplitter tel:+1-800-9988-9938
\http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \urn:isbn:1-565-92870-9
The URI is tel:+1-800-9988-9938
This is an opaque URI.
The scheme is tel
The scheme specific part is +1-800-9988-9938
The fragment ID is null
The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc
This is a hierarchical URI.
The scheme is http
The host is null
The user info is null
The port is -1
The path is /pub/a/2003/09/17/stax.html
The query string is null
The fragment ID is id=_hbc
The URI is urn:isbn:1-565-92870-9
This is an opaque URI.
The scheme is urn
The scheme specific part is isbn:1-565-92870-9
The fragment ID is null