Sign In/My Account | View Cart  
advertisement

Sponsored By:




Listen Print

Organizing XML with Entities

by Erik T. Ray
01/23/2001

The Basics

First, a review of the basics. According to the XML Recommendation1, "an XML document may consist of one or many storage units ... called entities." So, an entity is a piece of text and markup (called mixed content), or basically any subset of an XML document. The whole document is referred to as the document entity. (The document type definition [DTD] is another entity.) More interesting to you, the XML author, however, are the smaller bits and pieces of a document that can be contained inside entities.

Any entity (except for the document and its DTD) can be named, which gives you the power to call upon a segment of mixed-content text when you need it. As you will see, there are many applications for named entities, from inserting boilerplate text to spreading a document over multiple pages.

General Entities

To use a named (or general) entity, you first have to state your intention to use it in a piece of syntax called an entity declaration. Most often, entities are declared in the internal subset of the DTD. That's a place at the top of the document, inside the <!DOCTYPE> tag. Here's an example of a document that declares a general entity called "friend:"

<?xml version="1.0"?>
<!DOCTYPE memo
[
  <!ENTITY friend "Samuel Jeremiah Bagpipe-Grubbins">
]>
<memo priority="normal">
  <from>Julie</from>
  <to>roommates</to>
  <message>
My good friend, &friend;, will be stopping by to feed 
and talk to my goldfish while I'm away. I've given &friend; 
a set of keys and told him how to work the alarm. So please 
make &friend; feel at home when he's here. Thanks. 
  </message>
</memo>

In this document, we declared an entity called "friend" for the text "Samuel Jeremiah Bagpipe-Grubbins." Later, there are references to the entity of the form &friend;. The ampersand (&) and semicolon (;) are delimiters that tell the XML parser to treat the word as an entity reference. (Because the symbol & has special meaning for entity references, you actually have to use an entity &amp; when you just want to use the character (&) by itself.) When the XML parser reads this document, it automatically replaces all entity references with the entity's defined text.

Related Reading

Learning XML

Learning XML
Guide to Creating Self-Describing Data
By Erik T. Ray

Table of Contents
Index
Sample Chapter
Author's Article

Read Online--Safari Search this book on Safari:
 

Code Fragments only

This is a powerful way to store and retrieve bits of text. Some reasons you'd do this are:

  • To ensure consistency. Using an entity to represent repeating text guarantees that every instance will be exactly the same. This is useful for items that are complex and difficult to spell, such as Sam's long name.

  • To hold frequently changing information. If, at the last minute, Julie's friend Sam backed out of goldfish-sitting, she could find someone else to do it, and change the name in the memo in a flash. Instead of editing the name in three places, she only has to edit the entity declaration. Okay, maybe it's not such a hardship to change three names, but you can imagine that in a really large document, where a name can appear in hundreds of places, an entity would be a real labor-saver.

  • To limit repetitive typing. It's easier to type in "friend" than it is to type in Sam's full name. If you're typing in the same piece of text over and over, it may make sense to use an entity reference instead.

External Entities

Perhaps the most important role of an entity is to include text from another file in your document. An external entity is declared slightly differently than the general entity in our previous example because its replacement text is located in another file. The declaration needs to tell the XML parser how to find that text, whether it's in another place on the same computer system, or perhaps on another system somewhere on the Internet.


O'Reilly's XML books focus on providing core XML information as well as information on how to integrate XML with other key technologies, such as Java and Oracle. Our books will show you how to make the most of this license-free, platform-independent, and well-supported markup language.

In addition to Learning XML, our offerings include:


If this sounds like linking (e.g. hypertext links in HTML), you're partially right. It isn't a way to link one document to another; rather it's a way to link segments of a document together. Once the XML parser finds the replacement text and pops it into the places where a reference to an external entity is found, it treats that text as if it had been there all along. The user has no idea (and shouldn't care) that an external entity pulled in content from another file. This is quite different from XLink2, XML's linking paradigm. XLink defines the ways a document can link separate documents together, which often involves some user interaction (like clicking on a highlighted word in HTML).

The following example, a fictional lab report written by one Dr. Wyse, is a document consisting of several parts spread over multiple files:

<?xml version="1.0"?>
<!DOCTYPE report
[
  <!ENTITY abstract     SYSTEM "abs.xml">
  <!ENTITY data         SYSTEM "dat.xml">
  <!ENTITY analysis     SYSTEM "ana.xml">
  <!ENTITY conclusion   SYSTEM "con.xml">
  <!ENTITY bibliography SYSTEM "bib.xml">
  <!ENTITY appendix     SYSTEM
      "http://www.scistuff.org/pub/info/tables/tbl34.xml">
  <!ENTITY bio          SYSTEM "/home/penny/bio.txt">
  <!ENTITY header       PUBLIC 
      "-//SCILAB//XML Corp Banner v1.3//EN" 
      "/company/boilerplate/banner3.xml">
  <!ENTITY equations    SYSTEM "/company/sci/eqs.ent">
  &eqs.ent;
]>
<report>
  <date>2001.04.13</date>
  <author>Dr. Penny Wyse<author/>

  <!-- Main Document Parts -->

  &header;         <!-- company logo, legal info, etc. -->
  &abstract;       <!-- overview of the experiment -->
  &data;           <!-- experimental results in a big table -->
  &analysis;       <!-- graphs, diagrams, equations, gab -->
  &conclusion;     <!-- what we think happened -->
  &bibliography;   <!-- citations and research sources -->
  <appendix>
    <title>Isotopic Measurements for Einsteinium</title>
    &appendix;     <!-- a useful table I found somewhere -->
  </appendix>
  <colophon>       <!-- brief career history -->
    <authorbio>
      &bio;
    </authorbio>
  </colophon>
</memo>

The first thing you'll notice about this example is how sparse it is. Where is all the content? With clever use of entities, Dr. Wyse has spread all the content out among a bunch of files. The header of the report is a piece of boilerplate living in a file somewhere else on the system. The critical components, from abstract to bibliography, are files in the same location as the file printed here. Dr. Wyse also includes a table for her appendix, which is sourced in from a location on the Internet. Finally, she includes a bio from her home directory, where she can maintain her personal information.

The second thing you'll see is that Dr. Wyse uses different kinds of external entity declarations. These declarations you use depends on how you want to access the resource. The first declaration is a system identifier, which is a URL or a path to the file on the Internet. The second is a public identifier, which is a name for a resource that is universally recognized and doesn't require that you know precisely where the resource is. We won't go into the details of public identifiers, but system identifiers are quite useful on their own because they usually use a URL or a filesystem path to specify the location of a resource.

Dividing an XML Document into Components

So, why did Dr. Wyse butcher her document into so many files when it would be simpler just to keep all that stuff in one place? Dividing an XML document into components has several advantages:

  • It organizes document parts. In a complex document, such as a lab report, finding your way around can be tedious and difficult. Do you really want to scroll down a long file filled with dense markup to count how many sections there are? A scheme like the report example above saves time and protects your eyes by representing the structure of the document in a brief list of entity references.

  • It makes editing a document easier. Documents can grow extremely large. A book with a thousand pages can easily fill a few hundred megabytes of space, which is larger than any text editor or XML editor can handle. Even medium-size documents can be made easier to edit if you split them up into several files. For example, you'll have less to download if you want to work at home on just a section. For another, if you work with collaborators, both parties can edit their pieces simultaneously, without the hassle of merging content later.

  • Documents can share common material. This is one of the best features of external entities. No longer do you need to copy a piece of a document that is used over and over into every document that needs it. Now, you can have all your documents import the text from one place. To update a piece of boilerplate, such as a legal notice or a standard copyright page, you can edit the file that contains the boilerplate, and instantly all documents importing it will be updated. One caveat: unless you're using a location-independent scheme like public identifiers, you should be careful not to move the imported file without fixing the external entity declarations for it in all the documents that use it.

  • It lets you access a public data resource. There's a vast amount of data on the Internet, all of it available to your document for importing. It makes sense to distribute information this way since it's always changing and someone will be there to maintain it when you can't. Note that in the lab report example there is a declaration for an external entity that imports the file eqs.ent. This file contains entities that define equations in a language like MathML3. It's easier to store commonly used content in a central repository than it is to type it all out again and again. And if an equation definition is found to contain an error, it can be fixed in one place and replicated everywhere else.

Summary

Entities were introduced with XML's predecessor, the Standardized General Markup Language (SGML). But they've proved so valuable to XML authors that they were included in the slimmer XML specification while other features were pared away. Master the use of entities and you'll find that writing documents in XML is an easier and more manageable process. And you can impress your friends at parties with impressive XML tricks. (Well, maybe.)


Notes:

  1. The Extensible Markup Language (XML) Recommendation is written and maintained by the XML Working Group of the World Wide Web Consortium (W3C). (See the W3C's page on XML resources and information.) Version 1.0, Second Edition, of this document is available online.

  2. The XML Linking Language, also called XLink, is another recommendation by the W3C, available online.

  3. MathML is an XML markup language proposed by the W3C for encoding mathematical expressions and functions.

 

Erik Ray is an XML software specialist and developer at O'Reilly & Associates. He lives with his wife Jeannine and five parrots in Saugus, Massachusetts. Besides writing, he practices kendo, plays go, binds books, and stalks bookstores for rare and antiquarian books.

Learning XML

Related Reading

Learning XML
Guide to Creating Self-Describing Data
By Erik T. Ray

Table of Contents
Index
Sample Chapter
Author's Article

Read Online--Safari
Search this book on Safari:
 

Code Fragments only

Return to xml.oreilly.com