Home TOC |
![]() ![]() ![]() |
Creating a Document Type Definition (DTD)
After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid, and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.
Basic DTD Definitions
When you were parsing the slide show, for example, you saw that the
characters
method was invoked multiple times before and after comments and slide elements. In those cases, the whitespace consisted of the line endings and indentation surrounding the markup. The goal was to make the XML document readable--the whitespace was not in any way part of the document contents. To begin learning about DTD definitions, let's start by telling the parser where whitespace is ignorable.
Note: The DTD defined in this section is contained inslideshow1a.dtd
. (The browsable version isslideshow1a-dtd.html
.)
Start by creating a file named
slideshow.dtd
. Enter an XML declaration and a comment to identify the file, as shown below:<?xml version='1.0' encoding='utf-8'?> <!-- DTD for a simple "slide show". -->Next, add the text highlighted below to specify that a
slideshow
element containsslide
elements and nothing else:<!-- DTD for a simple "slide show". --> <!ELEMENT slideshow (slide+)>As you can see, the DTD tag starts with
<!
followed by the tag name (ELEMENT
). After the tag name comes the name of the element that is being defined (slideshow
) and, in parentheses, one or more items that indicate the valid contents for that element. In this case, the notation says that aslideshow
consists of one or moreslide
elements.Without the plus sign, the definition would be saying that a
slideshow
consists of a singleslide
element. Here are the qualifiers you can add to an element definition:
Table 3 DTD Element Qualifiers Qualifier
Name
Meaning
?
Question Mark
Optional (zero or one)
*
Asterisk
Zero or more
+
Plus Sign
One or more
You can include multiple elements inside the parentheses in a comma separated list, and use a qualifier on each element to indicate how many instances of that element may occur. The comma-separated list tells which elements are valid and the order they can occur in.
You can also nest parentheses to group multiple items. For an example, after defining an
image
element (coming up shortly), you could declare that everyimage
element must be paired with atitle
element in a slide by specifying((image, title)+)
. Here, the plus sign applies to theimage/title
pair to indicate that one or more pairs of the specified items can occur.Defining Text and Nested Elements
Now that you have told the parser something about where not to expect text, let's see how to tell it where text can occur. Add the text highlighted below to define the
slide
,title
,item
, andlist
elements:<!ELEMENT slideshow (slide+)> <!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)> <!ELEMENT item (#PCDATA | item)* >The first line you added says that a slide consists of a
title
followed by zero or moreitem
elements. Nothing new there. The next line says that a title consists entirely of parsed character data (PCDATA
). That's known as "text" in most parts of the country, but in XML-speak it's called "parsed character data". (That distinguishes it fromCDATA
sections, which contain character data that is not parsed.) The"#"
that precedesPCDATA
indicates that what follows is a special word, rather than an element name.The last line introduces the vertical bar (
|
), which indicates an or condition. In this case, eitherPCDATA
or anitem
can occur. The asterisk at the end says that either one can occur zero or more times in succession. The result of this specification is known as a mixed-content model, because any number ofitem
elements can be interspersed with the text. Such models must always be defined with#PCDATA
specified first, some number of alternate items divided by vertical bars (|
), and an asterisk (*
) at the end.Limitations of DTDs
It would be nice if we could specify that an
item
contains either text, or text followed by one or more list items. But that kind of specification turns out to be hard to achieve in a DTD. For example, you might be tempted to define anitem
like this:<!ELEMENT item (#PCDATA | (#PCDATA, item+)) >That would certainly be accurate, but as soon as the parser sees #PCDATA and the vertical bar, it requires the remaining definition to conform to the mixed-content model. This specification doesn't, so you get can error that says:
Illegal mixed content model for 'item'. Found ( ...,
where the hex character 28 is the angle bracket the ends the definition.Trying to double-define the item element doesn't work, either. A specification like this:
<!ELEMENT item (#PCDATA) > <!ELEMENT item (#PCDATA, item+) >produces a "duplicate definition" warning when the validating parser runs. The second definition is, in fact, ignored. So it seems that defining a mixed content model (which allows
item
elements to be interspersed in text) is about as good as we can do.In addition to the limitations of the mixed content model mentioned above, there is no way to further qualify the kind of text that can occur where
PCDATA
has been specified. Should it contain only numbers? Should be in a date format, or possibly a monetary format? There is no way to say in the context of a DTD.Finally, note that the DTD offers no sense of hierarchy. The definition for the
title
element applies equally to aslide
title and to anitem
title. When we expand the DTD to allow HTML-style markup in addition to plain text, it would make sense to restrict the size of anitem
title compared to aslide
title, for example. But the only way to do that would be to give one of them a different name, such as "item-title
". The bottom line is that the lack of hierarchy in the DTD forces you to introduce a "hyphenation hierarchy" (or its equivalent) in your namespace. All of these limitations are fundamental motivations behind the development of schema-specification standards.Special Element Values in the DTD
Rather than specifying a parenthesized list of elements, the element definition could use one of two special values:
ANY
orEMPTY
. TheANY
specification says that the element may contain any other defined element, orPCDATA
. Such a specification is usually used for the root element of a general-purpose XML document such as you might create with a word processor. Textual elements could occur in any order in such a document, so specifyingANY
makes sense.The
EMPTY
specification says that the element contains no contents. So the DTD for e-mail messages that let you "flag" the message with<flag/>
might have a line like this in the DTD:<!ELEMENT flag EMPTY>Referencing the DTD
In this case, the DTD definition is in a separate file from the XML document. That means you have to reference it from the XML document, which makes the DTD file part of the external subset of the full Document Type Definition (DTD) for the XML file. As you'll see later on, you can also include parts of the DTD within the document. Such definitions constitute the local subset of the DTD.
Note: The XML written in this section is contained inslideSample05.xml
. (The browsable version isslideSample05-xml.html
.)
To reference the DTD file you just created, add the line highlighted below to your
slideSample.xml
file:<!-- A SAMPLE set of slides --> <!DOCTYPE slideshow SYSTEM "slideshow.dtd"> <slideshowAgain, the DTD tag starts with
"<!"
. In this case, the tag name,DOCTYPE
, says that the document is aslideshow
, which means that the document consists of theslideshow
element and everything within it:<slideshow> ... </slideshow>This tag defines the
slideshow
element as the root element for the document. An XML document must have exactly one root element. This is where that element is specified. In other words, this tag identifies the document content as aslideshow
.The
DOCTYPE
tag occurs after the XML declaration and before the root element. TheSYSTEM
identifier specifies the location of the DTD file. Since it does not start with a prefix likehttp:/
orfile:/
, the path is relative to the location of the XML document. Remember thesetDocumentLocator
method? The parser is using that information to find the DTD file, just as your application would to find a file relative to the XML document. APUBLIC
identifier could also be used to specify the DTD file using a unique name--but the parser would have to be able to resolve itThe
DOCTYPE
specification could also contain DTD definitions within the XML document, rather than referring to an external DTD file. Such definitions would be contained in square brackets, like this:<!DOCTYPE slideshow SYSTEM "slideshow1.dtd" [...local subset definitions here... ]>
You'll take advantage of that facility later on to define some entities that can be used in the document.
Home TOC |
![]() ![]() ![]() |