Home TOC |
![]() ![]() ![]() |
Introducing XSLT and XPath
The XML Stylesheet Language (XSL) has three major subcomponents:
- The "flow object" standard. By far the largest subcomponent, this standard gives mechanisms for describing font sizes, page layouts, and how information "flows" from one page to another. This subcomponent is not covered by JAXP, nor is it included in this tutorial.
- This the transformation language, which lets you transform XML into some other format. For example, you might use XSLT to produce HTML, or a different XML structure. You could even use it to produce plain text or to put the information in some other document format. (And as you'll see in Generating XML from an Arbitrary Data Structure, a clever application can press it into service to manipulate non-XML data, as well.)
- At bottom, XSLT is a language that lets you specify what sorts of things to do when a particular element is encountered. But to write a program for different parts of an XML data structure, you need to be able to specify the part of the structure you are talking about at any given time. XPath is that specification language. It is an addressing mechanism that lets you specify a path to an element so, for example,
<article><title>
can be distinguished from<person><title>
. That way, you can describe different kinds of translations for the different<title>
elements.The remainder of this section describes the XSLT package structure, and discusses the XPath addressing mechanism in a bit more depth.
The XSLT Packages
There XSLT packages break down as follows:
- This package defines the factory class you use to get a
Transformer
object. You then configure the transformer with input (Source) and output (Result) objects, and invoke itstransform()
method to make the transformation happen. The source and result objects are created using classes from one of the other three packages.
- Defines the
DOMSource
andDOMResult
classes that let you use a DOM as an input to or output from a transformation.
- Defines the
SAXSource
andSAXResult
classes that let you use a SAX event generator as input to a transformation, or deliver SAX events as output to a SAX event processor.
- Defines the
StreamSource
andStreamResult
classes that let you use an I/O stream as an input to or output from a transformation.How XPath Works
The XPath specification is the foundation for a variety of specifications, including XSLT and linking/addressing specifications like XPointer. So an understanding of XPath is fundamental to a lot of advanced XML usage. This section provides a thorough introduction to XSLT, so you can refer to as needed later on.
Note: In this tutorial, you won't actually use XPath until you get to the last page of this section, Transforming XML Data with XSLT. So, if you like, you can skip this section and go on ahead to the next page, Writing Out a DOM as an XML File. (When you get to the last page, there will be a note that refers you back here, so you don't forget!)
In general, an XPath expression specifies a pattern that selects a set of XML nodes. XSLT templates then use those patterns when applying transformations. (XPointer, on the other hand, adds mechanisms for defining a point or a range, so that XPath expressions can be used for addressing.)
The nodes in an XPath expression refer to more than just elements. They also refer to text and attributes, among other things. In fact, the XPath specification defines an abstract document model that defines seven different kinds of nodes:
Note: The root element of the XML data is modeled by an element node. The XPath root node contains the document's root element, as well as other information relating to the document.
The data model is described in the last section of the XPath Specification, Section 5. (Like many specifications, it is frequently helpful to start reading near the end! Frequently, many of the important terms and underlying assumptions are documented there. That sequence has often been the "magic key" that unlocks the contents of a W3C specification.)
In this abstract model, syntactic distinctions disappear, and you are left with a normalized view of the data. In a text node, for example, it makes no difference whether the text was defined in a CDATA section, or if it included entity references;. The text node will consist of normalized data, as it exists after all parsing is complete. So the text will contain a
<
character, regardless of whether an entity reference like<
or a CDATA section was used to include it. (Similarly for the&
character.)In this section of the tutorial, we'll deal mostly with element nodes and text nodes. For the other addressing mechanisms, see the XPath Specification.
Basic XPath Addressing
An XML document is a tree-structured (hierarchical) collection of nodes. Like a hierarchical directory structure, it is useful to specify a path that points a particular node in the hierarchy. (Hence the name of the specification: XPath). In fact, much of the notation of directory paths is carried over intact:
- The forward slash
/
is used as a path separator.- An absolute path from the root of the document starts with a /.
- A relative path from a given location starts with anything else.
- A double period
..
indicates the parent of the current node.- A single period
.
indicates the current node.In an xHTML document, for example, the path
/h1/h2/
would indicate an h2 element under an h1. (Recall that in XML, element names are case sensitive, so this kind of specification works much better in xHTML than it would in HTML.)In a pattern-matching specification like XSLT, the specification
/h1/h2
selects all h2 elements that lie under an h1 element. To select a specific h2 element, square brackets[]
are used for indexing (like those used for arrays). The path/h1[4]/h2[5]
would therefore select the fifth h2 element under the fourth h1 element.
Note: In xHTML, all element names are in lowercase. But as a matter of style, uppercase names are easier to read and easier to write about. (Although they are admittedly harder to write.) For the remainder of XPATH tutorial, then, and for the section on using XSLT transforms, all XML element names will be in uppercase. (Attribute names, on the other hand, will remain in lowercase.)
As you've seen, a name in XPath specification refers to an element. To refer to attribute, you prefix it's name with an
@
sign. For example,@type
refers to thetype
attribute of an element. Assuming you have an XML document withlist
elements, for example, the expressionlist/@type
selects thetype
attribute of thelist
element.
Note: (Since the expression does not begin with /, the reference specifies a list node relative to the current context--whatever position in the document that happens to be.)
Basic XPath Expressions
The full range of XPath expressions takes advantage of the wildcards, operators, and functions that XPath defines. You'll be learning more about those shortly. Here, we'll take a look at a couple of the most common XPath expressions, simply to introduce the concept.
The expression
@type="unordered"
specifies an attribute named type whose value is "unordered". So an expression likeLIST/@type
specifies thetype attribute
of aLIST
element.But now for something a little different! In XPath, the square-bracket notation (
[]
) normally associated with indexing is extended to specify selection-criteria. For example, the expressionLIST[@type="unordered"]
selects allLIST
elements whosetype
value is "unordered".Similar expressions exist for elements, where each element has an associated string-value. (You'll see how the string-value is determined for a complicated element in a little while. For now, we'll stick with super-simple elements that have a single text string.)
Suppose you model what's going on in your organization with an XML structure that consists of
PROJECT
elements andACTIVITY
elements that have a text string with the project name, multiplePERSON
elements to list the people involved and, optionally, aSTATUS
element that records the projects status. Here are some more examples that use the extended square-bracket notation:
/PROJECT[.="MyProject"]
selects aPROJECT
named "MyProject"./PROJECT[STATUS]
--selects all projects that have aSTATUS
child element./PROJECT[STATUS="Critical"]--
selects all projects that have a STATUS child element with the string-value "Critical".Combining Index Addresses
The XPath specification defines quite a few addressing mechanisms, and they can be combined in many different ways. As a result, XPath delivers a lot of expressive power for a relatively simple specification. This section illustrates two more interesting combinations:
LIST[@type="ordered"][3]
--selects all LIST elements of type "ordered", and returns the third.LIST[3][@type="ordered"]
--selects the third LIST element, but only if it is of "ordered" type.
Note: Many more combinations of address operators are listed in section 2.5 of the XPath Specification. This is arguably the most useful section of the spec for defining an XSLT transform.
Wildcards
By definition, an unqualified XPath expression selects a set of XML nodes that matches that specified pattern. For example,
/HEAD
matches all top-levelHEAD
entries, while/HEAD[1]
matches only the first. Table 1 lists the wildcards that can be used in XPath expressions to broaden the scope of the pattern matching.In the project database example, for instance,
/*/PERSON[.="Fred"]
matches anyPROJECT
orACTIVITY
element that includes Fred.Extended-Path Addressing
So far, all of the patterns we've seen have specified an exact number of levels in the hierarchy. For example,
/HEAD
specifies anyHEAD
element at the first level in the hierarchy, while/*/*
specifies any element at the second level in the hierarchy. To specify an indeterminate level in the hierarchy, use a double forward slash (//
). For example, the XPath expression//PARA
selects allparagraph
elements in a document, wherever they may be found.The
//
pattern can also be used within a path. So the expression/HEAD/LIST//PARA
indicates all paragraph elements in a subtree that begins from/HEAD/LIST
.XPath Data Types and Operators
XPath expressions yield either a set of nodes,: a string, a boolean (true/false value), or a number. Table 2 lists the operations that can be applied in an Xpath expressions.
Finally, expressions can be grouped in parentheses, so you don't have to worry about operator precedence. (Which, for those of you who are good at such things, is roughly the same as that shown in the table.)
String-Value of an Element
Before going on, it's worthwhile to understand how the string-value of more complex element is determined. We'll do that now.
The string-value of an element is the concatenation of all descendent text nodes, no matter how deep. So, for a "mixed-model" XML data element like this:
<PARA>This_paragraph_contains_a_<B>bold</B>_word</PARA>The string-value of
<PARA>
is "This paragraph contains a bold word". In particular, note that<B>
is a child of<PARA>
and that the text contained in all children is concatenated to form the string-value.Also, it is worth understanding that the text in the abstract data model defined by XPath is fully normalized. So whether the XML structure contains the entity reference "
<
" or "<" in a CDATA section, the element's string-value will contain the "<" character. Therefore, when generating HTML or XML with an XSLT stylesheet, occurrences of "<" will have to be converted to<
or enclosed in a CDATA section. Similarly, occurrence of "&" will need to be converted to&
.XPath Functions
This section ends with an overview of the XPath functions. You can use XPath functions to select a collection of nodes in the same way that you would use an element-specification. Other functions return a string, a number, or a boolean value. For example, the expression
/PROJECT/text()
gets the string-value of project nodes.Many functions depend on the current context. In the example above, the context for each invocation of the
text()
function is thePROJECT
node that is currently selected.There are many XPath functions--too many to describe in detail here. This section provides a quick listing that shows the available XPath functions, along with a summary of what they do.
Note: Skim the list of functions to get an idea of what's there. For more information, see Section 4 of the XPath Specification.
Node-set functions
Many XPath expressions select a set of nodes. In essence, they return a node-set. One function does that, too.
(Elements only have an ID when the document has a DTD, which specifies which attribute has the
ID
type.)Positional functions
These functions return positionally-based numeric values.
last()
--returns the index of the last element. Ex:/HEAD[last()]
selects the lastHEAD
element.position()
--returns the index position. Ex:/HEAD[position() <= 5]
selects the first fiveHEAD
elementscount(...)
--returns the count of elements. Ex:/HEAD[count(HEAD)=0]
selects allHEAD
elements that have no subheads.String functions
These functions operate on or return strings.
concat(string, string, ...)
--concatenates the string valuesstarts-with(string1, string2)
--returns true if string1 starts with string2contains(string1, string2)
--returns true if string1 contains string2substring-before(string1, string2)
--returns the start of string1 before string2 occurs in itsubstring-after(string1, string2)
--returns the remainder of string1 after string2 occurs in itsubstring(string, idx)
--returns the substring from the index position to the end, where the index of the first char = 1substring(string, idx, len)
--returns the substring from the index position, of the specified lengthstring-length()
--returns the size of the context-node's string-valuestring-length(string)
--returns the size of the specified stringnormalize-space()
--returns the normalized string-value of the current node (no leading or trailing whitespace, and sequences of whitespace characters converted to a single space)normalize-space(string)
--returns the normalized string-value of the specified stringtranslate(string1, string2, string3)
--converts string1, replacing occurrences of characters in string2 with the corresponding character from string3
Note: XPath defines 3 ways to get the text of an element:text()
,string(object)
, and the string-value implied by an element name in an expression like this:/PROJECT[PERSON="Fred"]
.
Boolean functions
These functions operate on or return boolean values:
not(...)
--negates the specified boolean valuetrue()
--returns truefalse()
--returns falselang(string)
--returns true if the language of the context node (specified byxml:Lang
attributes) is the same as (or a sublanguage of) the specified language. Ex:Lang("en")
is true for<PARA_xml:Lang="en">...</PARA>
Numeric functions
These functions operate on or return numeric values.
sum(...)
--returns the sum of the numeric value of each node in the specified node-setfloor(N)
--returns the largest integer that is not greater than Nceiling(N)
--returns the smallest integer that is greater than Nround(N)
--returns the integer that is closest to NConversion functions
These functions convert one data type to another.
string(...)
--returns the string value of a number, boolean, or node-setboolean(...)
--returns the boolean-equivalent for a number, string, or node-set(a non-zero number, a non-empty node-set, and a non-empty string are all true)
(true is 1, false is 0, a string containing a number becomes that number, the string-value of a node-set is converted to a number)
Namespace functions
These functions let you determine the namespace-characteristics of a node.
local-name()
--returns the name of the current node, minus the namespace-extensionlocal-name(...)
--returns the name of the first node in the specified node set, minus the namespace-extensionnamespace-uri()
--returns the namespace URI from the current nodenamespace-uri(...)
--returns the namespace URI from the first node in the specified node setname()
--returns the expanded name (URI + local name) of the current nodename(...)
--returns the expanded name (URI + local name) of the first node in the specified node setSummary
XPath operators, functions, wildcards, and node-addressing mechanisms can be combined in wide variety of ways. The introduction you've had so far should give you a good head start at specifying the pattern you need for any particular purpose.
Home TOC |
![]() ![]() ![]() |