This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these messages)
|
Paradigm | Query language |
---|---|
Developer | W3C |
First appeared | 1998 |
Stable release | 3.1 / March 21, 2017 |
Influenced by | |
XSLT, XPointer | |
Influenced | |
XML Schema, XForms, JSONPath |
XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, [1] and can be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.
The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria. [2] [3] In popular use (though not in the official specification), an XPath expression is often referred to simply as "an XPath".
Originally motivated by a desire to provide a common syntax and behavior model between XPointer and XSLT, subsets of the XPath query language are used in other W3C specifications such as XML Schema, XForms and the Internationalization Tag Set (ITS).
XPath has been adopted by a number of XML processing libraries and tools, many of which also offer CSS Selectors, another W3C standard, as a simpler alternative to XPath.
There are several versions of XPath in use. XPath 1.0 was published in 1999, XPath 2.0 in 2007 (with a second edition in 2010), XPath 3.0 in 2014, and XPath 3.1 in 2017. However, XPath 1.0 is still the version that is most widely available. [1]
for
expression that is a cut-down version of the "FLWOR" expressions in XQuery. It is possible to describe the language by listing the parts of XQuery that it leaves out: the main examples are the query prolog, element and attribute constructors, the remainder of the "FLWOR" syntax, and the typeswitch
expression.The most important kind of expression in XPath is a location path. A location path consists of a sequence of location steps. Each location step has three components:
An XPath expression is evaluated with respect to a context node. An Axis Specifier such as 'child' or 'descendant' specifies the direction to navigate from the context node. The node test and the predicate are used to filter the nodes specified by the axis specifier: For example, the node test 'A' requires that all nodes navigated to must have label 'A'. A predicate can be used to specify that the selected nodes have certain properties, which are specified by XPath expressions themselves.
The XPath syntax comes in two flavors: the abbreviated syntax, is more compact and allows XPaths to be written and read easily using intuitive and, in many cases, familiar characters and constructs. The full syntax is more verbose, but allows for more options to be specified, and is more descriptive if read carefully.
The compact notation allows many defaults and abbreviations for common cases. Given source XML containing at least
<A><B><C/></B></A>
the simplest XPath takes a form such as
/A/B/C
that selects C elements that are children of B elements that are children of the A element that forms the outermost element of the XML document. The XPath syntax is designed to mimic URI (Uniform Resource Identifier) and Unix-style file path syntax.
More complex expressions can be constructed by specifying an axis other than the default 'child' axis, a node test other than a simple name, or predicates, which can be written in square brackets after any step. For example, the expression
A//B/*[1]
selects the first child ('*[1]
'), whatever its name, of every B element that itself is a child or other, deeper descendant ('//
') of an A element that is a child of the current context node (the expression does not begin with a '/
'). The predicate [1]
binds more tightly than the /
operator. To select the first node selected by the expression A//B/*
, write (A//B/*)[1]
. Note also, index values in XPath predicates (technically, 'proximity positions' of XPath node sets) start from 1, not 0 as common in languages like C and Java.
In the full, unabbreviated syntax, the two examples above would be written
/child::A/child::B/child::C
child::A/descendant-or-self::node()/child::B/child::node()[position()=1]
Here, in each step of the XPath, the axis (e.g. child
or descendant-or-self
) is explicitly specified, followed by ::
and then the node test, such as A
or node()
in the examples above.
Here the same, but shorter: A//B/*[position()=1]
Axis specifiers indicate navigation direction within the tree representation of the XML document. The axes available are: [b]
Full syntax | Abbreviated syntax | Notes |
---|---|---|
ancestor | ||
ancestor-or-self | ||
attribute | @ | @abc is short for attribute::abc |
child | xyz is short for child::xyz | |
descendant | // | // is short for /descendant-or-self::node()/ |
descendant-or-self | ||
following | ||
following-sibling | ||
namespace | ||
parent | .. | .. is short for parent::node() |
preceding | ||
preceding-sibling | ||
self | . | . is short for self::node() |
As an example of using the attribute axis in abbreviated syntax, //a/@href
selects the attribute called href
in a
elements anywhere in the document tree. The expression . (an abbreviation for self::node()) is most commonly used within a predicate to refer to the currently selected node. For example, h3[.='See also']
selects an element called h3
in the current context, whose text content is See also
.
Node tests may consist of specific node names or more general expressions. In the case of an XML document in which the namespace prefix gs
has been defined, //gs:enquiry
will find all the enquiry
elements in that namespace, and //gs:*
will find all elements, regardless of local name, in that namespace.
Other node test formats are:
<!-- Comment -->
hello
in <k>hello<m> world</m></k>
<?phpecho$a;?>
. In this case, processing-instruction('php')
would match.Predicates, written as expressions in square brackets, can be used to filter a node-set according to some condition. For example, a
returns a node-set (all the a
elements which are children of the context node), and a[@href='help.php']
keeps only those elements having an href
attribute with the value help.php
.
There is no limit to the number of predicates in a step, and they need not be confined to the last step in an XPath. They can also be nested to any depth. Paths specified in predicates begin at the context of the current step (i.e. that of the immediately preceding node test) and do not alter that context. All predicates must be satisfied for a match to occur.
When the value of the predicate is numeric, it is syntactic-sugar for comparing against the node's position in the node-set (as given by the function position()
). So p[1]
is shorthand for p[position()=1]
and selects the first p
element child, while p[last()]
is shorthand for p[position()=last()]
and selects the last p
child of the context node.
In other cases, the value of the predicate is automatically converted to a Boolean. When the predicate evaluates to a node-set, the result is true when the node-set is non-empty[ clarify ]. Thus p[@x]
selects those p
elements that have an attribute named x
.
A more complex example: the expression a[/html/@lang='en'][@href='help.php'][1]/@target
selects the value of the target
attribute of the first a
element among the children of the context node that has its href
attribute set to help.php
, provided the document's html
top-level element also has a lang
attribute set to en
. The reference to an attribute of the top-level element in the first predicate affects neither the context of other predicates nor that of the location step itself.
Predicate order is significant if predicates test the position of a node. Each predicate takes a node-set returns a (potentially) smaller node-set. So a[1][@href='help.php']
will find a match only if the first a
child of the context node satisfies the condition @href='help.php'
, while a[@href='help.php'][1]
will find the first a
child that satisfies this condition.
XPath 1.0 defines four data types: node-sets (sets of nodes with no intrinsic order), strings, numbers and Booleans.
The available operators are:
/
, //
and [...]
operators, used in path expressions, as described above.|
, which forms the union of two node-sets.and
and or
, and a function not()
+
, -
, *
, div
(divide), and mod
=
, !=
, <
, >
, <=
, >=
The function library includes:
Some of the more commonly useful functions are detailed below. [c]
true
if s1
starts with s2
true
if s1
contains s2
substring("ABCDEF",2,3)
returns BCD
.substring-before("1999/04/01","/")
returns 1999
substring-after("1999/04/01","/")
returns 04/01
Expressions can be created inside predicates using the operators: =, !=, <=, <, >=
and >
. Boolean expressions may be combined with brackets ()
and the Boolean operators and
and or
as well as the not()
function described above. Numeric calculations can use *, +, -, div
and mod
. Strings can consist of any Unicode characters.
//item[@price>2*@discount]
selects items whose price attribute is greater than twice the numeric value of their discount attribute.
Entire node-sets can be combined ('unioned') using the vertical bar character |. Node sets that meet one or more of several conditions can be found by combining the conditions inside a predicate with 'or
'.
v[x or y] | w[z]
will return a single node-set consisting of all the v
elements that have x
or y
child-elements, as well as all the w
elements that have z
child-elements, that were found in the current context.
Given a sample XML document
<?xml version="1.0" encoding="utf-8"?><Wikimedia><projects><projectname="Wikipedia"launch="2001-01-05"><editions><editionlanguage="English">en.wikipedia.org</edition><editionlanguage="German">de.wikipedia.org</edition><editionlanguage="French">fr.wikipedia.org</edition><editionlanguage="Polish">pl.wikipedia.org</edition><editionlanguage="Spanish">es.wikipedia.org</edition></editions></project><projectname="Wiktionary"launch="2002-12-12"><editions><editionlanguage="English">en.wiktionary.org</edition><editionlanguage="French">fr.wiktionary.org</edition><editionlanguage="Vietnamese">vi.wiktionary.org</edition><editionlanguage="Turkish">tr.wiktionary.org</edition><editionlanguage="Spanish">es.wiktionary.org</edition></editions></project></projects></Wikimedia>
The XPath expression
/Wikimedia/projects/project/@name
selects name attributes for all projects, and
/Wikimedia//editions
selects all editions of all projects, and
/Wikimedia/projects/project/editions/edition[@language='English']/text()
selects addresses of all English Wikimedia projects (text of all edition
elements where language
attribute is equal to English). And the following
/Wikimedia/projects/project[@name='Wikipedia']/editions/edition/text()
selects addresses of all Wikipedias (text of all edition
elements that exist under project
element with a name attribute of Wikipedia).
The Java package javax.xml.xpath
has been part of Java standard edition since Java 5 [8] via the Java API for XML Processing. Technically this is an XPath API rather than an XPath implementation, and it allows the programmer the ability to select a specific implementation that conforms to the interface.
XPath is increasingly used to express constraints in schema languages for XML.
The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them one can change the structure, style or content of a document. Nodes can have event handlers attached to them. Once an event is triggered, the event handlers get executed.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
XSLT is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text, or XSL Formatting Objects. These formats can be subsequently converted to formats such as PDF, PostScript, and PNG. Support for JSON and plain-text transformation was added in later updates to the XSLT 1.0 specification.
XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.
Saxon is an XSLT and XQuery processor created by Michael Kay and now developed and maintained by the company he founded, Saxonica. There are open-source and also closed-source commercial versions. Versions exist for Java, JavaScript and .NET.
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.
XPath 2.0 is a version of the XPath language defined by the World Wide Web Consortium, W3C. It became a recommendation on 23 January 2007. As a W3C Recommendation it was superseded by XPath 3.0 on 10 April 2014.
XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved.
SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.
The identity transform is a data transformation that copies the source data into the destination data without change.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
XML documents have a hierarchical structure and can conceptually be interpreted as a tree structure, called an XML tree.
The XQuery and XPath Data Model (XDM) is the data model shared by the XPath 2.0, XSLT 2.0, XQuery, and XForms programming languages. It is defined in a W3C recommendation. Originally, it was based on the XPath 1.0 data model which in turn is based on the XML Information Set.
The Oxygen XML Editor is a multi-platform XML editor, XSLT/XQuery debugger and profiler with Unicode support. It is a Java application so it can run in Windows, Mac OS X, and Linux. It also has a version that can run as an Eclipse plugin.
XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.
XQuery API for Java (XQJ) refers to the common Java API for the W3C XQuery 1.0 specification.
An XML transformation language is a programming language designed specifically to transform an input XML document into an output document which satisfies some specific goal.
Zorba is an open source query processor written in C++, implementing
JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.
XPath 3 is the latest version of the XML Path Language, a query language for selecting nodes in XML documents. It supersedes XPath 1.0 and XPath 2.0.
Since: 1.5
Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy supports either CSS selectors or XPath selectors.
{{cite web}}
: CS1 maint: unfit URL (link)