XML provides several language features for use in defining custom markup languages: XML declaration, elements and attributes, character references and CDATA sections, namespaces, and comments and processing instructions. You will learn about these language features in this section.
XML Declaration
An XML document usually begins with the XML declaration , which is special markup telling an XML parser that the document is XML. The absence of the XML declaration in Listing reveals that this special markup isnt mandatory. When the XML declaration is present, nothing can appear before it.
The XML declaration minimally looks like in which the nonoptional version attribute identifies the version of the XML specification to which the document conforms. The initial version of this specification (1.0) was introduced in 1998 and is widely implemented.
Note
The World Wide Web Consortium (W3C), which maintains XML, released version 1.1 in 2004. This version mainly supports the use of line-ending characters used on EBCDIC platforms (see http://en.wikipedia.org/wiki/EBCDIC ) and the use of scripts and characters that are absent from Unicode 3.2 (see http://en.wikipedia.org/wiki/Unicode ). Unlike XML 1.0, XML 1.1 isnt widely implemented and should be used only by those needing its unique features.
XML supports Unicode, which means that XML documents consist entirely of characters taken from the Unicode character set. The documents characters are encoded into bytes for storage or transmission, and the encoding is specified via the XML declarations optional encoding attribute. One common encoding is UTF-8 (see http://en.wikipedia.org/wiki/UTF-8 ), which is a variable-length encoding of the Unicode character set. UTF-8 is a strict superset of ASCII (see http://en.wikipedia.org/wiki/ASCII ), which means that pure ASCII text files are also UTF-8 documents.
Note
In the absence of the XML declaration or when the XML declarations encoding attribute isnt present, an XML parser typically looks for a special character sequence at the start of a document to determine the documents encoding. This character sequence is known as the byte-order-mark (BOM) and is created by an editor program (such as Microsoft Windows Notepad) when it saves the document according to UTF-8 or some other encoding. For example, the hexadecimal sequence EF BB BF signifies UTF-8 as the encoding. Similarly, FE FF signifies UTF-16 big endian (see https://en.wikipedia.org/wiki/UTF-16 ), FF FE signifies UTF-16 little endian, 00 00 FE FF signifies UTF-32 big endian (see https://en.wikipedia.org/wiki/UTF-32 ), and FF FE 00 00 signifies UTF-32 little endian. UTF-8 is assumed when no BOM is present.
If youll never use characters apart from the ASCII character set, you can probably forget about the encoding attribute. However, when your native language isnt English or when youre called to create XML documents that include non-ASCII characters, you need to properly specify encoding . For example, when your document contains ASCII plus characters from a non-English Western European language (such as , the cedilla used in French, Portuguese, and other languages), you might want to choose ISO-8859-1 as the encoding attributes valuethe document will probably have a smaller size when encoded in this manner than when encoded with UTF-8. Listing shows you the resulting XML declaration.
Le Fabuleux Destin dAmlie Poulain
franais
Listing 1-2.
An Encoded Document Containing Non-ASCII Characters
The final attribute that can appear in the XML declaration is standalone . This optional attribute, which is only relevant with DTDs (discussed later), determines if there are external markup declarations that affect the information passed from an XML processor (a parser) to the application. Its value defaults to no , implying that there are, or may be, such declarations. A yes value indicates that there are no such declarations. For more information, check out The standalone pseudo-attribute is only relevant if a DTD is used article at ( www.xmlplease.com/xml/xmlquotations/standalone ).
Elements and Attributes
Following the XML declaration is a hierarchical (tree) structure of elements, where an element is a portion of the document delimited by a start tag (such as ) and an end tag (such as ), or is an empty-element tag (a standalone tag whose name ends with a forward slash ( / ), such as ). Start tags and end tags surround content and possibly other markup whereas empty-element tags dont surround anything. Figure s XML document tree structure.