Glossary Term
XML
Overview and Applications of XML
- XML is a markup language and file format for storing, transmitting, and reconstructing arbitrary data.
- It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
- XML emphasizes simplicity, generality, and usability across the Internet.
- XML is widely used for representing arbitrary data structures, such as those used in web services.
- XML is commonly used for data interchange over the Internet.
- Many document formats, including RSS, Atom, Office Open XML, and XHTML, use XML syntax.
- XML is used as the base language for communication protocols like SOAP and XMPP.
- Industry data standards like Health Level 7 and OpenTravel Alliance are based on XML.
- XML underpins various publishing formats and is used extensively in publishing.
Key Terminology and Characters in XML
- XML documents consist of characters, and every legal Unicode character (except Null) can appear in an XML document.
- XML tags categorize and structurally organize information.
- XML schema (XSD) provides necessary metadata for interpreting and validating XML.
- XML attributes have a single value and can appear at most once on each element.
- XML declaration describes information about the XML document.
- XML documents consist of characters from the Unicode repertoire.
- XML includes facilities for identifying the encoding of Unicode characters and for expressing characters that cannot be used directly.
- Unicode code points within specific ranges are valid in XML 1.0 documents.
- XML 1.1 extends the set of allowed characters and restricts the use of certain control characters.
- The code point U+0000 (Null) is not permitted in any XML 1.1 document.
- The Unicode character set can be encoded into bytes using different encodings.
- XML allows the use of any Unicode-defined encodings and any preexisting text encodings.
- Well-known encodings include UTF-8 and UTF-16.
- XML recommends using UTF-8 without a BOM (Byte Order Mark).
- Various ISO/IEC 8859 encodings are subsets of the Unicode character set.
Escaping, Comments, and International Use in XML
- XML provides escape facilities for including problematic characters.
- Characters like < and & are syntax markers and should not appear outside a CDATA section.
- Some character encodings only support a subset of Unicode, limiting the representation of certain characters.
- It may not be possible to type certain characters on a keyboard.
- Some characters have visually indistinguishable glyphs, causing confusion.
- Comments can appear anywhere in a document outside other markup.
- Comments cannot be nested and cannot contain the string '--'.
- Entity and character references are not recognized within comments.
- Characters outside the document encoding's character set cannot be represented in comments.
- XML supports the direct use of almost any Unicode character.
- Chinese, Armenian, and Cyrillic characters can be included in XML documents.
- Proper rendering support is necessary to display non-supported characters correctly.
Syntactical Correctness, Schemas, and Validation in XML
- An XML document must be well-formed, satisfying syntax rules.
- Only legal Unicode characters should be used in the document.
- Special syntax characters like < and & should only appear when performing markup roles.
- Tags must be correctly nested and case-sensitive.
- Tag names have certain restrictions and cannot contain certain characters.
- An XML document can be valid if it references a Document Type Definition (DTD).
- XML processors can be validating or non-validating.
- Validity errors should be reported, but processing can continue.
- Schema languages like DTDs and XML Schema constrain the elements and attributes in a document.
- XML Schema (XSD) is more powerful than DTDs and allows for detailed constraints.
- RELAX NG is a standard for validating XML documents.
- It has a simpler definition and validation framework than XML Schema.
- RELAX NG schemas can be written in XML or a more compact non-XML syntax.
- Schematron is a language for making assertions about patterns in XML documents.
- It is a standard for rule-based validation.
- Schematron typically uses XPath expressions.
- DSDL is a multi-part ISO/IEC standard that includes different schema languages.
- It includes RELAX NG, Schematron, and languages for defining datatypes and character repertoire constraints.
Related Specifications, Programming Interfaces, and XML History
- XML namespaces enable the use of different vocabularies in a single document without naming collisions.
- XML Base defines the xml:base attribute for resolving relative URI references.
- XML Information Set (Infoset) is an abstract data model for describing XML documents.
- XSL is a family of languages for transforming and rendering XML documents.
- XPath is a non-XML language for addressing components of an XML document.
- APIs for XML processing fall into different categories, including stream-oriented, tree-traversal, data binding, and declarative transformation.
- Stream-oriented APIs like SAX and StAX are fast and memory-efficient.
- Tree-traversal APIs like DOM provide convenience for programmers but require more memory.
- XML data binding automates the translation between XML and programming-language objects.
- Declarative transformation languages like XSLT and XQuery are used for transforming and querying XML data.
- XML has appeared as a first-class data type in other languages.
- XML is a historical application profile of SGML.
- XML 1.0 initially defined in 1998, currently in fifth edition.
- XML 1.1 published on February 4, 2004, contains features to make XML easier to use.
- XML 1.0 and XML 1.1 have undergone minor revisions.
- XML 1.0 is widely implemented and recommended for general use.