Skip to main content
Glossary Term

XML

Overview and Applications of XML - XML is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. - It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. - XML emphasizes simplicity, generality, and usability across the Internet. - XML is widely used for representing arbitrary data structures, such as those used in web services. - XML is commonly used for data interchange over the Internet. - Many document formats, including RSS, Atom, Office Open XML, and XHTML, use XML syntax. - XML is used as the base language for communication protocols like SOAP and XMPP. - Industry data standards like Health Level 7 and OpenTravel Alliance are based on XML. - XML underpins various publishing formats and is used extensively in publishing. Key Terminology and Characters in XML - XML documents consist of characters, and every legal Unicode character (except Null) can appear in an XML document. - XML tags categorize and structurally organize information. - XML schema (XSD) provides necessary metadata for interpreting and validating XML. - XML attributes have a single value and can appear at most once on each element. - XML declaration describes information about the XML document. - XML documents consist of characters from the Unicode repertoire. - XML includes facilities for identifying the encoding of Unicode characters and for expressing characters that cannot be used directly. - Unicode code points within specific ranges are valid in XML 1.0 documents. - XML 1.1 extends the set of allowed characters and restricts the use of certain control characters. - The code point U+0000 (Null) is not permitted in any XML 1.1 document. - The Unicode character set can be encoded into bytes using different encodings. - XML allows the use of any Unicode-defined encodings and any preexisting text encodings. - Well-known encodings include UTF-8 and UTF-16. - XML recommends using UTF-8 without a BOM (Byte Order Mark). - Various ISO/IEC 8859 encodings are subsets of the Unicode character set. Escaping, Comments, and International Use in XML - XML provides escape facilities for including problematic characters. - Characters like < and & are syntax markers and should not appear outside a CDATA section. - Some character encodings only support a subset of Unicode, limiting the representation of certain characters. - It may not be possible to type certain characters on a keyboard. - Some characters have visually indistinguishable glyphs, causing confusion. - Comments can appear anywhere in a document outside other markup. - Comments cannot be nested and cannot contain the string '--'. - Entity and character references are not recognized within comments. - Characters outside the document encoding's character set cannot be represented in comments. - XML supports the direct use of almost any Unicode character. - Chinese, Armenian, and Cyrillic characters can be included in XML documents. - Proper rendering support is necessary to display non-supported characters correctly. Syntactical Correctness, Schemas, and Validation in XML - An XML document must be well-formed, satisfying syntax rules. - Only legal Unicode characters should be used in the document. - Special syntax characters like < and & should only appear when performing markup roles. - Tags must be correctly nested and case-sensitive. - Tag names have certain restrictions and cannot contain certain characters. - An XML document can be valid if it references a Document Type Definition (DTD). - XML processors can be validating or non-validating. - Validity errors should be reported, but processing can continue. - Schema languages like DTDs and XML Schema constrain the elements and attributes in a document. - XML Schema (XSD) is more powerful than DTDs and allows for detailed constraints. - RELAX NG is a standard for validating XML documents. - It has a simpler definition and validation framework than XML Schema. - RELAX NG schemas can be written in XML or a more compact non-XML syntax. - Schematron is a language for making assertions about patterns in XML documents. - It is a standard for rule-based validation. - Schematron typically uses XPath expressions. - DSDL is a multi-part ISO/IEC standard that includes different schema languages. - It includes RELAX NG, Schematron, and languages for defining datatypes and character repertoire constraints. Related Specifications, Programming Interfaces, and XML History - XML namespaces enable the use of different vocabularies in a single document without naming collisions. - XML Base defines the xml:base attribute for resolving relative URI references. - XML Information Set (Infoset) is an abstract data model for describing XML documents. - XSL is a family of languages for transforming and rendering XML documents. - XPath is a non-XML language for addressing components of an XML document. - APIs for XML processing fall into different categories, including stream-oriented, tree-traversal, data binding, and declarative transformation. - Stream-oriented APIs like SAX and StAX are fast and memory-efficient. - Tree-traversal APIs like DOM provide convenience for programmers but require more memory. - XML data binding automates the translation between XML and programming-language objects. - Declarative transformation languages like XSLT and XQuery are used for transforming and querying XML data. - XML has appeared as a first-class data type in other languages. - XML is a historical application profile of SGML. - XML 1.0 initially defined in 1998, currently in fifth edition. - XML 1.1 published on February 4, 2004, contains features to make XML easier to use. - XML 1.0 and XML 1.1 have undergone minor revisions. - XML 1.0 is widely implemented and recommended for general use.