Overview and Applications of XML
– XML is a markup language and file format for storing, transmitting, and reconstructing arbitrary data.
– It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
– XML emphasizes simplicity, generality, and usability across the Internet.
– XML is widely used for representing arbitrary data structures, such as those used in web services.
– XML is commonly used for data interchange over the Internet.
– Many document formats, including RSS, Atom, Office Open XML, and XHTML, use XML syntax.
– XML is used as the base language for communication protocols like SOAP and XMPP.
– Industry data standards like Health Level 7 and OpenTravel Alliance are based on XML.
– XML underpins various publishing formats and is used extensively in publishing.
Key Terminology and Characters in XML
– XML documents consist of characters, and every legal Unicode character (except Null) can appear in an XML document.
– XML tags categorize and structurally organize information.
– XML schema (XSD) provides necessary metadata for interpreting and validating XML.
– XML attributes have a single value and can appear at most once on each element.
– XML declaration describes information about the XML document.
– XML documents consist of characters from the Unicode repertoire.
– XML includes facilities for identifying the encoding of Unicode characters and for expressing characters that cannot be used directly.
– Unicode code points within specific ranges are valid in XML 1.0 documents.
– XML 1.1 extends the set of allowed characters and restricts the use of certain control characters.
– The code point U+0000 (Null) is not permitted in any XML 1.1 document.
– The Unicode character set can be encoded into bytes using different encodings.
– XML allows the use of any Unicode-defined encodings and any preexisting text encodings.
– Well-known encodings include UTF-8 and UTF-16.
– XML recommends using UTF-8 without a BOM (Byte Order Mark).
– Various ISO/IEC 8859 encodings are subsets of the Unicode character set.
Escaping, Comments, and International Use in XML
– XML provides escape facilities for including problematic characters.
– Characters like < and & are syntax markers and should not appear outside a CDATA section.
- Some character encodings only support a subset of Unicode, limiting the representation of certain characters.
- It may not be possible to type certain characters on a keyboard.
- Some characters have visually indistinguishable glyphs, causing confusion.
- Comments can appear anywhere in a document outside other markup.
- Comments cannot be nested and cannot contain the string '--'.
- Entity and character references are not recognized within comments.
- Characters outside the document encoding's character set cannot be represented in comments.
- XML supports the direct use of almost any Unicode character.
- Chinese, Armenian, and Cyrillic characters can be included in XML documents.
- Proper rendering support is necessary to display non-supported characters correctly.
Syntactical Correctness, Schemas, and Validation in XML
- An XML document must be well-formed, satisfying syntax rules.
- Only legal Unicode characters should be used in the document.
- Special syntax characters like < and & should only appear when performing markup roles.
- Tags must be correctly nested and case-sensitive.
- Tag names have certain restrictions and cannot contain certain characters.
- An XML document can be valid if it references a Document Type Definition (DTD).
- XML processors can be validating or non-validating.
- Validity errors should be reported, but processing can continue.
- Schema languages like DTDs and XML Schema constrain the elements and attributes in a document.
- XML Schema (XSD) is more powerful than DTDs and allows for detailed constraints.
- RELAX NG is a standard for validating XML documents.
- It has a simpler definition and validation framework than XML Schema.
- RELAX NG schemas can be written in XML or a more compact non-XML syntax.
- Schematron is a language for making assertions about patterns in XML documents.
- It is a standard for rule-based validation.
- Schematron typically uses XPath expressions.
- DSDL is a multi-part ISO/IEC standard that includes different schema languages.
- It includes RELAX NG, Schematron, and languages for defining datatypes and character repertoire constraints.
Related Specifications, Programming Interfaces, and XML History
- XML namespaces enable the use of different vocabularies in a single document without naming collisions.
- XML Base defines the xml:base attribute for resolving relative URI references.
- XML Information Set (Infoset) is an abstract data model for describing XML documents.
- XSL is a family of languages for transforming and rendering XML documents.
- XPath is a non-XML language for addressing components of an XML document.
- APIs for XML processing fall into different categories, including stream-oriented, tree-traversal, data binding, and declarative transformation.
- Stream-oriented APIs like SAX and StAX are fast and memory-efficient.
- Tree-traversal APIs like DOM provide convenience for programmers but require more memory.
- XML data binding automates the translation between XML and programming-language objects.
- Declarative transformation languages like XSLT and XQuery are used for transforming and querying XML data.
- XML has appeared as a first-class data type in other languages.
- XML is a historical application profile of SGML.
- XML 1.0 initially defined in 1998, currently in fifth edition.
- XML 1.1 published on February 4, 2004, contains features to make XML easier to use.
- XML 1.0 and XML 1.1 have undergone minor revisions.
- XML 1.0 is widely implemented and recommended for general use.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
Extensible Markup Language | |
![]() | |
Abbreviation | XML |
---|---|
Status | Published, W3C recommendation |
Year started | 1996 |
First published | February 10, 1998 |
Latest version | 1.1 (2nd ed.) September 29, 2006 |
Organization | World Wide Web Consortium (W3C) |
Editors | Tim Bray, Jean Paoli, Michael Sperberg-McQueen, Eve Maler, François Yergeau, John W. Cowan |
Base standards | SGML |
Related standards | W3C XML Schema |
Domain | Serialization |
Website | www |
Filename extension |
.xml |
---|---|
Internet media type | application/xml , text/xml |
Uniform Type Identifier (UTI) | public.xml |
UTI conformation | public.text |
Magic number | <?xml |
Developed by | World Wide Web Consortium |
Type of format | Markup language |
Extended from | SGML |
Extended to | Numerous languages, including XHTML, RSS, Atom, and KML |
Standard |
|
Open format? | Yes |
Free format? | Yes |
The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.
Several schema systems exist to aid in the definition of XML-based languages, while programmers have developed many application programming interfaces (APIs) to aid the processing of XML data.