Glossary Term
Character encoding
Introduction and History of Character Encoding
- Character encoding is the process of assigning numbers to graphical characters.
- It allows characters to be stored, transmitted, and transformed using digital computers.
- Code points are the numerical values that make up a character encoding.
- Code space, code page, and character map are terms used to describe the collective code points.
- Modern computer systems allow more elaborate character codes like Unicode.
- Early character codes were limited to a subset of characters used in written languages.
- Morse code, introduced in the 1840s, was the earliest well-known electrically transmitted character code.
- Various character encoding systems were developed, including Morse code, Baudot code, ASCII, and Unicode.
- The Baudot code, created by Émile Baudot in 1870, was later standardized as ITA2.
- ASCII, released in 1963, addressed the shortcomings of previous codes and was widely adopted.
- Punch card data encoding was invented by Herman Hollerith in the late 19th century.
- IBM developed Binary Coded Decimal (BCD) as a six-bit encoding scheme.
- BCD was later extended to include alphabetic and special characters, becoming EBCDIC.
- Researchers in the 1980s faced the challenge of accommodating additional characters without wasting computing resources.
- The compromise solution was Unicode, which introduced the concept of code points and variable-length encodings.
Terminology
- Informally, character encoding, character map, character set, and code page are often used interchangeably.
- A character is a minimal unit of text with semantic value.
- A character set is a collection of elements used to represent text, such as the Latin alphabet or Greek alphabet.
- A coded character set is a character set map.
- The distinction between these terms has become important with the emergence of more sophisticated character encodings.
Importance of Character Encoding
- Character encoding enables worldwide interchange of text in electronic form.
- It allows for the representation of most characters used in many written languages.
- Unicode has become the widely adopted encoding system, replacing earlier character encodings.
- The development of character encodings has been driven by the need for machine-mediated character-based symbolic information.
- The evolution of character codes has been influenced by the capabilities and limitations of early machines.
Code Pages and Code Units
- Code page is a historical name for a coded character set.
- Code page refers to a specific page number in the IBM standard character set manual.
- Other vendors, including Microsoft, SAP, and Oracle Corporation, have their own sets of code pages.
- Well-known code page suites are Windows (based on Windows-1252) and IBM/DOS (based on code page 437).
- The term code page is often used to refer to character encodings in general.
- Code unit size varies depending on the encoding.
- US-ASCII consists of 7-bit code units.
- UTF-8, EBCDIC, and GB 18030 consist of 8-bit code units.
- UTF-16 consists of 16-bit code units.
- UTF-32 consists of 32-bit code units.
Code Points, Characters, and Unicode Encoding Model
- A code point is represented by a sequence of code units.
- UTF-8 maps code points to a sequence of one, two, three, or four code units.
- UTF-16 uses surrogate pairs for code points with a value U+10000 or higher.
- UTF-32 represents every code point as a single code unit.
- GB 18030 commonly uses multiple code units per code point.
- What constitutes a character varies between character encodings.
- Letters with diacritics can be encoded as a single unified character or as separate characters that combine into a single glyph.
- Handling glyph variants is a choice made when constructing a character encoding.
- Some writing systems, like Arabic and Hebrew, accommodate different ways of joining graphemes.
- Characters in different contexts may represent the same semantic character.
- Unicode and ISO/IEC 10646 constitute a unified standard for character encoding.
- Unicode defines an abstract character repertoire (ACR) that supports all characters.
- A coded character set (CCS) maps characters to code points.
- A character encoding form (CEF) maps code points to code units.
- A character encoding scheme (CES) maps code units to octets for storage or transmission.
Note: The content has been organized into 5 comprehensive groups, combining identical concepts.