Introduction and History of Character Encoding
– Character encoding is the process of assigning numbers to graphical characters.
– It allows characters to be stored, transmitted, and transformed using digital computers.
– Code points are the numerical values that make up a character encoding.
– Code space, code page, and character map are terms used to describe the collective code points.
– Modern computer systems allow more elaborate character codes like Unicode.
– Early character codes were limited to a subset of characters used in written languages.
– Morse code, introduced in the 1840s, was the earliest well-known electrically transmitted character code.
– Various character encoding systems were developed, including Morse code, Baudot code, ASCII, and Unicode.
– The Baudot code, created by Émile Baudot in 1870, was later standardized as ITA2.
– ASCII, released in 1963, addressed the shortcomings of previous codes and was widely adopted.
– Punch card data encoding was invented by Herman Hollerith in the late 19th century.
– IBM developed Binary Coded Decimal (BCD) as a six-bit encoding scheme.
– BCD was later extended to include alphabetic and special characters, becoming EBCDIC.
– Researchers in the 1980s faced the challenge of accommodating additional characters without wasting computing resources.
– The compromise solution was Unicode, which introduced the concept of code points and variable-length encodings.
Terminology
– Informally, character encoding, character map, character set, and code page are often used interchangeably.
– A character is a minimal unit of text with semantic value.
– A character set is a collection of elements used to represent text, such as the Latin alphabet or Greek alphabet.
– A coded character set is a character set map.
– The distinction between these terms has become important with the emergence of more sophisticated character encodings.
Importance of Character Encoding
– Character encoding enables worldwide interchange of text in electronic form.
– It allows for the representation of most characters used in many written languages.
– Unicode has become the widely adopted encoding system, replacing earlier character encodings.
– The development of character encodings has been driven by the need for machine-mediated character-based symbolic information.
– The evolution of character codes has been influenced by the capabilities and limitations of early machines.
Code Pages and Code Units
– Code page is a historical name for a coded character set.
– Code page refers to a specific page number in the IBM standard character set manual.
– Other vendors, including Microsoft, SAP, and Oracle Corporation, have their own sets of code pages.
– Well-known code page suites are Windows (based on Windows-1252) and IBM/DOS (based on code page 437).
– The term code page is often used to refer to character encodings in general.
– Code unit size varies depending on the encoding.
– US-ASCII consists of 7-bit code units.
– UTF-8, EBCDIC, and GB 18030 consist of 8-bit code units.
– UTF-16 consists of 16-bit code units.
– UTF-32 consists of 32-bit code units.
Code Points, Characters, and Unicode Encoding Model
– A code point is represented by a sequence of code units.
– UTF-8 maps code points to a sequence of one, two, three, or four code units.
– UTF-16 uses surrogate pairs for code points with a value U+10000 or higher.
– UTF-32 represents every code point as a single code unit.
– GB 18030 commonly uses multiple code units per code point.
– What constitutes a character varies between character encodings.
– Letters with diacritics can be encoded as a single unified character or as separate characters that combine into a single glyph.
– Handling glyph variants is a choice made when constructing a character encoding.
– Some writing systems, like Arabic and Hebrew, accommodate different ways of joining graphemes.
– Characters in different contexts may represent the same semantic character.
– Unicode and ISO/IEC 10646 constitute a unified standard for character encoding.
– Unicode defines an abstract character repertoire (ACR) that supports all characters.
– A coded character set (CCS) maps characters to code points.
– A character encoding form (CEF) maps code points to code units.
– A character encoding scheme (CES) maps code units to octets for storage or transmission.
Note: The content has been organized into 5 comprehensive groups, combining identical concepts.
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".
Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages, sometimes restricted to upper case letters, numerals and some punctuation only. The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Character encoding using internationally accepted standards permits worldwide interchange of text in electronic form.