Skip to main content
Glossary Term

Character encoding

Introduction and History of Character Encoding - Character encoding is the process of assigning numbers to graphical characters. - It allows characters to be stored, transmitted, and transformed using digital computers. - Code points are the numerical values that make up a character encoding. - Code space, code page, and character map are terms used to describe the collective code points. - Modern computer systems allow more elaborate character codes like Unicode. - Early character codes were limited to a subset of characters used in written languages. - Morse code, introduced in the 1840s, was the earliest well-known electrically transmitted character code. - Various character encoding systems were developed, including Morse code, Baudot code, ASCII, and Unicode. - The Baudot code, created by Émile Baudot in 1870, was later standardized as ITA2. - ASCII, released in 1963, addressed the shortcomings of previous codes and was widely adopted. - Punch card data encoding was invented by Herman Hollerith in the late 19th century. - IBM developed Binary Coded Decimal (BCD) as a six-bit encoding scheme. - BCD was later extended to include alphabetic and special characters, becoming EBCDIC. - Researchers in the 1980s faced the challenge of accommodating additional characters without wasting computing resources. - The compromise solution was Unicode, which introduced the concept of code points and variable-length encodings. Terminology - Informally, character encoding, character map, character set, and code page are often used interchangeably. - A character is a minimal unit of text with semantic value. - A character set is a collection of elements used to represent text, such as the Latin alphabet or Greek alphabet. - A coded character set is a character set map. - The distinction between these terms has become important with the emergence of more sophisticated character encodings. Importance of Character Encoding - Character encoding enables worldwide interchange of text in electronic form. - It allows for the representation of most characters used in many written languages. - Unicode has become the widely adopted encoding system, replacing earlier character encodings. - The development of character encodings has been driven by the need for machine-mediated character-based symbolic information. - The evolution of character codes has been influenced by the capabilities and limitations of early machines. Code Pages and Code Units - Code page is a historical name for a coded character set. - Code page refers to a specific page number in the IBM standard character set manual. - Other vendors, including Microsoft, SAP, and Oracle Corporation, have their own sets of code pages. - Well-known code page suites are Windows (based on Windows-1252) and IBM/DOS (based on code page 437). - The term code page is often used to refer to character encodings in general. - Code unit size varies depending on the encoding. - US-ASCII consists of 7-bit code units. - UTF-8, EBCDIC, and GB 18030 consist of 8-bit code units. - UTF-16 consists of 16-bit code units. - UTF-32 consists of 32-bit code units. Code Points, Characters, and Unicode Encoding Model - A code point is represented by a sequence of code units. - UTF-8 maps code points to a sequence of one, two, three, or four code units. - UTF-16 uses surrogate pairs for code points with a value U+10000 or higher. - UTF-32 represents every code point as a single code unit. - GB 18030 commonly uses multiple code units per code point. - What constitutes a character varies between character encodings. - Letters with diacritics can be encoded as a single unified character or as separate characters that combine into a single glyph. - Handling glyph variants is a choice made when constructing a character encoding. - Some writing systems, like Arabic and Hebrew, accommodate different ways of joining graphemes. - Characters in different contexts may represent the same semantic character. - Unicode and ISO/IEC 10646 constitute a unified standard for character encoding. - Unicode defines an abstract character repertoire (ACR) that supports all characters. - A coded character set (CCS) maps characters to code points. - A character encoding form (CEF) maps code points to code units. - A character encoding scheme (CES) maps code units to octets for storage or transmission. Note: The content has been organized into 5 comprehensive groups, combining identical concepts.