/ˈkærɪktər ɛnˈkoʊdɪŋ/
noun — "the method of representing characters as digital data."
Character Encoding is a system that maps characters, symbols, or text elements to specific numeric values, allowing computers and digital devices to store, transmit, and interpret textual information. Each character in a text — whether a letter, digit, punctuation mark, or special symbol — is assigned a unique numeric code, which is then represented in binary form for processing and storage. Character encoding ensures that text can be consistently read, displayed, and exchanged between different systems.
Technically, a character encoding consists of three components:
- Code points — unique numbers assigned to each character, such as U+0041 for 'A' in Unicode.
- Encoding forms — methods of converting code points into sequences of bytes, such as UTF-8, UTF-16, or UTF-32.
- Interpretation rules — the algorithm or system that reconstructs characters from byte sequences during reading or processing.
# example: character encoding in UTF-8
char = "A"
code_point = ord(char) # Unicode code point: 65 (U+0041)
utf8_bytes = char.encode('utf-8')
# utf8_bytes = b'\x41'
# stored as binary 01000001
Early character encodings, such as ASCII (American Standard Code for Information Interchange), represented characters using 7 bits, allowing 128 symbols. As computing became global, extended encodings and Unicode were developed to represent thousands of characters from multiple languages, symbols, and scripts. Modern systems often use Unicode with UTF-8 or UTF-16 encoding to support internationalization and interoperability.
Conceptually, character encoding is a translation dictionary between human-readable symbols and machine-readable data. Without a consistent encoding, text exchanged between systems could become garbled, as the numeric representation of one character might be interpreted as a different character by another system. This is why standards like Unicode and UTF-8 are critical in web development, operating systems, and data storage.
In practice, character encoding affects text input, display, searching, sorting, encryption, and communication. Every programming language, file format, database, and network protocol must agree on an encoding to correctly interpret text. Misaligned encodings often result in errors, such as unreadable characters, mojibake, or data corruption.