UTF-8

/juː-ti-ɛf eɪt/

noun — "a variable-length encoding for Unicode characters."

UTF-8 (Unicode Transformation Format, 8-bit) is a character encoding system that represents every Unicode code point using sequences of 1 to 4 bytes. It is designed to be backward-compatible with ASCII, efficient for storage, and fully capable of representing every character defined in the Unicode standard. UTF-8 has become the dominant encoding for web content, software, and data interchange because it combines compatibility, compactness, and universality.

Technically, UTF-8 encodes Unicode code points as follows:

Code points U+0000 to U+007F (basic ASCII) use 1 byte.
Code points U+0080 to U+07FF use 2 bytes.
Code points U+0800 to U+FFFF use 3 bytes.
Code points U+10000 to U+10FFFF use 4 bytes.

This variable-length structure ensures efficient storage for common characters while still covering all languages and symbols.


# example: encoding "€" (Euro sign) in UTF-8
char = "€"
code_point = ord(char)            # U+20AC
utf8_bytes = char.encode('utf-8') 
# utf8_bytes = b'\xe2\x82\xac'
# 3-byte sequence represents the Unicode code point

UTF-8 is self-synchronizing: the start of a multi-byte sequence can be identified unambiguously, and the encoding prevents accidental misinterpretation of byte streams. This makes it robust for streaming, file storage, and network transmission. Additionally, it preserves ASCII characters exactly, so any ASCII text is already valid UTF-8.

Conceptually, UTF-8 is a bridge between human-readable characters and machine-level binary storage. It allows multilingual text, emojis, symbols, and special formatting characters to coexist in a single, standardized encoding, ensuring interoperability across systems and applications worldwide.

In practice, UTF-8 underpins most web technologies, operating systems, programming languages, and file formats. Its adoption simplifies internationalization, avoids encoding conflicts, and reduces storage overhead for texts that primarily use ASCII characters.

See Unicode, UTF-16, ASCII, Character Encoding.

Format

Character

Encoding