/juː-ti-ɛf/
noun — "a family of Unicode Transformation Format encodings."
UTF (Unicode Transformation Format) refers collectively to a set of character encoding schemes designed to represent Unicode code points as sequences of bytes or code units. Each UTF variant defines a method to convert the abstract numeric code points of Unicode into a binary format suitable for storage, transmission, and processing in digital systems. The most common UTFs are UTF-8, UTF-16, and UTF-32, each with different characteristics optimized for efficiency, compatibility, or simplicity.
Technically, a UTF defines:
- How each Unicode code point (U+0000 to U+10FFFF) maps to one or more code units (bytes or 16-/32-bit units).
- Rules for encoding and decoding sequences of code units, including handling multi-unit characters and surrogate pairs (for UTF-16).
- Optional markers such as a Byte Order Mark (BOM) to indicate endianness in UTF-16 and UTF-32.
Examples of UTF variants include:
- UTF-8 — variable-length encoding using 1 to 4 bytes per code point, backward-compatible with ASCII.
- UTF-16 — uses 16-bit units, with surrogate pairs for code points above U+FFFF.
- UTF-32 — fixed-length encoding using 32 bits per code point, simple but space-inefficient.
# example: representing the character "A" in different UTFs
char = "A" # code point U+0041
utf8_bytes = char.encode('utf-8') # b'\x41'
utf16_bytes = char.encode('utf-16') # b'\xff\xfe\x41\x00'
utf32_bytes = char.encode('utf-32') # b'\xff\xfe\x00\x00\x41\x00\x00\x00'
Conceptually, UTF is the bridge between Unicode’s abstract set of characters and practical binary storage or communication formats. Each variant balances trade-offs between storage efficiency, processing complexity, and compatibility with existing systems. UTFs ensure that text in any language or script can be represented reliably across platforms, devices, and applications.
In practice, choosing the appropriate UTF depends on the system’s needs. UTF-8 is most common for web and file formats, UTF-16 is used in environments optimized for 16-bit units (e.g., Windows, Java), and UTF-32 provides straightforward indexing at the cost of storage space.
See Unicode, UTF-8, UTF-16, UTF-32, Character Encoding.