UTF-16

/juː-ti-ɛf sɪksˈtiːn/

noun — "a fixed- or variable-length encoding for Unicode using 16-bit units."

UTF-16 (Unicode Transformation Format, 16-bit) is a character encoding standard that represents Unicode code points using sequences of 16-bit units, called code units. It can encode every Unicode character, including those outside the Basic Multilingual Plane (BMP), using either a single 16-bit unit or a pair of units known as a surrogate pair. UTF-16 is widely used in operating systems, programming environments, and file formats where 16-bit alignment provides efficiency and compatibility.

Technically, UTF-16 encodes Unicode characters as follows:

Code points U+0000 to U+FFFF (BMP) are represented directly as a single 16-bit code unit.
Code points U+10000 to U+10FFFF are represented using 2 16-bit units, called a surrogate pair. The first unit is a high surrogate, and the second is a low surrogate, together encoding the full code point.

This system allows UTF-16 to efficiently cover all Unicode characters while maintaining backward compatibility with systems expecting 16-bit aligned text.


# example: encoding the "𝄞" (musical G clef, U+1D11E) in UTF-16
char = "𝄞"
code_point = ord(char)                  # U+1D11E
utf16_bytes = char.encode('utf-16')     
# utf16_bytes = b'\xff\xfe\x34\xd8\x1e\xdd'
# uses a surrogate pair (high: 0xD834, low: 0xDD1E)

UTF-16 may be stored in little-endian or big-endian order. To indicate byte order, a Byte Order Mark (BOM) — code point U+FEFF — is often included at the start of a text stream. This allows systems to correctly interpret multi-byte sequences regardless of platform.

Conceptually, UTF-16 acts as a bridge between Unicode’s abstract code points and machine-level storage using 16-bit units. It is especially effective for languages and scripts where most characters fall within the BMP, as many characters can be represented with a single 16-bit unit. However, for texts containing a significant number of supplementary characters, UTF-16 requires surrogate pairs, making some operations more complex than in UTF-8.

In practice, UTF-16 is widely used in Windows operating systems, Java, .NET, and many internal text-processing systems. It provides efficient access to characters for algorithms optimized for 16-bit units and supports full Unicode interoperability when correctly implemented.

See Unicode, UTF-8, ASCII, Character Encoding.

Format

Character

Encoding