UTF-32

/juː-ti-ɛf θɜːrtiː-tuː/

noun — "a fixed-length Unicode encoding using 32-bit units."

UTF-32 (Unicode Transformation Format, 32-bit) is a character encoding standard that represents every Unicode code point using a fixed 32-bit code unit. Unlike variable-length encodings such as UTF-8 or UTF-16, each Unicode character in UTF-32 is stored in exactly 4 bytes, providing simple and direct access to any character without the need for parsing multiple bytes or surrogate pairs.

Technically, UTF-32 works as follows:

Every code point in the range U+0000 to U+10FFFF is encoded as a single 32-bit unit.
Optional Byte Order Marks (BOM) can indicate little-endian or big-endian ordering for storage and transmission.
No surrogate pairs or variable-length sequences are required, simplifying indexing, slicing, and random access in text processing.


# example: encoding the character "A" (U+0041) in UTF-32
char = "A"
utf32_bytes = char.encode('utf-32')
# utf32_bytes = b'\xff\xfe\x00\x00\x41\x00\x00\x00'  # includes BOM
# fixed 4-byte representation for each character

Conceptually, UTF-32 is the most straightforward encoding for mapping Unicode characters to machine memory. Each code point corresponds to exactly one fixed-size unit, making character indexing and manipulation trivial. However, this simplicity comes at the cost of storage efficiency, as most text (especially ASCII or BMP characters) uses more bytes than necessary compared to UTF-8 or UTF-16.

UTF-32 is primarily used in internal data structures, programming language APIs, or situations where simplicity and predictable indexing outweigh memory concerns. It is rarely used for network transmission or file storage due to its larger size.

See Unicode, UTF-8, UTF-16, Character Encoding.

Format

Character

Encoding