Byte Order Mark

A way of distinguishing between different UniCode encodings, and determining endianality for those encodings (UTF-16 and UTF-32) for which it matters.

The world has been slowly migrating to UniCode; but one issue is how to store Unicode text files. Several different representations have been chosen; the two most common are UTF-8 (UtfEight) and UTF-16 (UtfSixteen). The Unix world has selected UTF-8, the Windows world seems to prefer UTF-16.

UTF-8 is arguably more backwards-compatible (all AsciiCode files are valid UTF-8 files without modification), and it doesn't suffer from the same nasty endian problems as UTF-16. ( In UTF-16, on the other hand, endianness is a big problem; do you transmit the LSB or MSB of each 16-bit word first? One solution to this problem, found widely in the Windows universe, is the "FFFE cookie".

Each file encoded in UTF16 begins with the "Zero width, non-breaking space", which is at code point uFEFF. The endian reversial of this code is uFFFE, which is an invalid UniCode character. Thus, if one opens a file and sees "FF FE" as the first two bytes, it is a LittleEndian file; if one sees FE FF it is a BigEndian file.

UTF-32 is seldom used; however its ByteOrderMark is either 00 00 FE FF (BigEndian), or FF FE 00 00 (LittleEndian).

A ByteOrderMark exists for UTF-8 (EF BB BF), however it isn't really needed to distinguish byte order.


EditText of this page (last edited July 5, 2005) or FindPage with title or text search