Utf Sixteen

Unicode Transformation Format, 16-bit representation. A way of representing a stream of UniCode characters as a stream of 16-bit words. Commonly found in MicrosoftWindows (and JavaLanguage?) environments; where it is the native encoding used by the system.

Originally, UniCode itself was intended as a 16-bit format; but now it has gone well beyond that. UtfSixteen is designed to be compatible with most legacy 16-bit Unicode applications, with some escape sequences defined to allow those rare characters not found in the BasicMultilingualPlane? to be accessed. Unfortunately, this misleads programmers into writing buggy, insecure software based on the almost-true assumption that UTF-16 is a fixed width format. Since this assumption works nearly all of the time, broken code is unlikely to be picked up during testing and use, leaving untested edge case code paths to be exploited with malicious or even random data.

UtfSixteen is not compatible with AsciiCode text.

As most modern machines assume 8-bit words; endianality is an issue with UTF-16; a UTF-16 document can be transmitted either LSB first (LittleEndian) or MSB first (BigEndian). To distinguish between the two, especially when documents might be shared between different environments, the ByteOrderMark may be used. Where external context (such as a datatype tag) provides information on endianality, the ByteOrderMark may be omitted.


EditText of this page (last edited November 7, 2012) or FindPage with title or text search