Operations Guide

10.1 HotFix 1
- 10.5 HotFix 1
- 10.5
- 10.2 HotFix 1
- 10.2
- 10.1
- 10.0 HotFix 1
- 10.0

Back Next

Character Sets

character set

is used to represent all characters (or code points) in a language or script. The first character sets were single byte, meaning that they could only define a maximum of 256 characters.

code point

is simply a binary value that represents a character in a character set. ASCII and EBCDIC are examples of two single byte character sets that use different code points to represent the same set of characters. For example, the code-point 0x41 represents the ASCII letter ’A’ but in EBCDIC, the same letter is represented by 0xC1.

Some complex scripts contain more than 256 characters, so they need to use multiple bytes to represent a single character. The most common multi-byte character set is UNICODE.

The characters in a character set may be encoded in many ways. For example, a single byte character set could use a 7-bit or 8-bit encoding. A multi-byte character set could use a fixed width, variable width, or shift-sensitive variable-width encoding.

UNICODE Encoding

UNICODE supports three main encodings:

UCS-2

a 2 byte fixed width encoding.

UTF-16

a 2 byte fixed width encoding. In order to increase the range of characters that can be represented, a character may be followed by a supplemental character increasing the length to 4 bytes.

UTF-8

a variable length encoding ranging from 1 to 4 bytes in length. 7-bit ASCII characters are represented by a single byte in UTF-8 and use the same code-points. Therefore ASCII characters are indistinguishable from their UTF-8 encoded, Unicode counterparts.

Operating System Character Set

The operating system must have the appropriate character sets installed to be able to render the characters properly. Install a native language version of the operating system, or on Win32 install the English version with additional character sets.

Microsoft Windows

On Windows operating systems your Locale determines the ANSI character set used for rendering text in GUI applications. The corresponding OEM character set is used by console applications (those that run in a DOS Box). For example, U.S. English uses ANSI code page 1252 and OEM code page 437.

The Locale also determines the way numbers, currency, time and dates are displayed. The Locale is set using the

Regional Options/Setting

dialog, which is accessible from the Control Panel.

The Input Locale (as distinct from the Locale) determines your keyboard to character setting mapping.

MS-DOS Box

In order to render characters using different Locales from within an MS-DOS Box, select a True Type font. Raster Fonts cannot be used.

OEM code pages can be set explicitly with the

chcp

utility from within a DOS Box. For example:


C:\>chcp /?
Displays or sets the active code page number.

CHCP [nnn]

		nnn Specifies a code page number.

Type CHCP without a parameter to display the active code page number.

C:>chcp
Active code page: 437

Rendering CJK with English Locales

A useful tool for displaying CJK characters on an English/Western version of Windows is NJWIN’s CJK Viewer.

Rename Saved Search

Table of Contents

Operations Guide

Operations Guide

Character Sets

Character Sets

UNICODE Encoding

Operating System Character Set

Microsoft Windows

MS-DOS Box

Rendering CJK with English Locales