Table of Contents

Search

  1. Preface
  2. Introduction
  3. Servers
  4. Console Client
  5. Search Clients
  6. Table Loader
  7. Update Synchronizer
  8. Globalization
  9. Siebel Connector
  10. Web Services
  11. ASM Workbench
  12. Cluster Merge Rules
  13. Forced Link and Unlink
  14. System Backup and Restore
  15. Batch Utilities

Character Sets

Character Sets

A
character set
is used to represent all characters (or code points) in a language or script. The first character sets were single byte, meaning that they could only define a maximum of 256 characters.
A
code point
is simply a binary value that represents a character in a character set. ASCII and EBCDIC are examples of two single byte character sets that use different code points to represent the same set of characters. For example, the code-point 0x41 represents the ASCII letter ’A’ but in EBCDIC, the same letter is represented by 0xC1.
Some complex scripts contain more than 256 characters, so they need to use multiple bytes to represent a single character. The most common multi-byte character set is UNICODE.
The characters in a character set may be encoded in many ways. For example, a single byte character set could use a 7-bit or 8-bit encoding. A multi-byte character set could use a fixed width, variable width, or shift-sensitive variable-width encoding.

UNICODE Encoding

UNICODE supports three main encodings:
UCS-2
a 2 byte fixed width encoding.
UTF-16
a 2 byte fixed width encoding. In order to increase the range of characters that can be represented, a character may be followed by a supplemental character increasing the length to 4 bytes.
UTF-8
a variable length encoding ranging from 1 to 4 bytes in length. 7-bit ASCII characters are represented by a single byte in UTF-8 and use the same code-points. Therefore ASCII characters are indistinguishable from their UTF-8 encoded, Unicode counterparts.

Operating System Character Set

The operating system must have the appropriate character sets installed to be able to render the characters properly. Install a native language version of the operating system, or on Win32 install the English version with additional character sets.

Microsoft Windows

On Windows operating systems your Locale determines the ANSI character set used for rendering text in GUI applications. The corresponding OEM character set is used by console applications (those that run in a DOS Box). For example, U.S. English uses ANSI code page 1252 and OEM code page 437.
The Locale also determines the way numbers, currency, time and dates are displayed. The Locale is set using the
Regional Options/Setting
dialog, which is accessible from the Control Panel.
The Input Locale (as distinct from the Locale) determines your keyboard to character setting mapping.

MS-DOS Box

In order to render characters using different Locales from within an MS-DOS Box, select a True Type font. Raster Fonts cannot be used.
OEM code pages can be set explicitly with the
chcp
utility from within a DOS Box. For example:
C:\>chcp /? Displays or sets the active code page number. CHCP [nnn] nnn Specifies a code page number. Type CHCP without a parameter to display the active code page number. C:>chcp Active code page: 437

Rendering CJK with English Locales

A useful tool for displaying CJK characters on an English/Western version of Windows is NJWIN’s CJK Viewer.

0 COMMENTS

We’d like to hear from you!