Willow Pond PC Tips: Code pages

     

There are a number of utilities to search and display the contents of a file - similar to unix grep - that use the command console (or DOS prompt). These are super fast and return results from a 25,000 line file almost immediately after hitting "enter". Unfortunately the "DOS prompt" uses the DOS character mapping for ASCII characters above decimal 127; these differ from Windows' character maps.

Files edited with Windows' programs and using the Windows characters with ASCII decimal values >127 will display incorrectly in the DOS window. For example, ASCII 235 is the small "e" with a diaeresis (or umlaut) in Windows, but this is the lower case Greek "delta" in the IBM PC character set and this is the way it displays in the DOS window. Here's an example using the following text from a German web page discussing Draeseke's Mass in A minor: "Draesekes Weg als Künstler war ein fortwährender Kampf um Anerkennung. Der von allen seinen Schülern am Dresdener Konservatorium hochgeachtete Professor für Komposition und Musikgeschichte fand mit seinen Werken nicht das Echo in der musikalischen Öffentlichkeit, wie es Screen shot of German text with codepage=437, the Windows' default.dem in dieser Hinsicht glücklicheren Zeitgenossen Brahms beschieden war." Notice that none of the characters with a diacritic mark displays correctly (see the capital Greek sigma, instead of the lower case "a" with an umlaut in the first line). This can be a real nuisance for users of console applications.

The solution is setting the console code page. The default console in Windows (at least in versions w/American English set as default) uses code page 437, the old DOS/IBM-PC character set. The command "mode con" will display the code page used. This can be changed using the command:
                        "mode con cp select=nnnn"
where nnnn is the code page to be selected. nnnn=850 will give the IBM "international set" (also pretty lame). nnnn=1252 will set it to the Windows (Western European/USset), nnnn=28591 will set it to ISO 8859-1 Latin I, and nnn=28592 will set to ISO 8859-2 Eastern Europe (the latter with similar ASCII 128-255 mapping as Windows). Here are some code pages of interest:

Code page

Description

437

MS-DOS United States

708

Arabic (ASMO 708)

709

Arabic (ASMO 449+, BCON V4)

710

Arabic (Transparent Arabic)

720

Arabic (Transparent ASMO)

737

Greek (formerly 437G)

775

Baltic

850

MS-DOS Multilingual (Latin I)

852

MS-DOS Slavic (Latin II)

855

IBM Cyrillic (primarily Russian)

857

IBM Turkish

860

MS-DOS Portuguese

861

MS-DOS Icelandic

862

Hebrew

863

MS-DOS Canadian-French

864

Arabic

865

MS-DOS Nordic

866

MS-DOS Russian (former USSR)

869

IBM Modern Greek

874

Thai

932

Japan

936

Chinese (PRC, Singapore)

949

Korean

950

Chinese (Taiwan; Hong Kong SAR, PRC)

1200

Unicode (BMP of ISO 10646)

1250

Windows 3.1 Eastern European

1251

Windows 3.1 Cyrillic

1252

Windows 3.1 Latin 1 (US, Western Europe)

1253

Windows 3.1 Greek

1254

Windows 3.1 Turkish

1255

Hebrew

1256

Arabic

1257

Baltic

1258

Latin 1 (ANSI)

20000

CNS - Taiwan

20001

TCA - Taiwan

20002

Eten - Taiwan

20003

IBM5550 - Taiwan

20004

TeleText - Taiwan

20005

Wang - Taiwan

20127

US ASCII

20261

T.61

20269

ISO-6937

20866

Ukrainian - KOI8-U

21027

Ext Alpha Lowercase

21866

Russian - KOI8

28591

ISO 8859-1 Latin I

28592

ISO 8859-2 Eastern Europe

28593

ISO 8859-3 Turkish

28594

ISO 8859-4 Baltic

28595

ISO 8859-5 Cyrillic

28596

ISO 8859-6 Arabic

28597

ISO 8859-7 Greek

28598

ISO 8859-8 Hebrew

28599

ISO 8859-9 Latin Alphabet No.5

29001

Europa 3

1361

Korean (Johab)

There are registry entries under the category NLS (National Language Support) which list the code pages and defaults, but simply adding the line "mode con cp select=1252" or "mode con cp select=28592" to the batch file that launches the grep-like applications will change the console display properties. Here's the same German text displayed after manually setting the code page to 28592. Umlauts are now displayed in their full glory.

In addition, the console should be set to display unicode fonts. This can be set as the default by right clicking the task bar of the command console; click on properties; click on fonts. "Lucida console" is used here. Then apply using the "Save properties for future windows with same title" option. Other console parameters can be changed here or with the "mode con" command; type "mode con/?" at the DOS prompt for a list.

Useful sites:
Character Sets
Microsoft on "SetConsoleOutputCP"
ASCII Diacritics: ( ISO-8859-1 Latin-1)
Command line MODE command

[Go back to Willow Pond PC Tips]
[Was this useful? Have something to add? Let us know.]

       © All contents copyright by WillPondCo