Posted by: joachimvandenbogaert | April 24, 2006

Unicode conversion issues

I needed a small application to convert MS Word, non-Unicode documents to Word for Macintosh 6.0 documents. I could have hard coded conversion matrixes, since I only needed some conversions from CP1252 and CP1250 languages to some other European languages. But I wanted to make the application generic, since there was a possibility that other conversions would have to be carried out in other languages.

The conversion tables I used, can be found on unicode.org:
http://www.unicode.org/Public/MAPPINGS/VENDORS/

In this public directory, you can find codepage conversion tables, mapping characters in ASCII or extended ASCII onto their Unicode equivalents. As I started developing the application, all went well, but there was a persisting error when converting to Macintosh Central European languages. First I thought I had made a mistake, then I took a look on the fonts the Macintosh (OS 8.5) used. Some of them seemed corrupted. To ensure all glyphs were on their right place, I created an rtf document in Macintosh Word 6.0 in a CE-font. I noticed something strange:

All glyphs were in the right place (at least for Helvetica CE), but the Unicode reference in the rtf-file did not correspond with the Unicode mappings in the Unicode Vendor mappings file:

This is what I found:

\u196\’80
\u197\’81
\u199\’82
\u201\’83
\u209\’84
\u214\’85
\u220\’86
\u225\’87
\u224\’88
\u226\’89
\u228\’8a
\u227\’8b
\u229\’8c
\u231\’8d
\u233\’8e
\u232\’8f

This is what the file on Unicode.org says:

0x80 0x00C4 [ = 196]
0x81 0x0100 [ = 256]
0x82 0x0101 [ = 257]
0x83 0x00C9 [ = 201]
0x84 0x0104 [ = 260]
0x85 0x00D6 [ = 214]
0x86 0x00DC [ = 220]
0x87 0x00E1 [ = 225]
0x88 0x0105 [ = 261]
0x89 0x010C [ = 268]
0x8A 0x00E4 [ = 228]
0x8B 0x010D [ = 269]
0x8C 0x0106 [ = 262]
0x8D 0x0107 [ = 263]
0x8E 0x00E9 [ = 233]
0x8F 0x0179 [ = 377]

Conclusion: the mappings for Central European Macintosh on Unicode.org are wrong.
The correct mappings can be found on:

http://en.wikipedia.org/wiki/Macintosh_Central_European_encoding

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: