Posted by: joachimvandenbogaert | April 24, 2006

Unicode Conversion Issues

I needed a small application to convert MS Word, non-Unicode documents to Word for Macintosh 6.0 documents. I could have hard coded conversion matrixes, since I only needed some conversions from CP1252 and CP1250 languages to some other European languages. But I wanted to make the application generic, since there was a possibility that other conversions would have to be carried out in other languages. The conversion tables I used, can be found on In this public directory, you can find codepage conversion tables, mapping characters in ASCII or extended ASCII onto their Unicode equivalents. As I started developing the application, all went well, but there was a persisting error when converting to Macintosh Central European languages. First I thought I had made a mistake, then I took a look on the fonts the Macintosh (OS 8.5) used. Some of them seemed corrupted. To ensure all glyphs were on their right place, I created an rtf document in Macintosh Word 6.0 in a CE-font. I noticed something strange: All glyphs were in the right place (at least for Helvetica CE), but the Unicode reference in the rtf-file did not correspond with the Unicode mappings in the Unicode Vendor mappings file: This is what I found: \u196\’80 \u197\’81 \u199\’82 \u201\’83 \u209\’84 \u214\’85 \u220\’86 \u225\’87 \u224\’88 \u226\’89 \u228\’8a \u227\’8b \u229\’8c \u231\’8d \u233\’8e \u232\’8f This is what the file on says: 0x80 0x00C4 [ = 196] 0x81 0x0100 [ = 256] 0x82 0x0101 [ = 257] 0x83 0x00C9 [ = 201] 0x84 0x0104 [ = 260] 0x85 0x00D6 [ = 214] 0x86 0x00DC [ = 220] 0x87 0x00E1 [ = 225] 0x88 0x0105 [ = 261] 0x89 0x010C [ = 268] 0x8A 0x00E4 [ = 228] 0x8B 0x010D [ = 269] 0x8C 0x0106 [ = 262] 0x8D 0x0107 [ = 263] 0x8E 0x00E9 [ = 233] 0x8F 0x0179 [ = 377] Conclusion: the mappings for Central European Macintosh on are wrong. The correct mappings can be found on:


