Posted by: joachimvandenbogaert | May 8, 2008

Integrating HunSpell into a C# application

INTRODUCTION

I was asked to provide some spell cheking abilities to an application I am writing. After some googling I found two candidate spellcheckers. These were the features I was looking for:

  • Free to use/OpenSource
  • Active development community
  • Support for many languages
  • The ability to create custom dictionaries

NetSpell and HunSpell both fit the profile, but since HunSpell is OpenOffice’s default spell checker, I decided to go for HunSpell.

There was one problem though: there was no support for C#. The rest of this blog entry will be dedicated to the conversion of a native win32 application to managed .NET.

MANAGED vs. UNMANAGED

The main task consisted of making a C++ dll accessible via the .NET framework. It took me quite some time before I had all the information I needed. I even had to dig into some C++ in order to understand all the peculiarities involved.

The main problem with accessing unmanaged code is the definition of correct data structures to pass between the methods inside the managed code and the unmanaged code. The conversion between these datatypes is called “Marshaling”. The .NET framework takes care of this, but still you need to define all required data structures yourself. For example, a string in managed C# does not exist in unmanaged C++, so you will have to find a way to pass and return strings.

For frequently used dll’s inside the win32 api, a lot of work has been done by the community. http://www.pinvoke.net/ contains a large collection of, what I call “mapped” function signatures. Indeed, the larger part of the programming consists of merely mapping unmanaged C/C++ method signatures onto managed C#.

ENTER SWIG

The Pinvoke website is useful if you are working with win32 libraries, if you use other libraries , you have to find/derive the mappings yourself. But this is a very difficult and intensive job.

Luckily, there is http://www.swig.org/. Swig is a small commandline application that allows you to easily generate C++ and C# code for unmanaged code, it creates the mappings for you and works more or less in the same way a OR-mapper does. However, reading all the links above is still useful in the case the mapping is not perfect.

x86 vs x64 part 1

There was another problem to cope with: after I copied the project to my second development machine, which happens to run Vista 64-bit, my Unit Tests would not run. A System.BadImageFormatException was thrown for each method that involved the unmanaged C++ library.

The cause was of course the LibHunSpell dll that had been compiled for an x86 architecture. A solution would be to compile two versions of my library with each time a different dll (there is no such thing as conditional compile for different architectures in .NET), but since I wanted to have exactly the same unit tests for x86 and x64, regardless of which machine the code was running on, this was not possible.

To solve this problem, I implemented a Factory Pattern (http://en.wikipedia.org/wiki/Factory_method_pattern) to give the calling code the correct dll. If the code detects that it is running inside a 32bit environment, it loads the 32bit dll behind the scenes, otherwise it loads the 64bit dll. To determine which environment the application is running in, I used the IntPtr.Size property as discussed in the following newsgroup:

http://groups.google.be/group/microsoft.public.dotnet.languages.csharp/browse_thread/thread/96d020ac10f93c95/b864357a5f35f3aa?hl=nl&lnk=st&q=detect+64bit+C%23

The reason to go for the IntPtr method was that it would always give the correct environment, even if the application was loaded in 32bit mode on a 64bit OS.

Code edits

Some unit tests showed that the code generated by SWIG was not perfect. The C# bindings only seemed to work for the windows 1252 codepage. It was possible to spell check Dutch and English for example, but Russian in KOI8-R did not work. It took some time to figure out what was wrong. The following articles helped me a lot:

And of course also the java source code here : http://dion.swamp.dk/hunspell.html

They demonstrated the difference between a C++ string and a C# string. An ANSI C++ string is terminated with a zero mark (“\u0000”) inside the memory. So, actually HunSpell expects a string, in whatever encoding you configure, terminated by “\u0000”. I ended up changing some signatures, where an input string was replaced by a byte array (byte[]). In the code I then replaced the string by a zero-terminated string converted to a byte array with m_Encoding.GetBytes(). This seemed to do the trick.

Codepage mappings

Some more unit tests now revealed that a lot of HunSpell affix files use old encodings. Their names had to be mapped on Windows implementations of the code page. This required some research. The following links contain pages or papers that explicitely state that HunSpell codepages and Windows pages are equivalent:

x86 vs x64 part 2

After I got the most important method working (SpellCheck(byte[] word) properly, I ventured into getting the other methods to work. This turned out to be a frustrating chore for x64.

In x86 there is a suggest function that requires a ***char as an argument to store the suggestions for any given word in. Using the IntPtr, it was easy enough to convert this ***char to a C# string[]. The ***char is actually a pointer to an array of strings. So the address where the strings are stored is retrieved by Marshal.ReadIntPtr(pointerToAddressStringArray). By using the int that is returned by the suggest method, you know how many suggestions there are. So the addresses of the char arrays where the words are stored is retrieved by calling Marshal.ReadIntPtr on the address of the string array, with an offset that shifts 4 bytes each time. Then you simply need to call Marshal.ReadByte until you encounter a byte with a 0 value, to retrieve all characters.

I thought I could follow the same procedure for x64 but HunSpell seems to mess up the memory profoundly. Moreover there seems to be something totally wrong with how the characters are stored. If you have a look at the memory in a 64-bit environment you can hardly recognise any character.

For the moment, I stopped working on the 64-bit version. To fix it, I’m afraid I will have to dig into the C++ code.

Advertisements

Responses

  1. Hi, How did you deal with char ***. I’m trying to create a c++ clr wrapper. I’m getting a null ref exception when call suggest. Here is some sample code.

    array^ Suggest(String^ word)
    {
    char **slst = NULL;
    int len = pMS->suggest((char***)slst,
    (char *) Marshal::StringToHGlobalAnsi(word).ToPointer());

    array^ sl = gcnew array(len);

    for (int i = 0; i free_list((char***)slst, len);
    return sl;
    }

  2. Hi Paul,

    With the suggest method you get a pointer to an array of strings. To get the strings out of it, you count the amount of suggestions an with this count you can continue marshaling the individual strings (which are represented in memory as char arrays). Then you read out the bytes with the correct .NET encoding, note that I implemented a Dictionary that maps hunspell encodings onto .NET encodings to achieve this. Also note that you reach the end of an ANSI string when a 0-byte is discovered.

    For the moment this only works for the 32-bit version. In theory it should work with x64 also, by shifting with 8 instead of 4 bytes where applicable, but when I look at the memory in x64, it seems that it is completely messed up.

    public List Suggest(byte[] word, Encoding encoding)
    {
    List results = new List();

    // Pointer to string array
    IntPtr pointerToAddressStringArray = Marshal.AllocHGlobal(IntPtr.Size);
    int resultCount = LibHunSpell32.Hunspell_suggest(m_HunSpellHandle, word, pointerToAddressStringArray);

    // StringArray
    IntPtr addressStringArray = Marshal.ReadIntPtr(pointerToAddressStringArray);

    for (int i = 0; i < resultCount; i++)
    {
    // String (CharacterArray)
    IntPtr addressCharArray = Marshal.ReadIntPtr(addressStringArray, i * 4);
    int offset = 0;
    List bytesList = new List();
    byte newByte = Marshal.ReadByte(addressCharArray, offset++);
    while (newByte != 0)
    {
    bytesList.Add(newByte);
    newByte = Marshal.ReadByte(addressCharArray, offset++);
    }
    byte[] bytesArray = new byte[offset];
    bytesList.CopyTo(bytesArray);
    string suggestion = encoding.GetString(bytesArray);
    results.Add(suggestion);

    }

    Marshal.FreeHGlobal(pointerToAddressStringArray);
    return results;
    }

  3. I work on a x64 machine, so that don’t help me much. I’ve decided to port Hunspell and Hyphen with managed C++, and it works well on x64. Please take a look at the project on:

    nhunspell.sourceforge.net

    and tell me what i can do better or what you like. I know i don’t deal with utf8 right at the moment. You can make some suggestions if you like.

  4. I’ve released the first beta on
    http://nhunspell.sourceforge.net
    it deals with utf8, you can take a look at it.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: