Speeling problems? Trie F#.

March 9th, 2009

Maybe you just have large fingers.  Maybe you serve some traders, through a trade-entry language, and they have large fingers.  Whatever the reason, humans misspell words.

If you write software for humans, or if you yourself are a human, read on for a simple, classical solution to this ancient problem, expressed concisely in F#.

The first problem is recognizing that there is a problem

Yes, your parents were right.  You can’t be helped if you don’t know that you need help.  That’s why it’s important, right off the bat, to be able to determine that a word is actually misspelled before you try to correct it.  We’ll see, shortly, that correcting a misspelled word is only a slightly different problem than recognizing that it is misspelled.

To determine whether or not a word is spelled correctly, we need to consult a dictionary.  Paper dictionaries typically sort all known words in alphabetical order, forcing humans to validate spelling by a form of binary search (better at least than a linear search!), which incurs a log(N) cost for each lookup (with N being the number of words in the dictionary).  That might be acceptable just for validation, but when we search for spelling correction options that will translate to a series of lookups — enough to make it worthwhile to choose a different structure to represent our dictionary.

Enter the prefix-tree, or trie.  Tries store dictionaries efficiently by grouping all strings with a common prefix.  This defines a recursive structure, so that  to verify that a word is in the dictionary represented by a trie, we find the sub-trie associated with the first character of the word and then we verify that the remaining suffix of the word is in the sub-trie (recursively).

So, for example, the dictionary consisting of the following words:

Hello
Help
world
work
Jimmy
Jim

Would produce the following trie structure:

 

Here, edges N steps from the root represent the Nth character of some test string, and large circles represent valid words in the dictionary (of course, not all prefixes of valid words should be valid words themselves!).  And determining whether or not a word is in the dictionary just takes an amount of time proportional to the length of the word itself, rather than the size of the dictionary.

Just a short distance to the solution

Now we have an efficient method to determine whether or not a particular string is in our dictionary, but how can we use this structure to search for “similar” words once we determine that a word isn’t valid?  As my friend Doug Finke has pointed out, the critical definition of “similarity” is the edit-distance (or Levenshtein distance).

The edit-distance between two strings is defined as the minimum number of edits necessary to translate one into the other.  Within this definition, an “edit” means inserting a character somewhere within the string, modifying a character, or deleting a character.  For our purposes (though this is easily generalized) when we determine that a word is misspelled, we’ll search for all words in the dictionary that are at most an edit-distance of 1 from the erroneous word.

So, suppose that we had the same dictionary as above, and we were presented with the word “Jimky.”  Put aside for the moment that we know the obvious correction.  If we want to find all strings in the dictionary that are an edit-distance of 1 from “Jimky,” we can derive those from the definition of edit-distance itself:

Inserts
?Jimky
J?imky
Ji?mky
Jim?ky
Jimk?y
Jimky?

Updates
?imky
J?mky
Ji?ky
Jim?y
Jimk?

Deletes
imky
Jmky
Jiky
Jimy
Jimk

Here, “?” means “any valid character” (in other words, each string with a “?” above actually represents a set of strings — one string for each valid replacement of “?”).

This might seem to present an expensive proposition, possibly testing (and constructing!) a large number of strings for even moderately-sized words.  However, we can rest a bit easier in recognizing that several of these generated strings share a common prefix (e.g.: consider “Jim?ky” and “Jim?y” sharing the prefix “Jim”).  Luckily for us, we already have a data structure ideally suited for finding multiple words with a common prefix — the trie — which we can use to group all of these alternatives together and accept or reject them in bulk.

There is very little left now to bringing our solution together.  We can simply walk the dictionary-trie with our erroneous word.  Prior to taking each step, we consider what words we could find if we were to insert a character at the current level, replace the first character and search for the remaining suffix at the current level, or delete the first character and search for the remaining suffix from the current level — then we follow the current character in this word to the next level of the trie, and continue this process for the remainder of the string.

This method saves us some potential pain.  We don’t need to explicitly construct all strings that could possibly be an edit-distance of 1 from our test string.  Also, we have the exact information we need at each point that we decide to substitute valid characters for “?” (i.e.: at a given level in a trie, we always know what outgoing edges there are and what their labels are).

All that’s left now is to implement it.

A little F#

The first place to start is in representing a trie as a data-type.  A reasonable definition could be:

type (‘k,’a) trie = TNode of (‘a option * Map<’k,(‘k,’a) trie>)

Here, the type-variables ‘k and ‘a represent the base key type and element type (respectively).  A trie is just an associative map, and as with other associative maps we’re just defining the same kind of key and element types (except here, our actual key will be a sequence of the base key type).

We won’t really care much about ‘a (the element type) for this problem, and we’ll set ‘k to char.  It’s worth noting, though, that the value stored in a trie node is actually an ‘a option because we won’t store values in proper prefixes of valid strings (those will have just the option value None).

The edge/sub-trie relation is saved in a Map — and any type of associative map would do here.  In fact, in principle this edge/sub-trie map could itself be another trie (useful if our key sequence type is a list of strings, or a list of lists of strings, etc…).  This method of defining a type in terms of a (usually weaker) form of itself is termed “data-structural bootstrapping” (by at least one influential book on the subject of purely functional data structures).

Next, to be sufficiently general in defining a trie, we will want to abstract away the actual method of iterating over the key sequence and extracting keys.  An easy method of packaging that abstraction in F# is a record of functions, like:

type (‘ks, ‘k) iter_module =
   {
      eos  : (‘ks -> bool);
      head : (‘ks -> ‘k);
      tail : (‘ks -> ‘ks)
   };;

Here ‘ks represents a type that stores a sequence of our key elements, and ‘k represents the type of our key elements.  The functions in this iterator module should be familiar — eos determines whether or not a key-sequence is at its end, head gets the current key-value in a key-sequence, and tail increments the key-sequence iterator.

But for this spelling-correction problem we don’t need to be so general.  We already know that our base key type is char, and so we can define an iterator over a sequence of chars straightforwardly as a pair of a string (the sequence of chars) and an int (the current position in that sequence).  All that’s left then is to fill out a module definition that assumes this representation:

let siterm : (string * int, char) Trie.iter_module =
{
  eos  = fun (s, i) -> i >= String.length s;
  head = fun (s, i) -> s.[i];
  tail = fun (s, i) -> (s, i+1)
}
;;

We’ll assume the generic iterator module when implementing tries, and we’ll come back to this instance when we get to implementing spelling correction.  For now it’s useful to keep in mind as a justification for the key sequence iterator abstraction.

Now, we can define some simple functions over the nodes of a trie.  First, to extract the option value associated with a node:

let node_value = function
| TNode (ov, _) -> ov
;;

Similarly, to extract the map representing connections from the current trie node to sub-tries:

let node_map = function
| TNode (_, m) -> m
;;

To determine whether a trie is empty, and the standard empty trie:

let is_empty tn = Map.is_empty (node_map tn);;
let empty_trie  = TNode (None, Map.empty);;

And to find the sub-trie (if it exists) from a given node, along a specific edge:

let find_subtrie tn k =
 try
  Map.find k (node_map tn)
 with
  Not_found -> empty_trie
;;

Now, the first critical function defined on a trie is to update it (here using the key sequence iterator module and maintaining the assumption that common string prefixes will be shared):

let add m tn ks v =
 let rec upd tn’ ks’ =
  if (m.eos ks’) then
   TNode (Some v, (node_map tn’))
  else
   let k    = (m.head ks’) in
   let ov   = node_value tn’ in
   let tn” = upd (find_subtrie tn’ k) (m.tail ks’) in

    TNode (ov, Map.add k tn” (node_map tn’))
 in
  upd tn ks
;;

Similarly, to lookup the option value associated with a string:

let lookup m tn ks =
 let rec lv tn’ ks’ =
  if (m.eos ks’) then
   node_value tn’
  else
   let k = (m.head ks’) in
    lv (find_subtrie tn’ k) (m.tail ks’)
 in
  lv tn ks
;;

Finally, we can determine whether or not a specific word is valid in the trie by a simple use of the lookup function:

let mem m tn ks =
 match (lookup m tn ks) with
| Some _ -> true
| None   -> false
;;

Now, stepping outside of this trie definition, we can concern ourselves with implementing the spelling correction method that we worked out earlier.  First, we’ll need a couple of simple utilities.  One to determine the suffix of a string (represented by our string*int iterator):

let suffix (s, i) = String.sub s i (String.length s – i);;

And another, which computes the set of words defined by a trie node if we follow any valid character from that node, and then try to follow a specific char sequence:

let strings_with_suffix pfx trie wi words =
 let accumulate_paths c trie’ ws =
  if (mem siterm trie’ wi) then
   Set.add (pfx ^ (string c) ^ (suffix wi)) ws
  else
   ws
 in
  Map.fold accumulate_paths (node_map trie) words
;;

In other words, for a prefix P and a suffix S, this function finds P?S (substituting for “?” all valid characters that can follow P).

Finally we can write out our real spelling correction function, taking into account the insertions, replacements, and deletions that yield suggestions a maximum edit-distance of 1 from our test word:

let suggestions trie word =
  let rec paths_around_char pfx trie wi words =
   if (siterm.eos wi) then
    strings_with_suffix pfx trie (“”, 0) words
   else if (is_empty trie) then
    words
   else
    let c      = siterm.head wi in
    let wi’    = siterm.tail wi in
    let trie’  = find_subtrie trie c in
    let ins_ws = strings_with_suffix pfx trie wi words in
    let rep_ws = strings_with_suffix pfx trie wi’ ins_ws in
    let del_ws = if (mem siterm trie wi’) then
                  Set.add (pfx ^ (suffix wi’)) rep_ws
                 else
                  rep_ws
    in
     paths_around_char (pfx ^ (string c)) trie’ wi’ del_ws
  in
   paths_around_char “” trie (word, 0) Set.empty
;;

Finishing touches

To make this spelling corrector useful, some additional supporting code should be written to package it as a command, taking arguments for a dictionary and plaintext file.  We’ll also need some basic utilities for reading files and splitting them into words.  Additionally, we should provide the option to produce nice graphviz diagrams for arbitrary tries (as was used to produce the initial trie diagram in this article):

let graphviz ovstr kstr tn =
 let rec nodes p tn’ =
  let scs k = nodes (p ^ “_” ^ (kstr k)) (find_subtrie tn’ k) in
   let nshape = if is_terminal tn’ then “circle” else “point” in
   (” ” ^ p ^ ” [shape=\"" ^ nshape ^ "\" label=\"" ^ (ovstr (node_value tn')) ^ "\"];\n”) ^
   (String.concat “” (List.map scs (path_heads tn’)))
 in
 let rec connections p tn’ =
  let
   scs k =
    let sn = p ^ “_” ^ (kstr k) in
    (” ” ^ p ^ ” -> ” ^ sn ^ ” [ label = \"" ^ (kstr k) ^ "\"];\n”) ^ (connections sn (find_subtrie tn’ k))
  in
   String.concat “” (List.map scs (path_heads tn’))
 in
“digraph g {\n” ^ (nodes “root” tn) ^ (connections “root” tn) ^ “}\n”
;;

You can download the complete source as a Visual Studio 2008, F# console project here:

scheck.zip (5 KB)

In the attached project, edit the “dict” and “text” files to taste, or change the project’s default arguments to generate trie-graphs from the dictionary file.

One Response to “Speeling problems? Trie F#.”

  1. Kalani Thielen Says:

    Also, just to provide a little more detail on how efficient this method is…

    On my (underpowered) laptop, this program initializes the dictionary-trie from my ~0.5MB unix dictionary file in 0.18 seconds and it checks/corrects the entire text of Huckleberry Finn (~0.5MB as well — and a lot of words that need correction) in a little less than a second.

    This basic efficiency should also extend to most changes made to this method to make it more sophisticated (e.g.: sorting on likelihood of a word in context).