• - Geekm.ag -

  • latest
  • archive
  • about
  • Quickly identifying phrases in text using Go

    2012-07-01

    One problem I recently came across is given a listing of phrases, how do I quickly and efficiently identify which ones exist in a given text. I don't want to include any overlapping phrases, and in the case of multiple phrases matching the same chunk of text, I only want to use the longest matching phrase.

    For example, let's take this paragraph of text, shamelessly stolen from the H. G. Wells classic, “The War of the Worlds”:

    The Thing itself lay almost entirely buried in sand, amidst the scattered splinters of a fir tree it had shivered to fragments in its descent.  The uncovered part had the appearance of a huge cylinder, caked over and its outline softened by a thick scaly dun-coloured incrustation.  It had a diameter of about thirty yards.  He approached the mass, surprised at the size and more so at the shape, since most meteorites are rounded more or less completely.  It was, however, still so hot from its flight through the air as to forbid his near approach.  A stirring noise within its cylinder he ascribed to the unequal cooling of its surface; for at that time it had not occurred to him that it might be hollow.
    

    And we have this list of potential phrases:

    • meteorites are rounded
    • meteorites were
    • hollow
    • the thing
    • appearance of a huge cylinder
    • appearances of the markings
    • its strange appearance
    • wimbledon particularly had suffered
    • … possibly thousands of others

    We want to quickly identify which phrases are in the text, namely:

    • The thing
    • a huge cylinder
    • meteorites are rounded
    • hollow

    Initial Solution:

    The most obvious and simplest approach would be use a hash-map of the phrases, then go through the document, building phrases from adjacent words and checking them against the hash-map. For example:

    For each word:
        let found-phrase be nil
        let phrase be the word
        For x is 0 to maximum phrase length:
            if the phrase is in our hash-map:
                set found-phrase to be the phrase
            append the next word to phrase
        if found-phrase is not nil:
            add found-phrase to our output
    

    Now this will do the job, however it is a little brute-force and inelegant. The speed of the algorithm is quite dependent on the maximum length of the phrase and it will quickly slow down if we want to start supporting longer phrases.

    A much better solution is to know not only if the first word is in the phrase list, but also whether any phrases start with the word. If not, we don't need to waste any time on looking for longer phrases in the hash map, we can simply move onto the next word and start again from there. This essentially reduces the complexity from O(c * n * m) where c is the max length of a phrase to something closer to O(n * m) in most cases.

    So how can we do this? The answer is in using a custom implementation of a Trie (or prefix-tree).

    A Quick Look at Trie:

    A trie works by using each letter as the index of an array, which points to a child array representing the next letter in the word. Something like the following.

    Key: HELLO
    
    First Letter    Third letter    Fifth letter
        Second Letter   Fourth letter
    A
    B
    C
    …
    H →
        A
        …
        E →
            A
            …
            L → 
                A
                …
                L →
                    A
                    …
                    O → Value for HELLO
                    …
                    Z
                …
                Z
            …
            Z
        …
        Z
    …
    Z
    

    If we were to add another key to this, say “HELP”, the tree would then look something like:

    A
    B
    C
    …
    H →
        A
        …
        E →
            A
            …
            L → 
                A
                …
                L →
                    A
                    …
                    O → Value for HELLO
                    …
                    Z
                M
                N
                O
                P → Value for HELP
                …
                Z
            …
            Z
        …
        Z
    …
    Z
    

    Now, initially this is rather memory inefficient because we are creating an entire array for each letter, even if we are using one entry in it. So a basic memory optimisation is to remove all entries that only have a single child entry and then at the point things start to diverge, use an array there. With this optimisation in place, the above Trie would look something like this:

    A
    B
    C
    …
    H → Shortcut “EL”
        A
        ...
        L → Shortcut “O” → Value for HELLO
        M
        N
        O
        P → Value for HELP
        …
        Z
    …
    Z
    

    In this case the common “EL” part of the string is used to shortcut a couple of unused arrays. This also demonstrates one of the nice things about a Trie, which is that it is quite efficient to store repeated prefixes.

    NB: There are a number of papers published about additional optimisations that can be made to make a Trie even more efficient for both speed and memory (such as the use of half-sized pointers). However most of these assume an implementation in a lower-level language such as C.

    A Better Solution

    The real benefit for us however is that we can see quickly if a particular path is valid and may, if we keep adding words to our phrase, lead to a match. For example, there is no entry in the above Trie for the key “HEL”, however we do know that it is a valid prefix, so it is worth trying the next letter. Conversely, a search for “WOR” will fail at the first letter, and since there are no children we know there is no point attempting another search for “WORLD”.

    Therefore when searching, our implementation would now look something like this:

    For each word:
        let found-phrase be nil
        let phrase be the word
        For x is 0 to maximum-phrase-length:
            if the phrase is in our trie:
                set found-phrase to be the phrase
            if the phrase is a valid prefix in our trie:
                append the next word to phrase
            otherwise:
                break
        if found-phrase is not nil:
            add found-phrase to our output
    

    Now that we can break quite early in the discovery process, we can afford to have a much longer maximum phrase length, as it will generally not increase the number of lookups we perform and therefore not impact on our speed.

    How well does it work? To find out, I did a quick implementation in Go (golang).

    This basic implementation:

    • Is case-insensitive (strings are converted to uppercase when comparing)
    • Includes fast support for alpha-numeric (caps) runes and a couple of basic bits of punctuation
    • Includes unicode support by using a hash-backed lookup table for those bytes outside of the normally supported range
    • Designed for load-once and read many (no delete support)
    • Contains a basic memory saving technique outlined above
    • Contains the crucial valid path return value when looking up a key.

    Some basic benchmarks show this data structure is a little less than half the speed of a Go hash-map, which I don't think is too bad considering the hash map is a heavily optimised, native data-structure for Go:

    [richard@richard trie]$ go  test -test.bench Fetch
    PASS
    BenchmarkFetchTrie   5000000           570 ns/op
    BenchmarkFetchHashmap    5000000           383 ns/op
    

    With this in mind, we should expect at least a 2x speed increase when using our trie based algorithm instead of the brute-force hash-backed one with a 5 word phrase size limit (and almost 3x when searching for phrases 6 words in length), and so it goes:

    2012-06-17 17:11:42.197904 +0100 BST
    Test Hashmap:
    Found: [2 1 3 4]
    2012-06-17 17:11:42.363498 +0100 BST  (duration: 0.17 seconds)
    Test Trie:
    Found: [2 1 3 4]
    2012-06-17 17:11:42.412292 +0100 BST (duration: 0.05 seconds)
    

    So there we have it, an impressive speed increase and a much more scalable algorithm. If you want to have a play, I've made the source available on Github.

    Comments

    blog comments powered by Disqus