Quickly identifying phrases in text using Go
One problem I recently came across is given a listing of phrases, how do I quickly and efficiently identify which ones exist in a given text. I don't want to include any overlapping phrases, and in the case of multiple phrases matching the same chunk of text, I only want to use the longest matching phrase.
For example, let's take this paragraph of text, shamelessly stolen from the H. G. Wells classic, “The War of the Worlds”:
The Thing itself lay almost entirely buried in sand, amidst the scattered splinters of a fir tree it had shivered to fragments in its descent. The uncovered part had the appearance of a huge cylinder, caked over and its outline softened by a thick scaly dun-coloured incrustation. It had a diameter of about thirty yards. He approached the mass, surprised at the size and more so at the shape, since most meteorites are rounded more or less completely. It was, however, still so hot from its flight through the air as to forbid his near approach. A stirring noise within its cylinder he ascribed to the unequal cooling of its surface; for at that time it had not occurred to him that it might be hollow.
And we have this list of potential phrases:
- meteorites are rounded
- meteorites were
- the thing
- appearance of a huge cylinder
- appearances of the markings
- its strange appearance
- wimbledon particularly had suffered
- … possibly thousands of others
We want to quickly identify which phrases are in the text, namely:
- The thing
- a huge cylinder
- meteorites are rounded
The most obvious and simplest approach would be use a hash-map of the phrases, then go through the document, building phrases from adjacent words and checking them against the hash-map. For example:
For each word: let found-phrase be nil let phrase be the word For x is 0 to maximum phrase length: if the phrase is in our hash-map: set found-phrase to be the phrase append the next word to phrase if found-phrase is not nil: add found-phrase to our output
Now this will do the job, however it is a little brute-force and inelegant. The speed of the algorithm is quite dependent on the maximum length of the phrase and it will quickly slow down if we want to start supporting longer phrases.
A much better solution is to know not only if the first word is in the phrase list, but also whether any phrases start with the word. If not, we don't need to waste any time on looking for longer phrases in the hash map, we can simply move onto the next word and start again from there. This essentially reduces the complexity from O(c * n * m) where c is the max length of a phrase to something closer to O(n * m) in most cases.
So how can we do this? The answer is in using a custom implementation of a Trie (or prefix-tree).
A Quick Look at Trie:
A trie works by using each letter as the index of an array, which points to a child array representing the next letter in the word. Something like the following.
Key: HELLO First Letter Third letter Fifth letter Second Letter Fourth letter A B C … H → A … E → A … L → A … L → A … O → Value for HELLO … Z … Z … Z … Z … Z
If we were to add another key to this, say “HELP”, the tree would then look something like:
A B C … H → A … E → A … L → A … L → A … O → Value for HELLO … Z M N O P → Value for HELP … Z … Z … Z … Z
Now, initially this is rather memory inefficient because we are creating an entire array for each letter, even if we are using one entry in it. So a basic memory optimisation is to remove all entries that only have a single child entry and then at the point things start to diverge, use an array there. With this optimisation in place, the above Trie would look something like this:
A B C … H → Shortcut “EL” A ... L → Shortcut “O” → Value for HELLO M N O P → Value for HELP … Z … Z
In this case the common “EL” part of the string is used to shortcut a couple of unused arrays. This also demonstrates one of the nice things about a Trie, which is that it is quite efficient to store repeated prefixes.
NB: There are a number of papers published about additional optimisations that can be made to make a Trie even more efficient for both speed and memory (such as the use of half-sized pointers). However most of these assume an implementation in a lower-level language such as C.
A Better Solution
The real benefit for us however is that we can see quickly if a particular path is valid and may, if we keep adding words to our phrase, lead to a match. For example, there is no entry in the above Trie for the key “HEL”, however we do know that it is a valid prefix, so it is worth trying the next letter. Conversely, a search for “WOR” will fail at the first letter, and since there are no children we know there is no point attempting another search for “WORLD”.
Therefore when searching, our implementation would now look something like this:
For each word: let found-phrase be nil let phrase be the word For x is 0 to maximum-phrase-length: if the phrase is in our trie: set found-phrase to be the phrase if the phrase is a valid prefix in our trie: append the next word to phrase otherwise: break if found-phrase is not nil: add found-phrase to our output
Now that we can break quite early in the discovery process, we can afford to have a much longer maximum phrase length, as it will generally not increase the number of lookups we perform and therefore not impact on our speed.
How well does it work? To find out, I did a quick implementation in Go (golang).
This basic implementation:
- Is case-insensitive (strings are converted to uppercase when comparing)
- Includes fast support for alpha-numeric (caps) runes and a couple of basic bits of punctuation
- Includes unicode support by using a hash-backed lookup table for those bytes outside of the normally supported range
- Designed for load-once and read many (no delete support)
- Contains a basic memory saving technique outlined above
- Contains the crucial valid path return value when looking up a key.
Some basic benchmarks show this data structure is a little less than half the speed of a Go hash-map, which I don't think is too bad considering the hash map is a heavily optimised, native data-structure for Go:
[richard@richard trie]$ go test -test.bench Fetch PASS BenchmarkFetchTrie 5000000 570 ns/op BenchmarkFetchHashmap 5000000 383 ns/op
With this in mind, we should expect at least a 2x speed increase when using our trie based algorithm instead of the brute-force hash-backed one with a 5 word phrase size limit (and almost 3x when searching for phrases 6 words in length), and so it goes:
2012-06-17 17:11:42.197904 +0100 BST Test Hashmap: Found: [2 1 3 4] 2012-06-17 17:11:42.363498 +0100 BST (duration: 0.17 seconds) Test Trie: Found: [2 1 3 4] 2012-06-17 17:11:42.412292 +0100 BST (duration: 0.05 seconds)
So there we have it, an impressive speed increase and a much more scalable algorithm. If you want to have a play, I've made the source available on Github.