A fast, fuzzy, full-text index using Redis
May 5th, 2010 by adamNo matter how much we try to simplify PlayNice.ly, there will always be some features that we must include. One of these is search.
Say "search" to a developer and many will think of MySQL/MyISAM or Lucene. Both of these are pretty rockin' in their own way. MySQL's MyISAM storage engine can work as a pretty straight-forward search system, and Lucene is very powerful, and great for larger datasets.
However, we have found that MySQL's full-text search can get pretty sluggish when you are dealing with gigabytes of data. Lucene was a very strong candidate, but we felt there could be a third option…
PlayNice.ly is entirely based on a data-structure server called Redis. Redis is one of several new key-value databases which break away from traditional relational data architecture. It is simple, flexible, and blazingly fast. So why not use the tools we have already?
On a winter afternoon, we came up with the following…
Step 1: Indexing
Indexing your content into an easily searchable format is the first step with most search systems. But what format should we use?
At the most basic level, we would need to create a lookup to provide a list of item IDs for a given search keyword. In PlayNice.ly terms, an 'item' is a bug or task. Really though, it could be any piece of content, it just depends on your application. At PlayNice.ly we have items, so that's what we use here. ![]()
So, in Redis, we would create a set of item IDs for each word found in every item. Each set would be identified by a key containing the word. For example:
word:python = SET(9, 12, 23) word:exception = SET(2, 11) ...
We create these sets by extracting words from item descriptions, title, tags, comments etc. So if a user searched for 'exception', we could instantly find out which items contained the word 'exception'.
Yes, this system could result in a lot of keys. However, there are only a few hundred thousand words in the English language, and even that many keys is not likely to give Redis any problems!
Step 1.5: Fuzzy indexing
That is all well and good, but what if someone searched for 'exceptions'? Or what if, like me, their spelling was somewhat sub-par and they typed 'exceptins'? Well, they would be out of luck. So the forgiving world of PlayNice.ly needed to do better.
Enter phonetic algorithms, a blessing to the linguistically challenged.
Phonetic algorithms are actually pretty simple. They are just a way of taking a word, dropping out some letters (according to a set of rules), and providing you with a rough representation of how that word sounds. Common phonetic algorithms are Soundex and Metaphone. The former is pretty dated now, but we understand the latter to be more up-to-scratch (any to play nicely with other Western languages).
To give you an example, if you were to take the Metaphone of 'steven', then you would get 'STFN'. You would also get the same value for 'steeven' or 'stefen'.
So if we use the above example again, the Metaphones for 'python' and 'exception' are 'P0N' and 'AKSPXN' respectively. We can now use those values instead of the words in our Redis keys:
# using the Metaphone phonetic algorithm word:P0N = SET(9, 12, 23) word:AKSPXN = SET(2, 11) ...
So now we will get matches as long as our spelling has a reasonable resemblance to the word we mean.
We actually took our system slightly further and used the Double Metaphone phonetic algorithm (python lib), which may return two values for words with different possible pronunciations (sort of). So now, 'python' actually gets two codes, 'P0N' and 'PTN', which each get their own keys:
# using the Double Metaphone phonetic algorithm word:P0N = SET(9, 12, 23) word:PTN = SET(9, 12, 23) word:AKSPXN = SET(2, 11) ...
Yes this is more storage, but we thought it would be a worth it for the sake of a more lenient search engine.
Step 2: Searching (and the SPEED)
So now we actually want to do a search. This is a two step process, but each step is lightning fast:
- Take the Metaphone of our keyword
- Get the contents of that set from Redis.
That can be summed up with this single line of pseudo-code:
redis.smembers("word:" + metaphone("python"))
And that is it, and Redis could easily cope with doing tens of thousands of these searches per second, even on the most basic hardware.
But let's make it a little more complex. Say we want to search for multiple keywords:
# Search for multiple keywords keywords = ["python", "exception"] metaphones = [metaphone(k) for k in keywords] redis_keys = ["word:" + m for m in metaphones] item_ids = redis.sunion(redis_keys)
Here we use the SUNION Redis command to perform a union (i.e. an "OR") on all of our results. If you want to perform an "AND" on the keywords, then you can use the SINTER Redis command instead.
Summing up
I hope this has shown how simple it can be to implement blazingly fast fuzzy searching using Redis. PlayNice.ly performs all of its searching and filtering operations within Redis using nothing more than sets and sunion/sinter commands.
For the inquisitive, we put a stand-alone version of our indexing code here:
In the future we hope to cover more advanced Redis searching functionality.
If you liked this blog post then we would absolutely love you to tell us what you think. Please post comments and talk to us on Twitter, we are all very friendly.
And if you'd like to talk more about Redis, definitely join our Redis London group on Meetup.













May 5th, 2010 at 6:04 pm
Are you considering doing any tf-idf style indexing with Redis?
May 5th, 2010 at 6:42 pm
Hey Jacob! Yep, this is something we have thought about but we have yet to see a use-case for it in PlayNice.ly. This simply due to the way our filtering interface works, which places emphasis on speed and quick feedback.
However, there are a couple of ways I could imagine this working:
* Counting frequency of words
* Assigning weightings to different fields (a word in the subject is probably worth more than a word in a comment)
* Looking at differences between the search keyword and the word which the metaphone represents (using something like the Levenshtein Difference)
But if you do need this added complexity then you may be better served by something like Sphinx/Lucene. But who knows, maybe someone will develop an equivalent which backs onto Redis
May 6th, 2010 at 12:32 am
When I was making indexed search for my wiki, I discovered that finding the items is just a part of the task. A much more difficult task is ordering them. At first I just did a simple sort by the number of hits, but in the end I get the best results with word relevancy — the number of hits in given item divided by the number of hits of the given word in all items together, summed for all words in the search.
This may be a little problematic with your model, especially since you don’t record how many times the word appeared in given item.
May 6th, 2010 at 2:18 am
Have you tried lucene/solr and/or ferret? These optimize the hell out of inverted indexes and baked for many years. xtractr (http://www.pcapr.net/xtractr) uses ferret, for example.
May 6th, 2010 at 10:56 pm
The get_words_from_text() function will skip checking any word that immediately follows a stopword or a word that is too short.
This is the list comprehension that does what you want it to do:
words = [word for word in text.split() if len(word) >= MIN_WORD_LENGTH and word.lower() not in STOP_WORDS]
May 7th, 2010 at 9:20 am
Ah yes, well noticed! I have updated the gist.
Thank you
May 7th, 2010 at 6:33 pm
You said: “Lucene was a very strong candidate, but we felt there could be a third option”
That would be interesting to try both lucene and the solution with redis and do a few benchmarks and/or see if you get stucked on any issues.
May 8th, 2010 at 12:47 am
Indeed it would. I strongly suspect that this Redis solution would outperform Lucene purely because, a) Redis is outrageously fast, but mostly because b) the solution above is much simpler and lighter than Lucene. However, we really should do some benchmarking when we get the chance.
May 11th, 2010 at 12:07 pm
I like very much the solution! I’d like to remark something regarding updating the information.
Let’s imagine you change the descriptions of one single item (that’s pretty common in any real world situation). Your Sets would quickly be outdated because the sets would contain item IDs that are no longer in a specific metaphone.
To solve this, you should compare the old description with the new description and create a list of removed words (difference), and then parse remove the unused IDs from those sets. If you do not update descriptions quite often is assumable, but in any case must be taken into account.
Excellent article!
Off the record: I am Sphinx and Redis lover
May 11th, 2010 at 12:59 pm
Hey Albert! Thanks for the comment – the idea of doing a diff hadn’t occurred to me.
It is true that, in this implementation, words will remain in the index even if they get removed from an item. We don’t actually think this will be a problem for us (and may even make the search more useful), but it is something we are going to keep an eye on.
May 11th, 2010 at 10:47 pm
I’m assuming you’re going to see horrible index compression here.
Most search frameworks spend a good deal of time on index maintenance and compression … Lucene does a decent job.
May 11th, 2010 at 11:40 pm
Hi Kevin. I must confess that I am not too familiar with index compression (and Google hasn’t been massively forthcoming). Do you have a URL where I can read up on it?
I will take a stab in the dark and point out that Redis will optimise the storage of sets of integers, which can greatly reduce memory usage. However, by the look of Spinn3r I guess that you are dealing with a much greater data volume than we initially expect.
May 19th, 2010 at 2:20 pm
[...] you know, we’re big fans of Redis. Along with other NoSQL datastores, Redis is changing the way web apps are built. With its [...]
May 24th, 2010 at 1:08 pm
[...] post follows on from A fast, fuzzy, full-text index using Redis. Also, there are still a few places left for the first London Redis Meetup on [...]
August 18th, 2010 at 10:39 pm
I really like this approach.
You could use a sorted zset to assign weightings to different fields. For example you could increment score by 2 when indexing if the term is in the title and 1 if the word is in any other field. You could also increment score several times if it’s found more than 1 time.
August 18th, 2010 at 11:11 pm
Hi Zenx, that could be a very good idea, especially now that Redis has the ZUNIONSTORE / ZINTERSTORE commands.