A fast, fuzzy, full-text index using Redis
Posted by adam
Say “search” to a developer and many will think of MySQL/MyISAM or Lucene. Both of these are pretty rockin’ in their own way. MySQL’s MyISAM storage engine can work as a pretty straight-forward search system, and Lucene is very powerful, and great for larger datasets.
However, we have found that MySQL’s full-text search can get pretty sluggish when you are dealing with gigabytes of data. Lucene was a very strong candidate, but we felt there could be a third option…
PlayNice.ly is entirely based on a data-structure server called Redis. Redis is one of several new key-value databases which break away from traditional relational data architecture. It is simple, flexible, and blazingly fast. So why not use the tools we have already?
On a winter afternoon, we came up with the following…
Step 1: Indexing
Indexing your content into an easily searchable format is the first step with most search systems. But what format should we use?
At the most basic level, we would need to create a lookup to provide a list of item IDs for a given search keyword. In PlayNice.ly terms, an ‘item’ is a bug or task. Really though, it could be any piece of content, it just depends on your application. At PlayNice.ly we have items, so that’s what we use here. ![]()
So, in Redis, we would create a set of item IDs for each word found in every item. Each set would be identified by a key containing the word. For example:
word:python = SET(9, 12, 23) word:exception = SET(2, 11) ...
We create these sets by extracting words from item descriptions, title, tags, comments etc. So if a user searched for ‘exception’, we could instantly find out which items contained the word ‘exception’.
Yes, this system could result in a lot of keys. However, there are only a few hundred thousand words in the English language, and even that many keys is not likely to give Redis any problems!
Step 1.5: Fuzzy indexing
That is all well and good, but what if someone searched for ‘exceptions’? Or what if, like me, their spelling was somewhat sub-par and they typed ‘exceptins’? Well, they would be out of luck. So the forgiving world of PlayNice.ly needed to do better.
Enter phonetic algorithms, a blessing to the linguistically challenged.
Phonetic algorithms are actually pretty simple. They are just a way of taking a word, dropping out some letters (according to a set of rules), and providing you with a rough representation of how that word sounds. Common phonetic algorithms are Soundex and Metaphone. The former is pretty dated now, but we understand the latter to be more up-to-scratch (any to play nicely with other Western languages).
To give you an example, if you were to take the Metaphone of ‘steven’, then you would get ‘STFN’. You would also get the same value for ‘steeven’ or ‘stefen’.
So if we use the above example again, the Metaphones for ‘python’ and ‘exception’ are ‘P0N’ and ‘AKSPXN’ respectively. We can now use those values instead of the words in our Redis keys:
# using the Metaphone phonetic algorithm word:P0N = SET(9, 12, 23) word:AKSPXN = SET(2, 11) ...
So now we will get matches as long as our spelling has a reasonable resemblance to the word we mean.
We actually took our system slightly further and used the Double Metaphone phonetic algorithm (python lib), which may return two values for words with different possible pronunciations (sort of). So now, ‘python’ actually gets two codes, ‘P0N’ and ‘PTN’, which each get their own keys:
# using the Double Metaphone phonetic algorithm word:P0N = SET(9, 12, 23) word:PTN = SET(9, 12, 23) word:AKSPXN = SET(2, 11) ...
Yes this is more storage, but we thought it would be a worth it for the sake of a more lenient search engine.
Step 2: Searching (and the SPEED)
So now we actually want to do a search. This is a two step process, but each step is lightning fast:
- Take the Metaphone of our keyword
- Get the contents of that set from Redis.
That can be summed up with this single line of pseudo-code:
redis.smembers("word:" + metaphone("python"))
And that is it, and Redis could easily cope with doing tens of thousands of these searches per second, even on the most basic hardware.
But let’s make it a little more complex. Say we want to search for multiple keywords:
# Search for multiple keywords keywords = ["python", "exception"] metaphones = [metaphone(k) for k in keywords] redis_keys = ["word:" + m for m in metaphones] item_ids = redis.sunion(redis_keys)
Here we use the SUNION Redis command to perform a union (i.e. an “OR”) on all of our results. If you want to perform an “AND” on the keywords, then you can use the SINTER Redis command instead.
Summing up
I hope this has shown how simple it can be to implement blazingly fast fuzzy searching using Redis. PlayNice.ly performs all of its searching and filtering operations within Redis using nothing more than sets and sunion/sinter commands.
For the inquisitive, we put a stand-alone version of our indexing code here:
In the future we hope to cover more advanced Redis searching functionality.
If you liked this blog post then we would absolutely love you to tell us what you think. Please post comments and talk to us on Twitter, we are all very friendly.
And if you’d like to talk more about Redis, definitely join our Redis London group on Meetup .












Are you considering doing any tf-idf style indexing with Redis?
Hey Jacob! Yep, this is something we have thought about but we have yet to see a use-case for it in PlayNice.ly. This simply due to the way our filtering interface works, which places emphasis on speed and quick feedback.
However, there are a couple of ways I could imagine this working:
* Counting frequency of words
* Assigning weightings to different fields (a word in the subject is probably worth more than a word in a comment)
* Looking at differences between the search keyword and the word which the metaphone represents (using something like the Levenshtein Difference)
But if you do need this added complexity then you may be better served by something like Sphinx/Lucene. But who knows, maybe someone will develop an equivalent which backs onto Redis
When I was making indexed search for my wiki, I discovered that finding the items is just a part of the task. A much more difficult task is ordering them. At first I just did a simple sort by the number of hits, but in the end I get the best results with word relevancy — the number of hits in given item divided by the number of hits of the given word in all items together, summed for all words in the search.
This may be a little problematic with your model, especially since you don’t record how many times the word appeared in given item.
Have you tried lucene/solr and/or ferret? These optimize the hell out of inverted indexes and baked for many years. xtractr (http://www.pcapr.net/xtractr) uses ferret, for example.
The get_words_from_text() function will skip checking any word that immediately follows a stopword or a word that is too short.
This is the list comprehension that does what you want it to do:
words = [word for word in text.split() if len(word) >= MIN_WORD_LENGTH and word.lower() not in STOP_WORDS]
Ah yes, well noticed! I have updated the gist.
Thank you
You said: “Lucene was a very strong candidate, but we felt there could be a third option”
That would be interesting to try both lucene and the solution with redis and do a few benchmarks and/or see if you get stucked on any issues.
Indeed it would. I strongly suspect that this Redis solution would outperform Lucene purely because, a) Redis is outrageously fast, but mostly because b) the solution above is much simpler and lighter than Lucene. However, we really should do some benchmarking when we get the chance.
I like very much the solution! I’d like to remark something regarding updating the information.
Let’s imagine you change the descriptions of one single item (that’s pretty common in any real world situation). Your Sets would quickly be outdated because the sets would contain item IDs that are no longer in a specific metaphone.
To solve this, you should compare the old description with the new description and create a list of removed words (difference), and then parse remove the unused IDs from those sets. If you do not update descriptions quite often is assumable, but in any case must be taken into account.
Excellent article!
Off the record: I am Sphinx and Redis lover
Hey Albert! Thanks for the comment – the idea of doing a diff hadn’t occurred to me.
It is true that, in this implementation, words will remain in the index even if they get removed from an item. We don’t actually think this will be a problem for us (and may even make the search more useful), but it is something we are going to keep an eye on.
I’m assuming you’re going to see horrible index compression here.
Most search frameworks spend a good deal of time on index maintenance and compression … Lucene does a decent job.
Hi Kevin. I must confess that I am not too familiar with index compression (and Google hasn’t been massively forthcoming). Do you have a URL where I can read up on it?
I will take a stab in the dark and point out that Redis will optimise the storage of sets of integers, which can greatly reduce memory usage. However, by the look of Spinn3r I guess that you are dealing with a much greater data volume than we initially expect.
[...] you know, we’re big fans of Redis. Along with other NoSQL datastores, Redis is changing the way web apps are built. With its [...]
[...] post follows on from A fast, fuzzy, full-text index using Redis. Also, there are still a few places left for the first London Redis Meetup on [...]
I really like this approach.
You could use a sorted zset to assign weightings to different fields. For example you could increment score by 2 when indexing if the term is in the title and 1 if the word is in any other field. You could also increment score several times if it’s found more than 1 time.
Hi Zenx, that could be a very good idea, especially now that Redis has the ZUNIONSTORE / ZINTERSTORE commands.
[...] quem quer se aprofundar neste case, sugiro a leitura do artigo indicado pelo @jdrowell que mostra um case similar de busca de textos utilizando o [...]
[...] have been covering some fairly complex Redis topics here so far, so I thought it would be nice to do a good ol' getting started [...]
Hi, and thanks for the post.
Maybe I’m missing something, but how do you provide support for searching terms and phrases that are *not* in the english language? Sometimes source code and error message references seem like they contain *mostly* non-english words, and there are symbols and other special characters to consider. Do you just punt on these? Supporting them seems like it would grow the potential list of keys dramatically. On the other hand, not supporting them seems like it’d kill any usefulness of the search, since a really common use case for search is “I got this weird error…” and another is “I think the function name is ‘def x_this_for_me’”
Hi Brian,
You make very good points. The quality of the results from this system when indexing source code is pretty poor. However, it should perform fine for non-English European languages which have similar phonics to English.
As PlayNice.ly grows it is becoming clear that we are going to need to replace this search infrastructure with something more robust. It is a way off yet, but it is on the horizon.
Adam
I advice Sphinx. It’s a little bit more complicated, but have a great results. It also quite easy to make a “suggested search results” like in search engines
bookmarked your site, will check from time to time what you have made