Implementing a Information Retrieval system is a fun thing to do. However, doing it efficiently is not (at least to me). So my first few attempts didn’t really end well (mostly uses just Go/golang with some bash tricks here and there, with or without a database). Then I jumped back to Python, which I am more familiar with and was very surprised with all the options available. So I started with Pandas and Scikit-learn combo.
The combo in itself is great, if you have the resources to process the data. However, often times the amount of RAM is fixed, and not sufficient. While it is certainly enough to just hold my data, but processing them (reasonably fast) requires a lot of tuning in order to not hit the memory limit. In order to handle this, I migrated some part of code involving panda DataFrame to Blaze and Odo.
Also, as my feature vector is quite big (relatively speaking), searching through it (which is just actually comparing a query vector with all the vectors in the collection) is horribly slow. Then I found LSHForest component, which works well enough.
However, I also found Gensim and Annoy so I decided to give them a try.
Porting code to use Gensim and Annoy properly is not something I would consider fun. There are a lot of conversions involved, and not necessarily efficient. However, annoy does work A LOT faster. I have not actually measure it though, perhaps some time in future? After putting some work in it, I suppose I should document this down for my own reference.
So the prerequisites are pandas, scikit-learn, blaze, odo, gensim, and annoy. The data used in the following example is fetched from the Malaysia Open Data portal which lists all the public schools. With this dataset, we shall now build a quick search (for learning purpose, not really practical).
from blaze import Data import pandas
Not sure why I wasn’t allowed to read the xlsx file with Data()
, so I cheated and used pandas.
data = Data(pandas.read_excel('sekolah.xlsx', header=1, index_col=0).applymap(str).fillna(''))
from sklearn.feature_extraction.text import CountVectorizer
While gensim has a generic tokenizer, I prefer scikit-learn’s as it is more customizable without me having to implement one from scratch. So I am building a document with some fields in the xlsx file, and turn them into vectors with CountVectorizer.
vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(map(lambda _: ', '.join(_), data[['AlamatSurat', 'PoskodSurat', 'BandarSurat', 'Negeri']]))
from gensim.matutils import Sparse2Corpus, corpus2dense
First we need to turn the resulting Scipy CSR matrix into a corpus Gensim would recognize.
corpus = Sparse2Corpus(vectors, documents_columns=False)
from gensim.corpora.dictionary import Dictionary
Then we build a dictionary from our corpus and the word to id mapping from the vectorizer.
dictionary = Dictionary.from_corpus(corpus, id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))
from gensim import models
Next, we build a tfidf model based on the dictionary.
tfidf = models.TfidfModel(dictionary=dictionary)
Annoy doesn’t seem to like huge feature vector, so we reduce it with LSI (I tried the theoretically superior LDA, but not sure why it returned a very sparse matrix in my work). 200-500 is considered good enough according to the tutorial, so we stick with 500.
topics = models.LsiModel(tfidf[corpus], num_topics=500, id2word=dictionary, chunksize=5000)
from annoy import AnnoyIndex
Next we build the actual index to speed up search, without sacrificing much accuracy. We are returning just 1 nearest neighbour now, so we consider about recall and precision some other time.
index = AnnoyIndex(500) for i, corpora in enumerate(topics[corpus]): index.add_item(i, corpus2dense([corpora], num_terms=500).transpose()[0]) index.build(500)
So we are searching for oneself for each of the corpora, how hard can it be, right?
result = [] for i in range(len(corpus)): result.append(i in index.get_nns_by_item(i, 1))
Let’s see the numbers, shall we?
result_true = [r for r in result if r] print('{} out of {} (accuracy: {:.4f}%)'.format(len(result_true), len(result), 100 * len(result_true) / len(result)))
Larger number (more number of trees) in index.build()
usually increases the accuracy, but also results in a larger index file. Also the memory usage is much higher when a larger number is put into it. So depending on the resource constraint, one might want to test a couple of figures. (Yes, I need to read more about Annoy and Approximate Nearest Neighbours algorithms to understand how they work, still feeling like magic to me)
For model persistence, most of the components has their own method to serialize itself to disk. However, scikit-learn gives us a couple of options. One can pickle it, which is probably the easiest way to do it. There is also this joblib.dump that can be used instead. So for example, in order to dump our CountVectorizer above.
from sklearn.externals import joblib
joblib.dump(vectorizer, 'vectorizer')
Then you can import it as follows (mmap_mode='r'
would attempt to not load everything to the memory at once, which can be useful at times, but the file should stay in a SSD for obvious reason).
vectorizer = joblib.load('vectorizer', mmap_mode='r')
So this is very much it for now. Next I would continue this and build a classifier with scikit-learn, not sure how it is useful for the dataset though.