So I first heard about Panda probably a year ago when I was in my previous job. It looked nice, but I didn’t really get the chance to use it. So practically it is a library that makes data looks like a mix of relational database table and excel sheet. It is easy to do query with it, and provides a way to process it fast if you know how to do it properly (no, I don’t, so I cheated).
In order to best use the tool, I needed to know how to write vectorized code. Turns out it is not easy (I am taking a course on machine learning, and it uses octave which also emphasizes vectorized solutions which drives me crazy), and I didn’t want to go to the cython path just yet. Therefore I cheated by splitting data into chunks, to be processed in different processes.
So for example if I want to create a new column, based on whatever I have on each row, I would normally call the .apply
method, and (im)patiently wait for the thing to complete (at this stage iterating through 2.5mil rows).
dataframe.apply(foo, axis=1)
So I cheated by splitting the data into chunks, and process the chunks concurrently as follows
from concurrent.futures import ProcessPoolExecutor
from itertools import chain
with ProcessPoolExecutor(max_workers=njobs) as pool:
jobs = []
for i in range(0, data.shape[0], batch_size):
max_size = min(data.shape[0], i + batch_size)
jobs.append(pool.submit(
data[i:max_size].apply,
foo))
result = list(chain.from_iterable([job.result() for job in jobs]))
So I was also asked to keep track of the changes, so that means returning two columns at times, so what I did was ensure foo
returns a list/tuple of n
items, and reconstruct the result as follows
def foo(row):
return ('lorem', 'ipsum')
# the code above...
new_result = pandas.DataFrame(result,
index=data.index,
columns=('result', 'meta'))
Sometimes you need to pass in other stuff to the processing function, but lambda is out of the question as Python for some reason refuse to pickle it. This is when “partial” becomes useful (if I knew this earlier it would save me from all the lambdas). So instead of doing this
def foo(row, something):
return '{}'.format(something)
dataframe.apply(lambda row: foo(row, 'i am cute'), axis=1)
use partial instead
from functools import partial
dataframe.apply(partial(foo, something='i am cute'), axis=1)
Besides dealing with Panda, I also spent quite some time wrestling with Scikit-learn (and even submitted first ever pull requests to a public project that was eventually merged w00t). From vectorizing raw data, to building classifiers, to querying. So there’s this time I would want the sparse vector returned by the vectorizer (and the whole set of transformers) and a Panda series to form a new dataframe mainly for indexing conveniences, I did this
dataframe = pandas.DataFrame(zip(list(sparse_matrix_by_vectorizer),
some_series.values.tolist()),
index=some_series.index)
However, if a large matrix is used, it would easily OOM’ed the server. So I usually do this in conjunction with the chunking code above.
Speaking of memory usage, this combo is not really friendly. I couldn’t really count how many times I OOM’ed the development server (and got important processes killed by the oomkiller). Plus the performance of some of the components are just really bad. I still can’t get my collection which has 58000 dimensions to complete TruncatedSVD into 500 dimensions. Brute force kneighbors query is REALLY slow, and I settled with LSHForest (not sure how it impacts the accuracy for now). Now with LSHForest “index”, the pickled file size is even larger than the pickled tfidf matrix itself.
Over the past few days I discovered gensim, blaze and spotify annoy, which claimed to help with memory usage problem, search speed etc. Hopefully I would have time to put them into the code.