Notes on codes, projects and everything
So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.
(more…)After a year and half, a lot of things changed, and annoy also changed the splitting strategy too. However, I always wanted to do a proper follow up to the original post, where I compared boosting to Annoy. I still remember the reason I started that (flawed) experiment was because I found boosting easy.
(more…)While following through the Statistical Learning course, I came across this part on doing regression with boosting. Then reading through the material, and going through it makes me wonder, the same method may be adapted to Erik Bernhardsson‘s annoy algorithm.
(more…)Traversing a tree structure often involves writing a recursive function. However, Python isn’t the best language for this purpose. Therefore I started flattening the tree into a key-value dictonary structure. Logically it is still a tree, but it is physically stored as a dictionary. Therefore it is now easier to write a simple loop to traverse it.
In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.
Recently I switched my search code to Annoy because the input dataset is huge (7.5mil records with 20k dictionary count). It wasn’t without issues though, however I would probably talk about it next time. In order to figure out what each parameters meant, I spent some time watching through the talk given by the author @fulhack.
So I first heard about Panda probably a year ago when I was in my previous job. It looked nice, but I didn’t really get the chance to use it. So practically it is a library that makes data looks like a mix of relational database table and excel sheet. It is easy to do query with it, and provides a way to process it fast if you know how to do it properly (no, I don’t, so I cheated).
I haven’t got much time lately, so didn’t write about this new phone that I recently imported. For some reason, this new phone of mine do not act as mass storage device like its predecessors (to certain extend). Thankfully I can still ssh in the phone and this makes it possible to mount it as a sshfs volume.
It is very much expected that there will be endless stream of new (and often times better) tools introduced to solve the same set of problems. While I am slowly resuming my programming work, and in the process of reviving my very much dead postgrad project, I found some alternative to the tools I had used in the past. I suppose I shall just jot them down here so that there’s a reference for later use.
After a miserable trip back to academic world, I finally re-gained the courage to get back to job-market. For the time spent in university, I spent quite some time reading about Semantic Web and RDF. Then I thought, I should have published more in this format in future. However, that didn’t really happen, mostly because I am too lazy.
I was trying to learn scala and clojure to find one that I may want to use in my postgraduate project. After trying to learn scala for a couple of days, I gave up because I really don’t like the syntax (too OO for my liking). Then I continued with clojure and found myself liking the syntax better.
In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.