Notes on codes, projects and everything
So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.
(more…)After a year and half, a lot of things changed, and annoy also changed the splitting strategy too. However, I always wanted to do a proper follow up to the original post, where I compared boosting to Annoy. I still remember the reason I started that (flawed) experiment was because I found boosting easy.
(more…)While following through the Statistical Learning course, I came across this part on doing regression with boosting. Then reading through the material, and going through it makes me wonder, the same method may be adapted to Erik Bernhardsson‘s annoy algorithm.
(more…)In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.
Recently I switched my search code to Annoy because the input dataset is huge (7.5mil records with 20k dictionary count). It wasn’t without issues though, however I would probably talk about it next time. In order to figure out what each parameters meant, I spent some time watching through the talk given by the author @fulhack.
Implementing a Information Retrieval system is a fun thing to do. However, doing it efficiently is not (at least to me). So my first few attempts didn’t really end well (mostly uses just Go/golang with some bash tricks here and there, with or without a database). Then I jumped back to Python, which I am more familiar with and was very surprised with all the options available. So I started with Pandas and Scikit-learn combo.
In recent years, I start to make my development environment decouple from the tools delivered by the package manager used by the operating system. The tools (compiler, interpreters, libraries etc) are usually best left unmodified so other system packages that rely on them keeps working as intended. Also another reason for the setup is I wanted to follow the latest release as much as possible, which cannot be done unless I enroll myself to a rolling release distro.
(more…)Traversing a tree structure often involves writing a recursive function. However, Python isn’t the best language for this purpose. Therefore I started flattening the tree into a key-value dictonary structure. Logically it is still a tree, but it is physically stored as a dictionary. Therefore it is now easier to write a simple loop to traverse it.
My cloud storage is nearly exploding, and I am not in a good position to start subscribing for more storage (Yes, I am still #opentowork). Considering I just moved my domain settings to Cloudflare and started using Cloudflare tunnel, I figure I probably should just back up some of my photos, and host it on my workstation.
(more…)This is the formal draft of my statistical analysis report for the social audit project previously mentioned here. As the project is public by nature, I am cross-posting here for own reference.
(more…)Been trying my best to stick to the well-known UNIX Philosophy – “Do one thing and do it well”, so I have been breaking down my projects into numerous pieces of small tasks and rely on existing tools whenever possible. One of the existing tool that I use a lot is the GNU sort tool. Generally sort utility is really doing fine and dandy without having to configure anything, at least not until I realize the problem that leads to this post.