Notes on codes, projects and everything
In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.
Recently I switched my search code to Annoy because the input dataset is huge (7.5mil records with 20k dictionary count). It wasn’t without issues though, however I would probably talk about it next time. In order to figure out what each parameters meant, I spent some time watching through the talk given by the author @fulhack.
Implementing a Information Retrieval system is a fun thing to do. However, doing it efficiently is not (at least to me). So my first few attempts didn’t really end well (mostly uses just Go/golang with some bash tricks here and there, with or without a database). Then I jumped back to Python, which I am more familiar with and was very surprised with all the options available. So I started with Pandas and Scikit-learn combo.
Sometimes I really doubt about the advantage of recycling old stuff to fund for new units beyond goodwill. Sure you get to convince yourself that you are saving the environment by doing so, and it also saves money in the long run. However, I didn’t realize how much it generates it may be after trying to work out an answer for a fictional IQ question.
So I first heard about Panda probably a year ago when I was in my previous job. It looked nice, but I didn’t really get the chance to use it. So practically it is a library that makes data looks like a mix of relational database table and excel sheet. It is easy to do query with it, and provides a way to process it fast if you know how to do it properly (no, I don’t, so I cheated).
With most of my stuff more or less set, I guess it is time to start documenting the steps before I forget. So I heard a lot of good things about docker for quite some time, but haven’t really have the time to do it due to laziness (plus my relatively n00b-ness in the field of dev-ops). Just a few months ago, I decided to finally migrate away from webfaction (thanks for all the superb support) to a VPS so I can run more things on it.
While working on a text classification task, I spent quite some time preparing the training set for a given document collection. The project is supposed to be a pure golang implementation, so after some quick searching I found some libraries that are either a wrapper to libsvm, or a re-implementation. So I happily started to prepare my training set in the libsvm format.
This is the second part of the golang learning rant log. Previously on (note (code cslai)) I managed to make each line in the CSV into a hash map. So today I am going to make it into JSON Lines.
I was invited to try Go (the programming language, not that board game) a few months ago, however I didn’t complete back then. The main reason was because it felt raw, compared to other languages that I know a fair bit better (for example Ruby). There was no much syntatic sugar around, and getting some work done with it feels “dirty”.
A new day, and a new post on job application. So this time instead of asking a snippet, I was actually asked to deliver some sort of a full application. Not sure why this was required, but I had fun creating them nonetheless. Though I would say I am not really a fan of creating visual stuff though (oh the crappy animation nearly killed me).
I don’t quite remember when did I first heard about Category Theory, but the term stuck in my head for quite a while. Eventually I attempted to start looking for tutorials on the topic, but it is hard to find one that I actually understand. Most of them are either leaning too much to the Mathematics side, or too much to the Programming side.
(more…)I finally put in some time and effort learning myself a bit of Rust. Though I am still struggling with ownership and lifetimes (which is essentially everything about the language, to be honest), I find it more interesting compared to Golang, which is relatively boring, though being functional (no pun intended). While learning the language, the one thing I came across often is the Option
enum, then I remembered that I read something about Monad.
As the name implies, Resource Definition Framework, or RDF in short, is a language to represent information about resources in world wide web. Information that can be represented is mostly metadata like title (assuming the resource is a web-page), author, last modified date etc. Besides representing resource that is network-accessible, it can be used to represent things that cannot be accessed through the network, as long as it can be identified using a URI.
So I first heard about Panda probably a year ago when I was in my previous job. It looked nice, but I didn’t really get the chance to use it. So practically it is a library that makes data looks like a mix of relational database table and excel sheet. It is easy to do query with it, and provides a way to process it fast if you know how to do it properly (no, I don’t, so I cheated).
After a miserable trip back to academic world, I finally re-gained the courage to get back to job-market. For the time spent in university, I spent quite some time reading about Semantic Web and RDF. Then I thought, I should have published more in this format in future. However, that didn’t really happen, mostly because I am too lazy.