Notes on codes, projects and everything
Sometimes, letting a piece of code evolving by itself without much planning does not usually end well. However I was quite pleased with a by-product of it and I am currently formalizing it. So the by-product is some sort of DSL for a rule engine that I implemented to process records. It started as some lambda functions in Python but eventually becomes something else.
So my cheat with dask worked fine and dandy, until I started inspecting the output (which was to be used as an input for another script). While the script seemed to work fine, however when I started to parse each line I was hit with some funny syntax errors. After some quick inspection I found some of the lines was not printed completely.
Often times, I am dealing with JSONL files, though panda’s DataFrame is great (and blaze to certain extend), however it is offering too much for the job. Most of the received data is in the form of structured text and I do all sorts of work with them. For example checking for consistency, doing replace based on values of other columns, stripping whitespace etc.
I came across a video on Youtube on Pi day. Coincidently it was about estimating the value of Pi produced by Matt Parker aka standupmaths. While I am not quite interested in knowing the best way to estimate Pi, I am quite interested in the algorithm he showed in the video however. Specifically, I am interested to find out how easy it is to implement in Python.
Traversing a tree structure often involves writing a recursive function. However, Python isn’t the best language for this purpose. Therefore I started flattening the tree into a key-value dictonary structure. Logically it is still a tree, but it is physically stored as a dictionary. Therefore it is now easier to write a simple loop to traverse it.
In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.
Recently I switched my search code to Annoy because the input dataset is huge (7.5mil records with 20k dictionary count). It wasn’t without issues though, however I would probably talk about it next time. In order to figure out what each parameters meant, I spent some time watching through the talk given by the author @fulhack.
Implementing a Information Retrieval system is a fun thing to do. However, doing it efficiently is not (at least to me). So my first few attempts didn’t really end well (mostly uses just Go/golang with some bash tricks here and there, with or without a database). Then I jumped back to Python, which I am more familiar with and was very surprised with all the options available. So I started with Pandas and Scikit-learn combo.
Sometimes I really doubt about the advantage of recycling old stuff to fund for new units beyond goodwill. Sure you get to convince yourself that you are saving the environment by doing so, and it also saves money in the long run. However, I didn’t realize how much it generates it may be after trying to work out an answer for a fictional IQ question.
Another half a day spent on figuring out how to package my daemon properly, fortunately with help from friends over at #harmattan IRC channel as well as cckwes, I finally get the deb package generated properly. So just a quick reminder on what my daemon does, it is just a quick hack that toggles the ‘allow background connections’ on and off depending which kind of data network a user is connected to. Apparently I am not the only one who are looking for this, as a feature request was filed long long time ago.
Writing a usable form and database library has always been a painful experience. So why bother re-inventing the wheel when there are so many to choose from already? I am writing one mostly for learning purpose. After numerous attempts, I finally get my form and database library in shape. It is nowhere complete, but nor it is perfect, but it is currently the implementation that is closest to my original design. I will keep working on it so it can be used in my personal projects in the future.
With most of my stuff more or less set, I guess it is time to start documenting the steps before I forget. So I heard a lot of good things about docker for quite some time, but haven’t really have the time to do it due to laziness (plus my relatively n00b-ness in the field of dev-ops). Just a few months ago, I decided to finally migrate away from webfaction (thanks for all the superb support) to a VPS so I can run more things on it.
I have been following this excellent guide written by Benjamin Thomas to set up my virtual machine for development purpose. However, when I am starting to configure a Ubuntu Quantal alpha machine, parts of the guide became inapplicable. Hence, this post is written as a small revision to the previously mentioned guide.
I finally put in some time and effort learning myself a bit of Rust. Though I am still struggling with ownership and lifetimes (which is essentially everything about the language, to be honest), I find it more interesting compared to Golang, which is relatively boring, though being functional (no pun intended). While learning the language, the one thing I came across often is the
Option enum, then I remembered that I read something about Monad.