(note (code cslai))

Notes on codes, projects and everything

tfidf

Information Retrieving with ….. a lot of libraries

Implementing a Information Retrieval system is a fun thing to do. However, doing it efficiently is not (at least to me). So my first few attempts didn’t really end well (mostly uses just Go/golang with some bash tricks here and there, with or without a database). Then I jumped back to Python, which I am more familiar with and was very surprised with all the options available. So I started with Pandas and Scikit-learn combo.

(more…)

Random Posts

Linting a libsvm-formatted data

While working on a text classification task, I spent quite some time preparing the training set for a given document collection. The project is supposed to be a pure golang implementation, so after some quick searching I found some libraries that are either a wrapper to libsvm, or a re-implementation. So I happily started to prepare my training set in the libsvm format.

(more…)
Quick Hack: Terminal Appearing from Right

It is useful to have the terminal around whenever I code. However, while real screen estate is finite, having a terminal further limiting the amount of information that can be displayed at the same time. The problem with the terminal is that I don’t really need it all the time, so I usually find it buried under a group of windows.

(more…)
Statistical Analysis for Social Audit Project

This is the formal draft of my statistical analysis report for the social audit project previously mentioned here. As the project is public by nature, I am cross-posting here for own reference.
(more…)
Protobuf and dask

While JSON is a fine data-interchange format, however it does have some limitations. It is well-known for its simplicity, that even a non-programmer can easily compose a JSON file ~~(but humanity will surprise you IRL)~~. Therefore, it is found almost everywhere, from numerous web APIs, to geospatial data (GeoJSON), and even semantic web (RDF/JSON).

(more…)
Regression with 2-means clustering Annoy (non-scientific)

So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.
(more…)

(note (code cslai))

tfidf

Information Retrieving with ….. a lot of libraries

Random Posts

Linting a libsvm-formatted data

Quick Hack: Terminal Appearing from Right

Statistical Analysis for Social Audit Project

Protobuf and dask

Regression with 2-means clustering Annoy (non-scientific)