## (note (code cslai))

Notes on codes, projects and everything

# Regression with 2-means clustering Annoy (non-scientific)

So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.

Compared to the previous revision, I am doing a fairly standard 2-means clustering to split points into clusters.

1. Select two random centroid points, out of the available points, and make sure they are not the same vector.
2. For each point in the vector space, calculate the distance of them to the two centroid points.
3. Assign each point to the closest centroid.
4. (this perhaps diverge from the original algorithm) Calculate the number of points assigned to each centroid, find the `ratio` (count of points assigned to centroid `alpha` divide by cound of points to centroid `beta`)
5. Repeat (2), stop when the `ratio` converges, or it is within `0.4 < ratio < 0.6`.

(This is why I am not a good computer/data scientist)

The algorithm is not ideal and not scientific, but I just wanted to have a quick idea of how things may change. For those who aren’t aware, this is because I cannot tell if the centroids between two iterations converge even though if the ratios converge (though I secretly hope they are good enough).

As this is a fairly minor revision, I am pasting the splitter function here instead of updating my previous gist.

```def splitter_balanced(points_x, _points_y, idx_pool):
result = None
points_current = Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in sample(tuple(set(idx_pool)), 2)),
)
ratio_current = 1.0

while not result:
if points_are_different(points_current):
decider = branch_centroid_get(points_current)

branches = points_assign_branches(decider, points_x, idx_pool)
ratio = min(len(branches[True]), len(branches[False])) / max(
len(branches[True]), len(branches[False])
)

if isclose(ratio_current, ratio) or 0.4 &lt; ratio &lt; 0.6:
result = decider, branches
else:
ratio_current = ratio
points_current = Vectors(
points_x.dimension,
tuple(
centroid_update(
Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in idx_sub),
)
)
for _, idx_sub in branches.items()
),
)
else:
points_current = Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in sample(set(idx_pool), 2)),
)

return result
```

Starting with my own approximation to Bagging.

For some reason it looks like with the same dataset, the result is not as good as previously. However given the random nature of both random splitting and k-means algorithm, it could be just I am getting a set of very unfortunate starting points this time. (The points shown are the best model after performing a 5-fold cross-validation)

Interestingly, running this caught a minor problem in my original code. Apparently there is a chance for my code to bootstrap a N-sized training set consisting of repetitions of 1 point.

`choices(idx_pool, k=len(idx_pool))`

Thus failing this line of code

`sample(tuple(set(idx_pool)), 2))`

While this should have a very low chance of happening, but yea, in short I should go buy lottery.

Things get more interesting when we see my approximation to Random Forest.

The result is not following a similar trend, it started out great, but when the size of the forest increases it becomes worse, very very counter-intuitive. It is rather rare to see even the training error increases when the forest size increases, so it gets underfitted when we have more trees?

Next we have boosting

While the result is worst than the previous attempt using the legacy point splitting method, but the result is likely just due to the unfortunate choice of initial random centroids.

Comparing the methods

So in the end, it is just a fun follow-up to the previous experiment, overall this round the result is worse, and still very far from the classic original methods.

## Other new Posts

• ### Regression with Annoy (cont’d)

After a year and half, a lot of things changed, and annoy also changed the splitting strategy too. However, I always wanted to do a proper follow up to the original post, where I compared boosting to Annoy. I still remember the reason I started that (flawed) experiment was because I found boosting easy.

(more…)
• ### Building problock – my personal experience

A few years ago, I was asked to build a game or simulation (alongside 2048) as a part of a job application. Being very impressed with Explorable Explanations, I implemented Conway’s Game of life with Javascript and jQuery (that was before ES6 became popular). Then I made a very simple grid maker jQuery plugin to dynamically generate a grid of divs. If you check the source code, you may realize I rely on Underscore.js a lot back then.

(more…)
• ### Learning a new language for more than 2 months (feat. Exercism)

Usually I take about a week to learn a new language so I can start doing some real work with it. After all a programming language (at least the high level and dynamic ones) is just assignment, calculation, branching, looping and reuse (and in certain cases, concurrency/parallelism, not gonna dive deep in defining the difference though). Well, that was true until I started learning Rust, partly for my own leisure.

• ### Quick experiment, boosting vs annoy

While following through the Statistical Learning course, I came across this part on doing regression with boosting. Then reading through the material, and going through it makes me wonder, the same method may be adapted to Erik Bernhardsson‘s annoy algorithm.

(more…)
• ### Reinventing wheels with cheats using Rustlang

I finally put in some time and effort learning myself a bit of Rust. Though I am still struggling with ownership and lifetimes (which is essentially everything about the language, to be honest), I find it more interesting compared to Golang, which is relatively boring, though being functional (no pun intended). While learning the language, the one thing I came across often is the `Option` enum, then I remembered that I read something about Monad.

• ### Shoehorning redux into jQuery web app

Recently I volunteered in building a site that reports whether certain websites are blocked locally (please don’t ask why that is happening). As it is a very simple app reporting status I wanted it to be easily scrape-able. One of the decision made was I want it to have things to see on first load, this practically removes the possibility of using react, which is my current favorite.

• ### Building Dynamic Form the hard way with React-Redux

Javascript is getting so foreign to me these days, but mostly towards a better direction. So I recently got myself to learn react through work and the JSX extension makes web development bearable again. On the other hand, I picked up a little bit on Vue.js but really hated all the magic involved (No I don’t enjoy putting in code into quotes).

• ### Understanding potree hierarchy file

One of my recent tasks involving crawling a lot of geo-tagged data from a given service. The most recent one is crawling files containing a point cloud for a given location. So I began by observing the behavior in the browser. After exporting the list of HTTP requests involved in loading the application, I noticed there are a lot of requests fetching resources with a common `rXXX` pattern.