## (note (code cslai))

Notes on codes, projects and everything

# Regression with 2-means clustering Annoy (non-scientific)

So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.

Compared to the previous revision, I am doing a fairly standard 2-means clustering to split points into clusters.

1. Select two random centroid points, out of the available points, and make sure they are not the same vector.
2. For each point in the vector space, calculate the distance of them to the two centroid points.
3. Assign each point to the closest centroid.
4. (this perhaps diverge from the original algorithm) Calculate the number of points assigned to each centroid, find the `ratio` (count of points assigned to centroid `alpha` divide by cound of points to centroid `beta`)
5. Repeat (2), stop when the `ratio` converges, or it is within `0.4 < ratio < 0.6`.

(This is why I am not a good computer/data scientist)

The algorithm is not ideal and not scientific, but I just wanted to have a quick idea of how things may change. For those who aren’t aware, this is because I cannot tell if the centroids between two iterations converge even though if the ratios converge (though I secretly hope they are good enough).

As this is a fairly minor revision, I am pasting the splitter function here instead of updating my previous gist.

```def splitter_balanced(points_x, _points_y, idx_pool):
result = None
points_current = Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in sample(tuple(set(idx_pool)), 2)),
)
ratio_current = 1.0

while not result:
if points_are_different(points_current):
decider = branch_centroid_get(points_current)

branches = points_assign_branches(decider, points_x, idx_pool)
ratio = min(len(branches[True]), len(branches[False])) / max(
len(branches[True]), len(branches[False])
)

if isclose(ratio_current, ratio) or 0.4 &lt; ratio &lt; 0.6:
result = decider, branches
else:
ratio_current = ratio
points_current = Vectors(
points_x.dimension,
tuple(
centroid_update(
Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in idx_sub),
)
)
for _, idx_sub in branches.items()
),
)
else:
points_current = Vectors(
points_x.dimension,
tuple(points_x[idx] for idx in sample(set(idx_pool), 2)),
)

return result
```

Starting with my own approximation to Bagging.

For some reason it looks like with the same dataset, the result is not as good as previously. However given the random nature of both random splitting and k-means algorithm, it could be just I am getting a set of very unfortunate starting points this time. (The points shown are the best model after performing a 5-fold cross-validation)

Interestingly, running this caught a minor problem in my original code. Apparently there is a chance for my code to bootstrap a N-sized training set consisting of repetitions of 1 point.

`choices(idx_pool, k=len(idx_pool))`

Thus failing this line of code

`sample(tuple(set(idx_pool)), 2))`

While this should have a very low chance of happening, but yea, in short I should go buy lottery.

Things get more interesting when we see my approximation to Random Forest.

The result is not following a similar trend, it started out great, but when the size of the forest increases it becomes worse, very very counter-intuitive. It is rather rare to see even the training error increases when the forest size increases, so it gets underfitted when we have more trees?

Next we have boosting

While the result is worst than the previous attempt using the legacy point splitting method, but the result is likely just due to the unfortunate choice of initial random centroids.

Comparing the methods

So in the end, it is just a fun follow-up to the previous experiment, overall this round the result is worse, and still very far from the classic original methods. 