Notes on codes, projects and everything

Regression with 2-means clustering Annoy (non-scientific)

So apparently Annoy is now splitting points by using the centroids of 2 means clustering. It is claimed that it provides better results for ANN search, however, how does this impact regression? Purely out of curiosity, I plugged a new point splitting function and generated a new set of points.

Compared to the previous revision, I am doing a fairly standard 2-means clustering to split points into clusters.

  1. Select two random centroid points, out of the available points, and make sure they are not the same vector.
  2. For each point in the vector space, calculate the distance of them to the two centroid points.
  3. Assign each point to the closest centroid.
  4. (this perhaps diverge from the original algorithm) Calculate the number of points assigned to each centroid, find the ratio (count of points assigned to centroid alpha divide by cound of points to centroid beta)
  5. Repeat (2), stop when the ratio converges, or it is within 0.4 < ratio < 0.6.

(This is why I am not a good computer/data scientist)

The algorithm is not ideal and not scientific, but I just wanted to have a quick idea of how things may change. For those who aren’t aware, this is because I cannot tell if the centroids between two iterations converge even though if the ratios converge (though I secretly hope they are good enough).

As this is a fairly minor revision, I am pasting the splitter function here instead of updating my previous gist.

def splitter_balanced(points_x, _points_y, idx_pool):
    result = None
    points_current = Vectors(
        points_x.dimension,
        tuple(points_x[idx] for idx in sample(tuple(set(idx_pool)), 2)),
    )
    ratio_current = 1.0

    while not result:
        if points_are_different(points_current):
            decider = branch_centroid_get(points_current)

            branches = points_assign_branches(decider, points_x, idx_pool)
            ratio = min(len(branches[True]), len(branches[False])) / max(
                len(branches[True]), len(branches[False])
            )

            if isclose(ratio_current, ratio) or 0.4 &lt; ratio &lt; 0.6:
                result = decider, branches
            else:
                ratio_current = ratio
                points_current = Vectors(
                    points_x.dimension,
                    tuple(
                        centroid_update(
                            Vectors(
                                points_x.dimension,
                                tuple(points_x[idx] for idx in idx_sub),
                            )
                        )
                        for _, idx_sub in branches.items()
                    ),
                )
        else:
            points_current = Vectors(
                points_x.dimension,
                tuple(points_x[idx] for idx in sample(set(idx_pool), 2)),
            )

    return result

Starting with my own approximation to Bagging.

For some reason it looks like with the same dataset, the result is not as good as previously. However given the random nature of both random splitting and k-means algorithm, it could be just I am getting a set of very unfortunate starting points this time. (The points shown are the best model after performing a 5-fold cross-validation)

Interestingly, running this caught a minor problem in my original code. Apparently there is a chance for my code to bootstrap a N-sized training set consisting of repetitions of 1 point.

choices(idx_pool, k=len(idx_pool))

Thus failing this line of code

sample(tuple(set(idx_pool)), 2))

While this should have a very low chance of happening, but yea, in short I should go buy lottery.

Things get more interesting when we see my approximation to Random Forest.

The result is not following a similar trend, it started out great, but when the size of the forest increases it becomes worse, very very counter-intuitive. It is rather rare to see even the training error increases when the forest size increases, so it gets underfitted when we have more trees?

Next we have boosting

While the result is worst than the previous attempt using the legacy point splitting method, but the result is likely just due to the unfortunate choice of initial random centroids.

Comparing the methods

So in the end, it is just a fun follow-up to the previous experiment, overall this round the result is worse, and still very far from the classic original methods.

Thanks for your time.

leave your comment

name is required

email is required

have a blog?

This blog uses scripts to assist and automate comment moderation, and the author of this blog post does not hold responsibility in the content of posted comments. Please note that activities such as flaming, ungrounded accusations as well as spamming will not be entertained.

Click to change color scheme