After a year and half, a lot of things changed, and annoy also changed the splitting strategy too. However, I always wanted to do a proper follow up to the original post, where I compared boosting to Annoy. I still remember the reason I started that (flawed) experiment was because I found boosting easy.
Turns out boosting took me the longest to properly understand when I finally revisit the material.
So I am currently kinda waiting for the next project to finally kick start, so had a little bit of time to revisit this random experiment. So I went through the relevant parts of the Introduction to Statistics Learning course (even the original course site moved to a new location).
Considering I don’t remember what I did wrong in the previous attempt (also gave up finding out again), I am retrying this from scratch. I gave up the previous code that relied a lot on numpy as numpy code is rather not intuitive to write (though it is indeed very efficient). Also, I am going to re-attempt this with a “good enough” comparison to the three forest building methods detailed in the course as much as possible.
I should probably describe the way I am building a tree, as Annoy did previously. So given sample points scattered around a vector space (or you can imagine the points in 2D space), these are the steps I am taking to build a tree out of them.
- First select two random points (make sure they are not the same vector),
- Find a middle point between them.
- Draw a plane, treating the vector from
betaas normal, and the middle point we found in (2) lies on the plane. (In 2D space, it would be drawing a line that is perpendicular to the line from
betapassing through the point we found in (2))
- Now we have the space-separated by the plane, now calculate the distance from each point to the plane. All points that are less than
0are assigned to the left side of the tree, while the remaining to the right. (In 2D it means, for each point below the line we assign to the left, otherwise to the right)
- Repeat step (1) for each subset of the points in each branch, until we have at most
leaf_maxnumber of points (e.g.
5), a leaf node.
It is rather easy to understand, and I remember I was very amazed at how simple it is. It is rather excellent in finding good enough neighbours for a given point, without having to do an exhaustive check, which takes forever. I am not sure if the original author did further experiment on those and published any academic paper but it just worked mostly when I applied it to the problem I had at work.
So extending this idea, I am abusing this to do regression. The above-mentioned strategy still applies, just that in the end, I am calculating a mean of the corresponding response variable for the cluster of points in the leaf node.
Now that we are done with background information, we now move on to the TL;DR part of this post, the result is not great.
But I am happy I tried, properly.
The main reason we are building a forest is that, we want to reduce variance. Depending on how one picks his training set, the resulting tree can look very different from one another. One of the ways to fix this problem is to build a lot of trees (many trees => a forest) from multiple training sets, and averaging the result to reduce the high variance problem.
However, it is not always easy to be able to collect multiple different copies of training sets. This is where bootstrap sampling is helpful. Long story short, one just assumes the training set of size
n as a population. In order to build a tree, this person just picks a training set of the same size
n with replacement. In short if I have 2 points in the training set, chances are when I build a new training set to build a tree, I would have 2 identical points (as the picks is done with replacement).
So my implementation follows this, however, instead of picking the best pair to do branching, I just randomly pick 2, like how the original annoy algorithm does (to be fair I tried doing the best split but my implementation would likely take forever to complete).
In order to get the best result, I am doing a 5 fold cross validation to pick the best forest for the job. So the graph I am showing here all shows the result for the best forest.
This is the MSE vs Forest size (count of trees) for proper bagging vs my own lame implementation (calling it Lann, which is Lame Approximate Neighbour Search).
While the prediction quality increases somewhat with bigger forests, my implementation is rather bad compared to the real Bagging algorithm.
Of course, there may some bugs in my code that contributes to the result (which is very likely), so the problem may well be with me instead of the algorithm.
Next, we have the Random Forest. So instead of looking at all available features (or predictors, sorry I am a bit relaxed
with my term choice as I am not doing my MPhil research here heh), we consider only a subset of features, hence only consider a subset of points when we decide the split. Again, in our implementation, we don’t care TM and just proceed with the original “find 2 points at random, and then draw a plane in between to separate them all” strategy.
So instead of doing a bootstrap, we build the tree with the whole training set here. Not the most accurate comparison one may think, but I don’t care
(am really tempted to insert a meme gif here).
Again, we see the prediction gets better with the size of the forest increasing. I actually did this first, and I really am tempted to continue and build bigger forests. However, after finishing the whole experiment and looking at the performance of the proper Random Forest algorithm I just gave up.
Why bother lol?
Next, we have Boosting, which is something I thought I knew. Turns out I spent a lot more time understanding this when I was trying to build a roughly comparable version
(I know the previous two are not THAT similar but I tried heh).
So instead of building roughly independent big complex trees, one big difference here is that we usually build smaller trees, with just a handful of branches, at most. Another part of this I spent a lot of time to understand is that, it builds on the result of previous iteration.
So just a lengthy note so the future me understands. So far we build the tree with the same response variable (the
Y variable), so it is possible to brute-force concurrency/parallelism with the building part. However, this time, we initialize a new variable named
residual with our response variable, to be updated in each iteration. Also, we do not need to find the mean of prediction from all trees, but we do have a new variable to consider, the learning rate, or shrinking parameter,
- We start by fitting a tree as usual with respect to the residual instead of the the response variable.
- Now we have a tree (which is really a function f) with respect to the current residual. We can now add this to the collection of trees, but remember to multiply the prediction function with the learning rate
- Now, we throw the training set to this new tree in (1), multiply each of them with the learning rate. Next, we update by taking the original residual value for each point and deduct the new value we just obtained.
- Repeat (1) until we have enough trees in the forest. The final model would be the sum of learning rate multiply with the tree predictor functions.
From what I understand, each iteration of the tree function uses its own residual, so keep the residual value somewhere when the forest is used to do a prediction (hopefully the code is obvious on this).
Now back to the original post, my approach is roughly the same here. So there are two more variables to control, but I am sticking to
0.01 as the learning rate, and branching only once.
Again, the original method is doing much better here, nothing new.
So in general, yes, while building tree like how original Annoy does can be adapted to do regression (or even classification), and it does show the bigger the forest is, the prediction quality increases (almost in all cases the MSE drops), but it may not be the best way to do it.
BUT, it could be just me being incompetent in implementing it (it is rather inefficient and there may be bugs I didn’t catch). If you are interested in seeing the code, it can be found here (please forgive me for the horrible code and you may want to comment out the last part of code where it says Best split bootstrap sampling), and the original methods are done here. If you are interested in seeing the numbers instead of pretty graphs, check here.
Lastly, feel free to contact me for corrections.