In the previous post, I re-implemented Annoy in 2D with some linear algebra maths. Then I spent some time going through some tutorial on vectors, and expanded the script to handle data in 3D and more. So instead of finding gradient, the perpendicular line in the middle of two points, I construct a plane, and find the distance between it and points to construct the tree.
In order to construct a plane right in the middle of two random points, we need to find the vector connecting the two points.
# AB = OA - OB def pnormal(points): return tuple(map(lambda incoming: incoming[0] - incoming[1], zip(*points)))
Then we also need to find the middle point, which is similar to the previous version, with multiple dimensions support.
# vec(OM) = (vec(OA) + vec(OB)) / 2 def middle(points): return tuple(map(lambda component: (component[0] + component[1]) / 2, zip(*points)))
Then we can build a closure to calculate the signed distance of a point to a given plane.
def pd_calculator(pnormal, middle): d = reduce(lambda result, incoming: result + -(incoming[0] * incoming[1]), zip(pnormal, middle), 0) def _(point): return ((reduce(lambda result, incoming: result + incoming[0] * incoming[1], zip(pnormal, point), 0) + d) / sqrt(reduce(lambda result, incoming: result + pow(incoming, 2), pnormal, 0))) return _
The tree building part is again the same, group the points in the partition depending on whether they have positive, or negative distance. Continue the splitting, until a cluster of N points max is built.
def tree(points): result = {} if len(points) <= 20: result = { 'type': 'leaf', 'count': len(points), 'uuid': uuid4(), 'children': points } else: split = split_points(points) pd_calc = pd_calculator(pnormal(split), middle(split)) positive, negative = [], [] for point in points: if pd_calc(point) > 0: positive.append(point) else: negative.append(point) result = { 'type': 'branch', 'func': pd_calc, 'count': len(points), 'uuid': uuid4(), 'children': [tree(negative), tree(positive)] } return result
Then searching is the same as the last version, and practically I didn’t change the code for searching. The only exception to that is the part where I look for the nearest leaf. In this revision, as long as a point has a negative distance to the plane within the threshold, that particular branch is also cosidered in the search.
def leaves_nearest(point, tree, threshold): result = [] if tree['type'] == 'leaf': result.append(tree) else: delta = tree['func'](point) if delta > 0: result = leaves_nearest(point, tree['children'][1], threshold) elif delta > -threshold: result = leaves_nearest(point, tree['children'][0], threshold) + leaves_nearest(point, tree['children'][1], threshold) else: result = leaves_nearest(point, tree['children'][0], threshold) return result def distances_find(query, points): return [euclidean(query, point) for point in points] def search_tree(query, nleaves): candidates = list(chain.from_iterable([leaf['children'] for leaf in nleaves])) distances = distances_find(query, candidates) idx_min = argmin(distances) return (distances[idx_min], candidates[idx_min])
Yes, I got lazy and use scipy’s euclidean()
. So I am curious to see if threshold really helps in a multi-dimensional space, so I did this
def search_brute(query, points): distances = distances_find(query, points) idx_min = argmin(distances) return (distances[idx_min], points[idx_min]) dim = 10 points = [] print('Generating Points') for _ in range(10000): points.append(tuple([randint(0, 499) for __ in range(dim)])) print('Building Tree') _tree = tree(points) query = tuple([randint(0, 499) for __ in range(dim)]) print('Brutal Search Answer') t0 = time.clock() ganswer = search_brute(query, points) print('Search took {} seconds'.format(time.clock() - t0)) print(ganswer) for threshold in range(0, 10001, 10): print('Cluster Answer for threshold {}'.format(threshold)) t0 = time.clock() nleaves = leaves_nearest(query, _tree, threshold) canswer = search_tree(query, nleaves) print('Search took {} seconds'.format(time.clock() - t0)) print(canswer) if canswer[0] == ganswer[0]: break
And the respective output.
It seems to work, at least for small dimension (eg. dim = 5). However I have tried with 10D random data, and there is a higher chance for not finding the real closest point despite having a ridiculously high threshold. I don’t have a better idea on how to fix this at the moment though. But for now, the code should handle multiple dimensions properly.