One of my recent tasks involving crawling a lot of geo-tagged data from a given service. The most recent one is crawling files containing a point cloud for a given location. So I began by observing the behavior in the browser. After exporting the list of HTTP requests involved in loading the application, I noticed there are a lot of requests fetching resources with a common `rXXX`

pattern.

So the colleague who gave me the task didn’t really know if there’s a reason behind the pattern. All the information I get was the documentation, which for some reason I couldn’t understand at all. Fortunately, with help from Martin Heidegger over gitter I finally managed to find a way to figure out how the filenames of the group of resources `rXXX`

is derived.

So apparently potree expects the point cloud in an octree, which is a tree structure where each node has exactly 8 children nodes. The tree structure is then serialized into a long list of binary bits stored in a hierarchy file (in this case I am going to it as `index.hrc`

).

So each node is stored in the hierarchy index file as a tuple of two integers. First is an unsigned 8bit integer, and the second is an unsigned 32bit integer. The first element is to be discussed later, but the second one denotes how many points the node contains. Therefore in order to read them in Python, I need code like this.

import requests from bitstring import BitArray resp = requests.get('http://some/service/returning/index.hrc') tree_serialized = BitArray(bytes=resp.content).bin for node_offset in range(0, len(tree_serialized), 40): node = (tree_serialized[node_offset:node_offset + 8], BitArray(bin=tree_serialized[node_offset + 8:node_offset + 40]).uint)

So I am storing the first element of the tuple as a bit string, and the second as integer. This is because the bit string has a specific use — to denote which children contain points. Recall that we are dealing with a octree here, so each bit (from right to left) corresponds to the exact child `0`

to `7`

.

Since we are now dealing with a tree, the first pair is naturally the root of the tree. Now that we are concerned only in deriving the `rXXX`

names, we are going to disregard the second component in the tuple from here on. The numbers in `XXX`

corresponding to the position of the node. The root node is `r`

, then its children are `r0, r1, r2, ..., r7`

. Expanding from there, the children of node `r0`

are `r01, r02, r03, ..., r07`

, and so on, and so forth. There’s a name for serializing a tree into a list of number like this, but I don’t remember the name right now (materialized path?).

Suppose we have `01010101`

in the first element, it means we are expecting the following list of children with points (reading from right to left).

`r0`

`r2`

`r4`

`r6`

The tree is serialized following a breadth first search pattern, so after the root node, I should expect a tuple for node `r0`

. Then, I should be expecting `r1`

, which is empty according to the root node `r`

. However, the serialized tree is a sparse data structure, hence nodes with no points would not appear. Therefore in our little imaginary example here, the next tuple after `r0`

is `r2`

.

Repeat the same steps until we reach `r6`

, after which we go another level deeper, by referring `r0, r2, r4, r6`

to figure out which of their children contain points. The sparse nature proves to be a challenge when rebuilding the tree. Not only we need to keep track of the node number increment, we also needed to ensure the node actually carry points by referring to the parent.

However, that is already out of the scope of this post (plus my solution is so bad that it needed a lot of polish to be presentable). I am already happy enough to be able to correctly derive the filenames properly.

## Postscript

Some additional note on re-building the tree, apparently I could also skip the increment steps. The name of the children can be found by referring to the parents. As the tree is serialized in breadth first search, all I need to do is to keep appendding the name of newly discovered children to the end of the name list (vs. keep incrementing the current node name and check the tree to see if it should exist).

from functools import reduce tree, current_node, node_labels = dict(), '', () for node_offset in range(0, len(tree_serialized), 40): tree[current_node] = (tree_serialized[node_offset:node_offset + 8], BitArray(bin=tree_serialized[node_offset + 8:node_offset + 40]).uint) node_labels = reduce(lambda current, incoming: current + ('{}{}'.format(current_node, incoming[0]), ), filter(lambda _: int(_[-1]) == 1, enumerate(reversed(tree[current_node][0]))), node_labels[1:]) current_node = node_labels[0]