Notes on codes, projects and everything

Understanding potree hierarchy file

One of my recent tasks involving crawling a lot of geo-tagged data from a given service. The most recent one is crawling files containing a point cloud for a given location. So I began by observing the behavior in the browser. After exporting the list of HTTP requests involved in loading the application, I noticed there are a lot of requests fetching resources with a common rXXX pattern.

So the colleague who gave me the task didn’t really know if there’s a reason behind the pattern. All the information I get was the documentation, which for some reason I couldn’t understand at all. Fortunately, with help from Martin Heidegger over gitter I finally managed to find a way to figure out how the filenames of the group of resources rXXX is derived.

So apparently potree expects the point cloud in an octree, which is a tree structure where each node has exactly 8 children nodes. The tree structure is then serialized into a long list of binary bits stored in a hierarchy file (in this case I am going to it as index.hrc).

So each node is stored in the hierarchy index file as a tuple of two integers. First is an unsigned 8bit integer, and the second is an unsigned 32bit integer. The first element is to be discussed later, but the second one denotes how many points the node contains. Therefore in order to read them in Python, I need code like this.

import requests
from bitstring import BitArray

resp = requests.get('http://some/service/returning/index.hrc')
tree_serialized = BitArray(bytes=resp.content).bin

for node_offset in range(0, len(tree_serialized), 40):
    node = (tree_serialized[node_offset:node_offset + 8],
            BitArray(bin=tree_serialized[node_offset + 8:node_offset + 40]).uint)

So I am storing the first element of the tuple as a bit string, and the second as integer. This is because the bit string has a specific use — to denote which children contain points. Recall that we are dealing with a octree here, so each bit (from right to left) corresponds to the exact child 0 to 7.

Since we are now dealing with a tree, the first pair is naturally the root of the tree. Now that we are concerned only in deriving the rXXX names, we are going to disregard the second component in the tuple from here on. The numbers in XXX corresponding to the position of the node. The root node is r, then its children are r0, r1, r2, ..., r7. Expanding from there, the children of node r0 are r01, r02, r03, ..., r07, and so on, and so forth. There’s a name for serializing a tree into a list of number like this, but I don’t remember the name right now (materialized path?).

Suppose we have 01010101 in the first element, it means we are expecting the following list of children with points (reading from right to left).

  • r0
  • r2
  • r4
  • r6

The tree is serialized following a breadth first search pattern, so after the root node, I should expect a tuple for node r0. Then, I should be expecting r1, which is empty according to the root node r. However, the serialized tree is a sparse data structure, hence nodes with no points would not appear. Therefore in our little imaginary example here, the next tuple after r0 is r2.

Repeat the same steps until we reach r6, after which we go another level deeper, by referring r0, r2, r4, r6 to figure out which of their children contain points. The sparse nature proves to be a challenge when rebuilding the tree. Not only we need to keep track of the node number increment, we also needed to ensure the node actually carry points by referring to the parent.

However, that is already out of the scope of this post (plus my solution is so bad that it needed a lot of polish to be presentable). I am already happy enough to be able to correctly derive the filenames properly.

Postscript

Some additional note on re-building the tree, apparently I could also skip the increment steps. The name of the children can be found by referring to the parents. As the tree is serialized in breadth first search, all I need to do is to keep appendding the name of newly discovered children to the end of the name list (vs. keep incrementing the current node name and check the tree to see if it should exist).

from functools import reduce
tree, current_node, node_labels = dict(), '', ()
for node_offset in range(0, len(tree_serialized), 40):
    tree[current_node] = (tree_serialized[node_offset:node_offset + 8],
                          BitArray(bin=tree_serialized[node_offset + 8:node_offset + 40]).uint)

    node_labels = reduce(lambda current, incoming: current + ('{}{}'.format(current_node, incoming[0]), ),
                         filter(lambda _: int(_[-1]) == 1,
                                enumerate(reversed(tree[current_node][0]))),
                         node_labels[1:])
    current_node = node_labels[0]

leave your comment

name is required

email is required

have a blog?

This blog uses scripts to assist and automate comment moderation, and the author of this blog post does not hold responsibility in the content of posted comments. Please note that activities such as flaming, ungrounded accusations as well as spamming will not be entertained.

Click to change color scheme