While working on a text classification task, I spent quite some time preparing the training set for a given document collection. The project is supposed to be a pure golang implementation, so after some quick searching I found some libraries that are either a wrapper to libsvm, or a re-implementation. So I happily started to prepare my training set in the libsvm format.
The format itself is very simple, and is much sensible compared to matrix-market format I dealt with during my postgraduate time. So the format is shown as follows
<int-label> <int-index1>:<float-value> <int-index2>:<float-value> ...
Practically label is an integer which accepts either a 0 or 1 (I have not done multi-class classification just yet), and indexN is the index starting with 0, and the value. Each line represents a vector, and it is used to represent sparse data. Therefore, if an item has a value of 0, it need not be present in the vector definition (i.e. the entry of the item can be skipped entirely). In addition to that, the index has to be in ascending order (from left to right, obviously).
So my document vector is in tf-idf format, and it assumes “bag of words” model. So the problem came when I did not filter out duplicate terms inside a vector, therefore there were duplicate entries for certain index. The problem was not made obvious when I was trying out the golang implementation of libsvm. However, when I switched to scikit-learn due to this problem, I immediately get this error in return saying I have duplicate entries.
Out of frustration I took some time to write a quick linting function in python, and immediately the problem is found in the dataset. Just thought I would share it, and here it is
def lint(training_set):
for index, document in enumerate(training_set.split('\n').strip()):
current_index = -1
for item in map(lambda x: x.split(':'), document.split(' ')[1:]):
if int(item[0]) > current_index:
current_index = int(item[0])
else:
print(document)
raise Exception("Index not unique/ascending order".format(index + 1))