This post is purely based on my own speculation as there’s no experiment on real-life data to actually back the arguments. I am currently trying to document down a plan for my experiment(s) on recommender system (this reminds me that I have not release the Flickr data collection tool :/) and my supervisor advised to write a paragraph or two on some of the key things. Since he is not going to read it, so I might as well just post it here as a note.
The idea of using tag clusters is based on the papers published by Gemmell et al. during year 2008-2009 in which they used tag clusters to find out relevance between user and resources in their recommender system. The reason of using tag clusters instead of using tags alone is mainly to address two problems that are commonly found in folksonomies – tag ambiguity (polysemy) and tag redundancy (synonym) problems.
Fuzzy C-Means Clustering (FCM) is one of the fuzzy clustering methods where each node do not necessary belong to only one cluster. Instead, they may be a member of one or more clusters with different degree of membership. For example, if one model tags as points in a vector space, each tag may belong to one or more clusters, which is theoretically useful because some tags may carry different meanings (polysemy) depending on the context (user interest).
While the diagram in my previous post (my attempt to implement FCM clustering algorithm using clojure, not of production quality though) only shows each point belong to one cluster, the fact is each of the point may belong to more than one cluster. The line connecting each point to the cluster centroid is showing the most prominent relationship between a point with all the available clusters.
Similar to the approach used by Gemma et al., we are modeling users and resources using vector of tags. The weighting scheme to produce each tag entry would be most probably an adaptation of tf*idf used by most of the tag-based recommender systems. Now that each cluster is also a vector of tags, the degree of membership values are also treated as weights.
To calculate similarity between resources and tag clusters (which are all in vectors), there are some approaches that may be used (just discovered this after a quick search at Google, will read after publishing this). Adapting collaborative filtering into the original design published by Gemmell at el. is not too difficult, at least theoretically (yes, as mentioned, this post is full of crap speculation). All we need to do is just to find out a list of n similar user using the similarity measure, and then compare this group of users with the tag cluster to find out the degree of relationship between them.
Information extracted by traversing social graph, community/group membership as well as ratings (mark as favorite, +1, like) may be also applied somewhere in between, but I still can’t find where can it be used for now, will post followup whenever there is an update.