Back then, when I was still working on my postgraduate degree research, I used RDF, which was the preferred format in the world of Semantic Web to represent data. I eventually dropped the degree, and stopped following the development of the related technology and standards. Until I volunteered to update the import script for popit when I was looking for the next job/project.
Just a TL;dr, this is a quick “informal tech report” + miscellaneous notes for the import script.
Apparently throughout the past decade, the RDF-related standards were still being adopted in a lot of different areas. I was told some local government agency actually required projects submitted to be representable through the use of specifications published by the Dublin Core Metadata Initiative.
The adoption did not stop there, while reading through the requirements of this import script, I was given a couple of specifications published by the Popolo project. Also, apparently people eventually got fed up with XML, and adopted formats that are easier to read (and possibly parsed).
One of the main objectives of semantic web, is to create a huge network of linked entities. Adopting this into the importer script, is a great help for people researching the relevant topics, where it is now possible to link data exported from different instances of popit (a CMS to manage network data spec’ed by Popolo) together.
While the project itself involves published information about political figures, the script itself is a much simpler project. This is my first time managing a project with Poetry that requires me to properly publish the script.
As a language, Python is very likeable, it is relatively clean, and sufficiently expressive. Though the recommended runtime is not known for efficiency, it is still a great language for this project (where the script is usually ran one-off, also not periodically).
However, publishing completed work is a hairy problem which is still not properly addressed IMHO. The closest implementation is through Docker (which I may spend time writing next, for another project), but in this one I am experimenting building wheels through Poetry.
Starting a new project is a usual
poetry new --src <project name>
Without the --src
flag, a sub-directory which has the same name as <project name>
is created in the project root. I don’t quite like this, as I want all source code to be stored in a folder where the name is standardized in all projects. Also, it is rather clearer if the project consists of multiple smaller libraries, they will all be stored alongside the main one in src
folder.
I was told click is a good library for a program purely ran in the text terminal. While I managed to hack around it to get asyncio and git
utility like command structure working, it is still rather cumbersome lol. Probably try another one in future.
In order to get this project built as a wheel, which installs a script called primport
, these lines are to be added to the pyproject.toml
project specification
[tool.poetry.scripts]
primport = 'popit_relationship.primport:main'
Where the left-hand side is the name of the script, and the right-hand side is to tell how to locate the script, as specified in the documentation.
Doing a
poetry install
result in a editable install. Now I can do
poetry run primport
to execute the script, while still working on it. When I am happy with it, I can build the wheel by doing
poetry build
The resulting file will be stored in the dist
folder, which can then be installed by end users through
pip3 install popit_relationship-0.1.0-py3-none-any.whl
And it should work, as long as the user has a working Python installation, AND have wheel installed. Required libraries will be pulled automatically while the wheel is being installed.
I am aware of PyInstaller and newer PyOxidizer project, for distribution purpose. However, I can’t really do a cross-compile with the former, and the latter still feels very experimental. Hence the current compromise is to do this through installing a wheel package. I am still contemplating whether I should just skip the procedures above and somehow get github to serve just the wheel package.
Now that the script is installed, we should be able to run a primport
to import things to a locally pickled NetworkX graphs. At the current stage, the amount of data makes it feasible to directly pickle the file, however, when the data grows eventually, storing it into something like a sqlite would make more sense.
Once stored as NetworkX graphs, we can attempt to visualize it in places like a Jupyter notebook. While I am not researching in this area, I didn’t contribute much here. However, I did a demo on how it could be useful by providing a visualize
subcommand.
This is an example pulling from Politikus (a popit instance), and representing our current Prime Minister in the network graph form.
$ primport visualize --depth=1 https://politikus.sinarproject.org/persons/muhyiddin-yassin
The graph does not look pretty, but it helps to visualize when the nodes aren’t too crowded by limiting the depth through the --depth
argument.
In order to find if two or more people are linked somehow through the data stored in Politikus, we can pass in more nodes, and play with the --depth
parameter to explore.
$ primport visualize --depth=2 https://politikus.sinarproject.org/persons/lim-guan-eng https://politikus.sinarproject.org/persons/ong-kian-ming
For most users wanting a quick visualization, Neo4j is also supported. A docker instance of Neo4j is recommended as the community edition only allow 1 database to be created. Therefore, in order to not pollute existing pool of data, the use of a Neo4j docker instance is recommended.
Once the server is up, it is possible to save the graphs to a Neo4j instance by specifying the environment variable
$ NEO4J_URL="bolt:localhost:7687" NEO4J_AUTH="neo4j/abc123" primport save
With the data saved into a Neo4J instance, multiple visualization can be done with it by just point and click.
As much as possible, if one inspect the data stored in both NetworkX graphs and Neo4J, the URI to the schema is often included. While some of them not being published as RDF (e.g. the control statements relationships above), we inserted a mock URI in place of those schemas (Sinar Project may want to consider publishing them for real based on the existing ones?).
While it is still unavailable now, but the information stored in it should be sufficient to generate an RDF triplets, to be then processed by tools written for linked data (e.g. Weka). I was told there would probably be more graph analysis work done with the data in future, so feel free to contribute to the project if you are interested.