I am currently preparing myself in applying a postgrad programme and is looking for a research topic. At first I wanted to do something that is related to cloud computing but after some discussion with people around me, they suggest me to do something on semantic web. While posting my notes here, I realized that I had posted something on semantic network that looks like the base of semantic web here (Post still “Under construction” as of writing, will post the diagrams later tonight).
Just thought it would be better if I start by stating a problem, imagine one day Alice is looking for the price for a piece of movie DVD, what she would do is to get to the movie store website and search for the movie DVD. However, Alice might not have been able to get the price if she is not around as the task cannot be automated easily using a computer. This is because the web page that displays the price is prepared for human being like us.
Therefore semantic web is a proposed solution to the problem where information should be represented not only for human being like us to read, but also for machines to be able to understand and manipulate it. In short, the definition says that Semantic Web is an extension to the World Wide Web, which information is given well-defined meaning that enable both lay person and computers to work in co-operation.
From what I remember, before Google emerge as the most popular search engine, webmasters used to include descriptive meta tags within a HTML document, as follows:
<meta name="keywords" content="good looking, handsome, single">
<meta name="description" content="About Jeffrey04">
<meta name="author" content="Jeffrey04">
Such that, when user search with the keywords as specified above, the document above will be returned as one of the search results. However, although metadata is displayed via meta tags as mentioned above, it is never enough to enable computers to understand that “Jeffrey04 is staying in Malaysia” or “Jeffrey04 works in Kuala Lumpur”.
Then CSS got popular as people around starts encouraging separation of content and presentation. Therefore, more people start using HTML tags like
<strong> instead of
<b> doesn’t carry any meaning. Usage of tags with semantic meaning also enable users that relies on screen-reader to further understanding the material.
However, often times especially while HTML5 is still under drafting we group content into sections enclosed within a
<div> tag. For formatting purpose, as well as giving the block of information meaningful to machines, a class name is often given to the block, eg.
<div class="header">. The usage of tag attribute to give meaning to a piece of information leads to the development of various microformats.
When web-developers start using tags that accurately representing information within a document, numerous efforts are made to further mark up a document according to standard to ease machine processing. If one have ever does screen-scraping, they will feel the pain of trying to make the script to scrap the right information out of a HTML document.
To enable information to be read easily out of a HTML document, what a web-developer can do is to mark up the information following the specific standard. One of the standards is hCalendar, which is used to describe dated event, for example to mark up an event that is taking place at 6th June 2010, we would do:
<p class="vevent">John Doe is <span class="summary">getting married</span> on 6th June 2010 at the <span class="location">Community Hall</span>. The ceremony will be held from <abbr class="dtstart" title="2010-06-06T14:00:00+08:00">2PM</abbr> till <abbr class="dtend" title="2010-06-06T16:00:00+08:00">4PM</abbr>.</p>
Therefore, to do screen-scraping, one would just need to search for (via either CSS or XPath) the particular block of content above.
(to be continued, next is on Resource Description Framework, RDF, which loosely-related to my previous post on semantic net as linked above)