Notes on codes, projects and everything

Selecting Node with XPath

To do node selection for DOM operations, one typically uses CSS selectors as (probably) popularized by jQuery. However, there is another alternative that is as powerful if not better known as XPath. XPath may be able to do a lot more than just selecting node (which I have no time to find out for now) but I will just focus on how to do node selection in this blog post.

The main source of inspiration of this post is from John Resig’s blog post comparing CSS selectors to XPath selectors. Recently I am involved in a project where I need to scrap information from a bunch of badly marked up HTML files. Although I am advised to use regular expression to scrap information out from the file, but I decided to give XPath a try.

The initial version was coded with PHP but it wouldn’t work efficiently, as the files are not correctly marked up (and also not marked up semantically) and I had to fix it by throwing the markup to Tidy before processing it. Then the markup is passed to the DOM library classes for data extraction. To sum up on the performance, it was terribly slow (mainly caused by the Tidy process).

Then I headed to stackoverflow to look for alternatives. As recommended, I use lxml and beautiful soup together to scrap information out from the source HTML file (details to be discussed in the other post).

Back to the topic, supposedly a HTML file is given that is marked up as follows:


<html>
<head>
<title>Foo</title>
</head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<th><h1>Main Content</h1></th>
</tr>
<tr>
<td><span>I am a terribly marked up HTML file</span></td>
</tr>
<tr>
<td><span id="horrible_text">Just too horrible</span></td>
</tr>
</table>
</td>
<td>
<table class="sidebar">
<tr>
<th colspan="2"><font>Fancy Sidebar</font></th>
</tr>
<tr>
<td><font>Fizz</font></td>
<td><a href="another_broken_link.html">Fancy menu 1</a></td>
</tr>
<tr>
<td><font>Buzz</font></td>
<td><a href="i_am_also_broken.html">Fancy menu 2</a></td>
</tr>
<tr>
<td><font>Bar</font></td>
<td><a href="some_fancy_broken_link.html">Fancy menu 3</a></td>
</tr>
</table>
</td>
</tr>
</table>
<div id="footer">
<p>Some meaningless copyright text</p>
</div>
</body>
</html>

As seen in the above HTML, it is not as badly marked up but it basically shows what kind of HTML files I am dealing with (except there is no single element ID defined). To quickly get started in working with XPath, I went to w3schools to dig the tutorials as I am working with the file.

Starting from the simplest and most obvious case, which is looking for the copyright text node.

//*[@id='footer']/p

In CSS, one would probably get this done via

#footer > p

However, as I am doing this to get text information, hence in order to get the text content, I would need to do

//*[@id='footer']/p/text()

To summarize what those symbols mean, let me translate the last XPath selector into a layman friendly statement — select any node (empty tag followed by a wild card, //*) that has ([...]) an attribute named id (@id) which is valued ‘footer’ (@id='footer'), from there, select a paragraph node (/p) and extract the text contained within the node (/text()).

Now that you get some idea, we shall proceed do a slightly more complicated selector. Suppose I want to extract content from the third row in the main content table, I would need to do this (as I’m new in this, this may not be the most efficient way of selecting)

//tr[*/descendant-or-self::*[contains(text(), 'Main') and contains(text(), 'Content')]]
/following-sibling::tr[2]
/td
/descendant-or-self::text()

Well, from the markup above, there is no id, no class and basically there is nothing to uniquely identify the node that I want. (I don’t know how to select the node using CSS, so assume there is none for now) Therefore, in order to select the text that I want, I select the any tr node from the document (//tr) that contains ([...]) any node (*/descendant-or-self::*) which in turns contains (the nested [...]) text ‘Main’ and ‘Content’ (contains(text(), 'Main') and contains(text(), 'Content')). From there, find the second sibling after the selected tr(s) (/following-sibling::tr[2]) and get the descendant which is the table data node (/td). Then regardless what node is wrapping the text content, just return any text-content found within the selected table data node (/descendant-or-self::text()).

You may be curious why I don’t just test contains(text(), 'Main Content') but to split them into two different tests. This is due to the fact that in HTML, although browsers treat all whitespace and newline characters as a space, but they do make a difference in the markup. So if it happens that my table header node is marked up as follows

<th><h1>Main
Content</h1></th>

Although it is displayed identically within the browser, but if my selector tests using contains(text(), 'Main Content') it will fail as the correct way of doing it should be inserting a newline character instead of a space, maybe something like contains(text(), "Main\nContent"). There is a weakness in my approach though as the order is ignored in my test. So if there is a table within the document as follows

<table>
<tr>
<th><h1>Content not in the Main</h1></th>
</tr>
<tr>
<td><span>I am a terribly marked up HTML file</span></td>
</tr>
<tr>
<td><span id="horrible_text">Not supposed to be selected</span></td>
</tr>
</table>

Then the text ‘Not supposed to be selected’ will also be selected.

Last example of the day, suppose I want to select all the second column of table data nodes within ‘Fancy Sidebar’ if the first column has text content either ‘Fizz’ or ‘Buzz’

//td[*/descendant-or-self::*[contains(text(), 'Fizz') or contains(text(), 'Buzz')]]
/following-sibling::td[position() = 1]

The selector basically says select any table data node (//td) that contains ([..]) regardless node type (*/descendant-or-self::*) as long as it contains ([...]) text either ‘Fizz’ or ‘Buzz’ (contains(text(), 'Fizz') or contains(text(), 'Buzz')). From there select its sibling table data nodes (/following-sibling::td) and pick the first among them ([position() = 1]).

If you are observant enough, you should find out that the selector may actually be simplified as follows

//td[*/descendant-or-self::*[contains(text(), 'Fizz') or contains(text(), 'Buzz')]]
/following-sibling::td

The previous selector was used because I wanted to show the position() function XD.

Lastly, if firebug is used, then FireXPath is a great addition to the firebug that allows web developer to test XPath selectors.

leave your comment

name is required

email is required

have a blog?

This blog uses scripts to assist and automate comment moderation, and the author of this blog post does not hold responsibility in the content of posted comments. Please note that activities such as flaming, ungrounded accusations as well as spamming will not be entertained.

Comments

If performance of XPath is important, you may want to investigate vtd-xml

http://vtd-xml.sf.net

author
barriers
date
2009-11-19
time
11:39:04
Click to change color scheme