Tag it:
Delicious
Furl it!
Spurl
NewsVine
Reddit
YahooMyWeb
Technorati

Extracting Information from Web Pages Part I - XPath

Introduction

The recent dramatic growth of the Internet has made it the largest and probably most important data source currently available. The range of information that can be gathered on the web is virtually unlimited, including information about people,events, products, homemade or professional videos, pictures, scientific and technical resources and many more. In contrast to many other data sources, as e.g. closed databases, most information on the web is freely available. Moreover, on the web, the traditional distinction between information provider and information consumer is blurred. Anyone can contribute information by creating webpages, by writing reviews or by using any of the numerous web 2.0 applications that emerge on a daily basis.

In this tutorial we will discuss how to extract information from web pages such as to make it available to processing, e.g. by data mining. A key technology to extract content from the web is XPath, that allows to query structured documents in the XML language for specific information. We will first review the general structure of web content and then discuss XPath as a way to extract information from web pages. Then, we present solutions of how to apply XPath from Java applications and how to apply it within the data mining platform Rapid Miner, using the TextInput Plugin. The latter enables you to extract and process information in a single application. Finally, some hints are given on common problems users are facing when using XPath on HTML or other sources. Especially we discuss the possibility of deploying XPath queries visually using the Firefox browser.

XML, HTML, XHTML and XPath

Most content on the Internet is encoded using the Hypertext Markup Language (HTML). HTML documents consist of a set of tags that represent the structure and partially the presentation of the page content.

HTML emerged in the early nineties. It was considered an application of the Standard Generalized Markup Language (SGML) and later formulated explicitely as such. SGML is a meta-language that allows to specifiy human and machine-readable document standards and also provides basic tools as parsers, which can be re-used across different applications. Many of the properties of HTML are inherited from SGML, e.g. the fact that tags are not case sensitive. One major common aim of SGML and HTML is the separation of structure, content and presentation. This aim reflects the need of having documents that are readable by humans and machines alike. Therefore, original HTML contained mostly logical markup defining the function of an entity in the document, not its visualization. The latter was left to the browser. Unfortunetaly, as HTML became more and more popular, this separation was somewhat blurred. The official standard, set by the World Wide web Consortium (W3C) was often extended on an prototype basis: browser as Mosaic started to support new features, e.g. frames, even though they were not part of the standard. This also led to a mix-up of logical markup and presentation, which was clearly unsatisfying. A partial solution of this problem came with the introduction of style sheets. The idea of style sheets was to allow to annotate elements in HTML with identifiers and classes in a flexible way. The expression '<h1 class="tutorial">' would express, that this heading is of logical type 'tutorial'. A so called CSS sytle sheet now contains the necessary instructuctions that allow the browser to determine things like font-size or color of this heading. Especially, a special tags '<div>' and '<span>' were introduced, that by themselves do not have any special, semantic meaning. They obtain their meaning solely by the annotation with a class or with an identifier. The tags are used a lot in current HTML and partially even replace traditional tags, as '<h1>'.

What does this mean for our overall aim to extract structured information from webpages? While even with the use of stylesheets, structure, content and presentation are not well seperated HTML tags can help us to find and extract content from a web page. In the following example we see that the title of the product is surrounded by a <div> tag, that itself is a descendant of an <td> tag and has an attribute "class" with the value "product_name". 

<html>
...
<table>
<tr>
<td><div class="product_name">Darjeeling Puttabong</div></td>
<td><div class="price">10 Euro</div></td>
</tr>
</table>
...
</html>

The <div> tag thus leads us directly to the desired information. How can such a query be formulated such that it can be applied automatically to extract this information from a potentially large number of product pages?

The key to this problem is a language that was developed in the mid-nineties as another application of SGML, namely XML, or eXtensible Markup Language. While technically XML can be seen as an application of SGML, in practice it has become its descentant. XML is on the one hand much simpler than SGML increasing its acceptance by potential users. On the other hand, it is more restrictive than SGML. This allowed, besides other things, to develop more efficient parsers. Today, XML takes the role envisioned for SGML. It is applied in almost all application areas to define structured document standards, ranging from office documents to geographical markup. The relative restrictiveness of XML also allowed to develop an efficient query language for XML documents, that allows to reference individual tags in a document directly. This query language is called XPath. It was first introduced in the late nineties, shortly after the specification of XML. The following example is an XPath query, that would extract exactly the content of the <div> tag in the above example:

 //h:div[@class='product_name']/text()

A more detailed introduction to XPath is given below. First, however, it is necessary to clarify, in which is sense XPath (developed for XML) can actually be applied to HTML tags, or to put it another way: how to get XML from HTML?

How to get XML from HTML?

In principle, XPath can be applied to HTML, as HTML is a special case of XML, right? The answer is yes and no. As discussed above, HTML is an application of SGML. XML is, however, more restrictive than SGML. Thus, while HTML and XML are very similar, HTML does not meet all requirements  of the XML language. HTML does, for instance, not require closing tags for some markup elements (as e.g. <br>), where XML requires a matching closing tag for every opening tag. As this situation is not fully satisfying, there are efforts to make HTML XML compliant, most notably XHTML. XHTML was developed around the millenium and is an official standard supported by the W3C. Unfortunately, you still find a lot of pages on the Internet that are not XHTML compliant. Even worse, there are many pages that do not even fulfill the basic HTML standard. How can XPath be applied to such pages as well? The solution is to use a parser, that is able to repair corrupted and non-standard HTML and to transform it into valid XHTML. If you use RapidMiner, you can skip the remainder of this subsection, as RapidMiner takes care of the problem of converting HTML to XHTML automatically by internally using such a tool.

A very popular tool that cleans up bad HTML code and makes it compliant to XHTML is called HTML Tidy. It can be used to preprocess documents that are afterwards parsed by an XML parser. In this tutorial we use another tool called TagSoup. The following shows HTML code before and after clean-up. 

<html>



<head></head>

<body>

<H2>Product details</h2>

<P>This is a product information
<table>
<tr>
<td>
<div class="product_name">
Darjeeling Puttabong
</div>

</table>

</body>
</html>


Original HTML
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml"
version="-//W3C//DTD HTML 4.01 Transitional//EN">
<head />

<body>

<h2>Product details</h2>

<p>This is a product information</p>
<table>
<tr>
<td colspan="1" rowspan="1">
<div class="product_name">
Darjeeling Puttabong
</div>
</td>
</tr>
</table>

</body>
</html>

XHTML using TagSoup

As you probably know, there are two paradigms of parsing XML. One is based on the Simple API for XML (or SAX). SAX uses the paradigm of stream-parsing and parse events to process an XML tree. In this case, information is transformed to an internal representation 'on the fly'. Another paradigm is based on the Document Object Model (or DOM). In this case, the complete XML tree is read into main memory. Then the application logic can extract information from it. As you one might have guessed, XPath requires the existence of a DOM tree in the memory to be fully applicable. The reason is, that XPath allows to navigate an XML tree in every possible direction, while SAX only allows for linear parsing. While there have been some approaches to evaluate XPath without creating an explicit DOM representation (e.g. SXPath, they are only applicable under certain restrictions.

The TagSoup SAX XML parser reads any HTML file and produces XHTML parse events. There are however several possibilities to use TagSoup to create a DOM tree. One is based on the JDOM package, the other one on the XALAN package. JDOM is an alternative API to DOM that is specifically tailored to the Java programming language. It can be seen as a response to the first Java implementations of DOM, which were considered poor in performance and convenience by many Java programmers. While in the meantime some things have changed, JDOM is still an interesting alternative. The following code shows how to create a DOM tree from HTML using JDOM.

// This is an adapted example of the one that can be found on
// http://www.hackdiary.com/archives/000029.html

// Parse the HTML file using the tagsoup parser
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
Document doc = builder.build(new File("E:test.html"));


To execute this example you need to have the JDOM jar files on your classpath (which can be obtained here). Also the tagsoup jar file is needed (can be obtained here). A short example of how a similar thing can be achieved with XALAN SAX2DOM class can be found here.  

RapidMiner and the TextInput Plugin (WVTool) internally use TagSoup to generate an XML DOM tree from the input documents. If you use these tools, you do not have to worry about the things discussed in this section.

Using XPath to extract information from XML and (X)HTML

For now we assume that we obtained a DOM tree that represents the HTML document from which we would like to extract the information. We can now easily pose any XPath query. An XPath expression represents a sequence of steps from a start node (often denoted as context node) to another node or a set of nodes. For HTML, we assume that we always start with the root of the HTML document as start node. Each step in an XPath expression selects a set of nodes which are then passed to the following step until all steps are processed. The selection of nodes in a step can be based on axis specifiers, node tests and predicates. An axis specifier selects the nodes to which a predicate is applied. Predicates select among the specified nodes based on attribute value, position of the node, text it contains etc.;

Typical XPath queries on HTML

/h:html/h:body/h:table/h:tr/h:td/h:div/text()
Extract the content of the div tag in the first cell of the first table.

/h:html/h:body/h:table/h:tr[2]/h:td[3]/h:div/text()
Extract the content of the div tag from the cell in the second row and the third column.  Numbering starts with one.

//h:div[@class='product_name']/text()
Extract the content of the first div element that has the style class  'product_name'.

//h:div[contains(.,'Steeping')]/text()
Extract the content of the first div element that contains the text 'Steeping' in one of its subtree elements. The dot denotes the current node itself.

//h:div[contains(@class,'product')]/text()
Extract the content of the first div element that has a style class containing 'product'.

//h:div[@type='A' and @class='B']/text()
Extract the content of the first div element that has a style class 'B' and an attribute type with value 'A'.

It is beyond the scope of this tutorial to give a complete introduction to XPath. The table on the left shows some templates of XPath queries that are typical for extracting information from HTML. They can be used as point of departure for own queries. For a reference on XPath as well as additional examples refer to XPath references.

The following example shows how these queries can be applied from a Java application.

// This is an adapted example of the one that can be found on
// http://www.hackdiary.com/archives/000029.html

// Parse the HTML file using the tagsoup parser
SAXBuilder builder =
new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
Document doc = builder.build(new File("test.html"));

// Construct the XPath query and add a namespace binding
JDOMXPath titlePath =
new JDOMXPath("//h:div[@class='product_name']/text()");

titlePath.addNamespace("h", "http://www.w3.org/1999/xhtml");

// Extract the actual information
String productName =
((Text) titlePath.selectSingleNode(doc)).getText();




To run this example you need a current version of the Jaxen XPath processor on your classpath. You can download the jar files here.

The code is quite straightforward. The only thing to notice is probably the use of namespaces here. Each XML element is implicitely or explictely bound to a name space. This allows heterogeneous XML formats that use the same entity names to be processed without naming conflicts. As namespace identifiers are quite lengthly, you can introduce short ids for them. In the case above, 'h' is used as short id for the XHTML namespace. Then you have to put 'h:' in front of every element to indicate that this element belongs to the XHTML namespace.

Posing XPath queries with RapidMiner

If you use RapidMiner, the procedure is much simpler: We assume that you downloaded all html files to a local directory (e.g. using a web crawler). Also we assume that you installed RapidMiner and the TextInput Plugin (formerly known as WVTool Plugin). Now start RapidMiner, create a new experiment and add the FeatureExtraction operator to it. Then you must specify the directory or directories in which your files are located. You can do this using the 'texts' parameter of the operator. In the following example only a single directory is added:

rapidminer text selection

Now you can specify the actual queries by clicking on the attributes parameter. Use "Add" to add a first attribute. In the left field, you must enter a name for the attribute, say "product_name". The field on the right should contain an XPath query that extracts the desired information. You can add as many rows (and thus attributes) as you want. The following is an example with 2 attributes:

rapidminer xpath attributes

You can inspect whether your queries work by clicking on the "Preview" button. If everything looks fine, start the experiment.
You should see an example set in the result view, that looks like this:

rapidminer_example_set

You can now store the data into a file or add additional processing steps.

If you use Java to extract values from HTML, you have to deal with numeric values yourself. If you use RapidMiner, this is achieved automatically. In the example above, you see a '#' character before the attribute 'price'. This tells RapidMiner to interpret the attribute as a numeric one.

Usually, it is not possible to select only a single number by XPath. In the example above, the div element that contains the price also contains 'Euro'. Even worse, the element that contains steeping time contains a range '2-3' instead of a single number. The WVTool uses several heuristics  to extract numbers from such mixed content. First, it replaces all commas with dots. Then it tries to ignore prefixes and suffixes as to obtain the number only. If it identifies a substring of the kind '2-3' it parses both numbers and uses the arithmetic mean as final value. If the TextInput cannot parse the value even with these approaches, it assigns the value unknown.

Some hints on using XPath

XPath resources and tutorials

http://en.wikipedia.org/wiki/XPath
Wikipedia entry on XPath

http://www.w3.org/TR/xpath
The official specification

http://www.zvon.org/xxl/XPathTutorial/General/examples.html
A collection of XPath examples with short description

http://www.futurelab.ch/xmlkurs/xpath.de.html
An interactive XPath evaluator

http://www.getfirebug.com/
A firefox plugin that allows to visually create xpath queries

This tutorial cannot cover XPath in detail. For tutorials and references take a look at the XPath resources in the table on the left. In the following we will discuss only some not so obvious pitfalls, that you should be aware of.

A common source of errors in XML processing are namespaces. Namespaces help to combine XML from different sources, in cases where elements have identical names but different meaning. In general, every element and attribute can be annotated with a namespace. In most cases a default namespace is assigned to a complete document. In the case of XHTML this is specified in the outermost tag 'html':

<html xmlns="http://www.w3.org/1999/xhtml">
...
</html>


Posing a query like

//div[@class='product']/text()

on such a HTML document would fail, as the the div tag is not identical to the div tag bound to the XHTML namespace. Therefore, you must use namespaces in your query and you must bind the identifiers you use in the XPath query to a namespace. These identifiers are not necessarily the same as in the HTML file. The following is the example above, using the identifier "j" instead of "h" for XHTML. The output is the same.

// This is an adapted example of the one that can be found on
// http://www.hackdiary.com/archives/000029.html

// Parse the HTML file using the tagsoup parser
SAXBuilder builder =
new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
Document doc = builder.build(new File("test.html"));

// Construct the XPath query and add a namespace binding
JDOMXPath titlePath =
new JDOMXPath("//j:div[@class='product_name']/text()");
titlePath.addNamespace("j", "http://www.w3.org/1999/xhtml");

// Extract the actual information
String productName =
((Text) titlePath.selectSingleNode(doc)).getText();

In TextInput operator, XHTML is bound automatically to the identifier "h". If you need to apply the TextInput plugin to other XML than (X)HTML, you can add additional namespace bindings using the parameter "namespaces". The following shows an additional bindings for processing the output of the hostip service.

Namespaces for IP Lookup

XPath queries do in general match more than one element. Applied in the way described above, the first match is used.

Given the following HTML code:

<html>
<div>Products</div>
<table>
<tr>
<td><div class="product_name">Darjeeling Puttabong</div></td>
</tr>
</table>
...
</html>

The query

//h:div/text()

would result in the value "Products", instead of the title of the product. To obtain the product name, the query has to be made more specific, e.g. by stating additionally the class of the element:

//h:div[@class='product_name']/text()

Another common problem is the distinction between an element itself and its content or text. The query

//div[@class='product']

would return "<div class="product_name">Darjeeling Puttabong</div>" if applied in RapidMiner. The correct query is therefore again

//div[@class='product']/text()

If you use RapidMiner, you can use the Preview function to get a quick overview on whether your query worked the way you intended it.

A final hint: there is a great plugin for the Firefox browser, called Firebug. It allows you to inspect your HTML interactively and moreover visually, thus you can select an element in the tag tree and Firebug selects this element in the browser. This works also the other way around: You can click on an element in the page that you are currently browsing and Firebug will select the corresponding tag in the tag tree. Firebug has a second very useful function, namely generating an XPath query for an element in the tag tree. Both functions in conjunction allow to pose XPath queries for web page extraction very easily.

In the following example, we assume that we would like to extract the name of a person from homepages of a particular university unit. First, we right click on the name of the person and select 'Inspect Element':

Firebug: Inspecting an element

Now we see in the tag tree that '<h1>Michael Wurst</h1>' is selected. We right click on this entry and choose 'Copy XPath'.

Firebug: Extracting an XPath query

Now the following XPath expression leading to the tag is in the clipboard:

/html/body/center/table/tbody/tr/td/h1

Before we can use this expression we must first transform it a little bit to conform to the namespace requirements of RapidMiner mentioned above. Essentially we need to put an 'h:' in front of each entity and a '/text()' to the end, to indicate, that we would not have the tags themselves in our result. The resulting query is:

/h:html/h:body/h:center/h:table/h:tbody/h:tr/h:td/h:h1/text()

This query can now be used as extraction query directly in Rapid Miner. This methology allows to create basic XPath queries, even if you do not have any in depth knowledge of XHTML or XPath.

 
Valid XHTML 1.0 Strict Valid CSS!