JSoup: HTML Parsing Made Easy!

Sunday

Feb032013

Sunday, February 3, 2013 at 5:50PM

Do you ever get excited by a software library? I do. Good software is like beautiful art to me. I like to look at it, marvel in its power and revel in its simplicity. Good libraries are fun and easy to use and they just feel right. Google Guava felt right from the first time I used it and I just had a similar experience with another Java library - JSoup. Note that I have no affiliation with JSoup whatsoever - I just think that it’s great software that ought to be shared.

JSoup is an HTML parser (among other things), but it is atypical of most HTML parsers:

Its selection syntax is incredibly powerful, yet easy to understand.
It works with “real-world” HTML (in other words, HTML that isn’t well-formed).
It has no dependencies.

As an example, suppose I need to parse the goals scored in an Ontario Hockey League (OHL) game, such as the one today where Kitchener beat Erie 11-4.

In an OHL game summary, goals scored are rows in an HTML table. But the game summary contains many HTML tables, so how do I specify the table I want? Well, the first row of the goals scored table is a title row that appears in HTML as follows:

<tr class="content-w">
   <td class="dark" align="center" colspan="2">
      <b class="content-w">Scoring</b>
   </td>
</tr>

Each goal follows the title row and appears in HTML as follows:

<tr class="light">
   <td><i>1</i>. ER
      <a href='player.php?id=6330'>H. Hodgson</a>, (5) (
      <a href='player.php?id=5572'>L. Cairns</a>), 2:51
   </td>
   <!-- Stuff removed for brevity. -->
</tr>

Therefore, the rows for each goal are tr elements that have a class of light that follow an earlier sibling tr element that contains the text Scoring somewhere in its content. Here’s how easy that query is in JSoup:

final Elements elements = document.select("tr:matches(Scoring) ~ tr.light");

Amazing! Look at how powerful this one line of code is:

JSoup’s matches pseduo-selector means “elements whose text (or the text in any of its descendants) matches the specified regular expression”. (In this case, we are using the simplest type of regular expression to perform exact text matching.)
JSoup’s dot (‘.’) operator means “element with a specified class”.
JSoup’s tilde (‘~’) operator means “element preceded by sibling”.

Taken together, we get exactly the HTML elements we want - one element for each goal. I don’t even want to think about doing this type of selection with other HTML parsers! You can see the full power of selection statements in the Javadoc of JSoup’s Selector class.

Another beautiful element of JSoup’s design is that it has its own collection classes, but they implement interfaces in the Java Collections Framework. For example, the Elements class is org.jsoup.select.Elements but it implements Iterable<Element>, Collection<Element>, and List<Element>. Therefore, you can use the convenience methods provided by JSoup (such as elements.first() to get the first element in a list) but you can also treat the collection as an Iterable, Collection, or List. Brilliant!

This article just scratches the surface of JSoup. I invite you to check it out - you won’t be disappointed!

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

Post a New Comment

Enter your information below to add a new comment.

My response is on my own website »

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>