Software Blog

Sunday
Feb032013

JSoup: HTML Parsing Made Easy!

Do you ever get excited by a software library? I do. Good software is like beautiful art to me. I like to look at it, marvel in its power and revel in its simplicity. Good libraries are fun and easy to use and they just feel right. Google Guava felt right from the first time I used it and I just had a similar experience with another Java library - JSoup. Note that I have no affiliation with JSoup whatsoever - I just think that it’s great software that ought to be shared.

JSoup is an HTML parser (among other things), but it is atypical of most HTML parsers:

  • Its selection syntax is incredibly powerful, yet easy to understand.
  • It works with “real-world” HTML (in other words, HTML that isn’t well-formed).
  • It has no dependencies.

As an example, suppose I need to parse the goals scored in an Ontario Hockey League (OHL) game, such as the one today where Kitchener beat Erie 11-4.

In an OHL game summary, goals scored are rows in an HTML table. But the game summary contains many HTML tables, so how do I specify the table I want? Well, the first row of the goals scored table is a title row that appears in HTML as follows:

<tr class="content-w">
   <td class="dark" align="center" colspan="2">
      <b class="content-w">Scoring</b>
   </td>
</tr>

Each goal follows the title row and appears in HTML as follows:

<tr class="light">
   <td><i>1</i>. ER
      <a href='player.php?id=6330'>H. Hodgson</a>, (5) (
      <a href='player.php?id=5572'>L. Cairns</a>), 2:51
   </td>
   <!-- Stuff removed for brevity. -->
</tr>

Therefore, the rows for each goal are tr elements that have a class of light that follow an earlier sibling tr element that contains the text Scoring somewhere in its content. Here’s how easy that query is in JSoup:

final Elements elements = document.select("tr:matches(Scoring) ~ tr.light");

Amazing! Look at how powerful this one line of code is:

  • JSoup’s matches pseduo-selector means “elements whose text (or the text in any of its descendants) matches the specified regular expression”. (In this case, we are using the simplest type of regular expression to perform exact text matching.)
  • JSoup’s dot (‘.’) operator means “element with a specified class”.
  • JSoup’s tilde (‘~’) operator means “element preceded by sibling”.

Taken together, we get exactly the HTML elements we want - one element for each goal. I don’t even want to think about doing this type of selection with other HTML parsers! You can see the full power of selection statements in the Javadoc of JSoup’s Selector class.

Another beautiful element of JSoup’s design is that it has its own collection classes, but they implement interfaces in the Java Collections Framework. For example, the Elements class is org.jsoup.select.Elements but it implements Iterable<Element>, Collection<Element>, and List<Element>. Therefore, you can use the convenience methods provided by JSoup (such as elements.first() to get the first element in a list) but you can also treat the collection as an Iterable, Collection, or List. Brilliant!

This article just scratches the surface of JSoup. I invite you to check it out - you won’t be disappointed!

Tuesday
Jan222013

Well-named methods are better than comments

Today I’m going to discuss a benefit of one of the greatest inventions in the history of computer science - the method. One reason I love writing software so much is that the field is so young. We haven’t even had a century to consider the best alternatives for writing robust software and methods are only 50 years old or so, depending on your definition. Compare that with other fields like construction. Imagine how much we had to learn about construction 50 years after the first building was ever made. I bet there was ample room for improvement! That’s the state we’re in today in software.

Methods are normally taught as a way to remove duplication and that is undoubtedly one of their benefits. However, I want to discuss another benefit of methods that is often overlooked - their ability to improve program readability.

Consider a program I’ve written that converts from National Hockey League (NHL) depth charts on RotoWorld into a comma-separated values (CSV) player file suitable for use in another program I’ve written than runs the draft for an NHL pool.

At its heart, the depth chart parser reads from an input file (the depth charts saved as a text file) and writes to an output file (the CSV file). Therefore, the first time I wrote the program’s main method, it looked like this:

public static void main(final String[] cmdLineArgs) throws IOException
{
  final String inputFileSpec = "C:/data/hockeypool/rotoWorldDepthCharts.txt";
  System.out.println("Reading players from " + inputFileSpec + "...");
  final List<Player> players = new PlayersSupplier(
        new FileReader(inputFileSpec)).get();

  final String outputFileSpec = "C:/data/hockeypool/regularSeasonPlayers.csv.txt";
  System.out.println("Writing players to " + outputFileSpec + "...");
  new PlayersWriter(new FileWriter(outputFileSpec)).write(players);
}

(Ignore the hard-coded paths for now. I haven’t yet made the program flexible enough to accept configuration as to where to read and write data.)

At first glance, there’s nothing wrong with this method. It’s pretty short. There is only one path for readers to comprehend (in other words, there are no conditional statements). However, I still think it can be improved because the lines that read from and write to files are hard to digest. There’s a lot going on there that obscures the real purpose of those lines.

Some literature encourages the use of comments in a situation like this:

public static void main(final String[] cmdLineArgs) throws IOException
{
  // Read players from the input file.
  //
  final String inputFileSpec = "C:/data/hockeypool/rotoWorldDepthCharts.txt";
  System.out.println("Reading players from " + inputFileSpec + "...");
  final List<Player> players = new PlayersSupplier(
        new FileReader(inputFileSpec)).get();

  // Write players to the output file.
  //
  final String outputFileSpec = "C:/data/hockeypool/regularSeasonPlayers.csv.txt";
  System.out.println("Writing players to " + outputFileSpec + "...");
  new PlayersWriter(new FileWriter(outputFileSpec)).write(players);
}

Don’t do this! As Robert Martin so eloquently puts it in Clean Code, comments are failures; they indicate that we have failed to express ourselves in code. We should not be proud when we write a comment. Rather, we should use comments only as a last resort when all other mechanisms for expression have failed.

As Martin recommends, let’s try to express ourselves in code by using well-named methods:

public static void main(final String[] cmdLineArgs) throws IOException
{
  final List<Player> players = readPlayers(
        "C:/data/hockeypool/rotoWorldDepthCharts.txt");
  writePlayers(players, "C:/data/hockeypool/regularSeasonPlayers.csv.txt");
}

private static List<Player> readPlayers(final String fileSpec) throws IOException
{
  System.out.println("Reading players from " + fileSpec + "...");
  return new PlayersSupplier(new FileReader(fileSpec)).get();
}

private static void writePlayers(final List<Player> players, final String fileSpec)
      throws IOException
{
  System.out.println("Writing players to " + fileSpec + "...");
  new PlayersWriter(new FileWriter(fileSpec)).write(players);
}

Doesn’t this make the main method easier to understand? The method now clearly does two things - read players and write players. If you want to know the implementation details of those operations, you can look at the private methods, but if you don’t care about those details, you can ignore them. For example, if your team took over responsibility for maintaining this class, the first time you looked at the code you should not have to burden your mind with the minutiae of how the files are read and written. Note that by introducing private methods we were also able to forego the use of the inputFileSpec and outputFileSpec variables, further condensing the main method and making it even easier to understand. Additionally, the implementation details of reading and writing players can now more easily be refactored into their own classes, should we choose to do so.

In summary, use well-named methods liberally, even if they’re only called once. They make it easier to understand your code and they’re so much better than comments!

Sunday
Jan202013

A million small decisions

Software development is all about decisions. Developers make many tiny decisions every day that add up to seriously impact the overall quality of a code base. In this example, I will discuss decisions made while attempting to solve a seemingly simple problem - How do we eliminate duplication in argument validation?

(The class used in this article is available on GitHub as part of the kblaney-assertions project, which I hope to eventually publish to the Maven Central Repository.)

The Problem

It is often recommended that public methods validate their arguments (for example, see javapractices.com). Consistently doing so leads to systems that have the admirable quality of being fail-fast. For example, suppose I have a method that accepts two Strings, neither of which are allowed to be null:

public void foo(final String s1, final String s2)
{
  if (s1 == null)
  {
    throw new IllegalArgumentException("s1 is null");
  }
  if (s2 == null)
  {
    throw new IllegalArgumentException("s2 is null");
  }

  // Remainder of method elided for brevity.
}

Before we even get into the duplication problem, we’ve made some decisions. What exception should we throw if an argument is null? I prefer to throw an IllegalArgumentException when an argument is invalid in any way (including when it is null), but others prefer to throw a NullPointerException. I prefer to leave NullPointerExceptions to indicate specifically that a null pointer was dereferenced. (Perhaps this confusion could have been avoided if NullPointerException was named NullDereferenceException? Naming really is so important to maintainability!) We’ve also made a minor decision that we should have a blank line after parameter validation. Blank lines in software should mean something and I use them in cases like this to indicate a different “paragraph” of code. Code above the blank line is parameter validation; code below the blank line is the method body that assumes that parameters are valid.

There is clearly duplication here and as we all know, duplication is the root of all evil so as professional software developers, we should remove it. But how? Time for more decisions!

The easiest solution is to introduce a private method in the same class. However, it is highly likely that the same parameter validation exists in many classes, so a private method won’t do. Therefore, we need a new class. But what should it be called?

For now, the new class will only have methods related to null-checking, so should we call it NullCheck? What about ArgNullChecker? For this decision, I consider the Single Responsibility Principle. What is the one thing our new class is going to be really good at? Well, the class is going to assert certain conditions about method arguments. For now, the only methods are related to nullness, but there are many other criteria that could conceivably become methods in the same class:

  • numbers that must be positive or negative
  • numbers that must be greater than a minimum
  • numbers that must be smaller than a maximum
  • collections that must be non-empty
  • strings that must be non-empty or non-blank

With this in mind (even though we don’t write these methods yet), let’s name the new class ArgAssert. This name uses a common shortform (‘arg’ for ‘argument’) and makes it clear what the class is responsible for - asserting on arguments. With that decision aside, what should we name the method that checks whether a specified argument is null?

For this decision, I consider how I would write a short description of the method (for example, the opening sentence in the method’s Javadoc). In this case, the description would be “Asserts that a specified argument is not null.”. With that in mind, let’s name the method assertNotNull. What arguments should this method have?

Well, all the method does is check whether an argument is null, so the naive approach is for the method to have only one parameter:

public final class ArgAssert
{
  public static void assertNotNull(final String arg)
  {
    if (arg == null)
    {
      throw new IllegalArgumentException("Argument is null");
    }
  }
}

Note other decisions made here. What should we name the method’s argument? I considered stringToCheck, valueToCheck, value, and s before settling on arg. We want to remain consistent in our use of the “arg” shortform that we use in the class name. Should the method be static? I’m generally not a fan of static methods, because they can make in-memory unit testing a pain. (Yes, I know about the heavyweight mocking frameworks that allow one to mock static method calls. I’m not a fan of those either, but we don’t have time to get into that discussion here.) However, in this case we have a method that runs entirely in memory that nobody will ever need to mock. Therefore, it makes sense to make the method static.

Now we can remove the duplication and make our original method easier to read:

public void foo(final String s1, final String s2)
{
  ArgAssert.assertNotNull(s1);
  ArgAssert.assertNotNull(s2);

  // Remainder of method elided for brevity.
}

Note what’s missing if the assertNotNull method only has one parameter. If either argument is null, the exception’s message does not indicate the name of the argument. With an exception message alone, one can’t determine whether s1 or s2 was null. This is a significant debugging hurdle if we don’t have a traceback that shows which lines of code were executed. Therefore, let’s allow calling classes to pass in a String that indicates the name of the argument being validated:

public final class ArgAssert
{
  public static void assertNotNull(final String arg, final String argName)
  {
    if (arg == null)
    {
      throw new IllegalArgumentException(argName + " is null");
    }
  }
}

Now our method allows calling classes to validate that strings are not null. But wait! What about methods that accept other types of objects? Shouldn’t we be able to use the same method to validate those objects? Of course! Java generics to the rescue!

public final class ArgAssert
{
  public static <T> void assertNotNull(final T arg, final String argName)
  {
    if (arg == null)
    {
      throw new IllegalArgumentException(argName + " is null");
    }
  }
}

Note that we made the method generic, not the class. That’s so that different invocations of the method can have T represent different types:

public void foo(final String s, final Bar bar)
{
  ArgAssert.assertNotNull(s, "s");
  ArgAssert.assertNotNull(bar, "bar");

  // Remainder of method elided for brevity.
}

I guess we’re done now, right? Nope. Not yet. Our method can still be significantly improved. Consider a constructor that needs to validate its parameters before storing them in members:

public final class A
{
  private final B b;
  private final C c;

  public A(final B b, final C c)
  {
    ArgAssert.assertNotNull(b, "b");
    ArgAssert.assertNotNull(c, "c");

    this.b = b;
    this.c = c;
  }
}

Wouldn’t it be convenient if this constructor could combine argument validation and storing the argument in a member? It can if the assertNotNull method returns the validated parameter:

public final class A
{
  private final B b;
  private final C c;

  public A(final B b, final C c)
  {
    this.b = ArgAssert.assertNotNull(b, "b");
    this.c = ArgAssert.assertNotNull(c, "c");
  }
}

That looks pretty nice, so let’s make that change. Here is the final result, including Javadoc:

/**
 * Provides static utility methods that make assertions about arguments.
 * When assertions fail, an {@code IllegalArgumentException} is thrown.
 */
public final class ArgAssert
{
  /**
   * Asserts that a specified argument is not null. If the argument is null,
   * this method throws an {@code IllegalArgumentException}. If the argument
   * is not null, this method returns it.
   * 
   * @param <T> the type of the argument to check
   * @param arg the argument to check
   * @param argName the name of the argument (that gets included in the
   * {@code IllegalArgumentException} that is thrown if {@code arg} is null)
   * 
   * @return the non-null argument
   */
  public static <T> T assertNotNull(final T arg, final String argName)
  {
    if (arg == null)
    {
      throw new IllegalArgumentException(argName + " is null");
    }
    return arg;
  }
}

With this class in place, we are able to write concise parameter validation early in our public methods.

Summary

To summarize, think of all the decisions we had to make just for this simple method:

  • What should we name the class?
  • What should we name the method?
  • How many parameters should the method accept?
  • What should be the type of the method’s parameters?
  • What should we name the method parameters?
  • Should the method return a value?
  • What should the method’s return type be?
  • Should the method be static?
  • Should the method be generic?
  • Should the class be generic?
  • What type of exception should the method throw if it encounters a null object?
  • What should the exception’s message be if the method encounters a null object?
  • Where should I put blank lines to make the method easier to comprehend?

We didn’t discuss the following decisions, but they have to be made too:

  • What package does the class belong in?
  • What project does the class belong in? In other words, should the class be part of a separate artifact?
  • What should the method validate its parameters?

Now, consider that me must make these decisions again for every new method. Further, we must be open to future decisions impacting the decisions we have already made.

When you write software, don’t just dump some code into an editor, get things working and then think you’re done. Question the decisions you made while writing the code and get an outside opinion on your significant decisions, even before a formal code review. Doing so will lead to more maintainable software. Future maintainers will thank you!

Monday
Dec102012

Meet Google Guava's Optional class

Google Guava is a great library that should be part of every Java developer’s arsenal. As its user guide states, it includes classes for “collections, caching, primitives support, concurrency libraries, common annotations, string processing, I/O, and so forth.” I often peruse Google Guava’s Javadoc. There’s just so much good stuff in there! I can learn about API design, discover ways to improve my software, and seeing such beautiful code just makes me feel good. (Yeah, I know. That’s about as cool as driving a moped on the sidewalk while wearing khaki flood pants with a ketchup stain from a street meat hot dog but whatever… it’s what I do!)

I stumbled upon another gem in Google Guava last week while improving my program that creates gamesheets for Ontario Hockey League games. When I attend an OHL game (and I often do… go Bulls!) I take gamesheets that include details about the players on both teams. The details include player name, player type (rookie or veteran), season statistics and streaks, biographical details, and sweater number. It’s the last detail that introduced me to Google Guava’s Optional class.

Optional is a simple immutable generic class that stores an optional non-null value. As its Javadoc states, Optional allows you to represent “a T that must be present” and a “a T that might be absent” as two distinct types in your program, which can aid clarity.” For hockey players, sweater numbers are sometimes not known. For example, when a player is traded to a new team, he can’t always use the sweater number of his old team because a more veteran player on his new team might already use it. Until he plays his first game for his new team, his sweater number is often unknown. So, what Java type should I use to represent a player’s sweater number?

Well, sweater numbers in hockey are positive integers ranging from 1-99 so I started out with the obvious - int. For example, the Player class contained the following method:

class Player
{
  int getSweaterNum()
  // Other methods elided for brevity.
}

This poses a problem for players with an unknown sweater number. What should getSweaterNum return for such a player? I initially took the lazy approach and used a magic number (0). However, I forgot to check for that magic number in my code that uses the Player class. It blindly called getSweaterNum and put 0 in the sweater number column of my gamesheets. How ugly! Rather than print a 0, the sweater number column should remain blank so that I can fill it in at the game when I see what number the player wears during the warmup.

There are a couple of approaches I could use to solve this problem:

  • Define a constant (say Player.UNKNOWN_SWEATER_NUM = 0) for an unknown sweater number.
  • Use an Integer instead of an int for sweater number and use null to indicate that the sweater number is unknown.
  • Create a SweaterNum class to encapsulate the concept of a sweater number.

I don’t really like any of these approaches. The first approach leads to ugly code at the calling site and doesn’t make it any more likely that the caller will remember to check for the magic value. The second approach violates the best practice of not returning null (see Clean Code: A Handbook of Agile Software Craftsmanship by Robert Martin - it’s a fantastic book!). The third approach seems a bit like overkill.

Instead of these approaches, I chose to use Optional<Integer> for the return value of Player.getSweaterNum:

class Player
{
  Optional<Integer> getSweaterNum()
  // Other methods elided for brevity.
}

This increases clarity; human readers know by the method’s return value that a player’s sweater number is optional. This makes it less likely that calling classes will forget to check whether a sweater number is known. Calling classes do the following instead:

private String getSweaterNumCellValue(final Optional<Integer> sweaterNum)
{
  if (sweaterNum.isPresent())
  {
    return Integer.toString(sweaterNum.get());
  }
  else
  {
    return EMPTY_TABLE_CELL;
  }
}

The method that determines a player’s sweater number does the following:

try
{
  final String sweaterNumString = // get from OHL website
  return Optional.of(Integer.parseInt(sweaterNumString));
}
catch (final NumberFormatException e)
{
  return Optional.absent();
}

This is by no means a perfect solution as it reveals some of Java’s clunkiness. For example, it would be nicer if we could use Optional<int> but, alas, Java does not allow that syntax. However, I think it’s a good compromise. What do you think?

Wednesday
Aug222012

The False Economies of Software

I’ve been thinking a lot lately about how software companies can fall victim to false economies. Companies can take actions that save money in the short term, but have detrimental long-term impacts. The most telling examples are:

  • Use of low-cost offshore contractors
  • Failure to provide employees with the tools they need to get their job done
  • Inadequate hiring practices

Offshore Contractors

Offshore contractors seem like an attractive solution to a cost-conscious business. Their salaries are much lower than North American employees with the same job title, so why not use them? I’ll tell you why not! The true long-term cost of using offshore contractors is difficult to measure, but is typically very high because of:

  • Location difficulties
  • Language barriers
  • Skill level differences

It’s difficult enough to work on a team split across multiple locations. It’s even more challenging when some members are separated from others by many time zones. Given the location of prevalent contracting companies, it’s even possible that contractors never end up working at the same time as their full-time counterparts. This causes untold delays. Instead of discussing a problem right away in person with a colleague, an employee sends an email to a faraway land and it takes an entire day for it be read and acted upon.

Delays are further compounded by language barriers. It is often the case that contractors don’t have the same primary language as their full-time counterparts, so more clarity than normal is needed in communication, which costs time and money. Even more formality is required when there is a disparity in skill level between cheap contractors and their full-time counterparts.

Inadequate Tools

It’s easy for companies to think they are saving money by restricting spending on tools. However, the true cost of inadequate tools is wasted time and low employee morale. For example, if a company doesn’t buy its developers modern computers, builds take longer and developers spend more time waiting (and moving on to other activities like browsing the Internet and looking for a new place to work!). When developers go to lunch together, they share horror stories about their poor equipment and this further feeds the downward spiral of employee morale.

Inadequate Hiring Practices

The hiring process should be a great experience for applicants. Don’t cheap out and expect an interviewee to pay for their own transportation or lunch. Even if you wouldn’t hire a candidate to pick the weeds out of your Aunt Ida’s carrot garden, treat them like royalty. You never know who their friends are! You want them to tell everyone they know how great an experience they had and how amazing it would be if they were hired to work at your company.