Processing (Perl & LWP) - e-Reading Library

Once you have parsed some HTML, you need to process it. Exactly what you do will depend on the nature of your problem. Two common models are extracting information and producing a transformed version of the HTML (for example, to remove banner advertisements).

Whether extracting or transforming, you'll probably want to find the bits of the document you're interested in. They might be all headings, all bold italic regions, or all paragraphs with class="blinking". HTML::Element provides several functions for searching the tree.

9.3.1. Methods for Searching the Tree

In scalar context, these methods return the first node that satisfies the criteria. In list context, all such nodes are returned. The methods can be called on the root of the tree or any node in it.

$node->find_by_tag_name(tag [, ...])

Return node(s) for tags of the names listed. For example, to find all h1 and h2 nodes:

@headings = $root->find_by_tag_name('h1', 'h2');

$node->find_by_attribute(attribute, value)

Returns the node(s) with the given attribute set to the given value. For example, to find all nodes with class="blinking":

@blinkers = $root->find_by_attribute("class",
"blinking");

$node->look_down(...)

$node->look_up(...)

These two methods search $node and its children (and children's children, and so on) in the case of look_down, or its parent (and the parent's parent, and so on) in the case of look_up, looking for nodes that match whatever criteria you specify. The parameters are either attribute => value pairs (where the special attribute _tag represents the tag name), or a subroutine that is passed a current node and returns true to indicate that this node is of interest.

For example, to find all h2 nodes in the tree with class="blinking":

@blinkers = $root->look_down(_tag => 'h2', class => 'blinking');

We'll discuss look_down in greater detail later.

9.3.2. Attributes of a Node

Four methods give access to the basic information in a node:

$node->tag( ): The tag name string of this element. Example values: html, img, blockquote. Note that this is always lowercase.
$node->parent( ): This returns the node object that is the parent of this node. If $node is the root of the tree, $node->parent( ) will return undef.
$node->content_list( ): This returns the (potentially empty) list of nodes that are this node's children.
$node->attr(attributename): This returns the value of the HTML attributename attribute for this element. If there is no such attribute for this element, this returns undef. For example: if $node is parsed from <img src="x1.jpg" alt="Looky!">, then $node->attr("src") will return the string x1.jpg.

Four more methods convert a tree or part of a tree into another format, such as HTML or text.

$node->as_HTML([ entities [, indent_char [, optional_end_tags ]]]);

Returns a string consisting of the node and its children as HTML. The entities parameter is a string containing characters that should be entity escaped (if empty, all potentially unsafe characters are encoded as entities; if you pass just <>&, just those characters will get encoded—a bare minimum for valid HTML). The indent_char parameter is a string used for indenting the HTML. The optional_end_tags parameter is a reference to a hash that has a true value for every key that is the name of a tag whose closing tag is optional. The most common value for this parameter is {} to force all tags to be closed:

$html = $node->as_HTML("", "", {});

For example, this will emit </li> tags for any li nodes under $node, even though </li> tags are technically optional, according to the HTML specification.

Using $node->as_HTML( ) with no parameters should be fine for most purposes.

$node->as_text( )

Returns a string consisting of all the text nodes from this element and its children.

$node->starttag([entities])

Returns the HTML for the start-tag for this node. The entities parameter is a string of characters to entity escape, as in the as_HTML( ) method; you can omit this. For example, if this node came from parsing <TD class=loud>Hooboy</TD>, then $node->starttag( ) returns <td class="loud">. Note that the original source text is not reproduced exactly, because insignificant differences, such as the capitalization of the tag name or attribute names, will have been discarded during parsing.

$node->endtag( )

Returns the HTML for the end-tag for this node. For example, if this node came from parsing <TD class=loud>Hooboy</TD>, then $node->endtag( ) returns </td>.

These methods are useful once you've found the desired content. Example 9-4 prints all the bold italic text in a document.

Example 9-4. Bold-italic headline printer

#!/usr/bin/perl -w

use HTML::TreeBuilder;
use strict;

my $root = HTML::TreeBuilder->new_from_content(<<"EOHTML");
<b><i>Shatner wins Award!</i></b>
Today in <b>Hollywood</b> ...
<b><i>End of World Predicted!</i></b>
Today in <b>Washington</b> ...
EOHTML
$root->eof( );

# print contents of <b><i>...</i></b>
my @bolds = $root->find_by_tag_name('b');
foreach my $node (@bolds) {
  my @kids = $node->content_list( );
  if (@kids and ref $kids[0] and $kids[0]->tag( ) eq 'i') {
    print $kids[0]->as_text( ), "\n";
  }
}

Example 9-4 is fairly straightforward. Having parsed the string into a new tree, we get a list of all the bold nodes. Some of these will be the headlines we want, while others will simply be bolded text. In this case, we can identify headlines by checking that the node that it contains represents .... If it is an italic node, we print its text content.

The only complicated part of Example 9-4 is the test to see whether it's an interesting node. This test has three parts:

@kids: True if there are children of this node. An empty  would fail this test.
ref $kids[0]: True if the first child of this node is an element. This is false in cases such as Washington, where the first (and here, only) child is text. If we fail to check this, the next expression, $kids[0]->tag( ), would produce an error when $kids[0] isn't an object value.
$kids[0]->tag( ) eq 'i': True if the first child of this node is an i element. This would weed out anything like <img src="shatner.jpg">, where $kids[0]->tag( ) would return img, or Yes, Shatner!, where $kids[0]->tag( ) would return strong.

9.3.3. Traversing

For many tasks, you can use the built-in search functions. Sometimes, though, you'd like to visit every node of the tree. You have two choices: you can use the existing traverse( ) function or write your own using either recursion or your own stack.

The act of visiting every node in a tree is called a traversal. Traversals can either be preorder (where you process the current node before processing its children) or postorder (where you process the current node after processing its children). The traverse( ) method lets you both:

$node->traverse(callbacks [, ignore_text]);

The traverse( ) method calls a callback before processing the children and again afterward. If the callbacks parameter is a single function reference, the same function is called before and after processing the children. If the callbacks parameter is an array reference, the first element is a reference to a function called before the children are processed, and the second element is similarly called after the children are processed, unless this node is a text segment or an element that is prototypically empty, such as br or hr. (This last quirk of the traverse( ) method is one of the reasons that I discourage its use.)

Callbacks get called with three values:

sub callback 
  my ($node, $startflag, $depth,
      $parent, $my_index) = @_;
  # ...
}

The current node is the first parameter. The next is a Boolean value indicating whether we're being called before (true) or after (false) the children, and the third is a number indicating how deep into the traversal we are. The fourth and fifth parameters are supplied only for text elements: the parent node object and the index of the current node in its parent's list of children.

A callback can return any of the following values:

HTML::Element::OK (or any true value): Continue traversing.
HTML::Element::PRUNE (or any false value): Do not go into the children. The postorder callback is not called. (Ignored if returned by a postorder callback.)
HTML::Element::ABORT: Abort the traversal immediately.
HTML::Element::PRUNE_UP: Do not go into this node's children or into its parent node.
HTML::Element::PRUNE_SOFTLY: Do not go into the children, but do call this node's postorder callback.

For example, to extract text from a node but not go into table elements:

my $text;
sub text_no_tables {
  return if ref $_[0] && $_[0]->tag eq 'table';
  $text .= $_[0] unless ref $_[0];  # only append text nodex
  return 1;                         # all is copacetic
}

$root->traverse([\&text_no_tables]);

This prevents descent into the contents of tables, while accumulating the text nodes in $text.

It can be hard to think in terms of callbacks, though, and the multiplicity of return values and calling parameters you get with traverse( ) makes for confusing code, as you will likely note when you come across its use in existing programs that use HTML::TreeBuilder.

Instead, it's usually easier and clearer to simply write your own recursive subroutine, like this one:

my $text = '';
sub scan_for_non_table_text {
  my $element = $_[0];
  return if $element->tag eq 'table';   # prune!
  foreach my $child ($element->content_list) {
    if (ref $child) {  # it's an element
      scan_for_non_table_text($child);  # recurse!
    } else {           # it's a text node!
      $text .= $child;
    }
  }
  return;
}
scan_for_non_table_text($root);

Alternatively, implement it using a stack, doing the same work:

my $text = '';
my @stack = ($root);  # where to start
 
while (@stack) {
  my $node = shift @stack;
  next if ref $node and $node->tag eq 'table';  # skip tables
  if (ref $node) {
    unshift @stack, $node->content_list;        # add children
  } else {
    $text .= $node;                             # add text
  }
}

The while( ) loop version can be faster than the recursive version, but at the cost of being much less clear to people who are unfamiliar with this technique. If speed is a concern, you should always benchmark the two versions to make sure you really need the speedup and that the while( ) loop version actually delivers. The speed difference is sometimes insignificant. The manual page perldoc HTML::Element::traverse discusses writing more complex traverser routines, in the rare cases where you might find this necessary.


9.2. HTML::TreeBuilder		9.4. Example: BBC News

9.3. Processing

9.3.1. Methods for Searching the Tree

9.3.2. Attributes of a Node

Example 9-4. Bold-italic headline printer

9.3.3. Traversing