6.6. Example: Extracting Linksfrom Arbitrary HTML

Suppose that the links we want to check are in a remote HTML file that's not quite as rigidly formatted as my local bookmark file. Suppose, in fact, that a representative section looks like this:

<p>Dear Diary,
<br>I was listening to <a href="">Fresh
Air</a> the other day and they had <a href
="http://www.cs.Helsinki.FI/u/torvalds/">Linus Torvalds</a> on,
and he was going on about how he wrote some kinda
<a href="">program</a> or something.  If
he's so smart, why didn't he write something useful, like <a
href="why_I_love_tetris.html">Tetris</a> or <a href="../minesweeper_hints/"
>Minesweeper</a>, huh?

In the case of the bookmarks, we noted that links were each alone on a line, all absolute, and each capturable with m/ HREF="([^"\s]+)" /. But none of those things are true here! Some links (such as href="why_I_love_tetris.html") are relative, some lines have more than one link in them, and one link even has a newline between its href attribute name and its ="..." attribute value.

Regexps are still usable, though—it's just a matter of applying them to a whole document (instead of to individual lines) and also making the regexp a bit more permissive:

while ( $document =~ m/\s+href\s*=\s*"([^"\s]+)"/gi ) {
  my $url = $1;

(The /g modifier ("g" originally for "globally") on the regexp tries to match the pattern as many times as it can, each time picking up where the last match left off.)

Example 6-5 shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine (currently a placeholder) on it.

Example 6-5. diary-link-checker

#!/usr/bin/perl -w
# diary-link-checker - check links from diary page

use strict;
use LWP;

my $doc_url = "";
my $document;
my $browser;
init_browser( );

{  # Get the page whose links we want to check:
  my $response = $browser->get($doc_url);
  die "Couldn't get $doc_url: ", $resp->status_line
    unless $response->is_success;
  $document = $response->content;
  $doc_url = $response->request->base;
  # In case we need to resolve relative URLs later

while ($document =~ m/href\s*=\s*"([^"\s]+)"/gi) {
  my $absolute_url = absolutize($1, $doc_url);

sub absolutize {
  my($url, $base) = @_;
  use URI;
  return URI->new_abs($url, $base)->canonical;

sub init_browser {
  $browser = LWP::UserAgent->new;
  # ...And any other initialization we might need to do...
  return $browser;

sub check_url {
  # A temporary placeholder...
  print "I should check $_[0]\n";

When run, this prints:

I should check
I should check http://www.cs.Helsinki.FI/u/torvalds/
I should check
I should check
I should check

So our while (regexp) loop is indeed successfully matching all five links in the document. (Note that our absolutize routine is correctly making the URLs absolute, as with turning why_I_love_tetris.html into and ../minesweeper_hints/ into by using the URI class that we explained in Chapter 4, "URLs".)

Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check_url routine from the Example 6-4, and it will actually check the URLs that the our placeholder check_url routine promised we'd check.

