Rewrite for Features (Perl & LWP)

8.6.1. Debuggability

The greatest change is the introduction of all the links with "DEBUG" in them. Because the DEBUG constant is declared with value 0, all the tests of whether DEBUG is nonzero are obviously always false, and so all these lines are never run; in fact, the Perl compiler removes them from the parse tree of this program, so they're discarded the moment they're parsed. (Incidentally, there's nothing magic about the name "DEBUG"; you can call it "TRACE" or "Talkytalky" or "_mumbles" or whatever you want. However, using all caps is a matter of convention.) So, with a DEBUG value of 0, when you run this program, it simply prints this:

Listen to Current Show
  http://www.npr.org/ramfiles/fa/20011011.fa.ram
Listen to Monday - July 2, 2001
  http://www.npr.org/ramfiles/fa/20010702.fa.ram
Listen to Editor and writer Walter Kirn
  http://www.npr.org/ramfiles/fa/20010702.fa.01.ram
Listen to Casting director and actress Joanna Merlin
  http://www.npr.org/ramfiles/fa/20010702.fa.02.ram

(That first link is superfluous, but we'll deal with that in a bit; otherwise, it all works okay.) So these DEBUG lines do nothing. And when we deploy the above program with some code that harvests the pages instead of working from the local test page, the DEBUG lines will continue to do nothing. But suppose that, months later, the program just stops working. That is, it runs, but prints nothing, and we don't know why. Did NPR change the Fresh Air site so much that the old program listings' URLs are no longer serve any content? Or has some part of the format changed? If we just change DEBUG => 0 to DEBUG => 1 and rerun the program, we can see that parse_fresh_stream( ) is definitely being called on a stream from an HTML page, because we see the messages from the print statements in that routine:

About to parse stream with base
http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001
End of stream

Change the DEBUG level to 2, and we get more detailed output:

About to parse stream with base
http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001
Considering {<A HREF="index.cfm">}
Host is no good in http://freshair.npr.org/index.cfm
Considering {<A HREF="http://www.npr.org/ramfiles/fa/20011011.fa.prok">}
Path is no good in http://www.npr.org/ramfiles/fa/20011011.fa.prok
Considering {<A HREF="dayFA.cfm?todayDate=current">}
[...]
Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.prok">}
Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.prok
Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.01.prok">}
Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.01.prok
Considering {<A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn">}
Host is no good in http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn
Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.02.prok">}
Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.02.prok
Considering {<A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin">}
Host is no good in http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin
Considering {<A HREF="dayFA.cfm?todayDate=06%2F29%2F2001">}
Host is no good in http://freshair.npr.org/dayFA.cfm?todayDate=06%2F29%2F2001
Considering {<A HREF="dayFA.cfm?todayDate=07%2F03%2F2001">}
Host is no good in http://freshair.npr.org/dayFA.cfm?todayDate=07%2F03%2F2001
End of stream

Our parse_fresh_stream( ) routine is still correctly rejecting index.cfm and the like, for having a "no good" host (i.e., not www.npr.org). And we can see that it's happening on those "ramfiles" links, and it's not rejecting their host, because they are on www.npr.org. But it rejects their paths. When we look back at the code that triggers rejection based on the path, it kicks in only when the path fails to match m{/ramfiles/.*\.ram$}. Why don't our ramfiles paths match that regexp anymore? Ah ha, because they don't end in .ram anymore; they end in .prok, some new audio format that NPR has switched to! This is evident at the end of the lines beginning "Path is no good." Change our regexp to accept .prok, rerun the program, and go about our business. Similarly, if the audio files moved to a different server, we'd be alerted to their host being "no good" now, and we could adjust the regexp that checks that.

We had to make some fragile assumptions to tell interesting links apart from uninteresting ones, but having all these DEBUG statements means that when the assumptions no longer hold, we can quickly isolate the problem.

8.6.2. Images and Applets

Speaking of assumptions, what about the fact that (back to our pre-.prok local test file and setting DEBUG back to 0) we get an extra link at the start of the output here?

Listen to Current Show
  http://www.npr.org/ramfiles/fa/20011011.fa.ram
Listen to Monday - July 2, 2001
  http://www.npr.org/ramfiles/fa/20010702.fa.ram
Listen to Editor and writer Walter Kirn
  http://www.npr.org/ramfiles/fa/20010702.fa.01.ram
Listen to Casting director and actress Joanna Merlin
  http://www.npr.org/ramfiles/fa/20010702.fa.02.ram

If we go to our browser and use the "Find in Page" function to see where "Listen to Current Show" appears in the rendered page, we'll probably find no match. So where's it coming from? Try the same search on the source, and you'll see:

<A HREF="http://www.npr.org/ramfiles/fa/20011011.fa.ram">
  <IMG SRC="images/listen.gif" ALT="Listen to Current Show"
    WIDTH="124" HEIGHT="47" BORDER="0" HSPACE="0" VSPACE="0">
</A>

Recall that get_text( ) and get_text_trimmed( ) give special treatment to img and applet elements; they treat them as virtual text tags with contents from their alt values (or in the absence of any alt value, the strings [IMG] or [APPLET]). That might be a useful feature normally, but it's bothersome now. So we turn it off by adding this line just before our while loop starts reading from the stream:

$stream->{'textify'} = {};

We know that's the line to use partly because I mentioned it as an aside much earlier, and partly because it's in the HTML::TokeParser manpage (where you can also read about how to do things with the textify feature other than just turn it off). With that change made, our program prints this:

??
  http://www.npr.org/ramfiles/fa/20011011.fa.ram
Listen to Monday - July 2, 2001
  http://www.npr.org/ramfiles/fa/20010702.fa.ram
Listen to Editor and writer Walter Kirn
  http://www.npr.org/ramfiles/fa/20010702.fa.01.ram
Listen to Casting director and actress Joanna Merlin
  http://www.npr.org/ramfiles/fa/20010702.fa.02.ram

That ?? is there because when the first link had no link text (and we're no longer counting alt text), it caused get_trimmed_text( ) to return an empty string. That is a false value in Perl, so it causes the fallthrough to ?? here:

my $text = $stream->get_trimmed_text('/a') || "??";

If we want to explicitly skip things with no link text, we change that to:

my $text = $stream->get_trimmed_text('/a');
unless(length $text) {
  DEBUG > 1 and print "Skipping link with no link-text\n";
  next;
}

That makes the program give this output, as we wanted it:

Listen to Monday - July 2, 2001
  http://www.npr.org/ramfiles/fa/20010702.fa.ram
Listen to Editor and writer Walter Kirn
  http://www.npr.org/ramfiles/fa/20010702.fa.01.ram
Listen to Casting director and actress Joanna Merlin
  http://www.npr.org/ramfiles/fa/20010702.fa.02.ram

8.6.4. Live Data

All it needs to actually pull data from the Fresh Air web site, is to comment out the code that calls the local test file and substitute some simple code to get the data for a block of days. Here's is the whole program source, with those changes and additions:

use strict;
use constant DEBUG => 0;
use HTML::TokeParser;

#parse_fresh_stream(
#  HTML::TokeParser->new('fresh1.html') || die($!),
#  'http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001'
#);

scan_last_month( );

sub scan_last_month {
  use LWP::UserAgent;
  my $browser = LWP::UserAgent->new( );
  foreach my $date_mdy (weekdays_last_month( )) {
    my $url = sprintf(
     'http://freshair.npr.org/dayFA.cfm?todayDate=%02d%%2f%02d%%2f%04d',
     @$date_mdy
    );
    DEBUG and print "Getting @$date_mdy URL $url\n";
    sleep 3; # Don't hammer the NPR server!
    my $response = $browser->get($url);
    unless($response->is_success) {
      print "Error getting $url: ", $response->status_line, "\n";
      next;
    }
    my $stream = HTML::TokeParser->new($response->content_ref)
     || die "What, couldn't make a stream?!";
    parse_fresh_stream($stream, $response->base);
  }
}

sub weekdays_last_month { # Boring date handling. Feel free to skip.
  my($now) = time;
  my $this_month = (gmtime $now)[4];
  my(@out, $last_month, $that_month);

  do { # Get to end of last month.
    $now -= (24 * 60 * 60); # go back a day
    $that_month = (gmtime $now)[4];
  } while($that_month == $this_month);
  $last_month = $that_month;

  do { # Go backwards thru last month
    my(@then) = (gmtime $now);
    unshift @out, [$then[4] + 1 , $then[3], $then[5] + 1900] # m,d,yyyy
      unless $then[6] == 0 or $then[6] == 6;
    $now -= (24 * 60 * 60); # go back one day
    $that_month = (gmtime $now)[4];
  } while($that_month == $last_month);
  return @out;
}

# Unchanged since you last saw it:
sub parse_fresh_stream {
  use URI;
  my($stream, $base_url) = @_;
  DEBUG and print "About to parse stream with base $base_url\n";

  while(my $a_tag = $stream->get_tag('a')) {
    DEBUG > 1 and printf "Considering {%s}\n", $a_tag->[3];
    my $url = URI->new_abs( ($a_tag->[1]{'href'} || next), $base_url);
    unless($url->scheme eq 'http') {
      DEBUG > 1 and print "Scheme is no good in $url\n";
      next;
    }
    unless($url->host =~ m/www\.npr\.org/) {
      DEBUG > 1 and print "Host is no good in $url\n";
      next;
    }
    unless($url->path =~ m{/ramfiles/.*\.ram$}) {
      DEBUG > 1 and print "Path is no good in $url\n";
      next;
    }
    DEBUG > 1 and print "IT'S GOOD!\n";
    my $text = $stream->get_trimmed_text('/a') || "??";
    printf "%s\n  %s\n", $text, $url;
  }
  DEBUG and print "End of stream\n";
  return;
}

8.6. Rewrite for Features

8.6.1. Debuggability

8.6.2. Images and Applets

8.6.3. Link Text

8.6.4. Live Data


8.5. Narrowing In		8.7. Alternatives