So far in this book, we've been using one type of user-agent object: objects of the class LWP::UserAgent. This is generally appropriate for a program that makes only a few undemanding requests of a remote server. But for cases in which we want to be quite sure that the robot behaves itself, the best way to start is by using LWP::RobotUA instead of LWP::UserAgent.
An LWP::RobotUA object is like an LWP::UserAgent object, with these exceptions:
Instead of calling $browser = LWP::UserAgent->new( ), you call:
$robot = LWP::RobotUA->new( 'botname/1.2', '[email protected]' )
Specify a reasonably unique name for the bot (with an X.Y version number) and an email address where you can be contacted about the program, if anyone needs to do so.
When you call $robot->get(...) or any other method that performs a request (head( ), post( ), request( ), simple_request( )), LWP calls sleep( ) to wait until enough time has passed since the last request was made to that server.
When you request anything from a given HTTP server using an LWP::RobotUA $robot object, LWP will make sure it has consulted that server's robots.txt file, where the server's administrator can stipulate that certain parts of his server are off limits to some or all bots. If you request something that's off limits, LWP won't actually request it, and will return a response object with a 403 (Forbidden) error, with the explanation "Forbidden by robots.txt."
For specifics on robots.txt files, see the documentation for the LWP module called WWW::RobotRules, and also be sure to read http://www.robotstxt.org/wc/robots.html.
Besides having all the attributes of an LWP::UserAgent object, an LWP::RobotUA object has one additional interesting attribute, $robot->delay($minutes), which controls how long this object should wait between requests to the same host. The current default value is one minute. Note that you can set it to a non-integer number of minutes. For example, to set the delay to seven seconds, use $robot->delay(7/60).
So we can take our New York Times program from Chapter 11, "Cookies, Authentication,and Advanced Requests" and make it into a scrupulously well-behaved robot by changing this one line:
my $browser = LWP::UserAgent->new( );
to this:
use LWP::RobotUA; my $browser = LWP::RobotUA->new( 'JamiesNYTBot/1.0', '[email protected]' # my address ); $browser->delay(5/60); # 5 second delay between requests
We may not notice any particular effect on how the program behaves, but it makes quite sure that the $browser object won't perform its requests too quickly, nor request anything the Times's webmaster thinks robots shouldn't request.
In new programs, I typically use $robot as the variable for holding LWP::RobotUA objects instead of $browser. But this is a merely cosmetic difference; nothing requires us to replace every $browser with $robot in the Times program when we change it from using an LWP::UserAgent object to an LWP::RobotUA object.
You can freely use LWP::RobotUA anywhere you could use LWP::UserAgent, in a Type One or Type Two spider. And you really should use LWP::RobotUA as the basis for any Type Three or Type Four spiders. You should use it not just so you can effortlessly abide by robots.txt rules, but also so that you don't have to remember to write in sleep statements all over your programs to keep it from using too much of the remote server's bandwidth—or yours!
Copyright © 2002 O'Reilly & Associates. All rights reserved.