start page | rating of books | rating of authors | reviews | copyrights

Book HomeLearning Perl, 3rd EditionSearch this book

16.4. Variable-length (Text) Databases

Many simple databases are merely text files written in a format that allows a program to read and maintain them. For example, a configuration file for some program might be a text file, with one configuration parameter being set on each line. Or maybe the file is a mailing list, with one name and address on each line (probably with the components of the name and address separated by tab characters).

Updating text files is more difficult than it probably seems at first. But that's only because we're used to seeing text files rendered as pages (or screens) of text. If you could see the file as it is written in the filesystem, the difficulty is more apparent. Since we can't show you the file as it's actually written without opening up a disk drive, here's our rendition of a piece of a text file[356]:

[356]Of course, the real file wouldn't have lines at all; it's one long stream of text. And the newline character should really be a single-character code. But these differences don't hurt this as an example.

He had bought a large map representing the sea,\n  Without the l
east vestige of land:\nAnd the crew were much pleased when they 
found it to be\n  A map they could all understand.\n\n"What's th
e good of Mercator's North Poles and Equators,\n  Tropics, Zones
, and Meridian Lines?"\nSo the Bellman would cry: and the crew w
ould reply\n  "They are merely conventional signs!\n\n"Other map
s are such shapes, with their islands and capes!\n  But we've go
t our brave Captain to thank:"\n(So the crew would protest) "tha
t he's bought us the best-\n  A perfect and absolute blank!"\n\n

If you had this file open in your text editor, it would be easy to change a word, add a comma, or fix a misspelling. If your editor is powerful enough, in fact, you could change the indentation of each line with a single command. But the text file is a stream of bytes; if you wanted to add even a single comma, the remainder of the text file (possibly thousands or millions of bytes) would have to move over to make room. Nearly every tiny change would mean lots of slow copying operations on the file. So how can we edit the file efficiently?

The most common way of programmatically updating a text file is by writing an entirely new file that looks similar to the old one, but making whatever changes we need as we go along. As you'll see, this technique gives nearly the same result as updating the file itself, but it has some beneficial side effects as well.

In this example, we've got hundreds of files with a similar format. One of them is fred03.dat, and it's full of lines like these:

Program name: granite
Author: Gilbert Bates
Company: RockSoft
Department: R&D
Phone: +1 503 555-0095
Date: Tues March 9, 1999
Version: 2.1
Size: 21k
Status: Final beta

We need to fix this file so that it has some different information. Here's roughly what this one should look like when we're done:

Program name: granite
Author: Randal L. Schwartz
Company: RockSoft
Department: R&D
Date: June 12, 2002 6:38 pm
Version: 2.1
Size: 21k
Status: Final beta

In short, we need to make three changes. The name of the Author should be changed; the Date should be updated to today's date, and the Phone should be removed completely. And we have to make these changes in hundreds of similar files as well.

Perl supports a way of in-place editing of files with a little extra help from the diamond operator ("<>"). Here's a program to do what we want, although it may not be obvious how it works at first. This program's only new feature is the special variable $^I; ignore that for now, and we'll come back to it:

#!/usr/bin/perl -w

use strict;

chomp(my $date = `date`);
@ARGV = glob "fred*.dat" or die "no files found";
$^I = ".bak";

while (<>) {
  s/^Author:.*/Author: Randal L. Schwartz/;
  s/^Phone:.*\n//;
  s/^Date:.*/Date: $date/;
  print;
}

Since we need today's date, the program starts by using the system date command. A better way to get the date (in a slightly different format) would almost surely be to use Perl's own localtime function in a scalar context:

my $date = localtime;

To get the list of files for the diamond operator, we read them from a glob. The next line sets $^I, but keep ignoring that for the moment.

The main loop reads, updates, and prints one line at a time. (With what you know so far, that means that all of the files' newly modified contents will be dumped to your terminal, scrolling furiously past your eyes, without the files being changed at all. But stick with us.) Note that the second substitution can replace the entire line containing the phone number with an empty string -- leaving not even a newline -- so when that's printed, nothing comes out, and it's as if the Phone never existed. Most input lines won't match any of the three patterns, and those will be unchanged in the output.

So this result is close to what we want, except that we haven't shown you how the updated information gets back out on to the disk. The answer is in the variable $^I. By default it's undef, and everything is normal. But when it's set to some string, it makes the diamond operator ("<>") even more magical than usual.

We already know about much of the diamond's magic -- it will automatically open and close a series of files for you, or read from the standard-input stream if there aren't any filenames given. But when there's a string in $^I, that string is used as a backup filename's extension. Let's see that in action.

Let's say it's time for the diamond to open our file fred03.dat. It opens it like before, but now it renames it, calling it fred03.dat.bak.[357] We've still got the same file open, but now it has a different name on the disk. Next, the diamond creates a new file and gives it the name fred03.dat. That's okay; we weren't using that name any more. And now the diamond selects the new file as the default for output, so that anything that we print will go into that file.[358]

[357]Some of the details of this procedure will vary on non-Unix systems, but the end result should be nearly the same. See the release notes for your port of Perl.

[358]The diamond also tries to duplicate the original file's permission and ownership settings as much as possible; for example, if the old one was world-readable, the new one should be, as well.

So now the while loop will read a line from the old file, update that, and print it out to the new file. This program can update hundreds of files in a few seconds on a typical machine. Pretty powerful, huh?

Once the program has finished, what does the user see? The user says, "Ah, I see what happened! Perl edited my file fred03.dat, making the changes I needed, and saved me a copy of the original in the backup file fred03.dat.bak just to be helpful!" But we now know the truth: Perl didn't really edit any file. It made a modified copy, said "Abracadabra!", and switched the files around while we were watching sparks come out of the magic wand. Tricky.

Some folks use a tilde ("~") as the value for $^I, since that resembles what emacs does for backup files. Another possible value for $^I is the empty string. This enables in-place editing, but doesn't save the original data in a backup file. But since a small typo in your pattern could wipe out all of the old data, using the empty string is recommended only if you want to find out how good your backup tapes are. It's easy enough to delete the backup files when you're done. And when something goes wrong and you need to rename the backup files to their original names, you'll be glad that you know how to use Perl to do that (see the multiple-file rename example in Chapter 13, "Manipulating Files and Directories").

16.4.1. In-place Editing from the Command Line

A program like the example from the previous section is fairly easy to write. But Larry decided it wasn't easy enough.

Imagine that you need to update hundreds of files that have the misspelling Randal instead of the one-l name Randal. You could write a program like the one in the previous section. Or you could do it all with a one-line program, right on the command line:

$ perl -p -i.bak -w -e 's/Randal/Randal/g' fred*.dat

Perl has a whole slew of command-line options that can be used to build a complete program in a few keystrokes.[359] Let's see what these few do.

[359]See the perlrunmanpage for the complete list.

Starting the command with perl does something like putting #!/usr/bin/perl at the top of a file does: it says to use the program perl to process what follows.

The -p option tells Perl to write a program for you. It's not much of a program, though; it looks something like this:[360]

[360]Actually, the print occurs in a continue block. See the perlsynand perlrunmanpages for more information.

while (<>) { print; }. 

If you want even less, you could use -n instead; that leaves out the print statement. (Fans of awk will recognize -p and -n.) Again, it's not much of a program, but it's pretty good for the price of a few keystrokes.

The next option is -i.bak, which you might have guessed sets $^I to ".bak" before the program starts. If you don't want a backup file, you can use -i alone, with no extension.

We've seen -w before -- it turns on warnings.

The -e option says "executable code follows." That means that the s/Randal/Randal/g string is treated as Perl code. Since we've already got a while loop (from the -p option), this code is put inside the loop, before the print. For technical reasons, the last semicolon in the -e code is optional. But if you have more than one -e, and thus more than one chunk of code, only the semicolon at the end of the last one may safely be omitted.

The last command-line parameter is fred*.dat, which says that @ARGV should hold the list of filenames that match that glob. Put the pieces all together, and it's as if we had written a program like this:

#!/usr/bin/perl -w

@ARGV = glob "fred*.dat";
$^I = ".bak";

while (<>) {
  s/Randal/Randal/g;
  print;
}

Compare this program to the one we used in the previous section. It's pretty similar. These command-line options are pretty handy, aren't they?



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.