Book HomePerl CookbookSearch this book

6.15. Greedy and Non-Greedy Matches

Problem

You have a pattern with a greedy quantifier like *, +, ?, or {}, and you want to stop it from being greedy.

A classic case of this is the na�ve substitution to remove tags from HTML. Although it looks appealing, s#<TT>.*</TT>##gsi, actually deletes everything from the first open TT tag through the last closing one. This would turn "Even <TT>vi</TT> can edit <TT>troff</TT> effectively." into "Even effectively", completely changing the meaning of the sentence!

Solution

Replace the offending greedy quantifier with the corresponding non-greedy version. That is, change *, +, ?, and {} into *?, +?, ??, and {}?, respectively.

Discussion

Perl has two sets of quantifiers: the maximal ones *, +, ?, and {} (sometimes called greedy) and the minimal ones *?, +?, ??, and {}? (sometimes called stingy). For instance, given the string "Perl is a Swiss Army Chainsaw!", the pattern /(r.*s)/ matches "rl is a Swiss Army Chains" whereas /(r.*?s)/ matches "rl is".

With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for * or one or more times for +, the matching engine prefers the "or more" portion of that description. Thus /foo.*bar/ matches from the first "foo" up to the last "bar" in the string, rather than merely the next "bar", as some might expect. To make any of the regular expression repetition operators prefer stingy matching over greedy matching, add an extra ? . So *? matches zero or more times, but rather than match as much as it possibly can the way * would, it matches as little as possible.

# greedy pattern
s/<.*>//gs;                     # try to remove tags, very badly

# non-greedy pattern
s/<.*?>//gs;                    # try to remove tags, still rather badly

This approach doesn't remove tags from all possible HTML correctly, because a single regular expression is not an acceptable replacement for a real parser. See Recipe 20.6 for the right way to do this.

Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern BEGIN.*?END in a pattern amidst other elements will always match the shortest amount of text between occurrences of BEGIN and END. Imagine the pattern /BEGIN(.*?)END/. If matched against the string "BEGIN and BEGIN and END", $1 would contain "and BEGIN and". This is probably not what you want.

Imagine if we were trying to pull out everything between bold-italic pairs:

<b><i>this</i> and <i>that</i> are important</b> Oh, <b><i>me too!</i></b>

A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be this one:

m{ <b><i>(.*?)</i></b> }sx

You might be surprised to learn that the pattern doesn't do that. Many people incorrectly understand this as matching a "<b><i>" sequence, then something that's not "<b><i>", and then "</i></b>", leaving the intervening text in $1. While often it works out that way due to the input data, that's not really what it says. It just matches the shortest leftmost substring that satisfies the entire pattern. In this case, that's the entire string. If the intention were to extract only stuff between "<b><i>" and its corresponding "</i></b>", with no other bold-italic tags in between, it would be incorrect.

If the string in question is just one character, a negated class is remarkably more efficient than a minimal match, as in /X([^X]*)X/. But the general way to say "match BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN and END is as follows (this also stores the intervening part in $1):

/BEGIN((?:(?!BEGIN).)*)END/

Applying this to the HTML-matching code, we end up with something like:

m{ <b><i>(  (?: (?!</b>|</i>). )*  ) </i></b> }sx

or perhaps:

m{ <b><i>(  (?: (?!</[ib]>). )*  ) </i></b> }sx

Jeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed really matters, such as:

m{
    <b><i> 
    [^<]*  # stuff not possibly bad, and not possibly the end.
    (?:
 # at this point, we can have '<' if not part of something bad
     (?!  </?[ib]>  )   # what we can't have
     <                  # okay, so match the '<'
     [^<]*              # and continue with more safe stuff
    ) *
    </i></b>
 }sx

This is a variation on Jeffrey's unrolling-the-loop technique, described in Chapter 5 of Mastering Regular Expressions.

See Also

The non-greedy quantifiers in the "Regular Expressions" section of perlre (1), and in the "the rules of regular expression matching" section of Chapter 2 of Programming Perl


Previous: 6.14. Matching from Where the Last Pattern Left OffPerl CookbookNext: 6.16. Detecting Duplicate Words
6.14. Matching from Where the Last Pattern Left OffBook Index6.16. Detecting Duplicate Words

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.