Many tools that support regular expressions (regexes) support positive and negative lookahead. What good is lookahead? Why would you ever use it?
A Positive Example
Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:
What then, said I, shall I do? You shan't, he replied, do anything.
As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:
[A-Za-z']+,
This yields four results over the string:
- then,
- I,
- shan't,
- replied,
Notice that this gets me the comma too, though, which I would then have to remove. Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?
We can do that by modifying our regex as follows:
[A-Za-z']+(?=,)
This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be). The modified regex results in these matches:
- then
- I
- shan't
- replied
A Positive Negative Example
What if I wanted to match all the words not followed by a comma? I would use negative lookahead:
(?>[A-Za-z']+)(?!,)
(Okay, negative lookahead and atomic grouping)
…to get these matches:
- What
- said
- shall
- I
- do
- You
- he
- do
- anything
Huh? Atomic Grouping?
Yep. Otherwise you’ll get the following (unintended matches highlighted):
- What
- the
- said
- shall
- I
- do
- You
- shan’
- he
- replie
- do
- anything
Without atomic grouping (the (?>
…)
in the regex), when the regex engine sees that a match-in-progress comes up against a disqualifying comma, it simply backs off one letter to complete the match: the + in the regex gives the engine that flexibility. Applying atomic grouping disallows this and says, don’t give up characters you’ve matched.
When Lookahead Does You No Good
Lookahead doesn’t really help if you only care whether or not there was a match (that is, you don’t care what text was matched). If all I care about is whether or not the string contains any words followed by a comma, I would dump lookahead and use the simpler regex:
[A-Za-z']+,
Acknowledgments
Thanks to Jeffrey Friedl for writing Mastering Regular Expressions, 3rd ed., before reading which I had not even heard of regex lookahead.
Also, thanks to Sergey Evdokimov for his online Regular Expression Editor. Handy!