Positive examples of positive and negative lookahead

Many tools that support regular expressions (regexes) support positive and negative lookahead.  What good is lookahead?  Why would you ever use it?

A Positive Example

Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:

What then, said I, shall I do?  You shan't, he replied, do anything.

As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:

[A-Za-z']+,

This yields four results over the string:

  1. then,
  2. I,
  3. shan't,
  4. replied,

Notice that this gets me the comma too, though, which I would then have to remove.  Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?

We can do that by modifying our regex as follows:

[A-Za-z']+(?=,)

This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be).  The modified regex results in these matches:

  1. then
  2. I
  3. shan't
  4. replied

A Positive Negative Example

What if I wanted to match all the words not followed by a comma?  I would use negative lookahead:

(?>[A-Za-z']+)(?!,)

(Okay, negative lookahead and atomic grouping)

…to get these matches:

  1. What
  2. said
  3. shall
  4. I
  5. do
  6. You
  7. he
  8. do
  9. anything

Huh? Atomic Grouping?

Yep.  Otherwise you’ll get the following (unintended matches highlighted):

  1. What
  2. the
  3. said
  4. shall
  5. I
  6. do
  7. You
  8. shan’
  9. he
  10. replie
  11. do
  12. anything

Without atomic grouping (the (?>) in the regex), when the regex engine sees that a match-in-progress comes up against a disqualifying comma, it simply backs off one letter to complete the match: the + in the regex gives the engine that flexibility.  Applying atomic grouping disallows this and says, don’t give up characters you’ve matched.

When Lookahead Does You No Good

Lookahead doesn’t really help if you only care whether or not there was a match (that is, you don’t care what text was matched).  If  all I care about is whether or not the string contains any words followed by a comma, I would dump lookahead and use the simpler regex:

[A-Za-z']+,

Acknowledgments

Thanks to Jeffrey Friedl for writing Mastering Regular Expressions, 3rd ed., before reading which I had not even heard of regex lookahead.

Also, thanks to Sergey Evdokimov for his online Regular Expression Editor.  Handy!

online-regex-editor

About these ads

  1. #1 by Aaron Alexander on March 25, 2009 - 2:26 pm

    I thought I’d point out another online regular expression editor that I use: Rubular.

    Thanks for the article, by the way. I didn’t know about the lookahead feature.

  2. #2 by Witek on June 11, 2013 - 11:39 am

    It would be from performance and regular expression semantic, much better to use matching and negative character class.

    ([A-Za-z’]+),
    ([A-Za-z’]+)([^,]|$) # followed by non-coma, or end of string.

    and use group 1.

  3. #3 by sars on September 4, 2013 - 6:30 am

    I need more explanation on atomic grouping,,,
    Thanks in advance

  4. #4 by Max on September 26, 2013 - 10:47 am

    I want a regex that will validate number between 1 to 9 except 7. i.e, it will initially check if a given number is in the range of 1 to 9 , if true then another check will make if that given number is 7 or not. If its 7 then fail else pass. Does anyone have any idea. Your help will be highly appreciated .

  5. #5 by danielmeyer on September 26, 2013 - 10:59 am

    Dear Max,
    Why don’t you write out your best guess, and then we’ll work on it from there.

    Daniel

  1. Ascertaining how subtract works with Strings in Groovy « All things Grails and RIA
  2. Groovy Regex text manipulation example « All things Grails and RIA
  3. Groovy Regular Expressions to abbreviate compass directions with look ahead and look behind. « All things Grails and RIA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.