Book review: Mastering Regular Expressions

This review covers Mastering Regular Expressions, 3rd ed., by Jeffrey E.F. Friedl.  Sebastopol, CA: O’Reilly Media, Inc., 2006.

mastering-regular-expressions-cover-scan

Background: The Pain

Regular expressions:  They tend to be difficult to write and difficult to read… but it’s hard to get away from them.  They can help you manipulate text in sophisticated ways.

Often it is possible to avoid regular expressions and get by with simple (non-regex-enabled) text search and replace — and for several years I have done so where possible; but every so often the task is complex enough that I fumblingly try my hand at regular expressions again.  When it works, I am glad; when it doesn’t work, I’ve often not been clear about what went wrong.  But the persistence with which problems continue to arise for which regular expressions would be an elegant help has convinced me of this:  Regular expressions are a tool that a professional programmer should have in his or her toolbox.

Tired of my stabs in the dark, I decided it was time I work to gain a deep understanding of these regex beasts.

Summary

This is a book about mastering regular expressions.  It’s not primarily a regex quick-reference guide, nor a “Get up to speed on regular expressions in 24 hours” book.  Rather, it’s a steady climb from the chapter 1 bunny hills to chapter 6’s double black diamonds.

The book is divided into three sections: the introduction (chapters 1-3); the details (chapters 4-6); and tool-specific information (chapters 7-10: one chapter each for Perl, Java, .NET, and PHP).

As Friedl says at the beginning of chapters 8, 9, and 10:

[T]his book’s foremost intention is not to be a reference, but a detailed instruction on how to master regular expressions.

The author recommends reading the first six chapters before jumping into one of the tool-specific chapters.

The Good

  • Great topic coverage: Here are some of the topics covered by the book:
    • Greedy vs. lazy quantifiers and how they affect matching
    • Backtracking
    • How a regex engine’s “transmission bump-along” works
    • Comparison of the three types of regex engines: DFA, Traditional NFA, and POSIX NFA – and how the engine type affects matching and efficiency
    • In what ways the “language” used for a character class is different from the language used for the larger regex
    • How to be careful when using greedy quantifiers like .*
    • Atomic grouping and possessive matching
    • Non-capturing parentheses
    • Positive and negative lookahead and lookbehind (collectively, “lookaround”)
    • Differences among regex flavors
  • Grizzled wisdom: In addition to all the topics covered, Friedl frequently notes caveats such as “this chart is only the tip of the iceberg — for every feature shown, there are a dozen important issues that are overlooked” (p. 91) and then explains what he means.  He’s not only explaining the table at hand — he’s also helping you learn how to think about tables of regex engine features — to learn what information you can safely glean from such a table and what important details tend to be left unstated.
  • Attention to detail: (This is similar to the previous point, but seems distinct in my mind.)  Some technical books assert that if you do X, the system does Y — but leave out the rare (but important) cases where if you do X, the system doesn’t do Y, and the cases where the system does Y without you doing X, leaving you to stumble into those cases on your own.This book does vanishingly little of that.  As just one example, on page 442 while he is explaining PHP’s m and D pattern modifiers, Friedl discusses the effect if you don’t use either modifier; the effect of the m modifier; and the effect of the D modifier.  But then, in characteristic form, he adds that “If both the m and D pattern modifiers are used, D is ignored.”  This type of attention to precision is bound to save the reader research and debugging time discovering such cases on their own.
  • Steady guidance along to the advanced topics: If Friedl started out with some of the advanced topics from chapters 5 and 6, I might have lost hope and given up.  Instead, he starts out simple and builds as he goes.  While I did not always take the time to fully understand each example in chapters 5 and 6, I found chapters 1-4 very approachable when taken in order.
  • Diagrams and tables: I found the diagrams sprinkled throughout showing the text and what parts of the regex match where, very helpful.  There are others — the backtracking diagrams on 229, 230, and 231; the tables on pages 92 and many other places.
  • Helpful cross-references everywhere: Whenever a concept is mentioned that is developed further somewhere else, the text points to the page number of that further development.  Also, at the beginning of some chapters there is a table-o-pointers (like a mini table of contents) to topics discussed in the chapter, for later quick reference.
  • Brain-jogging quizzes: I found the quizzes sprinkled through the book to be helpful in getting my brain going.  If the quizzes had been lumped together at the end of each chapter, I would have skipped them — but since they were few and sprinkled in among the reading at odd times, they piqued my interest, and my comprehension was aided by doing them.
  • Respectful tone: Though he starts from the beginning building a foundation to help the reader understand regular expressions, Friedl avoids a condescending tone.  He also avoids an apologetic “I want the reader to think I’m cool” tone when dealing with much deep technical content.
  • Good craftsmanship: Each diagram, table, section, quiz — as well as the organization of the chapters and the progression of the examples — has a purpose and contributes toward the single purpose of helping the reader master regular expressions.  Even the unique typographic conventions contribute to understanding.  No diagram is there just for fluff.  This coherence is refreshing.
  • Now it’s a quick reference: Now that I have read the book, I can use it as a quick reference.

The Bad

  • Not a quick reference at first: I had barely started the book when a regex question came up in our development.  I thought that lookaround might be the solution to our problem, so I flipped to the point in the book where the concept is introduced (Adding Commas to a Number with Lookaround, pp. 59 and following)… and had little idea what I was reading.  I found I could not skip to later sections without going through the sections leading up to them.
  • Takes some commitment: Reading this book was less like using a vending machine and more like a two-month apprenticeship.
  • You have to do work: You have to think!

Overall

There’s territory to master,  and I can’t fault the author for that.  This basically neutralizes the “bad” comments about the book.  I had fiddled around with regular expressions for several years without growing much better at them, but this book has has launched me into being able to use regular expressions in much more advanced ways.  What a joy it was, for instance  — I think I was probably in chapter 4 at the time — to be able to help a co-worker using my newly gained knowledge of lookbehind!

After studying through Friedl’s book, I’m finally not a regex beginner any longer.  I understand the territory better, and if a regex didn’t match like I expected I believe I could look into it and have a shot at figuring out the cause (before, my practice was more hack-and-hope).

I heartily recommend Mastering Regular Expressions for anyone who feels the time has finally come for them to take the time to really understand regexes.

Advertisements

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s