Posts Tagged regex

Automation-assisted manual transformations

I had been unit testing an SQL generator and had a bunch of tests that gave various input to the generator and tested its output against expected SQL.   The SQL in my tests looked like this:

Now I was ready to feed the SQL to a database engine to verify that it was validly formed.  I would generally use grep for this type of task; but here my SQL statements were formatted multiline for easier reading, and grep operates in a line-by-line mode.  There were over 100 test cases, so it was worth figuring out an automated solution.  I also wanted to avoid writing a single-purpose text-processing utility if possible.

I ended up writing down the following steps for myself:

  1. Turn the tests into a single line of text for ease of working with tools
    On command line: cat Test*.cpp | tr -d "\n\r" > all-one-line.out
  2. Discard everything but the queries, inserting a newline after each
    In editor (SlickEdit for me), open all-one-line.out and Replace All (using Perl-style regexes):
    .*?("(?:\( )*SELECT[^;]+?;)
  3. Clean up what the regex didn’t
    Delete the last line
  4. Get rid of quotes
    Replace \" with nothing
  5. Get rid of semicolons
    Replace ;$ with nothing
  6. Get rid of extra spaces
    Replace <space>+ with <space>
  7. Save the file in the editor
  8. Get rid of Oracle-specific tests
    grep --invert-match TO_DATE < all-one-line.out > all-one-line.sql
  9. Let cool for 5 minutes before serving
    Paste all-one-line.sql into MS SQL Server Management Studio and execute (with Results to Text)

This may look like a large number of steps, but I got to where I could run through them in about 30 seconds and test all 130  queries on the server.  Nice!

Future improvements

Once I had the ability to test my test output against the database server, I wanted to do that each time the tests’ expected results changed.  So where I had originally envisioned a single smoke test run, I ended up going through these automation-assisted manual steps ten or twenty times.  In retrospect, the single-purpose utility script would clearly have been the better approach after all.  I need to get more comfortable whipping up such scripts to lower the barrier to writing them when these occasions arise.

Twelve years in, I would think I would be at the top of my craft by now, but there are still things that seem pretty basic that I’m learning.  Hmmm…I wonder if life really does begin at 40?


, , , , ,

Leave a comment

Positive examples of positive and negative lookahead

Many tools that support regular expressions (regexes) support positive and negative lookahead.  What good is lookahead?  Why would you ever use it?

A Positive Example

Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:

What then, said I, shall I do?  You shan't, he replied, do anything.

As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:


This yields four results over the string:

  1. then,
  2. I,
  3. shan't,
  4. replied,

Notice that this gets me the comma too, though, which I would then have to remove.  Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?

We can do that by modifying our regex as follows:


This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be).  The modified regex results in these matches:

  1. then
  2. I
  3. shan't
  4. replied

A Positive Negative Example

What if I wanted to match all the words not followed by a comma?  I would use negative lookahead:


(Okay, negative lookahead and atomic grouping)

…to get these matches:

  1. What
  2. said
  3. shall
  4. I
  5. do
  6. You
  7. he
  8. do
  9. anything

Huh? Atomic Grouping?

Yep.  Otherwise you’ll get the following (unintended matches highlighted):

  1. What
  2. the
  3. said
  4. shall
  5. I
  6. do
  7. You
  8. shan’
  9. he
  10. replie
  11. do
  12. anything

Without atomic grouping (the (?>) in the regex), when the regex engine sees that a match-in-progress comes up against a disqualifying comma, it simply backs off one letter to complete the match: the + in the regex gives the engine that flexibility.  Applying atomic grouping disallows this and says, don’t give up characters you’ve matched.

When Lookahead Does You No Good

Lookahead doesn’t really help if you only care whether or not there was a match (that is, you don’t care what text was matched).  If  all I care about is whether or not the string contains any words followed by a comma, I would dump lookahead and use the simpler regex:



Thanks to Jeffrey Friedl for writing Mastering Regular Expressions, 3rd ed., before reading which I had not even heard of regex lookahead.

Also, thanks to Sergey Evdokimov for his online Regular Expression Editor.  Handy!



MRE3 errata

I recently finished reading Mastering Regular Expressions, 3rd ed., by Jeffrey E.F. Friedl.  In the main chapters (the first six) I didn’t find a single typo. (I thought the PHP example in the table at the bottom of page 190 was missing a backslash, but I was wrong!  See the explanation at the top of page 445.)  I found this lack of errors (at least ones that I could find) really impressive given the level of technical detail.  I did find a few minor errors in the tool-specific chapters (chapters 7-10) and sent them along to the author  — all part of my ploy to attain fame and fortune by pointing out others’ mistakes ;).  So that I don’t lose them, here they are:

From: Daniel Meyer
To: ‘Jeffrey Friedl’
Subject: MRE3 errata

Dear Mr. Friedl,

I have just finished reading Mastering Regular Expressions, 3rd ed., and I found it both helpful and enjoyable.  In the spirit of contributing to the betterment of the work, here are some things I believe are errata:

p.293, under Context bullet point: s/in which they’re in/which they’re in/
p.298, second paragraph: s/looking localization/looking at localization/
p.428 sidebar: Based on the VB.NET code, the output in both columns should have “Options are” instead of “Option are”
p.457, last paragraph: s/but since they retained/but since they are retained/
p.459, last paragraph: s/are note interpolated/are not interpolated/
p.462, code snippet at bottom of page: Shouldn’t both be $0 ?
p.467, after first code snippet: “The S pattern modifier is used for efficiency” (but the code snippet does not use the S pattern modifier)
p.473, first if-statement in preg_regex_to_pattern function: the comment says ‘/’ followed by ‘\’ or EOS (but the code matches ‘\’ followed by ‘/’ or EOS)

Thank you!

Daniel Meyer

Leave a comment

Book review: Mastering Regular Expressions

This review covers Mastering Regular Expressions, 3rd ed., by Jeffrey E.F. Friedl.  Sebastopol, CA: O’Reilly Media, Inc., 2006.


Background: The Pain

Regular expressions:  They tend to be difficult to write and difficult to read… but it’s hard to get away from them.  They can help you manipulate text in sophisticated ways.

Often it is possible to avoid regular expressions and get by with simple (non-regex-enabled) text search and replace — and for several years I have done so where possible; but every so often the task is complex enough that I fumblingly try my hand at regular expressions again.  When it works, I am glad; when it doesn’t work, I’ve often not been clear about what went wrong.  But the persistence with which problems continue to arise for which regular expressions would be an elegant help has convinced me of this:  Regular expressions are a tool that a professional programmer should have in his or her toolbox.

Tired of my stabs in the dark, I decided it was time I work to gain a deep understanding of these regex beasts.


This is a book about mastering regular expressions.  It’s not primarily a regex quick-reference guide, nor a “Get up to speed on regular expressions in 24 hours” book.  Rather, it’s a steady climb from the chapter 1 bunny hills to chapter 6’s double black diamonds.

The book is divided into three sections: the introduction (chapters 1-3); the details (chapters 4-6); and tool-specific information (chapters 7-10: one chapter each for Perl, Java, .NET, and PHP).

As Friedl says at the beginning of chapters 8, 9, and 10:

[T]his book’s foremost intention is not to be a reference, but a detailed instruction on how to master regular expressions.

The author recommends reading the first six chapters before jumping into one of the tool-specific chapters.

The Good

  • Great topic coverage: Here are some of the topics covered by the book:
    • Greedy vs. lazy quantifiers and how they affect matching
    • Backtracking
    • How a regex engine’s “transmission bump-along” works
    • Comparison of the three types of regex engines: DFA, Traditional NFA, and POSIX NFA – and how the engine type affects matching and efficiency
    • In what ways the “language” used for a character class is different from the language used for the larger regex
    • How to be careful when using greedy quantifiers like .*
    • Atomic grouping and possessive matching
    • Non-capturing parentheses
    • Positive and negative lookahead and lookbehind (collectively, “lookaround”)
    • Differences among regex flavors
  • Grizzled wisdom: In addition to all the topics covered, Friedl frequently notes caveats such as “this chart is only the tip of the iceberg — for every feature shown, there are a dozen important issues that are overlooked” (p. 91) and then explains what he means.  He’s not only explaining the table at hand — he’s also helping you learn how to think about tables of regex engine features — to learn what information you can safely glean from such a table and what important details tend to be left unstated.
  • Attention to detail: (This is similar to the previous point, but seems distinct in my mind.)  Some technical books assert that if you do X, the system does Y — but leave out the rare (but important) cases where if you do X, the system doesn’t do Y, and the cases where the system does Y without you doing X, leaving you to stumble into those cases on your own.This book does vanishingly little of that.  As just one example, on page 442 while he is explaining PHP’s m and D pattern modifiers, Friedl discusses the effect if you don’t use either modifier; the effect of the m modifier; and the effect of the D modifier.  But then, in characteristic form, he adds that “If both the m and D pattern modifiers are used, D is ignored.”  This type of attention to precision is bound to save the reader research and debugging time discovering such cases on their own.
  • Steady guidance along to the advanced topics: If Friedl started out with some of the advanced topics from chapters 5 and 6, I might have lost hope and given up.  Instead, he starts out simple and builds as he goes.  While I did not always take the time to fully understand each example in chapters 5 and 6, I found chapters 1-4 very approachable when taken in order.
  • Diagrams and tables: I found the diagrams sprinkled throughout showing the text and what parts of the regex match where, very helpful.  There are others — the backtracking diagrams on 229, 230, and 231; the tables on pages 92 and many other places.
  • Helpful cross-references everywhere: Whenever a concept is mentioned that is developed further somewhere else, the text points to the page number of that further development.  Also, at the beginning of some chapters there is a table-o-pointers (like a mini table of contents) to topics discussed in the chapter, for later quick reference.
  • Brain-jogging quizzes: I found the quizzes sprinkled through the book to be helpful in getting my brain going.  If the quizzes had been lumped together at the end of each chapter, I would have skipped them — but since they were few and sprinkled in among the reading at odd times, they piqued my interest, and my comprehension was aided by doing them.
  • Respectful tone: Though he starts from the beginning building a foundation to help the reader understand regular expressions, Friedl avoids a condescending tone.  He also avoids an apologetic “I want the reader to think I’m cool” tone when dealing with much deep technical content.
  • Good craftsmanship: Each diagram, table, section, quiz — as well as the organization of the chapters and the progression of the examples — has a purpose and contributes toward the single purpose of helping the reader master regular expressions.  Even the unique typographic conventions contribute to understanding.  No diagram is there just for fluff.  This coherence is refreshing.
  • Now it’s a quick reference: Now that I have read the book, I can use it as a quick reference.

The Bad

  • Not a quick reference at first: I had barely started the book when a regex question came up in our development.  I thought that lookaround might be the solution to our problem, so I flipped to the point in the book where the concept is introduced (Adding Commas to a Number with Lookaround, pp. 59 and following)… and had little idea what I was reading.  I found I could not skip to later sections without going through the sections leading up to them.
  • Takes some commitment: Reading this book was less like using a vending machine and more like a two-month apprenticeship.
  • You have to do work: You have to think!


There’s territory to master,  and I can’t fault the author for that.  This basically neutralizes the “bad” comments about the book.  I had fiddled around with regular expressions for several years without growing much better at them, but this book has has launched me into being able to use regular expressions in much more advanced ways.  What a joy it was, for instance  — I think I was probably in chapter 4 at the time — to be able to help a co-worker using my newly gained knowledge of lookbehind!

After studying through Friedl’s book, I’m finally not a regex beginner any longer.  I understand the territory better, and if a regex didn’t match like I expected I believe I could look into it and have a shot at figuring out the cause (before, my practice was more hack-and-hope).

I heartily recommend Mastering Regular Expressions for anyone who feels the time has finally come for them to take the time to really understand regexes.

Leave a comment

A regex to selectively remove package from class names

I wanted to post the call stack at which I was experiencing a problem, to a forum.  I had expanded the display in Eclipse to show fully-qualified class names:

Thread [main] (Suspended (breakpoint at line 66 in$BasicSetter))$BasicSetter.set(java.lang.Object, java.lang.Object, org.hibernate.engine.SessionFactoryImplementor) line: 66
org.hibernate.tuple.entity.PojoEntityTuplizer(org.hibernate.tuple.entity.AbstractEntityTuplizer).setIdentifier(java.lang.Object, line: 234
org.hibernate.persister.entity.SingleTableEntityPersister(org.hibernate.persister.entity.AbstractEntityPersister).setIdentifier(java.lang.Object,, org.hibernate.EntityMode) line: 3624
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.AbstractSaveEventListener).performSave(java.lang.Object,, org.hibernate.persister.entity.EntityPersister, boolean, java.lang.Object, org.hibernate.event.EventSource, boolean) line: 194
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.AbstractSaveEventListener).saveWithGeneratedId(java.lang.Object, java.lang.String, java.lang.Object, org.hibernate.event.EventSource, boolean) line: 144
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).saveWithGeneratedOrRequestedId(org.hibernate.event.SaveOrUpdateEvent) line: 210
org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(org.hibernate.event.SaveOrUpdateEvent) line: 56
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).entityIsTransient(org.hibernate.event.SaveOrUpdateEvent) line: 195
org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(org.hibernate.event.SaveOrUpdateEvent) line: 50
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).onSaveOrUpdate(org.hibernate.event.SaveOrUpdateEvent) line: 93
org.hibernate.impl.SessionImpl.fireSave(org.hibernate.event.SaveOrUpdateEvent) line: 562, java.lang.Object) line: 550, java.lang.Object) line: 67
org.jboss.envers.synchronization.VersionsSync.executeInSession(org.hibernate.Session) line: 120
org.jboss.envers.synchronization.VersionsSync.beforeCompletion() line: 135 line: 366 line: 142 line: 96
org.springframework.transaction.jta.JtaTransactionManager.doCommit( line: 1028
org.springframework.transaction.jta.JtaTransactionManager( line: 732
org.springframework.transaction.jta.JtaTransactionManager( line: 701
org.springframework.transaction.interceptor.TransactionInterceptor(org.springframework.transaction.interceptor.TransactionAspectSupport).commitTransactionAfterReturning(org.springframework.transaction.interceptor.TransactionAspectSupport$TransactionInfo) line: 321
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(org.aopalliance.intercept.MethodInvocation) line: 116
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(org.springframework.aop.framework.ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(org.aopalliance.intercept.MethodInvocation) line: 89
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(org.springframework.aop.framework.ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(java.lang.Object, java.lang.reflect.Method, java.lang.Object[], net.sf.cglib.proxy.MethodProxy) line: 635
com.ontsys.db.GreetingSetDAO$$EnhancerByCGLIB$$cd993582.create(com.ontsys.db.GreetingSetPO) line: not available
com.ontsys.db.EnversWithCollectionsTest.testComplexCreate() line: 112
sun.reflect.NativeMethodAccessorImpl.invoke0(java.lang.reflect.Method, java.lang.Object, java.lang.Object[]) line: not available [native method]
sun.reflect.NativeMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) line: 39
sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) line: 25
java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object...) line: 597
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall() line: 44
org.junit.runners.model.FrameworkMethod$1(org.junit.internal.runners.model.ReflectiveCallable).run() line: 15
org.junit.runners.model.FrameworkMethod.invokeExplosively(java.lang.Object, java.lang.Object...) line: 41
org.junit.internal.runners.statements.InvokeMethod.evaluate() line: 20
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner.runChild(org.junit.runners.model.FrameworkMethod, org.junit.runner.notification.RunNotifier) line: 73
org.junit.runners.BlockJUnit4ClassRunner.runChild(java.lang.Object, org.junit.runner.notification.RunNotifier) line: 46
org.junit.runners.BlockJUnit4ClassRunner(org.junit.runners.ParentRunner<T>).runChildren(org.junit.runner.notification.RunNotifier) line: 180
org.junit.runners.ParentRunner<T>.access$000(org.junit.runners.ParentRunner, org.junit.runner.notification.RunNotifier) line: 41
org.junit.runners.ParentRunner$1.evaluate() line: 173
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner(org.junit.runners.ParentRunner<T>).run(org.junit.runner.notification.RunNotifier) line: 220
org.eclipse.jdt.internal.junit4.runner.JUnit4TestMethodReference(org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference).run(org.eclipse.jdt.internal.junit.runner.TestExecution) line: 38[]) line: 38
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(java.lang.String[], java.lang.String, org.eclipse.jdt.internal.junit.runner.TestExecution) line: 460
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(org.eclipse.jdt.internal.junit.runner.TestExecution) line: 673 line: 386
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(java.lang.String[]) line: 196

This made it easy to see when the thread of control was in JUnit, Spring, Bitronix, Envers, Hibernate, etc…

But it was way too busy.  I decided that I wanted to keep the fully qualified method names but drop the package names off the method parameter class names.

Eclipse (3.3.2 anyway, which I’m using)’s Show Qualified Names option either shows ’em all or hides ’em all — there doesn’t seem to be a directly supported way to get the view I want there.

I decided to try crafting a regular expression to do the replacement, and came up with this:

   (?<=\(|, )&#91;a-z0-9.$&#93;+(&#91;A-Z&#93;\w+)

I used <a href="">Sergey's regular expression tester Eclipse plug-in</a> to try out my regular expression and visually see if it was matching what I expected (actually I used it to come up with the above regex):

<a href=""><img class="alignnone size-full wp-image-1434" title="regular-expression-tester" src="" alt="regular-expression-tester" width="680" height="181" /></a>

Here's what the regex does, bit by bit:

   (?<=\(|, )

This part uses lookbehind <em>(?&lt;=)</em> to say "match if just before the current position is a left-parenthesis or a comma+space."


This section matches the package part of a fully-qualified class name: one or more alphanumeric, periods, or dollar signs.


This final section matches only a capital letter, followed by one or more “word”-constitutin’ characters.  This matches the class name without the package.  We enclose this section in capturing parentheses, allowing us to replace the matched text with simply…


Performing the replacement yields this still-thick-but-not-quite-as-bad version of the call stack:

Thread [main] (Suspended (breakpoint at line 66 in$BasicSetter))$BasicSetter.set(Object, Object, SessionFactoryImplementor) line: 66
org.hibernate.tuple.entity.PojoEntityTuplizer(AbstractEntityTuplizer).setIdentifier(Object, Serializable) line: 234
org.hibernate.persister.entity.SingleTableEntityPersister(AbstractEntityPersister).setIdentifier(Object, Serializable, EntityMode) line: 3624
org.hibernate.event.def.DefaultSaveEventListener(AbstractSaveEventListener).performSave(Object, Serializable, EntityPersister, boolean, Object, EventSource, boolean) line: 194
org.hibernate.event.def.DefaultSaveEventListener(AbstractSaveEventListener).saveWithGeneratedId(Object, String, Object, EventSource, boolean) line: 144
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).saveWithGeneratedOrRequestedId(SaveOrUpdateEvent) line: 210
org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(SaveOrUpdateEvent) line: 56
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).entityIsTransient(SaveOrUpdateEvent) line: 195
org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(SaveOrUpdateEvent) line: 50
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).onSaveOrUpdate(SaveOrUpdateEvent) line: 93
org.hibernate.impl.SessionImpl.fireSave(SaveOrUpdateEvent) line: 562, Object) line: 550, Object) line: 67
org.jboss.envers.synchronization.VersionsSync.executeInSession(Session) line: 120
org.jboss.envers.synchronization.VersionsSync.beforeCompletion() line: 135 line: 366 line: 142 line: 96
org.springframework.transaction.jta.JtaTransactionManager.doCommit(DefaultTransactionStatus) line: 1028
org.springframework.transaction.jta.JtaTransactionManager(AbstractPlatformTransactionManager).processCommit(DefaultTransactionStatus) line: 732
org.springframework.transaction.jta.JtaTransactionManager(AbstractPlatformTransactionManager).commit(TransactionStatus) line: 701
org.springframework.transaction.interceptor.TransactionInterceptor(TransactionAspectSupport).commitTransactionAfterReturning(TransactionAspectSupport$TransactionInfo) line: 321
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(MethodInvocation) line: 116
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(MethodInvocation) line: 89
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Object, Method, Object[], MethodProxy) line: 635
com.ontsys.db.GreetingSetDAO$$EnhancerByCGLIB$$cd993582.create(GreetingSetPO) line: not available
com.ontsys.db.EnversWithCollectionsTest.testComplexCreate() line: 112
sun.reflect.NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method]
sun.reflect.NativeMethodAccessorImpl.invoke(Object, Object[]) line: 39
sun.reflect.DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
java.lang.reflect.Method.invoke(Object, Object...) line: 597
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall() line: 44
org.junit.runners.model.FrameworkMethod$1(ReflectiveCallable).run() line: 15
org.junit.runners.model.FrameworkMethod.invokeExplosively(Object, Object...) line: 41
org.junit.internal.runners.statements.InvokeMethod.evaluate() line: 20
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner.runChild(FrameworkMethod, RunNotifier) line: 73
org.junit.runners.BlockJUnit4ClassRunner.runChild(Object, RunNotifier) line: 46
org.junit.runners.BlockJUnit4ClassRunner(ParentRunner<T>).runChildren(RunNotifier) line: 180
org.junit.runners.ParentRunner<T>.access$000(ParentRunner, RunNotifier) line: 41
org.junit.runners.ParentRunner$1.evaluate() line: 173
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner(ParentRunner<T>).run(RunNotifier) line: 220
org.eclipse.jdt.internal.junit4.runner.JUnit4TestMethodReference(JUnit4TestReference).run(TestExecution) line: 38[]) line: 38
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(String[], String, TestExecution) line: 460
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(TestExecution) line: 673 line: 386
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(String[]) line: 196


1 Comment

A simplifying regex

Regular expressions can get pretty complex.  I’m in the process of trying to master them, but from time to time I worry that I should just give up on the concept — that perhaps there are generally easier ways of accomplishing the same thing, and regexes are a waste of time.

Then I see an example like today’s that renews my belief that  it’s worthwhile to master these regexes:


A co-worker had a method that received a String of comma-separated values.  Inside the method, he did a split(), with a comma as the delimiter. A simplified version of the class function in question:

public class Splitter {
    public String[] split(String s) {
        return s.split(",");

…and its test:

public class SplitterTest {
	public void testSplitBasic() {
		List<String> splitString = new Splitter().splitBasic("bob, sam,harry");
		assertEquals(3, splitString.size());
		assertEquals("bob", splitString.get(0));
		assertEquals(" sam", splitString.get(1));
		assertEquals("harry", splitString.get(2));

This test passes.  So far, so good.

The Problem

He wanted to enhance the split method to support escaping a comma within the string.  When the split method encountered an escaped comma (“\,”) , it should not consider the comma a delimiter (and it should eat the backslash).

So we would want this test to pass:

        List<String> splitString = new Splitter().splitSmart("bob\\, sam,harry");
        assertEquals(2, splitString.size());
        assertEquals("bob, sam", splitString.get(0));
        assertEquals("harry", splitString.get(1));

splitSmart(): a Non-Regex Implementation

He had a working implementation that looked something like this:

    public List<String> splitSmart(String s) {
        List<String> list = new ArrayList<String>();

        String concat = "";
        boolean concatenating = false;
        for (String x : s.split(",")) {
            if (x.endsWith("\\")) {
                concat += x.substring(0, x.length() - 1) + ",";
                concatenating = true;
            } else if (concatenating) {
                concat += x;
                concatenating = false;
            } else {
                concatenating = false;
        return list;

This makes the test pass, but my co-worker was not happy with it.  Too clunky.

A Simplification using a Regex

I had to think about it for a few minutes, but eventually it came to me that what we wanted as a delimiter was a comma not preceded by a backslash.  Looks like a great opportunity to use… negative lookbehind!

The regex way to say “a comma not preceded by a backslash” is:

(?X) is the regex way of saying “not preceded by X” (in this case, a backslash, which has to be escaped) and the comma is the thing to match.

Now we can simplify splitSmart() down to this:

    public List splitSmart(String s) {
        List list = new ArrayList();

        for (String x : s.split(“(?remove a backslash that is followed by a comma, using positive lookahead!


My co-worker was pleased to use the regex-totin’ split-n-replace version of the code.  We both agreed it looked cleaner and simpler, even with the somewhat odd-looking lookbehind syntax and the double-escaped backslashes.  For my part, I was happy to be able to apply my regex learning to help someone.  :)



Regex unit test suite?

I’m reading Mastering Regular Expressions, by Jeffrey Friedl.  Regular expressions come in a lot of different flavors and dialects.  In reading the book, I realized that when I’ve used grep for text searching, sometimes my regex has failed because I was using the + metacharacter, which grep doesn’t support! (I’m using Cygwin’s GNU grep 2.5.3)

Wouldn’t it be nice to have a regex unit test suite that you could run a utility against and see for certain what metacharacters it supports?  I’m envisioning something sort of like a “configure” script, except instead of storing configuration settings it would just print them to the screen.

Some settings that might be useful:

  • Does this tool support the + metacharacter?
  • For grouping, should I use ( ) or \( \) ?
  • Does this tool support the {min,max} (or \{min,max\}) syntax?

[Update 1/15/2009: I’m now in Chapter 4 of Jeff Friedl’s Mastering Regular Expressions, and by now I know of other this I’d like to test:

  • Lazy quantifiers: ??, *?, +?, {max,min}?
  • Possessive quantifiers: *+, ++, ?+, {min,max}+
  • Atomic grouping: (?>…)
  • Which kind of regex engine does the tool use: Traditional NFA, DFA, or POSIX NFA?


Though I’m calling it a suite, probably a fairly monolithic single file o tests would be sufficient.  It seems that separate version of the suite would need to be made for each language, but all the same tests would be there in each version…

Has anyone done something like this already, I wonder?