Automation-assisted manual transformations

I had been unit testing an SQL generator and had a bunch of tests that gave various input to the generator and tested its output against expected SQL.   The SQL in my tests looked like this:

Now I was ready to feed the SQL to a database engine to verify that it was validly formed.  I would generally use grep for this type of task; but here my SQL statements were formatted multiline for easier reading, and grep operates in a line-by-line mode.  There were over 100 test cases, so it was worth figuring out an automated solution.  I also wanted to avoid writing a single-purpose text-processing utility if possible.

I ended up writing down the following steps for myself:

  1. Turn the tests into a single line of text for ease of working with tools
    On command line: cat Test*.cpp | tr -d "\n\r" > all-one-line.out
  2. Discard everything but the queries, inserting a newline after each
    In editor (SlickEdit for me), open all-one-line.out and Replace All (using Perl-style regexes):
    .*?("(?:\( )*SELECT[^;]+?;)
  3. Clean up what the regex didn’t
    Delete the last line
  4. Get rid of quotes
    Replace \" with nothing
  5. Get rid of semicolons
    Replace ;$ with nothing
  6. Get rid of extra spaces
    Replace <space>+ with <space>
  7. Save the file in the editor
  8. Get rid of Oracle-specific tests
    grep --invert-match TO_DATE < all-one-line.out > all-one-line.sql
  9. Let cool for 5 minutes before serving
    Paste all-one-line.sql into MS SQL Server Management Studio and execute (with Results to Text)

This may look like a large number of steps, but I got to where I could run through them in about 30 seconds and test all 130  queries on the server.  Nice!

Future improvements

Once I had the ability to test my test output against the database server, I wanted to do that each time the tests’ expected results changed.  So where I had originally envisioned a single smoke test run, I ended up going through these automation-assisted manual steps ten or twenty times.  In retrospect, the single-purpose utility script would clearly have been the better approach after all.  I need to get more comfortable whipping up such scripts to lower the barrier to writing them when these occasions arise.

Twelve years in, I would think I would be at the top of my craft by now, but there are still things that seem pretty basic that I’m learning.  Hmmm…I wonder if life really does begin at 40?

Positive examples of positive and negative lookahead

Many tools that support regular expressions (regexes) support positive and negative lookahead.  What good is lookahead?  Why would you ever use it?

A Positive Example

Say I want to retrieve from a text document all the words that are immediately followed by a comma. We’ll use this example string:

What then, said I, shall I do?  You shan't, he replied, do anything.

As a first attempt, I could use this regular expression to get one or more word parts followed by a comma:


This yields four results over the string:

  1. then,
  2. I,
  3. shan't,
  4. replied,

Notice that this gets me the comma too, though, which I would then have to remove.  Wouldn’t it be better if we could express that we want to match a word that is followed by a comma without also matching the comma?

We can do that by modifying our regex as follows:


This matches groups of word characters that are followed by a comma, but because of the use of lookahead the comma is not part of the matched text (just as we want it not to be).  The modified regex results in these matches:

  1. then
  2. I
  3. shan't
  4. replied

A Positive Negative Example

What if I wanted to match all the words not followed by a comma?  I would use negative lookahead:


(Okay, negative lookahead and atomic grouping)

…to get these matches:

  1. What
  2. said
  3. shall
  4. I
  5. do
  6. You
  7. he
  8. do
  9. anything

Huh? Atomic Grouping?

Yep.  Otherwise you’ll get the following (unintended matches highlighted):

  1. What
  2. the
  3. said
  4. shall
  5. I
  6. do
  7. You
  8. shan’
  9. he
  10. replie
  11. do
  12. anything

Without atomic grouping (the (?>) in the regex), when the regex engine sees that a match-in-progress comes up against a disqualifying comma, it simply backs off one letter to complete the match: the + in the regex gives the engine that flexibility.  Applying atomic grouping disallows this and says, don’t give up characters you’ve matched.

When Lookahead Does You No Good

Lookahead doesn’t really help if you only care whether or not there was a match (that is, you don’t care what text was matched).  If  all I care about is whether or not the string contains any words followed by a comma, I would dump lookahead and use the simpler regex:



Thanks to Jeffrey Friedl for writing Mastering Regular Expressions, 3rd ed., before reading which I had not even heard of regex lookahead.

Also, thanks to Sergey Evdokimov for his online Regular Expression Editor.  Handy!


MRE3 errata

I recently finished reading Mastering Regular Expressions, 3rd ed., by Jeffrey E.F. Friedl.  In the main chapters (the first six) I didn’t find a single typo. (I thought the PHP example in the table at the bottom of page 190 was missing a backslash, but I was wrong!  See the explanation at the top of page 445.)  I found this lack of errors (at least ones that I could find) really impressive given the level of technical detail.  I did find a few minor errors in the tool-specific chapters (chapters 7-10) and sent them along to the author  — all part of my ploy to attain fame and fortune by pointing out others’ mistakes ;).  So that I don’t lose them, here they are:

From: Daniel Meyer
To: ‘Jeffrey Friedl’
Subject: MRE3 errata

Dear Mr. Friedl,

I have just finished reading Mastering Regular Expressions, 3rd ed., and I found it both helpful and enjoyable.  In the spirit of contributing to the betterment of the work, here are some things I believe are errata:

p.293, under Context bullet point: s/in which they’re in/which they’re in/
p.298, second paragraph: s/looking localization/looking at localization/
p.428 sidebar: Based on the VB.NET code, the output in both columns should have “Options are” instead of “Option are”
p.457, last paragraph: s/but since they retained/but since they are retained/
p.459, last paragraph: s/are note interpolated/are not interpolated/
p.462, code snippet at bottom of page: Shouldn’t both be $0 ?
p.467, after first code snippet: “The S pattern modifier is used for efficiency” (but the code snippet does not use the S pattern modifier)
p.473, first if-statement in preg_regex_to_pattern function: the comment says ‘/’ followed by ‘\’ or EOS (but the code matches ‘\’ followed by ‘/’ or EOS)

Thank you!

Daniel Meyer

Book review: Mastering Regular Expressions

This review covers Mastering Regular Expressions, 3rd ed., by Jeffrey E.F. Friedl.  Sebastopol, CA: O’Reilly Media, Inc., 2006.


Background: The Pain

Regular expressions:  They tend to be difficult to write and difficult to read… but it’s hard to get away from them.  They can help you manipulate text in sophisticated ways.

Often it is possible to avoid regular expressions and get by with simple (non-regex-enabled) text search and replace — and for several years I have done so where possible; but every so often the task is complex enough that I fumblingly try my hand at regular expressions again.  When it works, I am glad; when it doesn’t work, I’ve often not been clear about what went wrong.  But the persistence with which problems continue to arise for which regular expressions would be an elegant help has convinced me of this:  Regular expressions are a tool that a professional programmer should have in his or her toolbox.

Tired of my stabs in the dark, I decided it was time I work to gain a deep understanding of these regex beasts.


This is a book about mastering regular expressions.  It’s not primarily a regex quick-reference guide, nor a “Get up to speed on regular expressions in 24 hours” book.  Rather, it’s a steady climb from the chapter 1 bunny hills to chapter 6’s double black diamonds.

The book is divided into three sections: the introduction (chapters 1-3); the details (chapters 4-6); and tool-specific information (chapters 7-10: one chapter each for Perl, Java, .NET, and PHP).

As Friedl says at the beginning of chapters 8, 9, and 10:

[T]his book’s foremost intention is not to be a reference, but a detailed instruction on how to master regular expressions.

The author recommends reading the first six chapters before jumping into one of the tool-specific chapters.

The Good

  • Great topic coverage: Here are some of the topics covered by the book:
    • Greedy vs. lazy quantifiers and how they affect matching
    • Backtracking
    • How a regex engine’s “transmission bump-along” works
    • Comparison of the three types of regex engines: DFA, Traditional NFA, and POSIX NFA – and how the engine type affects matching and efficiency
    • In what ways the “language” used for a character class is different from the language used for the larger regex
    • How to be careful when using greedy quantifiers like .*
    • Atomic grouping and possessive matching
    • Non-capturing parentheses
    • Positive and negative lookahead and lookbehind (collectively, “lookaround”)
    • Differences among regex flavors
  • Grizzled wisdom: In addition to all the topics covered, Friedl frequently notes caveats such as “this chart is only the tip of the iceberg — for every feature shown, there are a dozen important issues that are overlooked” (p. 91) and then explains what he means.  He’s not only explaining the table at hand — he’s also helping you learn how to think about tables of regex engine features — to learn what information you can safely glean from such a table and what important details tend to be left unstated.
  • Attention to detail: (This is similar to the previous point, but seems distinct in my mind.)  Some technical books assert that if you do X, the system does Y — but leave out the rare (but important) cases where if you do X, the system doesn’t do Y, and the cases where the system does Y without you doing X, leaving you to stumble into those cases on your own.This book does vanishingly little of that.  As just one example, on page 442 while he is explaining PHP’s m and D pattern modifiers, Friedl discusses the effect if you don’t use either modifier; the effect of the m modifier; and the effect of the D modifier.  But then, in characteristic form, he adds that “If both the m and D pattern modifiers are used, D is ignored.”  This type of attention to precision is bound to save the reader research and debugging time discovering such cases on their own.
  • Steady guidance along to the advanced topics: If Friedl started out with some of the advanced topics from chapters 5 and 6, I might have lost hope and given up.  Instead, he starts out simple and builds as he goes.  While I did not always take the time to fully understand each example in chapters 5 and 6, I found chapters 1-4 very approachable when taken in order.
  • Diagrams and tables: I found the diagrams sprinkled throughout showing the text and what parts of the regex match where, very helpful.  There are others — the backtracking diagrams on 229, 230, and 231; the tables on pages 92 and many other places.
  • Helpful cross-references everywhere: Whenever a concept is mentioned that is developed further somewhere else, the text points to the page number of that further development.  Also, at the beginning of some chapters there is a table-o-pointers (like a mini table of contents) to topics discussed in the chapter, for later quick reference.
  • Brain-jogging quizzes: I found the quizzes sprinkled through the book to be helpful in getting my brain going.  If the quizzes had been lumped together at the end of each chapter, I would have skipped them — but since they were few and sprinkled in among the reading at odd times, they piqued my interest, and my comprehension was aided by doing them.
  • Respectful tone: Though he starts from the beginning building a foundation to help the reader understand regular expressions, Friedl avoids a condescending tone.  He also avoids an apologetic “I want the reader to think I’m cool” tone when dealing with much deep technical content.
  • Good craftsmanship: Each diagram, table, section, quiz — as well as the organization of the chapters and the progression of the examples — has a purpose and contributes toward the single purpose of helping the reader master regular expressions.  Even the unique typographic conventions contribute to understanding.  No diagram is there just for fluff.  This coherence is refreshing.
  • Now it’s a quick reference: Now that I have read the book, I can use it as a quick reference.

The Bad

  • Not a quick reference at first: I had barely started the book when a regex question came up in our development.  I thought that lookaround might be the solution to our problem, so I flipped to the point in the book where the concept is introduced (Adding Commas to a Number with Lookaround, pp. 59 and following)… and had little idea what I was reading.  I found I could not skip to later sections without going through the sections leading up to them.
  • Takes some commitment: Reading this book was less like using a vending machine and more like a two-month apprenticeship.
  • You have to do work: You have to think!


There’s territory to master,  and I can’t fault the author for that.  This basically neutralizes the “bad” comments about the book.  I had fiddled around with regular expressions for several years without growing much better at them, but this book has has launched me into being able to use regular expressions in much more advanced ways.  What a joy it was, for instance  — I think I was probably in chapter 4 at the time — to be able to help a co-worker using my newly gained knowledge of lookbehind!

After studying through Friedl’s book, I’m finally not a regex beginner any longer.  I understand the territory better, and if a regex didn’t match like I expected I believe I could look into it and have a shot at figuring out the cause (before, my practice was more hack-and-hope).

I heartily recommend Mastering Regular Expressions for anyone who feels the time has finally come for them to take the time to really understand regexes.

A regex to selectively remove package from class names

I wanted to post the call stack at which I was experiencing a problem, to a forum.  I had expanded the display in Eclipse to show fully-qualified class names:

Thread [main] (Suspended (breakpoint at line 66 in$BasicSetter))$BasicSetter.set(java.lang.Object, java.lang.Object, org.hibernate.engine.SessionFactoryImplementor) line: 66
org.hibernate.tuple.entity.PojoEntityTuplizer(org.hibernate.tuple.entity.AbstractEntityTuplizer).setIdentifier(java.lang.Object, line: 234
org.hibernate.persister.entity.SingleTableEntityPersister(org.hibernate.persister.entity.AbstractEntityPersister).setIdentifier(java.lang.Object,, org.hibernate.EntityMode) line: 3624
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.AbstractSaveEventListener).performSave(java.lang.Object,, org.hibernate.persister.entity.EntityPersister, boolean, java.lang.Object, org.hibernate.event.EventSource, boolean) line: 194
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.AbstractSaveEventListener).saveWithGeneratedId(java.lang.Object, java.lang.String, java.lang.Object, org.hibernate.event.EventSource, boolean) line: 144
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).saveWithGeneratedOrRequestedId(org.hibernate.event.SaveOrUpdateEvent) line: 210
org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(org.hibernate.event.SaveOrUpdateEvent) line: 56
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).entityIsTransient(org.hibernate.event.SaveOrUpdateEvent) line: 195
org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(org.hibernate.event.SaveOrUpdateEvent) line: 50
org.hibernate.event.def.DefaultSaveEventListener(org.hibernate.event.def.DefaultSaveOrUpdateEventListener).onSaveOrUpdate(org.hibernate.event.SaveOrUpdateEvent) line: 93
org.hibernate.impl.SessionImpl.fireSave(org.hibernate.event.SaveOrUpdateEvent) line: 562, java.lang.Object) line: 550, java.lang.Object) line: 67
org.jboss.envers.synchronization.VersionsSync.executeInSession(org.hibernate.Session) line: 120
org.jboss.envers.synchronization.VersionsSync.beforeCompletion() line: 135 line: 366 line: 142 line: 96
org.springframework.transaction.jta.JtaTransactionManager.doCommit( line: 1028
org.springframework.transaction.jta.JtaTransactionManager( line: 732
org.springframework.transaction.jta.JtaTransactionManager( line: 701
org.springframework.transaction.interceptor.TransactionInterceptor(org.springframework.transaction.interceptor.TransactionAspectSupport).commitTransactionAfterReturning(org.springframework.transaction.interceptor.TransactionAspectSupport$TransactionInfo) line: 321
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(org.aopalliance.intercept.MethodInvocation) line: 116
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(org.springframework.aop.framework.ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(org.aopalliance.intercept.MethodInvocation) line: 89
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(org.springframework.aop.framework.ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(java.lang.Object, java.lang.reflect.Method, java.lang.Object[], net.sf.cglib.proxy.MethodProxy) line: 635
com.ontsys.db.GreetingSetDAO$$EnhancerByCGLIB$$cd993582.create(com.ontsys.db.GreetingSetPO) line: not available
com.ontsys.db.EnversWithCollectionsTest.testComplexCreate() line: 112
sun.reflect.NativeMethodAccessorImpl.invoke0(java.lang.reflect.Method, java.lang.Object, java.lang.Object[]) line: not available [native method]
sun.reflect.NativeMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) line: 39
sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) line: 25
java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object...) line: 597
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall() line: 44
org.junit.runners.model.FrameworkMethod$1(org.junit.internal.runners.model.ReflectiveCallable).run() line: 15
org.junit.runners.model.FrameworkMethod.invokeExplosively(java.lang.Object, java.lang.Object...) line: 41
org.junit.internal.runners.statements.InvokeMethod.evaluate() line: 20
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner.runChild(org.junit.runners.model.FrameworkMethod, org.junit.runner.notification.RunNotifier) line: 73
org.junit.runners.BlockJUnit4ClassRunner.runChild(java.lang.Object, org.junit.runner.notification.RunNotifier) line: 46
org.junit.runners.BlockJUnit4ClassRunner(org.junit.runners.ParentRunner<T>).runChildren(org.junit.runner.notification.RunNotifier) line: 180
org.junit.runners.ParentRunner<T>.access$000(org.junit.runners.ParentRunner, org.junit.runner.notification.RunNotifier) line: 41
org.junit.runners.ParentRunner$1.evaluate() line: 173
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner(org.junit.runners.ParentRunner<T>).run(org.junit.runner.notification.RunNotifier) line: 220
org.eclipse.jdt.internal.junit4.runner.JUnit4TestMethodReference(org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference).run(org.eclipse.jdt.internal.junit.runner.TestExecution) line: 38[]) line: 38
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(java.lang.String[], java.lang.String, org.eclipse.jdt.internal.junit.runner.TestExecution) line: 460
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(org.eclipse.jdt.internal.junit.runner.TestExecution) line: 673 line: 386
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(java.lang.String[]) line: 196

This made it easy to see when the thread of control was in JUnit, Spring, Bitronix, Envers, Hibernate, etc…

But it was way too busy.  I decided that I wanted to keep the fully qualified method names but drop the package names off the method parameter class names.

Eclipse (3.3.2 anyway, which I’m using)’s Show Qualified Names option either shows ’em all or hides ’em all — there doesn’t seem to be a directly supported way to get the view I want there.

I decided to try crafting a regular expression to do the replacement, and came up with this:

   (?<=\(|, )&#91;a-z0-9.$&#93;+(&#91;A-Z&#93;\w+)

I used <a href="">Sergey's regular expression tester Eclipse plug-in</a> to try out my regular expression and visually see if it was matching what I expected (actually I used it to come up with the above regex):

<a href=""><img class="alignnone size-full wp-image-1434" title="regular-expression-tester" src="" alt="regular-expression-tester" width="680" height="181" /></a>

Here's what the regex does, bit by bit:

   (?<=\(|, )

This part uses lookbehind <em>(?&lt;=)</em> to say "match if just before the current position is a left-parenthesis or a comma+space."


This section matches the package part of a fully-qualified class name: one or more alphanumeric, periods, or dollar signs.


This final section matches only a capital letter, followed by one or more “word”-constitutin’ characters.  This matches the class name without the package.  We enclose this section in capturing parentheses, allowing us to replace the matched text with simply…


Performing the replacement yields this still-thick-but-not-quite-as-bad version of the call stack:

Thread [main] (Suspended (breakpoint at line 66 in$BasicSetter))$BasicSetter.set(Object, Object, SessionFactoryImplementor) line: 66
org.hibernate.tuple.entity.PojoEntityTuplizer(AbstractEntityTuplizer).setIdentifier(Object, Serializable) line: 234
org.hibernate.persister.entity.SingleTableEntityPersister(AbstractEntityPersister).setIdentifier(Object, Serializable, EntityMode) line: 3624
org.hibernate.event.def.DefaultSaveEventListener(AbstractSaveEventListener).performSave(Object, Serializable, EntityPersister, boolean, Object, EventSource, boolean) line: 194
org.hibernate.event.def.DefaultSaveEventListener(AbstractSaveEventListener).saveWithGeneratedId(Object, String, Object, EventSource, boolean) line: 144
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).saveWithGeneratedOrRequestedId(SaveOrUpdateEvent) line: 210
org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(SaveOrUpdateEvent) line: 56
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).entityIsTransient(SaveOrUpdateEvent) line: 195
org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(SaveOrUpdateEvent) line: 50
org.hibernate.event.def.DefaultSaveEventListener(DefaultSaveOrUpdateEventListener).onSaveOrUpdate(SaveOrUpdateEvent) line: 93
org.hibernate.impl.SessionImpl.fireSave(SaveOrUpdateEvent) line: 562, Object) line: 550, Object) line: 67
org.jboss.envers.synchronization.VersionsSync.executeInSession(Session) line: 120
org.jboss.envers.synchronization.VersionsSync.beforeCompletion() line: 135 line: 366 line: 142 line: 96
org.springframework.transaction.jta.JtaTransactionManager.doCommit(DefaultTransactionStatus) line: 1028
org.springframework.transaction.jta.JtaTransactionManager(AbstractPlatformTransactionManager).processCommit(DefaultTransactionStatus) line: 732
org.springframework.transaction.jta.JtaTransactionManager(AbstractPlatformTransactionManager).commit(TransactionStatus) line: 701
org.springframework.transaction.interceptor.TransactionInterceptor(TransactionAspectSupport).commitTransactionAfterReturning(TransactionAspectSupport$TransactionInfo) line: 321
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(MethodInvocation) line: 116
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(MethodInvocation) line: 89
org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation(ReflectiveMethodInvocation).proceed() line: 171
org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Object, Method, Object[], MethodProxy) line: 635
com.ontsys.db.GreetingSetDAO$$EnhancerByCGLIB$$cd993582.create(GreetingSetPO) line: not available
com.ontsys.db.EnversWithCollectionsTest.testComplexCreate() line: 112
sun.reflect.NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method]
sun.reflect.NativeMethodAccessorImpl.invoke(Object, Object[]) line: 39
sun.reflect.DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
java.lang.reflect.Method.invoke(Object, Object...) line: 597
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall() line: 44
org.junit.runners.model.FrameworkMethod$1(ReflectiveCallable).run() line: 15
org.junit.runners.model.FrameworkMethod.invokeExplosively(Object, Object...) line: 41
org.junit.internal.runners.statements.InvokeMethod.evaluate() line: 20
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner.runChild(FrameworkMethod, RunNotifier) line: 73
org.junit.runners.BlockJUnit4ClassRunner.runChild(Object, RunNotifier) line: 46
org.junit.runners.BlockJUnit4ClassRunner(ParentRunner<T>).runChildren(RunNotifier) line: 180
org.junit.runners.ParentRunner<T>.access$000(ParentRunner, RunNotifier) line: 41
org.junit.runners.ParentRunner$1.evaluate() line: 173
org.junit.internal.runners.statements.RunBefores.evaluate() line: 28
org.junit.internal.runners.statements.RunAfters.evaluate() line: 31
org.junit.runners.BlockJUnit4ClassRunner(ParentRunner<T>).run(RunNotifier) line: 220
org.eclipse.jdt.internal.junit4.runner.JUnit4TestMethodReference(JUnit4TestReference).run(TestExecution) line: 38[]) line: 38
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(String[], String, TestExecution) line: 460
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(TestExecution) line: 673 line: 386
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(String[]) line: 196

A simplifying regex

Regular expressions can get pretty complex.  I’m in the process of trying to master them, but from time to time I worry that I should just give up on the concept — that perhaps there are generally easier ways of accomplishing the same thing, and regexes are a waste of time.

Then I see an example like today’s that renews my belief that  it’s worthwhile to master these regexes:


A co-worker had a method that received a String of comma-separated values.  Inside the method, he did a split(), with a comma as the delimiter. A simplified version of the class function in question:

public class Splitter {
    public String[] split(String s) {
        return s.split(",");

…and its test:

public class SplitterTest {
	public void testSplitBasic() {
		List<String> splitString = new Splitter().splitBasic("bob, sam,harry");
		assertEquals(3, splitString.size());
		assertEquals("bob", splitString.get(0));
		assertEquals(" sam", splitString.get(1));
		assertEquals("harry", splitString.get(2));

This test passes.  So far, so good.

The Problem

He wanted to enhance the split method to support escaping a comma within the string.  When the split method encountered an escaped comma (“\,”) , it should not consider the comma a delimiter (and it should eat the backslash).

So we would want this test to pass:

        List<String> splitString = new Splitter().splitSmart("bob\\, sam,harry");
        assertEquals(2, splitString.size());
        assertEquals("bob, sam", splitString.get(0));
        assertEquals("harry", splitString.get(1));

splitSmart(): a Non-Regex Implementation

He had a working implementation that looked something like this:

    public List<String> splitSmart(String s) {
        List<String> list = new ArrayList<String>();

        String concat = "";
        boolean concatenating = false;
        for (String x : s.split(",")) {
            if (x.endsWith("\\")) {
                concat += x.substring(0, x.length() - 1) + ",";
                concatenating = true;
            } else if (concatenating) {
                concat += x;
                concatenating = false;
            } else {
                concatenating = false;
        return list;

This makes the test pass, but my co-worker was not happy with it.  Too clunky.

A Simplification using a Regex

I had to think about it for a few minutes, but eventually it came to me that what we wanted as a delimiter was a comma not preceded by a backslash.  Looks like a great opportunity to use… negative lookbehind!

The regex way to say “a comma not preceded by a backslash” is:

(?X) is the regex way of saying “not preceded by X” (in this case, a backslash, which has to be escaped) and the comma is the thing to match.

Now we can simplify splitSmart() down to this:

    public List splitSmart(String s) {
        List list = new ArrayList();

        for (String x : s.split(“(?remove a backslash that is followed by a comma, using positive lookahead!


My co-worker was pleased to use the regex-totin’ split-n-replace version of the code.  We both agreed it looked cleaner and simpler, even with the somewhat odd-looking lookbehind syntax and the double-escaped backslashes.  For my part, I was happy to be able to apply my regex learning to help someone.  :)

Regex unit test suite?

I’m reading Mastering Regular Expressions, by Jeffrey Friedl.  Regular expressions come in a lot of different flavors and dialects.  In reading the book, I realized that when I’ve used grep for text searching, sometimes my regex has failed because I was using the + metacharacter, which grep doesn’t support! (I’m using Cygwin’s GNU grep 2.5.3)

Wouldn’t it be nice to have a regex unit test suite that you could run a utility against and see for certain what metacharacters it supports?  I’m envisioning something sort of like a “configure” script, except instead of storing configuration settings it would just print them to the screen.

Some settings that might be useful:

  • Does this tool support the + metacharacter?
  • For grouping, should I use ( ) or \( \) ?
  • Does this tool support the {min,max} (or \{min,max\}) syntax?

[Update 1/15/2009: I’m now in Chapter 4 of Jeff Friedl’s Mastering Regular Expressions, and by now I know of other this I’d like to test:

  • Lazy quantifiers: ??, *?, +?, {max,min}?
  • Possessive quantifiers: *+, ++, ?+, {min,max}+
  • Atomic grouping: (?>…)
  • Which kind of regex engine does the tool use: Traditional NFA, DFA, or POSIX NFA?


Though I’m calling it a suite, probably a fairly monolithic single file o tests would be sufficient.  It seems that separate version of the suite would need to be made for each language, but all the same tests would be there in each version…

Has anyone done something like this already, I wonder?

Sometimes there’s an easier way

I was working to get snapshots of the Hibernate AnnotationConfiguration properties in effect when Hibernate local transactions are in use versus when a JTA transaction manager is in use, and halfway through I found there was a much easier way to capture my data.

What I did First

  1. While a debugging session was stopped at a breakpoint at line 732 of org.springframework.orm.hibernate3.LocalSessionFactoryBean (In buildSessionFactory() where the AnnotationConfiguration is done being built), I went to the Expressions view and selected the AnnotationConfiguration named “config”;
  2. Opened config’s properties field and manually expanded all 116 nodes, exposing their key/value pairs;
  3. Selected the 116 nodes-with-key/value-pairs and chose Copy Expressions from the context menu;
  4. Pasted into Notepad and manually removed the “[0…99]” and “[100…116]” lines;
  5. Saved the modified file as hibernate-annotation-config-jta.txt
  6. Ran the following commands:
    egrep --invert-match "\[" hibernate-annotation-config-jta.txt > keys-values.txt
    egrep key= keys-values.txt | sed 's/[ \t]*key= \"\(.*\)\".*$/\1/' > keys.txt
    egrep value= keys-values.txt | sed 's/[ \t]*value= \"\(.*\)\".*$/\1/' > values.txt
    paste --delimiter== keys.txt values.txt > keys-values-jta.txt

This worked — it got all 116 entries into key=value format, one key/value pair per line — and I prepared to do it again while running my test in Hibernate mode.

Then I noticed…

Then I saw another pane in the Expressions view I hadn’t really noticed before, on the right hand side.  When I clicked on one of the AnnotationConfig’s properties in the Expressions window (even without expanding that entry to show the key and value), I saw the entry in key=value format over in that right hand pane.  And… if I selected all 116 nodes in the left pane of the Expressions view, all 116 key=value pairs showed up in the right pane, ready for me to select and copy!

My new workflow, then, is:

  1. While a debugging session is stopped at a breakpoint at line 732 of org.springframework.orm.hibernate3.LocalSessionFactoryBean (In buildSessionFactory() where the AnnotationConfiguration is done being built), go to the Expressions view, select the AnnotationConfiguration named “config”, and within that select the properties field;
  2. Make sure the “Show Logical Structure” button is pressed;
  3. Expand the properties field so that you can see the individual [0], [1] [2]… entries
  4. Select all these entries:
  5. In the Expressions view’s right hand pane, Select All and copy.

Those are nicer steps than what I did first!  Eclipse comes through again!

SVN Searcher

Last time, we ended up with this, our pinnacle of achievement, to find all classes in our framework layer that instantiate non-PO public classes:

for c in `find -wholename */main/*.java | xargs grep “public class [A-Z]” | sed -e “s/.*public class ([^ ]*) .*/1/”`; do find -wholename */main/*.java | xargs egrep ” = new $c” | grep –invert-match ” = new [A-Za-z]*PO”; done

With SVN Searcher, we can replace that with this search:

+FileBody:"public class" AND +FileBody:" = new " AND +Name:/src/main/java/


Let’s assess the pros and cons of this:

  • + It doesn’t require checking out the world;
  • + It completes in seconds rather than minutes (though we could optimize our script to not do the same directory search multiple times, which would help it);
  • + It doesn’t require installing Cygwin (I doubt many of our developers have it installed);
  • – This doesn’t allow us to strain out the PO instantiations — SVN Searcher’s underlying searcher techology, Lucene, doesn’t support wildcards in phrases:

    Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries)…Note: You cannot use a * or ? symbol as the first character of a search.  (from the Lucene 2.4.0 Query Parser Syntax document)

    …so all those’ll show up too;

  • – In addition to the limitations on where a wildcard can be used, the Lucene query parsing syntax doesn’t support full regular expression searching.
  • 0 This one is a weakness of both approaches: I think SVN Search uses a non-real-time indexing approach, so the index you’re searching will not be up to date if anyone’s committed since the last time the index was rebuilt; but with the download-the-world approach you similarly always have to remember to do a svn update or you’re searching an out of date working directory, and it’s still not real-time (you never know when someone might have just committed a new thing that depends on the thing you’re querying about…)


SVN Searcher could be a good first resort, for when I want to know where in the system a certain class is being used.  But I think there will be times when I need more power than it offers.

The download-the-world-and-execute-a-line-of-Greek approach is too clunky for many to feel comfortable with, and it’s going to get less workable for me too as we start having branches and tags in the repository.  I need a better solution too.

Finding the magic non-beans

Someone asked me yesterday what examples we have in the framework services “layer” where we provide functionality through a class that is not exposed as a Spring bean.  I could think of just a couple of examples off the top of my head, but both were kind of odd ones.  I wanted to search the framework services codebase to see if we have other uses of non-beans.

I reasoned that a distinctive characteristic of using a non-bean X is that you tend to see … = new X(... in code using the class.  Here’s what I decided I wanted:

  1. For *.java in the framework services code base, show me the names of the classes that are public (e.g., search for “public class”)
  2. For each class C in this list, search our entire code base for ” = new C(“

1. Finding the Public Classes

1.1. Listing the .java files

From the Windows command prompt I navigated to the directory of my workspace and issued a simple

dir /s /b *.java

This output the filenames in C:… format, which would normally be fine, except that the Cygwin utilities I use for this type of file processing don’t deal well with the backslashes.  They expect a more Unixy output.  I can provide that by using the following instead:

find -name *.java

This output the same filenames in ./…/path/to/the/ format.

1.2. Whittling away the non-public classes

Next, I wanted to search each file in the above list (281java files) for the string “public class”.  (I could probably have skipped this step and just chopped the .java from the filenames to yield the class names except that we have several package private classes that I don’t care about for purposes of this search.)

Xargs to the rescue!

find -name *.java | xargs grep "public class"

This pares our list down to 172 classes.

1.3. Whittling away the test classes

I notice as I examine the output from step 1.2 that several of the public classes are in …/src/test/java/… .  For purposes of this search, I don’t care about those — I only want to see the public classes in production code.  Without bothering to spend time reading the find utility’s manpage,  I modify the search to be like this:

find -name */main/*.java | xargs grep "public class"

At this point, I get one of the most helpful warnings I’ve ever seen (thanks, findutils team!):

find: warning: Unix filenames usually don’t contain slashes (though pathnames do).  That means that ‘-name `*/main/*.java” will probably evaluate to false all the time on this system.  You might find the ‘-wholename’ test more useful, or perhaps ‘-samefile’.
Alternatively, if you are using GNU grep, you could use ‘find … -print0 | grep -FzZ `*/main/*.java”.

Sure enough, no results found.  I fix the search to use the -wholename test, as the warning suggests:

find -wholename */main/*.java | xargs grep "public class"

This works, and now my list is down to 79 public classes, all in src/main/java.

1.4. Just the class names

Actually, what I have is 79 lines like this:

./exceptions/src/main/java/com/ontsys/fw/exception/ class InvalidDataImpl implements InvalidData {

I want just the class names.

A while later…

Here’s our command line now (broken into its three parts for readability; it’s all one line when I run it):

find -wholename */main/*.java
| xargs grep "public class "
| sed -e "s/.*public class ([^ ]*) .*/1/"

To put into words what this is doing:

  • Line 1 lists all production Java files (excludes src/test/java/)
  • Line 2 looks in each file listed and prints the lines that contain the string “public class “.
  • Line 3 looks for public class X (where X is a bunch of non-space characters) and prints just the class name.

1.5. Just the classnames: a minor tweak

The regular expression we’re passing to grep in line 2 above has matched a Javadoc comment line in which the phrase “public class ” was used.  Let’s tweak line 2 to only match lines that have a capital letter after public class:

find -wholename */main/*.java | xargs grep “public class [A-Z]” | sed -e “s/.*public class ([^ ]*) .*/1/”

So: now we have 78 class names printing out.  Now to see who instantiates these.

2. Who Instantiates These?

Now, for each class C in our list (78 of ’em), I want to know all the places in our codebase where ” = new C” appears.

2.1. Who-all instantiates these?

Here’s an approach (O(N^2) at best, but I find it’s often easier for me to make it work fast once it works at all…):


bash-3.2$ for c in`find -wholename */main/*.java | xargs grep “public class [A-Z]” | sed -e “s/.*public class ([^ ]*) .*/1/”`; do find -wholename */main/*.java | xargs grep ” = new $c”; done

Notice that we’re running bash to get the backquote goodness.

For each class in our list-o-78, we search the working directory for production java source files that instantiate that class directly.  This took five-and-a-half minutes on my PC, not searching the whole codebase (which I guess I’d have to check out in its entirety…hmm…) but just framework services.

2.2. Who that we care about instantiates these?

The results generated by step 2.1. include mostly all instantiation of our POs (persistence objects?)  I would like to remove instantiations of POs from our results and see what’s left.

I can tell a PO because its classname always ends with PO.  So here’s our updated command line:

bash-3.2$ for c in `find -wholename */main/*.java | xargs grep “public class [A-Z]” | sed -e “s/.*public class ([^ ]*) .*/1/”`; do find -wholename */main/*.java | xargs egrep ” = new $c” | grep –invert-match ” = new [A-Za-z]*PO”; done

This gets us just the five interesting instantiations.

Next time: How to avoid all this using SVN Searcher!