Regular expressions can get pretty complex. I’m in the process of trying to master them, but from time to time I worry that I should just give up on the concept — that perhaps there are generally easier ways of accomplishing the same thing, and regexes are a waste of time.
Then I see an example like today’s that renews my belief that it’s worthwhile to master these regexes:
Background
A co-worker had a method that received a String of comma-separated values. Inside the method, he did a split(), with a comma as the delimiter. A simplified version of the class function in question:
public class Splitter {
public String[] split(String s) {
return s.split(",");
}
}
…and its test:
public class SplitterTest {
@Test
public void testSplitBasic() {
List<String> splitString = new Splitter().splitBasic("bob, sam,harry");
assertEquals(3, splitString.size());
assertEquals("bob", splitString.get(0));
assertEquals(" sam", splitString.get(1));
assertEquals("harry", splitString.get(2));
}
}
This test passes. So far, so good.
The Problem
He wanted to enhance the split method to support escaping a comma within the string. When the split method encountered an escaped comma (“\,”) , it should not consider the comma a delimiter (and it should eat the backslash).
So we would want this test to pass:
List<String> splitString = new Splitter().splitSmart("bob\\, sam,harry");
assertEquals(2, splitString.size());
assertEquals("bob, sam", splitString.get(0));
assertEquals("harry", splitString.get(1));
splitSmart(): a Non-Regex Implementation
He had a working implementation that looked something like this:
public List<String> splitSmart(String s) {
List<String> list = new ArrayList<String>();
String concat = "";
boolean concatenating = false;
for (String x : s.split(",")) {
if (x.endsWith("\\")) {
concat += x.substring(0, x.length() - 1) + ",";
concatenating = true;
} else if (concatenating) {
concat += x;
list.add(concat);
concatenating = false;
} else {
list.add(x);
concatenating = false;
}
}
return list;
}
This makes the test pass, but my co-worker was not happy with it. Too clunky.
A Simplification using a Regex
I had to think about it for a few minutes, but eventually it came to me that what we wanted as a delimiter was a comma not preceded by a backslash. Looks like a great opportunity to use… negative lookbehind!
The regex way to say “a comma not preceded by a backslash” is:
(?<!\\),
…where (?<!X) is the regex way of saying “not preceded by X” (in this case, a backslash, which has to be escaped) and the comma is the thing to match.
Now we can simplify splitSmart() down to this:
public List<String> splitSmart(String s) {
List<String> list = new ArrayList<String>();
for (String x : s.split("(?<!\\\\),")) {
list.add(x.replace("\\,", ","));
}
return list;
}
…and the tests still pass.
Note the double-escaped backslash in the split() call: the backslash already has to be escaped since the backslash is a regex metacharacter; but then each of these have to be escaped because backslash is also a Java String metacharacter. The x.replace() line then just literal-text-replaces each backslash-comma pair with a comma; it could have been equivalently written
x.replaceAll("\\\\(?=,)", "")
…to more explicitly remove a backslash that is followed by a comma, using positive lookahead!
Conclusion
My co-worker was pleased to use the regex-totin’ split-n-replace version of the code. We both agreed it looked cleaner and simpler, even with the somewhat odd-looking lookbehind syntax and the double-escaped backslashes. For my part, I was happy to be able to apply my regex learning to help someone. :)