Regex: Splitting by Character, Unless in Quotes

Many times when you're parsing text you find yourself needing to split strings on a comma character (or new lines, tabs, etc.), but then what if you needed to use a comma in your string and not split on it? An example of this could be a large number. So maybe we'd have a string like this:

age: 28, favorite number: 26, salary: $1,234,108  Splitting by commas on this would yield: age: 28 favorite number: 26 salary:$1
234
108


Close, but not quite.

For formatting purposes many numbers have commas like this, so we can't really avoid it.

One way to solve this problem is to put quotes around the string that shouldn't be split. So our example from above would then look like this:

age: 28, favorite number: 26, "salary: $1,234,108"  So now to split on this we'll need to create a regex string that says "split on all comma characters unless it's in between quotes". Using Java and Regex, this should work: String[] strArray = text.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");


Using the regex string above, here is how we'd split a string using Java:

String input = "age: 28, favorite number: 26, \"salary: $1,234,108\""; String[] splits = input.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
for (int i = 0; i < splits.length; i++) {
System.out.println(splits[i].trim());
}
// Output:
// age: 28
// favorite number: 26
// salary: \$1,234,108


This regex string uses what's called a "positive lookahead" to check for quotation marks without actually matching them.

This is a really powerful feature in regex, but can be difficult to implement.

To practice, try looking at the regex we gave and see if you can modify it to split on a different character, like a semicolon (;). If that was easy, try modifying it so it needs to see two quotation marks on each side of the string.

Have a simpler regex string, or some tips on creating them? Let us know in the comments!