Human readable regular expressions

Published: February 27, 2018  •  java, javascript

Regular expressions are a powerful tool for input validation, text extraction and find and replace operations. Every programming language supports regular expressions, either as part of the standard library or implemented directly into the language.

In the following example we need to process strings that are sent to us in this format: First 2 or 3 uppercase ASCII characters from A to Z, then a hyphen(-) followed by a number between 0 and 999, then a dot and ending with either x, y or z. Our task is to split the string into three components. For instance if we get the input "AB-0.z", we want as a result the three substrings "AB", "0" and "z".


Java

In Java, the regular expression support is located in the java.util.regex package and consists of the Pattern and Matcher class (with some additional supporting classes).

    String[] inputs = new String[] { "AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x" };

    Pattern pattern = Pattern.compile("^([A-Z]{2,3})-(\\d{1,3})\\.([xyz])$");

    for (String input : inputs) {
      Matcher matcher = pattern.matcher(input);
      if (matcher.matches()) {
        System.out.printf("Group 1: %s, Group 2: %s, Group 3: %s%n", matcher.group(1),
            matcher.group(2), matcher.group(3));
      }
      else {
        System.out.println(input + " does not match");
      }
    }

Native.java


JavaScript

In JavaScript we can either use the RegExp constructor new RegExp("...") or the RegEx literal /.../


const regex = /^([A-Z]{2,3})-(\d{1,3})\.([xyz])$/g;

const inputs = ["AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x"];

for (const input of inputs) {
  regex.lastIndex = 0;
  const groups = regex.exec(input);
  if (groups != null) {
    console.log(`Group 1: ${groups[1]}, Group 2:  ${groups[1]}, Group 3: ${groups[1]}`);
  }
  else {
    console.log(`${input} does not match`);
  }
}
    
    

native.js

Regular expressions are powerful but also a bit cryptic. Even this simple example uses all kind of braces, brackets and parentheses. If you don't write a lot of regular expressions each day, it might take a few seconds and maybe a look into the documentation to really comprehend what the expression is doing.

Fortunately there are many tools that help you write, test and visualize regular expressions. Here are some online services that I've found:


VerbalExpression

It would be nice if we could write the expressions in a more human readable way. And this is exactly the purpose of the VerbalExpressions library.

It allows you to write regular expressions in a builder style syntax. VerbalExpressions is available for over 30 programming languages. It is an expression builder and uses under the hood the built in regular expression functionality of the programming language.

Java

If you want to use VerbalExpression in a Java application, you add this dependency to your project.

    <dependency>
      <groupId>ru.lanwen.verbalregex</groupId>
      <artifactId>java-verbal-expressions</artifactId>
      <version>1.5</version>
    </dependency>

pom.xml

Our previous example, rewritten with VerbalExpression, looks like this. You have to write more code, but it allows you to read the regular expression like an English text. VerbalExpression does not teach you how to write regular expressions, you still need to know the basics and capabilities of regular expressions, but VerbalExpression helps you formulate and self document the expression.

    VerbalExpression regex = VerbalExpression.regex()
        .startOfLine()
        .capture().range("A", "Z").count(2, 3).endCapture()
        .then("-")
        .capture().digit().count(1, 3).endCapture()
        .then(".")
        .capture().anyOf("xzy").endCapture()
        .endOfLine()
        .build();

    System.out.println(regex.toString());

    String[] inputs = new String[] { "AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x" };

    Pattern pattern = Pattern.compile(regex.toString());

    for (String input : inputs) {
      Matcher matcher = pattern.matcher(input);
      if (matcher.matches()) {
        System.out.printf("Group 1: %s, Group 2: %s, Group 3: %s%n", matcher.group(1),
            matcher.group(2), matcher.group(3));
      }
      else {
        System.out.println(input + " does not match");
      }
    }

VerbalRegex.java

The toString() method returns the expression in string form, that we then can use in the Pattern.compile() method, like we did with the regular expression in the first example. You can still write erroneous expression, VerbalExpression does for instance not check if you close every capturing group correctly (capture() / endCapture()). One thing you don't have to worry about is quoting special characters like the dot in our example. The then() method automatically escapes these characters.


JavaScript

In JavaScript, you can add the library with

npm install verbal-expressions

to your npm project or load it from the CDN with a script tag

<script src="https://cdn.jsdelivr.net/npm/verbal-expressions@0.3.0/VerbalExpressions.min.js"></script>

The code looks almost the same as the Java code. There are some differences, like the method name for beginning a capture group (beginCapture(), in Java capture()) and the missing count() method in the JavaScript library. Fortunately we can workaround this omission with the add() method that allows us to add anything to the expression.

The VerbalExpression JavaScript library augments the standard RegExp object and adds all these builder methods when you call VerEx(). Because the return value of the builder methods is a standard RegExp object the code that uses the regular expression is exactly the same as in the first example.

var VerEx = require('verbal-expressions');

const regex = VerEx()
                .startOfLine()
                .beginCapture()
                  .range("A", "Z").add("{2,3}")
                .endCapture()
                .then("-")
                .beginCapture()
                  .digit().add("{1,3}")
                .endCapture()
                .then(".")
                .beginCapture()
                  .anyOf("xyz")
                .endCapture()
                .endOfLine();

const inputs = ["AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x"];

for (const input of inputs) {
  regex.lastIndex = 0;
  const groups = regex.exec(input);
  if (groups != null) {
    console.log(`Group 1: ${groups[1]}, Group 2:  ${groups[2]}, Group 3: ${groups[3]}`);
  }
  else {
    console.log(`${input} does not match`);
  }
}

verbalregex.js


Complex expressions

Another interesting feature is to extract common parts of the expression and then reuse them multiple times.

In this example the expression for the first and last part are the same: 3 numbers, then 2 characters.

    VerbalExpression regex = VerbalExpression.regex()
        .startOfLine()
        .range("1", "9").count(3).range("a", "z").count(2)
        .then("-")
        .range("a", "z").count(2)
        .then("-")
        .range("1", "9").count(3).range("a", "z").count(2)
        .endOfLine()
        .build();

    String input = "123xy-ab-311de";
    System.out.println(regex.test(input));

VerbalRegex2.java

Instead of repeating the expression we can create a builder instance for this part and then reuse it with the add() method.

    VerbalExpression.Builder part = VerbalExpression.regex()
        .range("1", "9").count(3).range("a", "z").count(2);
    
    regex = VerbalExpression.regex()
        .startOfLine()
        .add(part)
        .then("-")
        .range("a", "z").count(2)
        .then("-")
        .add(part)
        .endOfLine()
        .build();

    System.out.println(regex.test(input));

VerbalRegex2.java

On this wiki page you find a more complex example that demonstrates the process of extracting duplicate expressions in more detail.

You can use a similar approach when you write regular expressions by hand. Extract common parts into separate string variables and then append them to the expression.

The last example also showed you one of the convenience methods of the VerbalExpression class: test(), a method that tests if the given string matches the regular expression. VerbalExpression provides a few more such methods that simplify common use cases. If your use case is not covered by one of these methods, you can always extract the pattern with toString() and instantiate a Pattern class from the expression, like we did in the first example.

You find the source code of all the examples on GitHub:
https://github.com/ralscha/blog/tree/master/verbalregex