Regular Expressions

Chapter 26 Regular Expressions

Regular Expressions is all about matching strings and extracting substrings from strings. Make sure that you have read through Chapter 25 first to get an important basis for the details of a string.

🔗

The main way that regular expressions come up is that often there is data involved in scientific computing and we may need to parse strings to extract the needed data. Regular Expressions are a nice way to do this. There will be two places that we will use regular expression. 1) to determine if a string matches some pattern and 2) extract information out of a string. Often the first step will need to be done for the latter step.

🔗

As with much of this text, although the examples here will be done using Julia, regular expressions are ubiquitous and every language has support for them. In general they all use the same syntax for the regular expressions.

🔗

Section 26.1 Motivating Example: parsing a polynomial

As with most chapters in this book, we start with a motivating example. Polynomials are a very important mathematics object/structure and as we saw in Section 12.3, we create a struct to handle Polynomials and we will create a relatively full-featured module for them in Chapter 47. For example if we have a polynomial in the following form: "3x^2-9x+10" as a string, we would like to parse it as a polynomial (that is determine the coefficients and powers). We will see how do to this at the end of this chapter.

🔗

Section 26.2 Basic Regular Expressions

A regular expression is a sequence of characters (like a string) that is used to match other strings. Some of the characters are regular characters, which will match the corresponding character. The other type are special characters which often match multiple characters (for example any digit).

🔗

In Julia, we can make a regular expression by prepending a string with an r. For example, let’s say that we want match the string "cat". These are all regular characters, so the regular expression is r"cat". If we run

🔗

map(s-> occursin(r"cat", s), ["catastrophe", "scatter", "tigercat"])

🔗

which returns [1, 1, 1], indicating that all three match. Note that this is the same matching as using occursin with "cat" and this always works if the regular expression is regular characters.

🔗

We can also match the beginning and end of the string with a regular expression. The special character ^ at the beginning of the regular expression will match the rest of the string at the beginning or r"^cat" as an example. The following:

🔗

map(s-> occursin(r"^cat", s), ["catastrophe", "scatter", "tigercat"])

🔗

returns [1, 0, 0] showing that only "catastrophe" starts with "cat".

🔗

Another special character is the $ which means to match the end of the string. So

🔗

map(s-> occursin(r"cat$", s), ["catastrophe", "scatter", "tigercat"])

🔗

will return [0, 0, 1] indicating that only "tigercat" matches "cat" at the end.

🔗

Subsection 26.2.1 Character Class and Ranges

Typically we want more flexibility with regular expressions. For example, instead of matching "cat", what if we want to match "cat", "cot" or "cut"? We can do this with a set of []. For example a regular expression r"c[aou]t" will match "cat", "cot" or "cut" such as

🔗

map(s -> occursin(r"c[auo]t", s), ["catalog", "scotch", "cutlery", "settle"])

🔗

returns [1, 1, 1, 0] indicating that the first three matches but the last one does not.

🔗

If we want a range of values, we can use a - within the [], which is known as a character class in the world of regular expressions. For example, finding words that start with the lower case letters a through f, we can use the regular expression r"^[a-f]". A test for this would be

🔗

map(s -> occursin(r"^[a-f]",s), ["apple", "checkmate", "frosted flakes", "zebra"])

🔗

returns [1, 1, 1, 0], showing that the first 3 match, but the 4th does not. Now to confuse things, ^ can be used within the [] for a not in. Let’s say we don’t what a string to start with a through f. Entering

🔗

map(s -> occursin(r"^[^a-f]",s), ["apple", "frosted flakes", "poutine", "zebra"])

🔗

returns [0, 0, 1, 1] showing that the last two strings match.

🔗

If we want any character matched, we use a .. For example, if the the regular expression r"c.t" can be used

🔗

map(s -> occursin(r"c.t",s), ["catalog", "tactile", "yacht"])

🔗

returns [1, 0, 1], where "tactile" does not match because there is no character between the "c" and "t".

🔗

Subsection 26.2.2 Optional sets of characters

Another important regular expression is that of a optional set of characters. If we want to match "cat" or "dog" we can construct a regular expression with these string separated with a | (like an or). Consider the following:

🔗

map(s -> occursin(r"dog|cat",s), ["dogma", "catalog", "chair"])

🔗

which results in [1, 1, 0]. Often if we want to match an option with other strings, we can add parentheses to group the option. (We will use parentheses for other grouping later). For example

🔗

map(s -> occursin(r"(dog|cat)fish",s), ["dogfish", "catfish", "clownfish"])

🔗

returns [1, 1, 0].

🔗

Check Your Understanding 26.1.

Write a regular expression that tests for four-letter words that have "oo" or "ee" in the middle like fool or seen. Test on these words as well as ones that aren’t four letters and don’t have the double "o" or "e".

🔗

Subsection 26.2.3 Digits and alphabetic characters

Using the technique in the previous section, we can match a digit with the regular expression r"[0-9]". For example,

🔗

occursin(r"^[0-9]","1234")

🔗

returns true. However since matching digits is a common occurrence, \d can be used to match digits. Therefore,

🔗

occursin(r"^\d","1234")

🔗

returns true as well. To match alphabetic characters, there are a couple of options. If you are looking for precisely the 26 lowercase latin characters then [a-z] is the best way to do this. However, there are many characters (such as letters with accents) or unicode characters that are alphabetic and there is a [[:alpha:]]. Some examples are:

🔗

occursin(r"^[a-z],["apple", "zebra", "1234"])

🔗

returns true for the first two and false for the third. To use the broader [:alpha:] class of characters, as an example

🔗

map(s -> occursin(r"^[[:alpha:]]",s) , ["apple", "ωγ", "1234"])

🔗

returns true for the first two and false for the third one. Note that typically it is used within a [] block as well since it is a set of optional characters.

🔗

Lastly, another helpful special character is that of a word character, \w. This will match an alphabetic character, a digit or an underscore _. An example is

🔗

map(s -> occursin(r"\w\w\w", s), ["r2c", "ww_c", "i a"])

🔗

which returns true, true and false. The last one because the 2nd character is a space.

🔗

White space is also something to detect. Although, it can be used with occursin, a more practical way to use this is with split, which takes a string and splits it. We saw this in Section 6.10, but in that section only specific strings were allowed. For example,

🔗

split("The dog jumps over	the log", r"\s")

🔗

where some of the spaces are tabs and others are a space character. This returns the array

🔗

["The", "dog", "jumps", "over", "the", "log"]

🔗

Another example would be

🔗

split("1.0,2.0,3.0;4.5,7.9;-10", r"[,;]")

🔗

which returns [1.0, 2.0, 3.0, 4.5, 7.9, -10] and recall that the regular expression is basically , or ;.

🔗

Check Your Understanding 26.2.

Write a regular expression to match five-letter words (which may be useful for solving Wordle). Start first with the lower-case latin characters and then allow other characters like the greek letters. Test against 4, 5 and 6 letter words that should and should not match.

🔗

Subsection 26.2.4 Quantifiers

We often want to know if a character is repeated some number of times. For example, if we want to check that c is repeated three times, we can use r"ccc", however, we can also write this as r"c{3}", which is actually longer, but sometimes easier to read. If we want to test if there are between 2 and 4 consecutive cs, then we can use r"c{2,4}". As an example, consider:

🔗

map(s -> occursin(r"^c{2,4}$",s),["c","cc","ccc","cccc","ccccc"])

🔗

which returns [false, true, true, true, false] and note that the regular expression includes a ^ and $ indicating that it needs to match at the beginning and end. If we want to say it matches 3 or more times, we can use r"c{3,}" as in

🔗

map(s -> occursin(r"^c{3,}$",s),["c","cc","ccc","cccc","ccccc"])

🔗

which returns [false, false, true, true, true] and if we want a maximum of 3, use r"c{,3} as in

🔗

map(s -> occursin(r"^c{0,3}$",s),["c","cc","ccc","cccc","ccccc"])

🔗

which returns [true, true, true, false, false]. There are two quantifiers that are used more than any other: matching 0 or more times and 1 or more times. Because of this, these have special characters * and +. For example to match 0 or more c, then we use r"c*". For example,

🔗

map(s -> occursin(r"^c*a",s),["a","ca","cca","ccca","cccca"])

🔗

will match zero or more c at the beginning of the string. All five of the above match. In contrast if we use r"^c+a", then

🔗

map(s -> occursin(r"^c+a",s),["a","ca","cca","ccca","cccca"])

🔗

then the first one does not match, but the others do. The last example of this is the regular expression .*, which matches any character 0 or more times. At first glance this doesn’t seem useful

🔗

Check Your Understanding 26.3.

Rewrite the regular expression finds four-letter words that you did above in Check Your Understanding 26.2 to use quantifiers.

🔗

Section 26.3 Extracting substrings with Regular Expressions

Although often it is nice to know whether or not a string matches a particular pattern, the power of regular expressions comes with extracting information. Let’s start with an example. Sports scores are stored as strings in the form "X-X", where "X" is an integer and the home team is listed first. These could be "4-3", "123-97". First, we can determine if these match with the following

🔗

map(s -> occursin(r"^\d+-\d+$"), ["78-75","5-3", "123-97"])

🔗

which returns [1, 1, 1], therefore all true. Note that although - seems to be a special character, which would need escaping, it is only special in the character ranges [ ], and since this doesn’t use that, no escaping is needed.

🔗

Ultimately, we want to extract the integer information, so we will group these by using parentheses. Instead, we use the regular expression r"(\d+)-(\d+)". If you replace this with that above, you will still see that it matches (returns true). Intead, we will use the match method which returns more information. If we enter:

🔗

match(r"^(\d+)-(\d+)$", "78-75")

🔗

then the following is returned:

🔗

RegexMatch("78-75", 1="78", 2="75")

🔗

and this is a object of type RegexMatch, the first argument is the matched string and the other parts are the matched substrings. We can get those substrings with m[1] and m[2].

🔗

Next, let’s parse the substrings as numbers. For this we can use the parse method and we’ll make it a tuple with:

🔗

(home = parse(Int,m[1]), away = parse(m[2], m[2]))

🔗

Note 26.4.

The author has done a lot of work with sport scores over the years and uses both Julia and this exact method of extracting scores using webpages with the scores. The extraction of scoring on a webpage can be automated so that years worth of data can be combed through quickly. Generally once the scores are extracted, they are stored in a database (see Chapter 50) for processing later.

🔗

Remark 26.5.

If you only need to test if a string matches a pattern (using a regular expression), use occursin. If you need to extract substrings, use match.

🔗

Check Your Understanding 26.6.

Although the parts of a U.S. phone number have little meaning these days, the first 3 digits are the area code, the second three are the region (or region code) and the last 4 are the number within the region.

🔗

First, extract phone numbers in the U.S. form of "XXX-XXX-XXXX" or "(XXX) XXX-XXXX" where the separator - can be a .. Secondly, parse the three substrings as integers.

🔗

Section 26.4 Matching integers and decimals

Examples below will parse integers and decimals from a string and we’ll use regular expressions to do this work. As we saw above, to match a digit, we will use \d and if we want more than one decimal, we can follow that with the +. For a positive integer, the regular expression can match with r"\d+". For example,

🔗

occursin(r"\d+","1234")

🔗

returns true. If we want to include a + or - sign with the integer, we can use a [+-]? in front of the number. Therefore,

🔗

int_re = r"^[+-]?\d+$"
map(s-> occursin(int_re,s),["1234", "+1234", "-1234"])

🔗

which returns true for all three (as a boolean vector). Note: below, we’ll do a better job with testing these.

🔗

To match a decimal, we can start with the integer match and tack on a decimal point (\., where it needs to be escaped) as well as additional decimals. A reasonably robust decimal regular expression is:

🔗

dec_re = r"^[-+]?\d+\.\d*$"

🔗

There’s a few things to note here:

🔗

The front of this regular expression is the same as the integer version: [+-]?\d+.
🔗

🔗
The remaining part of the regular expression (\.\d*) is for the decimal point and any trailing digits. The decimal point is escaped because . matches any character. The result matches any number (0 or more) of digits.
🔗

🔗

🔗

Here’s a small number of tests for this:

🔗

map(s -> occursin(dec_re, s),["-1.3", "-1.", "14.0343", "14", "-15"])

🔗

and the first three return true. The last two do not since they are missing decimal points. Lastly, if we want this to match either of the integer or decimal number (which we will do below), we’ll take the decimal point and training digits as a group and make it optional. That is,

🔗

int_or_dec_re = r"^[-+]?\d+(\.\d*)?"$"

🔗

and we’ll test it with the previous test or

🔗

map(s -> occursin(int_or_dec_re, s),["-1.3", "-1.", "14.0343", "14", "-15"])

🔗

which returns all true.

🔗

Check Your Understanding 26.7.

Recall that floating point numbers that are written in scientific notation like 1.234e-4 which represents $1.234 \times 10^{-4}\text{.}$ Update the regular expression above to handle these type of numbers, recalling that the front is just a decimal, so you can just handle the e-4 above. Make this part optional.

🔗

Section 26.5 Matching and Capture Groups

Generally, it is desirable to pull out a substring from a string and regular expression are good at this. For example, we may want to pass along a string and pull out an integer or decimal for parsing to a Int or Float.

🔗

Let’s say that we have a string "(14,-34)" that we wish to parse as a point in the plane. We will start with a pair of integers. Since we want to capture both numbers, we will surround these by ( ). Define

🔗

pt_re = r"^\(([+-]?\d+),([+-]?\d+)\)$"

🔗

where the parentheses at the beginning and end need to be escaped at $ and $. If we then test for matching with

🔗

occursin(pt_re,"(14,-17)")

🔗

which returns true. This is nice that if say it matches, but we would like to determine the substrings for the two points. We can do this with the match function like

🔗

m = match(pt_re, "(14,-17)")

🔗

which returns

🔗

RegexMatch("(14,-17)", 1="14", 2="-17")

🔗

which is an object (struct) that captures information about the match. The first argument is the string that matches the regular expression, which is the full string. The next two are the two capture groups. We can access these with m1[1] and m1[2].

🔗

Check Your Understanding 26.8.

Update the regular expression above to handle a point of two integers or floating point numbers. Test your regular expression with the match command.

🔗

Section 26.6 Replacing substrings with regular expressions.

Above, we used regular expressions in three ways. occursin, which returns true or false if there is a substring within or regular expression that matches a larger string. Secondly, we used the match function to extract parts of a string. Lastly, we used a regular expression to split a string into an array of other strings. In this section, we will see a couple of other ways regular expressions can be used.

🔗

Replacing strings with another string is quite helpful. A simple non-regex versions is the following:

🔗

replace("Alice like cookies.  Also, Alice doesn't like carrots.", "Alice" => "Ben")

🔗

which returns

🔗

"Ben like cookies.  Also, Ben doesn't like carrots."

🔗

Replace is more powerful that just simple string. We can use a regular expression as well. Consider

🔗

replace("Is there a doctor in the house?  There is.", r"[Ii]s" => "are")

🔗

which returns

🔗

"are there a doctor in the house?  There are."

🔗

And lastly, we can use capture groups as well. Consider the following:

🔗

replace("Are the kids still in the pond?  Are the adults sitting on the beach?", "Are" => "Is", r"\s(\w+)s\s" => s" \1 ")

🔗

returns "Is the kid still in the pond? Is the adult sitting on the beach?". First, notice that we have multiple replacements. Also, on the replacements, we have a regular expression with a capture group (with parentheses) and we have a substitution string s" \1 " which starts with an s. Note that the substitution string has a \1, which will be the matched capture group. This allow us to replace both adults and kids.

🔗

Section 26.7 Parsing a Polynomial

We now turn to an example of parsing a string to interpret as a polynomial. A goal will be to add a constructor to our Polynomial module that will take a string and if it can be written as a polynomial, to parse it and store it as a polynomial. For example, here are some examples

🔗

5x-10
3x^2-7x+11
-5.6x^11+7.349x^5+x

🔗

Although we hope to eventually do general polynomials, let’s start with a linear function, like the top one above. Also for simplicity, we’ll assume that the coefficients are integers and that the variable is x. To start, let’s build a regular expression that will start with an integer (to be captured) followed by an x followed by another integer (to be captured). The following will do this

🔗

lin = r"([+-]?\d+)x([+-]\d+)"

🔗

A few things to note about this

🔗

Although we typically don’t append a + to a positive integer, but it is a valid way to write one, so so the [+-] handles this.
🔗

🔗
The [+-] doesn’t need to be there (we assume it is positive if missing), so we append a ? on this.
🔗

🔗
The \d+ matches any length sequence of digits, so this is the integer without the sign.
🔗

🔗
The sign and the digits are surrounded by (), so this is our first capture group. This will be parsed as our first coefficient.
🔗

🔗
The second capture group is the same as the first except that the sign is not optional.
🔗

🔗

🔗

Subsection 26.7.1 Testing the regular expression

Clearly the regular expression should be tested. We will use the tools from the Test package that was explained in Chapter 23. To start with we’ll make a testset for linear functions using

🔗

@testset "Linear Functions" begin
  @test match(lin, "5x-10") !== nothing
  @test match(lin, "-5x-10") !== nothing
  @test match(lin, "5x+10") !== nothing
  @test match(lin, "-5x+10") !== nothing
end

Right now, we just want to make sure that four strings above match the regular expression for linear relationships. Note that this seems to take care of all linear functions (with integer coefficients) in that either of the coefficients can be positive or negative. Running this you will see that all test pass. However, did we capture all linear relationships? A couple stand out to me after thinking a bit. "5x" itself is a linear function as is 5*x+10 and if we add these to the tests, they will fail.

🔗

Since we want to include the * character as optional between the coefficient and the x and also the constant coefficient, we will append a ? to each of these as

🔗

lin = r"([-+]?\d+)\*?x([-+]?\d+)?"

🔗

noting that * is a special character so needs to be escaped as \*. Now the test set:

🔗

@testset "Linear Functions" begin
  @test match(lin, "5x-10") !== nothing
  @test match(lin, "-5x-10") !== nothing
  @test match(lin, "5x+10") !== nothing
  @test match(lin, "-5x+10") !== nothing
  @test match(lin, "5x") !== nothing
  @test match(lin, "5*x-10") !== nothing
end

🔗

will pass for all the tests.

🔗

Subsection 26.7.2 Linear Functions with decimal coefficients

As we saw above, we have a regular expression to handle decimal coefficients. This is

🔗

dec = r"^([-+]?\d+(\.\d*)?)$"

🔗

We can update our linear function to include decimals for the coefficients by replacing the integer coefficient regular expression with that of the one above.

🔗

lin_dec = r"([-+]?\d+(\.\d*)?)\*?x([-+]?\d+(\.\d*)?)?"

🔗

the above set of sets can be updated to use this instead. (Try it!) and also include a new set of tests with decimal coefficients:

🔗

@testset "Linear Functions with decimal coefficients" begin
  @test match(lin_dec, "5.0x-10.5") !== nothing
  @test match(lin_dec, "-5.9x-10.2") !== nothing
  @test match(lin_dec, "5.3x+10.4") !== nothing
  @test match(lin_dec, "-5.25x+10.8") !== nothing
  @test match(lin_dec, "5.x") !== nothing
  @test match(lin_dec, "5.3*x-10.55") !== nothing
end

🔗

Subsection 26.7.3 Splitting a Polynomial into terms

With the success of parsing a linear function, we could next generate a quadratic polynomial, however, we would like to parse a general polynomial and since a term could have any power (that is, the degree of the polynomial could be any positive integer), this isn’t practical to generate this way.

🔗

Instead, we will split up the polynomial into terms. For an example, let’s start with $4x^3-2x^2+6$ which we will write as the string 4x^3-2x^2+6. To split the polynomial, there is a split function that we can split on either a + or -, however that sign is not saved. Instead, the following will do what we want:

🔗

function splitPoly(p::String)
  local terms = String[]
  # if the first character is a +/-, start the index at 2
  local ind1 = occursin(r"^[+-]",p) ? 2 : 1
  while true
    ind2 = findnext(r"[+-]", p, ind1)
    if ind2 == nothing
      # Push the last term onto the term stack.
      push!(terms, string(SubString(p, ind1-1)))
      break
    end
    # The first time through the loop, the substring calculation is different.
    push!(terms, string(SubString(p,(ind1 == 1 ? 1 : ind1 -1):first(ind2)-1)))
    ind1 = first(ind2)+1
  end
  terms
end

And since this is relatively large, let’s walk through the steps:

🔗

Line 2: terms stores the terms as an array of strings. This will act like a stack.
🔗

🔗
Line 4: the variable ind1 will be the first index of the array substring. Later in the function, we will pull out from the polynomial string between ind1 and ind2 (or adjusted a bit). If the polynomial string has a sign, then we start the index at 2, otherwise at 1.
🔗

🔗
Line 6: in general, while true is a surefire way to get an infinite loop since there is no stopping condition. However, the if statement starting on line 8 will check if we’re done and break out of the loop.
🔗

🔗
Line 7: this determines the second index which will extract the substring from the polynomial string. Note that result of this is a UnitRange.
🔗

🔗
Line 8: if there is no match for a +/-, the result of ind2 is nothing
🔗

🔗
Line 10: As the comment says, we are at the end of the string, and push the last terms onto the terms array. When SubString is called with only one index, it takes the end of the string. Also, we covert the SubString to a String with the string command.
🔗

🔗
Line 14: For a match on a +/-, we push the proper substring onto the terms array. If ind1 is 1, we need to adjust the string (beginning of the string). Also, note that since ind2 is a UnitRange, we take the first value of the range.
🔗

🔗
Line 15: take ind1 to be the next character after ind2 before repeating the loop.
🔗

🔗

🔗

Let’s check with the string 4x^3-2x+6 that this works. Calling

🔗

splitPoly("4x^3-2+6")

🔗

returns

🔗

3-element Vector{String}:
 "4x^3"
 "-2x"
 "+6"

🔗

which appears to work well. (Note: later, we will do some testing after converting to Polynomial objects.)

🔗

Subsection 26.7.4 Parsing Polynomial Terms

The next step is to parse the individual terms of the polynomial. For this, we will use a a regular expression to handle this. For first step, let’s assume that the constant out front is an integer. The following will parse these terms:

🔗

poly_re = r"^([+-]?\d+)(x(\^(\d+))?)?$"

🔗

And let’s test this on the 4x^3 term with

🔗

m1 = match(poly_re, "4x^3")

🔗

which returns

🔗

RegexMatch("4x^3", 1="4", 2="x^3", 3="^3", 4="3")

🔗

and you should notice that m[1] will be the coefficient and m[4] will be the power. If we test the next term with

🔗

m2 = match(poly_re, "-2x")

🔗

this returns

🔗

RegexMatch("-2x", 1="-2", 2="x", 3=nothing, 4=nothing)

🔗

and you should notice that again m2[1] is the coefficient and since m2[4] is nothing that this is a linear term. And finally,

🔗

m3 = match(poly_re, "+6")

🔗

returns

🔗

RegexMatch("+6", 1="+6", 2=nothing, 3=nothing, 4=nothing)

🔗

and there is no match for the last 3 capture groups, so this is just a constant term.

🔗

Although this looks good, thinking a bit ahead, if we have a polynomial term like -x^2, this will not parse this. Try

🔗

match(poly_re, "-x^2")

🔗

returns nothing. We will tweak the regular expression above to allow a + or - with no number in front of the x term.

🔗

poly_re = r"^([+-]?)(\d+)?(x(\^(\d+))?)?$"

🔗

and now rerunning the line above results in

🔗

RegexMatch("-x^2", 1="-", 2=nothing, 3="x^2", 4="^2", 5="2")

🔗

which shows that it matches. Note that these are different than the matches above--there are 5 capture groups now. It is noted that the 2nd group matched nothing because there was no numerical coefficient before the x^2 term.

🔗

Prev Top Next