A
regular expression is a sequence of characters (like a string) that is used to match other strings. Some of the characters are
regular characters, which will match the corresponding character. The other type are
special characters which often match multiple characters (for example any digit).
In Julia, we can make a regular expression by prepending a string with an
r. For example, let’s say that we want match the string
"cat". These are all regular characters, so the regular expression is
r"cat". If we run
map(s-> occursin(r"cat", s), ["catastrophe", "scatter", "tigercat"])
which returns
[1, 1, 1], indicating that all three match. Note that this is the same matching as using
occursin with
"cat" and this always works if the regular expression is regular characters.
We can also match the beginning and end of the string with a regular expression. The special character
^ at the beginning of the regular expression will match the rest of the string at the beginning or
r"^cat" as an example. The following:
map(s-> occursin(r"^cat", s), ["catastrophe", "scatter", "tigercat"])
returns
[1, 0, 0] showing that only "catastrophe" starts with "cat".
Another special character is the
$ which means to match the end of the string. So
map(s-> occursin(r"cat$", s), ["catastrophe", "scatter", "tigercat"])
will return
[0, 0, 1] indicating that only "tigercat" matches "cat" at the end.
Subsection 26.2.1 Character Class and Ranges
Typically we want more flexibility with regular expressions. For example, instead of matching
"cat", what if we want to match
"cat",
"cot" or
"cut"? We can do this with a set of
[]. For example a regular expression
r"c[aou]t" will match
"cat",
"cot" or
"cut" such as
map(s -> occursin(r"c[auo]t", s), ["catalog", "scotch", "cutlery", "settle"])
returns
[1, 1, 1, 0] indicating that the first three matches but the last one does not.
If we want a range of values, we can use a
- within the
[], which is known as a
character class in the world of regular expressions. For example, finding words that start with the lower case letters
a through
f, we can use the regular expression
r"^[a-f]". A test for this would be
map(s -> occursin(r"^[a-f]",s), ["apple", "checkmate", "frosted flakes", "zebra"])
returns
[1, 1, 1, 0], showing that the first 3 match, but the 4th does not. Now to confuse things,
^ can be used within the
[] for a
not in. Let’s say we don’t what a string to start with
a through
f. Entering
map(s -> occursin(r"^[^a-f]",s), ["apple", "frosted flakes", "poutine", "zebra"])
returns
[0, 0, 1, 1] showing that the last two strings match.
If we want any character matched, we use a
.. For example, if the the regular expression
r"c.t" can be used
map(s -> occursin(r"c.t",s), ["catalog", "tactile", "yacht"])
returns
[1, 0, 1], where "tactile" does not match because there is no character between the "c" and "t".
Subsection 26.2.2 Optional sets of characters
Another important regular expression is that of a optional set of characters. If we want to match "cat" or "dog" we can construct a regular expression with these string separated with a
| (like an or). Consider the following:
map(s -> occursin(r"dog|cat",s), ["dogma", "catalog", "chair"])
which results in
[1, 1, 0]. Often if we want to match an option with other strings, we can add parentheses to group the option. (We will use parentheses for other grouping later). For example
map(s -> occursin(r"(dog|cat)fish",s), ["dogfish", "catfish", "clownfish"])
Check Your Understanding 26.1.
Write a regular expression that tests for four-letter words that have "oo" or "ee" in the middle like
fool or
seen. Test on these words as well as ones that aren’t four letters and don’t have the double "o" or "e".
Subsection 26.2.3 Digits and alphabetic characters
Using the technique in the previous section, we can match a digit with the regular expression
r"[0-9]". For example,
occursin(r"^[0-9]","1234")
returns
true. However since matching digits is a common occurrence,
\d can be used to match digits. Therefore,
returns
true as well. To match alphabetic characters, there are a couple of options. If you are looking for precisely the 26 lowercase latin characters then
[a-z] is the best way to do this. However, there are many characters (such as letters with accents) or unicode characters that are alphabetic and there is a
[[:alpha:]]. Some examples are:
occursin(r"^[a-z],["apple", "zebra", "1234"])
returns true for the first two and false for the third. To use the broader
[:alpha:] class of characters, as an example
map(s -> occursin(r"^[[:alpha:]]",s) , ["apple", "ωγ", "1234"])
returns true for the first two and false for the third one. Note that typically it is used within a
[] block as well since it is a set of optional characters.
Lastly, another helpful special character is that of a
word character,
\w. This will match an alphabetic character, a digit or an underscore
_. An example is
map(s -> occursin(r"\w\w\w", s), ["r2c", "ww_c", "i a"])
which returns true, true and false. The last one because the 2nd character is a space.
White space is also something to detect. Although, it can be used with
occursin, a more practical way to use this is with
split, which takes a string and splits it. We saw this in
Section 6.10, but in that section only specific strings were allowed. For example,
split("The dog jumps over the log", r"\s")
where some of the spaces are tabs and others are a space character. This returns the array
["The", "dog", "jumps", "over", "the", "log"]
split("1.0,2.0,3.0;4.5,7.9;-10", r"[,;]")
which returns
[1.0, 2.0, 3.0, 4.5, 7.9, -10] and recall that the regular expression is basically
, or
;.
Check Your Understanding 26.2.
Write a regular expression to match five-letter words (which may be useful for solving Wordle). Start first with the lower-case latin characters and then allow other characters like the greek letters. Test against 4, 5 and 6 letter words that should and should not match.
Subsection 26.2.4 Quantifiers
We often want to know if a character is repeated some number of times. For example, if we want to check that
c is repeated three times, we can use
r"ccc", however, we can also write this as
r"c{3}", which is actually longer, but sometimes easier to read. If we want to test if there are between 2 and 4 consecutive
cs, then we can use
r"c{2,4}". As an example, consider:
map(s -> occursin(r"^c{2,4}$",s),["c","cc","ccc","cccc","ccccc"])
which returns
[false, true, true, true, false] and note that the regular expression includes a
^ and
$ indicating that it needs to match at the beginning and end. If we want to say it matches 3 or more times, we can use
r"c{3,}" as in
map(s -> occursin(r"^c{3,}$",s),["c","cc","ccc","cccc","ccccc"])
which returns
[false, false, true, true, true] and if we want a maximum of 3, use
r"c{,3} as in
map(s -> occursin(r"^c{0,3}$",s),["c","cc","ccc","cccc","ccccc"])
which returns
[true, true, true, false, false]. There are two quantifiers that are used more than any other: matching 0 or more times and 1 or more times. Because of this, these have special characters
* and
+. For example to match 0 or more
c, then we use
r"c*". For example,
map(s -> occursin(r"^c*a",s),["a","ca","cca","ccca","cccca"])
will match zero or more
c at the beginning of the string. All five of the above match. In contrast if we use
r"^c+a", then
map(s -> occursin(r"^c+a",s),["a","ca","cca","ccca","cccca"])
then the first one does not match, but the others do. The last example of this is the regular expression
.*, which matches any character 0 or more times. At first glance this doesn’t seem useful
Check Your Understanding 26.3.