Using Strings

Chapter 25 Using Strings

We were introduced to strings in Chapter 2 which explains how to create strings, concatenate them and use interpolation. This chapter goes in depth into how strings are stored as well as UTF-8 characters and how to load strings from files in an efficient manner.

🔗

Section 25.1 Loading a file of strings

To begin with, let’s start with our motivating problem, loading a file and reading the results. First, so let’s start with an IMDB dataset from Kaggle.com. Note: you will need to make an account to download this. Use your finder or explorer window to move the file into the same directory as a Julia notebook directory.

🔗

Check that the file exists with isfile("IMBD.csv") and you should get true returned. If not, double check that the file is in the correct directory. Note that the file is named IMBD.csv not IMDB.csv, perhaps a typo. We can read the lines in with the command

🔗

movies = readlines("IMBD.csv")

🔗

and the result is a nearly 10,000 length array of strings.

There is a package called CSV that will do a better job loading in files of this type, but we want something simple.

Each element of the array is a line in the file and consists of 9 values that come from the file.

🔗

Throughout this section, we will examine how to process this file with the intent of understanding strings and how we can manipulate them in Julia.

🔗

Section 25.2 Strings of Unicode Characters

From the previous section, each movie is a string that needs to be parsed. Since the readlines method returns an array of strings, each of which is a line of the file, we can access the first line of the file with m1 = movies[1]. This returns

🔗

"title,year,certificate,duration,genre,rating,description,stars,votes"

🔗

which are the stored aspects of each movie. You can think of a CSV file as an array of values and these are the headers of the columns. The variable m1 is String and has many aspects of an array of characters. For example m1[1] returns

🔗

't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)

🔗

that returns a lot of information. Here’s a breakdown of them:

🔗

't' is the output of the character. Note that it is surrounded by single quotes which differentiates characters and strings.
🔗

🔗
ASCII is an old-school way of encoding characters. We’ll see this in more detail below.
🔗

🔗
Unicode U+0074 indicates that the character is Unicode, a more robust way of encoding characters. The 0074 is the numerical code for the ’t’ character.
🔗

🔗
(category Ll: Letter, lowercase): lists characteristics of the character. Ll is just shorthand for a Letter and lowercase.
🔗

🔗

🔗

Unicode characters can be added to a string by prefacing their code with \u or \U, where the lower case \u is used if the unicode is 4 hexadecimal digits and \U if up to eight. For example, as we used in Chapter 12, we used the suits for the playing cards and \u2660 represented the spade suit. Many of the adopted emoji are longer codes such as \U1f602 which produces a "😂".

🔗

ASCII stands for American Standard Code for Information Interchange and was an early form of encoding characters. It has 128 code points, meaning the ability to only store that many characters and stores the keys on U.S. keyboards as well as some hidden characters such as tabs and line feeds.

🔗

In Julia, we can test if either a character or a string is ascii with the isascii command. For example, isascii('1') and isascii('A') both return true.

🔗

Section 25.3 Strings are indexable, really?

We noted above that that we can access the characters in a string using array notation. Another example might be the following with some non ASCII characters

🔗

quad = "(x+1)² = x²+2x+1"

🔗

Many of these can be accessed, like quad[1] which returns ’(’, quad[end-1] which returns ’+’ and quad[12] returns ’²’. We can also put in a range of characters, such as quad[1:6] which will return "(x+1)²". However, if we access quad[13], we get the following error:

🔗

StringIndexError: invalid index [13], valid nearby indices [12]=>'²', [14]=>'+'

🔗

This error seems odd, but that is because unicode characters are a bit odd. Instead of ASCII characters which are a single byte and can be stored as a UInt8, Unicode characters can either be 1, 2, 3 or 4 bytes in size. We’ll dig a bit deeper with this. We can determine which indices are valid for this string with collect(eachindex(quad)) which returns all integers between 1 and 18 except 7 and 13, which are the slots after the ’²’.

🔗

The function codeunits applied to a string will return an array of all of the individual bytes of the string. For example codeunits("²") returns

🔗

2-element Base.CodeUnits{UInt8, String}:
 0xc2
 0xb2

🔗

which shows that a ’²’ is a 2-byte character. Alternative, the emoji we saw above if examined with codeunits("😂") returns

🔗

4-element Base.CodeUnits{UInt8, String}:
 0xf0
 0x9f
 0x98
 0x82

🔗

shows that this is a 4-byte unicode character.

🔗

So this basically means that you can’t treat a string line an array of characters, or can you? If you are careful and selecting the right index of a string, then we’re good. And if you understand that Unicode characters can be wider than 1 byte, you’re good.

🔗

Section 25.4 Strings are immutable

Another way that strings differ from arrays is that they are immutable. We saw we can access characters using array notation, but we we try to set or update character, then we get an error. quad[1] = '[' returns

🔗

MethodError: no method matching setindex!(::String, ::Char, ::Int64)
The function `setindex!` exists, but no method is defined for this combination of argument types.

🔗

There are a number of advantages to being immutable with the primary one being that efficiencies can be made for speed if they are immutable. If you need a mutable string, then you can look at arrays of characters instead.

🔗

Section 25.5 Splitting up a String

If we return to the movies files and recall that we stored the first line as m1, which is the headings for the columns (variables) separated by columns. We can split this line up with categories = split(m1,",") which returns

🔗

9-element Vector{SubString{String}}:
 "title"
 "year"
 "certificate"
 "duration"
 "genre"
 "rating"
 "description"
 "stars"
 "votes"

🔗

which shows that there are 9 header strings in this file. Note however, that instead of getting a vector of Strings, this array is of type: Vector{SubString{String}}. What is a something of type SubString? Just looking at it seems like a regular string.

🔗

A SubString actually points to the original string for some efficiency.

All of the differences between them gets a bit into the weeds and in this case, the current version of the Julia documentation does not reveal this. A discourse discussion on this gives a little more insight.

🔗

Section 25.6 Writing functions with string arguments

Although there are plenty of ways to concatenate two (or more) strings, let’s write a concat function that does this. For example:

🔗

concat(str1::String, str2::String) = str1 * str2

🔗

and if we call it like: concat("string 1 and ", "string 2"), then we get the expected result "string 1 and string 2". This seems great, but if we use a couple of the categories from the previous section, concat(categories[1], categories[2]), we will get the error:

🔗

MethodError: no method matching concat(::SubString{String}, ::SubString{String})
The function `concat` exists, but no method is defined for this combination of argument types.

🔗

This is expected in that the arguments called with the categories variable are SubString{String} objects and not String objects. This seems a bit annoying that split returns a SubString, but this is an easy fix because there is a supertype of both of these. We can find this with:

🔗

supertype(String), supertype(SubString{String}) returns (AbstractString, AbstractString) showing that AbstractString is a supertype of both. Therefore, when we write the concat function above, if we declare the arguments to be of type AbstractString, then this will work for both Strings and Substrings. That is

🔗

concat(str1::AbstractString, str2::AbstractString) = str1 * str2

🔗

A takeaway message is that anytime that you are writing a function with string types, use AbstractString instead of String and it should work with any type of string.

🔗

Check Your Understanding 25.1.

Another string type that is a subtype of String is called a LazyString. Investigate in the Julia Documentation about the LazyString type and when you might use it.

🔗

Section 25.7 Simple Pattern Matching in Strings

Let’s start with some examples of pattern matching. If we want to determine if a string matches a pattern, we can use the exact string. For example, let’s see if a string matches "cat". We can do this with the occursin function like:

🔗

occursin("cat","cat")

🔗

which will return true. Of course, we don’t need to use occursin to test that two strings are equal,

🔗

"cat"=="cat"

🔗

will do this. The occursin command will just determine if there is the first string somewhere inside the second string. So occursin("cat", "scatter") will return true, but occursin("cat", "cottage") will return false. Throughout this chapter, we will be testing a few different strings to determine how matches occur, we may do both of these matches together with

🔗

map(s -> occursin("cat", s), ["scatter", "cottage"])

🔗

and this returns

🔗

2-element Vector{Bool}:
  1
  0

🔗

and recall that boolean vectors show values of 1 for true and 0 for false. Thus "scatter" matches, but "cottage" does not.

🔗

There are also the methods startswith and endswith that matches a string at the beginning or end.

🔗

map(s->startswith(s, "cat"), ["catastrophe", "scatter", "tigercat"])

🔗

returns the vector [1, 0, 0] indicating that only the first string matched. Note that the startswith function has the argument list switched from that of the occursin function. And example with endswith is

🔗

map(s->endswith(s, "cat"), ["catastrophe", "scatter", "tigercat"])

🔗

returns [0,0,1] indicating that only the last string matched.

🔗

The next chapter will introduce you to regular expressions which is an extremely robust way to do pattern matching and extraction in strings.

🔗

Prev Top Next