We were introduced to strings in Chapter 2 which explains how to create strings, concatenate them and use interpolation. This chapter goes in depth into how strings are stored as well as UTF-8 characters and how to load strings from files in an efficient manner.
To begin with, let’s start with our motivating problem, loading a file and reading the results. First, so let’s start with an IMDB dataset from Kaggle.com. Note: you will need to make an account to download this. Use your finder or explorer window to move the file into the same directory as a Julia notebook directory.
Check that the file exists with isfile("IMBD.csv") and you should get true returned. If not, double check that the file is in the correct directory. Note that the file is named IMBD.csv not IMDB.csv, perhaps a typo. We can read the lines in with the command
From the previous section, each movie is a string that needs to be parsed. Since the readlines method returns an array of strings, each of which is a line of the file, we can access the first line of the file with m1 = movies[1]. This returns
which are the stored aspects of each movie. You can think of a CSV file as an array of values and these are the headers of the columns. The variable m1 is String and has many aspects of an array of characters. For example m1[1] returns
Unicode U+0074 indicates that the character is Unicode, a more robust way of encoding characters. The 0074 is the numerical code for the ’t’ character.
Unicode characters can be added to a string by prefacing their code with \u or \U, where the lower case \u is used if the unicode is 4 hexadecimal digits and \U if up to eight. For example, as we used in Chapter 12, we used the suits for the playing cards and \u2660 represented the spade suit. Many of the adopted emoji are longer codes such as \U1f602 which produces a "😂".
ASCII stands for American Standard Code for Information Interchange and was an early form of encoding characters. It has 128 code points, meaning the ability to only store that many characters and stores the keys on U.S. keyboards as well as some hidden characters such as tabs and line feeds.
In Julia, we can test if either a character or a string is ascii with the isascii command. For example, isascii('1') and isascii('A') both return true.
We noted above that that we can access the characters in a string using array notation. Another example might be the following with some non ASCII characters
Many of these can be accessed, like quad[1] which returns ’(’, quad[end-1] which returns ’+’ and quad[12] returns ’²’. We can also put in a range of characters, such as quad[1:6] which will return "(x+1)²". However, if we access quad[13], we get the following error:
This error seems odd, but that is because unicode characters are a bit odd. Instead of ASCII characters which are a single byte and can be stored as a UInt8, Unicode characters can either be 1, 2, 3 or 4 bytes in size. We’ll dig a bit deeper with this. We can determine which indices are valid for this string with collect(eachindex(quad)) which returns all integers between 1 and 18 except 7 and 13, which are the slots after the ’²’.
So this basically means that you can’t treat a string line an array of characters, or can you? If you are careful and selecting the right index of a string, then we’re good. And if you understand that Unicode characters can be wider than 1 byte, you’re good.
Another way that strings differ from arrays is that they are immutable. We saw we can access characters using array notation, but we we try to set or update character, then we get an error. quad[1] = '[' returns
MethodError: no method matching setindex!(::String, ::Char, ::Int64)
The function `setindex!` exists, but no method is defined for this combination of argument types.
There are a number of advantages to being immutable with the primary one being that efficiencies can be made for speed if they are immutable. If you need a mutable string, then you can look at arrays of characters instead.
If we return to the movies files and recall that we stored the first line as m1, which is the headings for the columns (variables) separated by columns. We can split this line up with categories = split(m1,",") which returns
which shows that there are 9 header strings in this file. Note however, that instead of getting a vector of Strings, this array is of type: Vector{SubString{String}}. What is a something of type SubString? Just looking at it seems like a regular string.
A SubString actually points to the original string for some efficiency. 2
All of the differences between them gets a bit into the weeds and in this case, the current version of the Julia documentation does not reveal this. A discourse discussion on this gives a little more insight.
and if we call it like: concat("string 1 and ", "string 2"), then we get the expected result "string 1 and string 2". This seems great, but if we use a couple of the categories from the previous section, concat(categories[1], categories[2]), we will get the error:
MethodError: no method matching concat(::SubString{String}, ::SubString{String})
The function `concat` exists, but no method is defined for this combination of argument types.
This is expected in that the arguments called with the categories variable are SubString{String} objects and not String objects. This seems a bit annoying that split returns a SubString, but this is an easy fix because there is a supertype of both of these. We can find this with:
supertype(String), supertype(SubString{String}) returns (AbstractString, AbstractString) showing that AbstractString is a supertype of both. Therefore, when we write the concat function above, if we declare the arguments to be of type AbstractString, then this will work for both Strings and Substrings. That is
A takeaway message is that anytime that you are writing a function with string types, use AbstractString instead of String and it should work with any type of string.
Another string type that is a subtype of String is called a LazyString. Investigate in the Julia Documentation about the LazyString type and when you might use it.
Let’s start with some examples of pattern matching. If we want to determine if a string matches a pattern, we can use the exact string. For example, let’s see if a string matches "cat". We can do this with the occursin function like:
will do this. The occursin command will just determine if there is the first string somewhere inside the second string. So occursin("cat", "scatter") will return true, but occursin("cat", "cottage") will return false. Throughout this chapter, we will be testing a few different strings to determine how matches occur, we may do both of these matches together with
returns the vector [1, 0, 0] indicating that only the first string matched. Note that the startswith function has the argument list switched from that of the occursin function. And example with endswith is