Sets and Strings
Exercise 2.1. You’re trying to break into a safe that has a PIN lock. The safe has two buttons: 0 and 1. The PIN you’re trying to guess is a three-digit sequence of binary numbers, and accepts the last three digits you’ve typed in without needing you to hit enter: i.e. if you typed in “00010”, the safe would open if the pin was either “000” or “001” or “010”.
Sounds easy, right? There’s only eight possible PINs to check (two possibilities per digit, three digits in the PIN possible pins), so we should be able to brute-force the lock by checking all possibilities.
However, the safe is wired to call the cops if more than ten buttons are pressed and the correct PIN is not entered. As such, we can’t use our brute-force approach: that could take entries!
Is there an approach that is guaranteed to break us into the safe?
Exercise 2.2. You’re a geneticist! As such, you’re working with DNA strands, which we can think of as long strings over the alphabet , if we let these letters represent the nucleotides adenine, cytosine, guanine and thymine.
You’ve designed a clever little combination of DNA restriction+polymerase enzymes that do the following: given any string of DNA strands, every time there’s a substring of the form “\ldots AC \ldots” in , that substring gets cut out and replaced with “\ldots CCA \ldots”
So, for example, if your DNA strand was “ACGT,” it would get turned into “CCAGT” and then would stay stable from there. If your strand was “ACCT”, however, it would first turn into “CCACT”, and then “CCCCAT.”
Suppose you’re originally working with strings of DNA all of the form “AAAAC,” and you dump them into a bath with your enzymes in it. What would you expect to see at the end of this process?
Earlier in this coursebook, we discussed various properties about numbers (divisibility, modular arithmetic, etc) that are very useful in computer science!
However, numbers are not the only things that we work with in computing systems. We also work heavily with things like passwords, user IDs, databases full of names: i.e. strings! We study these objects in the following section:
Strings
First, let’s define what an alphabet is:
Example 2.1. You’re already familiar with the Roman alphabet:
Another alphabet that comes in handy is the collection of decimal digits! We use this to describe numbers:
If we’re working in binary, we use a much smaller alphabet:
If we’re working with DNA, we’d use
as our alphabet, where these represent the four nucleotide bases cytosine [C], guanine [G], adenine [A] or thymine [T].
There are other alphabets that are too big to write down here: for example, the set of all Unicode symbols, or the set of all emojis!
Given an alphabet, it’s often useful to be able to refer to the whole thing with a symbol. We’ll do this by writing something like . This notation, where we list our symbols between a pair of curly braces and separate them with commas, tells us that is an alphabet containing the ten symbols .
With the definition of an alphabet in hand, we can define strings:
Some people refer to strings as “words:” if you see an author referring to a collection of words over a given alphabet, this is just a synonym for strings!
Example 2.2. If we let be the Roman alphabet described earlier, then “cat,” “mongoose,” and “ssssssssssss” are all strings over this alphabet. Note that these strings don’t have to correspond to any particular meaning; they’re just sequences of symbols!
If we let be the decimal alphabet, then “123,” “00012,” and “999” are each possible strings over this alphabet. Again, these don’t always have to correspond to numbers! In particular, notice that as strings we think that “00012” and “12” are different things. Even though as numbers they’re equal, as strings they’re quite different: “00012” has zeroes in it, while “12”does not. (That is, think about entering a password on your phone. There, if someone has a password of “00012,” entering “12” shouldn’t unlock your phone!)
We will sometimes not specify an alphabet, and instead just refer to strings by listing their entries. If so, we assume that their alphabet is the most reasonable one to work with that string in (usually either the Roman alphabet, decimal, or binary.)
A particularly useful string to refer to is the empty string "", i.e. the string containing no symbols. We denote this string by writing .
Strings are incredibly useful in computer science! Essentially every program we have works with data in the form of strings, in the form of ID numbers, names, IP addresses, and just simply the binary strings that encode literally everything that a computer does.
Perhaps the simplest operation to define on strings is length:
Example 2.3. The string “abcdef” has length 6, the string “00000” has length 5, and the string “0123” has length 4.
The idea of length is useful when we’re trying to describe a general string! Many arguments involving strings will start with the sentence “Take a string over the alphabet . Let be the length of , and write as .”
Note that when we’re working with strings, writing something like ” ” does not mean that we’re multiplying these things all together as if they were numbers! That is: the string “0123” is not the same thing as the product .
This is why it’s important to keep track of the type of thing you’re working with / in general, at the start of problems, to define your variables and notation.
In particular, we can use this to define what it means for two strings to be equal:
In other words, two strings are equal if and only if they are literally character-for-character identical! Note that two strings of different lengths are always nonequal.
Example 2.4.
- The strings “00001” and “1” are different. Even though the underlying numbers they represent are the same, these are different-length strings.
- If we take the alphabet given by all characters on a keyboard, the strings “12+23” and “10+25” are different. Even though these are the same length and represent the same underlying integer, the characters are different in some places: for instance, the second character of the first string is 2, while the second character of the second string is 0.
A particularly useful operation on strings in computer science is concatenation:
Example 2.5.
-
Let “song” and “bird”. Then is the string “songbird”.
-
Let “12” and “0”. Then is equal to “120”. Notice that this is very different to what we would mean by writing if we thought of as integers; there, would denote !
In general, if you’re using string concatenation on strings of numbers, make sure to indicate this to your reader repeatedly through your working so that they know what you’re doing. The use of quotation marks can help keep things clear: that is, because we wrote “12” and “0”, we’ve told you that we are thinking of as strings, and thus that concatenation is the appropriate way to combine them.
-
You can concatenate multiple strings at once: i.e. if “3”, ”.”, and “14159265…”, then is just “3.14159265…”
Notice that if has length and has length , then has length .
Concatenation is used in tons of practical applications:
- Every bank account number is a concatenation of a bank code (telling you what company you bank with,) an account number (which tells the bank who owns this account,) and an account type code (telling you what kind of account that number is attached to.)
- We saw that many ID numbers have “check digits” when we worked with modular arithmetic! As such, your full ID number is usually created by concatenating your account number with the check digit.
Related to the idea of concatenation, we have the concepts of prefixes, substring and suffixes:
Similarly, we say that is a suffix of if is just with some additional stuff possibly tacked on the front: i.e. if we can find a third string such that .
Finally, we say that is a substring (alternately, an “infix”) of if is just with some stuff possibly tacked on both the front and end: i.e. if we can find strings such that .
Example 2.6.
- If “snowball,” then “snow” is a prefix of , “ball” is a suffix of , and “now” is a substring of .
- If “112323411,” then “112” is a prefix of , “323411” is a suffix of , and “2” is a substring of .
As a bit of practice with writing arguments, we study a few claims:
Proof. Take any string . Notice that , because attaching the empty string to the start or end of any string doesn’t change it. Therefore meets the definition of being a prefix, suffix, and substring for any other string !
Proof. If is a prefix of , then there is some string such that . Therefore, we have as well, because concatenating the empty string with any string doesn’t change it! This shows us that satisfies the definition of substring, as claimed.
We can use this idea of a “substring” to answer our safe-cracking problem:
Answer to Exercise 2.1. Think of the sequence of keys we’re entering into the safe as a string . If we do this, then the properties we want to have are the following: we want every three-digit binary string to occur as a substring of , and we want to have length at most 10.
As it turns out, we can do this! Enter the following string: “0001011100.” This string has length 10, and contains all possible three-digit pins as subsequences, as shown below. Success!
We can also use it to answer our DNA puzzle:
Answer to Exercise 2.2.
On one hand, we could just brute-force the answer, by repeatedly looking for “AC” substrings and replacing them with “CCA” substrings:
AAAC | AAACCA | |
AAACCA | AACCACA | |
AACCACA | ACCACCCAA | |
ACCACCCAA | CCACCCACCAA | |
CCACCCACCAA | CCCCACCCCACAA | |
CCCCACCCCACAA | CCCCCCACCCCCAAA | |
CCCCCCACCCCCAAA | CCCCCCCCACCCCAAA | |
CCCCCCCCACCCCAAA | CCCCCCCCCCACCCAAA | |
CCCCCCCCCCACCCAAA | CCCCCCCCCCCCACCAAA | |
CCCCCCCCCCCCACCAAA | CCCCCCCCCCCCCCACAAA | |
CCCCCCCCCCCCCCACAAA | CCCCCCCCCCCCCCCCAAAA |
In other words: the result is a string of 16 C’s, followed by 4 A’s.
Alternately, we could just notice the following: every time a C moves past an A, we replace that with two . Therefore, if we move a past two ’s in a row, we’d expect to repeat this “doubling” process twice, and have four ’s; in general, if we move a past ’s in a row, we’d expect to see ’s at the end, as we’ve doubled our ’s times in this process! This matches our results, as .
Sets
A second useful object, that we will often study in relation to strings, is the concept of a set:
To describe a set, we just list its elements between a pair of curly braces: for example, would be how we would describe the set consisting of the three numbers 1, 2 and 3.
Basically every collection of things in real life can be thought of as a set:
Example 2.7.
- The collection of all strings in the Oxford English Dictionary is a set. It contains elements like “heart” and “number,” but not things like “arbleorble.”
- The collection of all words in Māori is a set. This set contains elements like “tapawhāa” and “tau” (the Māori words for rectangle and number,) but does not contain strings like “123abc.”
- The collection of all commands in C is a set.
- The collection of all binary strings of length at most 2 is a set. We could write this set out by listing its elements: .
- The “empty” set containing no elements is a set! We call this the empty set, and refer to this by drawing the symbol . This is a fairly useful set to be able to refer to, for the same reasons that 0 is a useful number; it can be handy to talk about “nothing” in a concrete way!
- The set of all prime numbers is a set:
- The set of integers , the set of rational numbers , the set of natural numbers , and the set of real numbers are all sets.
- The set of all polynomials with degree at most 3 is a set: it contains things like and .
- The set of all irrational numbers is a set.
- The set of all numbers that are solutions to the equation is a set. (Specifically, because , this set is just , the set containing only one object, namely 1).
Notice that sets can be finite (in the case of things like “the collection of all English words”) or infinite (in the case of the set of all prime numbers!)
To make our lives easier when working with sets, let’s make a few notational conventions about how we should treat them:
-
When we’re describing a set, we don’t care about the order in which we list our elements: i.e. cat, tag, tact and tag, cat, tact are both the “same” to us! This is because we only care about what things are contained within a set; the order is something that we’ll wind up changing a lot depending on the context (i.e. sometimes alphabetical, sometimes by length…) and isn’t itself something we want to care about.
-
Similarly, when we’re describing a set, we only want to list each element once. This is because otherwise it would be quite irritating to try to look things up in our set: imagine a dictionary that just listed the word “mongoose” forty times in a row!
As such, if someone gives you a set in which an element is repeated twice, we just remove duplicates: i.e. we say cat, tag, tact, tact, tact and cat, tag, tact are the same, and would never write the first thing if we couldn’t help it.
-
In the case of cat, tag, tact , we were able to describe our set by just listing its elements. This works for small cases, but becomes quite unwieldly for larger sets: imagine having to write out all of the words in French before discussing the French language!
To deal with this, we have an alternate way of writing sets: you can describe them by giving a property. For instance, when we say “the set of all words in Māori” above, we’re giving you a property that a given string of letters may or may not satisfy (i.e. “is it a word in Māori”), and then taking the set of all words that satisfy that property.
While the sentences we used in our examples above do work as definitions for sets, you can also use the following more “math-y” construction: to describe the set of all strings with property blah, you can just write .
For instance, the set of all odd-length binary strings could be described as the following:
We use the notation ” ” as shorthand for the word “in.”
The ” ” on the left tells you the variable name, the divider just separates the variable from its property, and the text at the right gives the required property.
You can also use the left-hand part to describe the structure of your set’s elements: i.e. something like
gives you all binary strings that start with the prefix “001”.
One useful concept when working with sets is a notion of “size:”
In maths, the word cardinality is used to refer to the size of a set. If you take papers like Maths 190 or Compsci 225, you can learn to study the idea of “different sizes of infinity” by working with cardinality! In particular, using the idea of a bijection in those courses, you can show that the integers, rationals, and natural numbers somehow all have the same “countable” size of infinity, while the real numbers somehow have a larger and “uncountable” size of infinity.
Example 2.8.
- The set has size 5.
- The set of all binary strings of length 2, i.e. , has size 4.
Another useful concept when working with sets is the idea of a “subset:”
Example 2.9.
-
Let be the collection of all University of Auckland ID numbers, and let be the collection of all University of Auckland ID numbers corresponding to active Compsci 120 students. Then is a subset of !
-
Let be the set of all binary strings of length 3, and let be the set of all binary strings with exactly two 1’s.
Then is not a subset of . This is because contains things like ” ”, which are not in . Similarly, is not a subset of , because contains things like ” ” that are not in !
-
Let be the English language, and be the collection of all English words that rhyme with “avocado.” Then is a subset of , as every word in is by definition a word in !
We also have a number of useful operations that we perform on sets:
Example 2.10.
-
Let be the collection of all English words with even length and be the collection of all English words with odd length. In this case, is the collection of all English words.
-
Let be the collection of all Compsci 120 students that turned in assignment 1, and be the collection of all Compsci 120 students that attended tutorial 1. Then is the collection of all Compsci 120 students who either attended tutorial 1 or turned in assignment 1, or both.
In general, unions work like “or” operations: the union of a set defined by property with a set defined by property is just the collection of all elements that satisfy property or .
-
Let be the collection of the 1000 most common phrases used in spam emails (things like “You be a Winner!!!1!!”) and be a collection of dodgy email addresses (e.g.
"bi11.gates@micr0soft.ie"
). Then, the union is a good start for a “block list,” i.e. something that an email filter can use to automatically trash certain emails.
Example 2.11.
-
Let be the English language and be the German language. Then is the set of words that are both in English and German at the same time: i.e. words like “alphabet,” “computer” and ”tag” would be in , as they are all both English and German words.
-
Let be the set of numbers that are multiples of 3, and be the set of numbers who are multiples of 2. Then is the set of numbers that are multiples of both 2 and 3; i.e. it’s the set of all numbers that are multiples of 6!
Like how union was an “or,” intersection works like an “and” operation: that is, the intersection of a set defined by property with a set defined by property is just the collection of all elements that satisfy property and .
-
If is the set consisting of ID numbers of current Compsci 120 students, and is the set consisting of ID numbers of current Compsci 720 students, then , the empty set. (This is because there are no students simultaneously taking 120 and 720!)
Example 2.12.
-
If was the set of ID numbers for all current Compsci 120 students, and was the set of ID numbers for Compsci 120 students who attended at least eight tutorials, then is the set of ID numbers for students who attended seven or fewer tutorials (i.e. the ID numbers of students who will not have perfect marks for tutorials. Don’t be in this set!)
-
If is the set of prime numbers, and is set of odd integers, then is the collection of all primes that are not odd: that is, .
-
Let denote the set of all ASCII strings of length at least 10, be the set of all English words, and be a list of the 10,000 most common passwords. The set is a good start to a list of “acceptable” passwords: i.e. if you were making a login system, you could require all of your users to pick words in . Doing this would mean that they have to pick passwords that
- Have length at least 10 (i.e. are in ),
- Aren’t in a dictionary (i.e. not in ), and
- Aren’t commonly used (i.e. not in ).
Useful!
Finally, we describe what it means for two sets to be equal:
- Every element in is a element in , and
- Every element in is also a element in .
If you go back to our remarks earlier, this should make sense. We said that the only thing we cared about for a set was the elements it contained; i.e. we didn’t care about the order, and we ignored repeats/etc. Therefore, two sets should be the same if they contain the same elements!
A useful proof technique, that we’ll often use to show that two sets are the same is the following. Take two sets that you want to show are equal. Suppose you showed that
- every element in is a element in , and also
- every element in is a element in .
Then, by the definition above, we would know that and are equal! As such, we can use this two-part approach to prove that many pairs of objects are equal. We study a few examples here, to get the hang of this:
Proof.
We proceed as suggested above:
- First, we show that every element in is in . To do this, we note that by definition is the set of all elements that are in either or . Therefore, if we take any element , we either have or . This lets us work in cases:
-
If , then recall that we’ve assumed that . By definition, this means that every element in is also in . Therefore, we have .
-
If , then we trivially have .
Therefore, we’ve shown that for any , we have , as desired.
- Second, we show that every element in is in . This is not too challenging.
Just notice that , by definition, contains all elements in either or . Therefore, for any , we have by definition.
This completes the two-way argument, as desired!
This is not the only way to prove that two sets are equal! As always, simply expanding the definitions of both sets can often do the same trick:
Proof.
We proceed by expanding our definitions:
- First, notice that , by definition, is the collection of all elements in that are not in .
- By definition again, is “(the collection of all elements in that are not in ) that are also not in .”
- We can simplify this to “the collection of all elements in that are not in or .”
- Every element of is, um, in .
- Therefore, “the collection of all elements in that are not in or ” is the empty set, as the “not in ” condition eliminates all of the elements in .
So we’ve proven our claim!
To close our chapter, we study a trickier example of this process:
Proof.
We again proceed by expanding our definitions:
- On the left-hand-side, we have . By definition, consists of all of the elements that are in , but not in .
As well, by definition we know that is the set of all elements that are in either or , or both.
Therefore, can just be thought of as all of the elements that are in , but not in either or .
-
On the right-hand-side, we can similarly use our definitions to notice that is the set of all elements that are in but not , and that is the set of all elements that are in but not .
As a consequence, we have that is the set of all elements that are both (in but not ) and (in but not ). Logically, we can simplify this sentence to the condition “in , but not in either or .”
These are the same statements; therefore we’ve shown that these sets are equal!
Practice Problems
- (-) Show that any suffix of a word is a substring of .
- If is both a prefix and a suffix of , then must ? Either show that this is true, or find a counterexample.
- Show that if is a substring of and is a substring of , then .
- (+) Suppose our safe in Exercise 2.1 had PIN numbers with decimal digits, not binary (i.e. they could be any digit from 0-9, instead of just 0 and 1). What is the smallest number of buttons we would have to press to guarantee that the safe would open in this situation?
- Suppose that and are two sets with the following property: every string in is a prefix of a string in . Is it possible for to contain more elements than ? Either find such an example, or explain why this is impossible.
- (-) Explain why for any set .
- Show that for any three set , that . (Try doing so using the method from Claim 2.3, and then try with the method from Claim 2.4! Which do you prefer?)
- Take two binary strings of the same length. We say that and are orthogonal if they disagree at precisely half of their locations: for example, “1111” and “1100” are orthogonal.
- (-) Show that if are odd-length strings, then and cannot be orthogonal.
- Find a set consisting of four length-4 strings that are all orthogonal to each other (i.e. every possible pair of strings in your set should be orthogonal).
- (+) Find a set consisting of length- strings that are all orthogonal to each other.
- (++) What is the largest set of orthogonal length-668 strings that you can make?