COMPSCI 120:
Mathematics for Computer Science

Sets and Strings

Exercise 2.1. You’re trying to break into a safe that has a PIN lock. The safe has two buttons: 0 and 1. The PIN you’re trying to guess is a three-digit sequence of binary numbers, and accepts the last three digits you’ve typed in without needing you to hit enter: i.e. if you typed in “00010”, the safe would open if the pin was either “000” or “001” or “010”.

Sounds easy, right? There’s only eight possible PINs to check (two possibilities per digit, three digits in the PIN 23=8\Rightarrow 2^3 = 8 possible pins), so we should be able to brute-force the lock by checking all possibilities.

However, the safe is wired to call the cops if more than ten buttons are pressed and the correct PIN is not entered. As such, we can’t use our brute-force approach: that could take 838\cdot 3 entries!

Is there an approach that is guaranteed to break us into the safe?

Exercise 2.2. You’re a geneticist! As such, you’re working with DNA strands, which we can think of as long strings over the alphabet {A,C,G,T}\{A, C,G,T\} , if we let these letters represent the nucleotides adenine, cytosine, guanine and thymine.

You’ve designed a clever little combination of DNA restriction+polymerase enzymes that do the following: given any string ss of DNA strands, every time there’s a substring of the form “\ldots AC \ldots” in ss , that substring gets cut out and replaced with “\ldots CCA \ldots”

So, for example, if your DNA strand was “ACGT,” it would get turned into “CCAGT” and then would stay stable from there. If your strand was “ACCT”, however, it would first turn into “CCACT”, and then “CCCCAT.”

Suppose you’re originally working with strings of DNA all of the form “AAAAC,” and you dump them into a bath with your enzymes in it. What would you expect to see at the end of this process?

Earlier in this coursebook, we discussed various properties about numbers (divisibility, modular arithmetic, etc) that are very useful in computer science!

However, numbers are not the only things that we work with in computing systems. We also work heavily with things like passwords, user IDs, databases full of names: i.e. strings! We study these objects in the following section:

Strings

First, let’s define what an alphabet is:

Definition 2.1.Link copied An alphabet is any collection of symbols.

Example 2.1. You’re already familiar with the Roman alphabet:

A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z\fbox{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z}

Another alphabet that comes in handy is the collection of decimal digits! We use this to describe numbers:

0,1,2,3,4,5,6,7,8,9\fbox{0,1,2,3,4,5,6,7,8,9}

If we’re working in binary, we use a much smaller alphabet:

0,1\fbox{0,1}

If we’re working with DNA, we’d use

A,C,G,T\fbox{A,C,G,T}

as our alphabet, where these represent the four nucleotide bases cytosine [C], guanine [G], adenine [A] or thymine [T].

There are other alphabets that are too big to write down here: for example, the set of all Unicode symbols, or the set of all emojis!

Given an alphabet, it’s often useful to be able to refer to the whole thing with a symbol. We’ll do this by writing something like Σ={0,1,2,3,4,5,6,7,8,9}\Sigma = \{0,1,2,3,4,5,6,7,8,9\} . This notation, where we list our symbols between a pair of curly braces and separate them with commas, tells us that Σ\Sigma is an alphabet containing the ten symbols 0,1,90,1,\ldots 9 .

With the definition of an alphabet in hand, we can define strings:

Definition 2.2.Link copied Take any alphabet Σ\Sigma . A string over the alphabet Σ\Sigma is any sequence of letters in an alphabet.

Some people refer to strings as “words:” if you see an author referring to a collection of words over a given alphabet, this is just a synonym for strings!

Example 2.2. If we let Σ\Sigma be the Roman alphabet described earlier, then “cat,” “mongoose,” and “ssssssssssss” are all strings over this alphabet. Note that these strings don’t have to correspond to any particular meaning; they’re just sequences of symbols!

If we let Σ\Sigma be the decimal alphabet, then “123,” “00012,” and “999” are each possible strings over this alphabet. Again, these don’t always have to correspond to numbers! In particular, notice that as strings we think that “00012” and “12” are different things. Even though as numbers they’re equal, as strings they’re quite different: “00012” has zeroes in it, while “12”does not. (That is, think about entering a password on your phone. There, if someone has a password of “00012,” entering “12” shouldn’t unlock your phone!)

We will sometimes not specify an alphabet, and instead just refer to strings by listing their entries. If so, we assume that their alphabet is the most reasonable one to work with that string in (usually either the Roman alphabet, decimal, or binary.)

A particularly useful string to refer to is the empty string "", i.e. the string containing no symbols. We denote this string by writing λ\lambda .

Strings are incredibly useful in computer science! Essentially every program we have works with data in the form of strings, in the form of ID numbers, names, IP addresses, and just simply the binary strings that encode literally everything that a computer does.

Perhaps the simplest operation to define on strings is length:

Definition 2.3.Link copied The length of any string is the number of characters in that string.

Example 2.3. The string “abcdef” has length 6, the string “00000” has length 5, and the string “0123” has length 4.

The idea of length is useful when we’re trying to describe a general string! Many arguments involving strings will start with the sentence “Take a string ss over the alphabet Σ\Sigma . Let nn be the length of ss , and write ss as s1s2s3sns_1s_2s_3\ldots s_n .”

Note that when we’re working with strings, writing something like ”s1s2s3sns_1s_2s_3\ldots s_n ” does not mean that we’re multiplying these things all together as if they were numbers! That is: the string “0123” is not the same thing as the product 0123=00\cdot1\cdot2\cdot3 = 0 .

This is why it’s important to keep track of the type of thing you’re working with / in general, at the start of problems, to define your variables and notation.

In particular, we can use this to define what it means for two strings to be equal:

Definition 2.4.Link copied Take any two strings s=s1s2sns = s_1s_2\ldots s_n and t=t1t2tnt = t_1t_2 \ldots t_n of the same length. We say that ss and tt are equal if s1=t1,s2=t2s_1 = t_1, s_2 = t_2 , \ldots and sn=tns_n = t_n .

In other words, two strings are equal if and only if they are literally character-for-character identical! Note that two strings of different lengths are always nonequal.

Example 2.4.

A particularly useful operation on strings in computer science is concatenation:

Definition 2.5.Link copied Take any two strings s=s1s2sns = s_1s_2 \ldots s_n and t=t1t2tmt = t_1t_2\ldots t_m . The concatenation of ss and tt , written stst , is the string s1s2snt1t2tms_1s_2\ldots s_nt_1t_2\ldots t_m .

Example 2.5.

Notice that if ss has length nn and tt has length mm , then stst has length n+mn+m .

Concatenation is used in tons of practical applications:

Related to the idea of concatenation, we have the concepts of prefixes, substring and suffixes:

Definition 2.6.Link copied Let ss and tt be strings. We say that ss is a prefix of tt if tt is just ss with some additional stuff possibly tacked on the end: i.e. if we can find a third string uu such that su=tsu = t .

Similarly, we say that ss is a suffix of tt if tt is just ss with some additional stuff possibly tacked on the front: i.e. if we can find a third string uu such that us=tus = t .

Finally, we say that ss is a substring (alternately, an “infix”) of tt if tt is just ss with some stuff possibly tacked on both the front and end: i.e. if we can find strings u,vu,v such that usv=tusv = t .

Example 2.6.

As a bit of practice with writing arguments, we study a few claims:

Claim 2.1.Link copied The empty string λ\lambda is a prefix, suffix, and substring of every string tt .

Proof. Take any string tt . Notice that t=λt=tλ=λtλt = \lambda t = t \lambda = \lambda t \lambda , because attaching the empty string to the start or end of any string doesn’t change it. Therefore λ\lambda meets the definition of being a prefix, suffix, and substring for any other string tt ! \square

Claim 2.2.Link copied If ss is a prefix of tt , then ss is a substring of tt .

Proof. If ss is a prefix of tt , then there is some string uu such that su=tsu = t . Therefore, we have λsu=t\lambda s u = t as well, because concatenating the empty string λ\lambda with any string doesn’t change it! This shows us that ss satisfies the definition of substring, as claimed.

We can use this idea of a “substring” to answer our safe-cracking problem:

Answer to Exercise 2.1. Think of the sequence of keys we’re entering into the safe as a string ss . If we do this, then the properties we want ss to have are the following: we want every three-digit binary string to occur as a substring of ss , and we want ss to have length at most 10.

As it turns out, we can do this! Enter the following string: “0001011100.” This string has length 10, and contains all possible three-digit pins as subsequences, as shown below. Success!

00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00
00 00 00 11 00 11 11 11 00 00

We can also use it to answer our DNA puzzle:

Answer to Exercise 2.2.

On one hand, we could just brute-force the answer, by repeatedly looking for “AC” substrings and replacing them with “CCA” substrings:

AAAC\to AAACCA
AAACCA\to AACCACA
AACCACA\to ACCACCCAA
ACCACCCAA\to CCACCCACCAA
CCACCCACCAA\to CCCCACCCCACAA
CCCCACCCCACAA\to CCCCCCACCCCCAAA
CCCCCCACCCCCAAA\to CCCCCCCCACCCCAAA
CCCCCCCCACCCCAAA\to CCCCCCCCCCACCCAAA
CCCCCCCCCCACCCAAA\to CCCCCCCCCCCCACCAAA
CCCCCCCCCCCCACCAAA\to CCCCCCCCCCCCCCACAAA
CCCCCCCCCCCCCCACAAA\to CCCCCCCCCCCCCCCCAAAA

In other words: the result is a string of 16 C’s, followed by 4 A’s.

Alternately, we could just notice the following: every time a C moves past an A, we replace that CC with two CC . Therefore, if we move a CC past two AA ’s in a row, we’d expect to repeat this “doubling” process twice, and have four CC ’s; in general, if we move a CC past nn AA ’s in a row, we’d expect to see 2n2^n CC ’s at the end, as we’ve doubled our CC ’s nn times in this process! This matches our results, as 24=162^4 = 16 .

Sets

A second useful object, that we will often study in relation to strings, is the concept of a set:

Definition 2.7.Link copied A set AA is just a collection of things. We call those things the elements of AA , and write xAx \in A to denote with symbols the statement ”xx is an element of AA “.

To describe a set, we just list its elements between a pair of curly braces: for example, {1,2,3}\{1,2,3\} would be how we would describe the set consisting of the three numbers 1, 2 and 3.

Basically every collection of things in real life can be thought of as a set:

Example 2.7.

Notice that sets can be finite (in the case of things like “the collection of all English words”) or infinite (in the case of the set of all prime numbers!)

To make our lives easier when working with sets, let’s make a few notational conventions about how we should treat them:

One useful concept when working with sets is a notion of “size:”

Definition 2.8.Link copied A set AA has size nn if it contains precisely nn different elements. If AA contains infinitely many different elements, we say that AA has “infinite” size. We denote the size of AA by writing A|A| .

In maths, the word cardinality is used to refer to the size of a set. If you take papers like Maths 190 or Compsci 225, you can learn to study the idea of “different sizes of infinity” by working with cardinality! In particular, using the idea of a bijection in those courses, you can show that the integers, rationals, and natural numbers somehow all have the same “countable” size of infinity, while the real numbers somehow have a larger and “uncountable” size of infinity.

Example 2.8.

Another useful concept when working with sets is the idea of a “subset:”

Definition 2.9.Link copied Take two sets A,BA, B We say that BB is a subset of AA , and write BAB \subseteq A , if every object in BB is also an object in AA .

Example 2.9.

We also have a number of useful operations that we perform on sets:

Definition 2.10.Link copied Let A,BA, B be a pair of sets. We define the union of these two sets, ABA \cup B , to be the collection of all elements that are in either AA or BB or both.

Example 2.10.

Definition 2.11.Link copied Let A,BA, B be a pair of sets. We define the intersection of these two sets, ABA \cap B , to be the collection of all elements that are in both AA and BB at the same time.

Example 2.11.

Definition 2.12.Link copied Let A,BA, B be a pair of sets. We define the difference of these two sets, written ABA \setminus B or alternately ABA - B , to be the collection of all elements that are both in AA and not in BB at the same time.

Example 2.12.

Finally, we describe what it means for two sets to be equal:

Definition 2.13.Link copied We say that two sets A,BA, B are equal if they both consist of the same elements; that is, if
  • Every element in AA is a element in BB , and
  • Every element in BB is also a element in AA .

If you go back to our remarks earlier, this should make sense. We said that the only thing we cared about for a set was the elements it contained; i.e. we didn’t care about the order, and we ignored repeats/etc. Therefore, two sets should be the same if they contain the same elements!

A useful proof technique, that we’ll often use to show that two sets are the same is the following. Take two sets A,BA, B that you want to show are equal. Suppose you showed that

  1. every element in AA is a element in BB , and also
  2. every element in BB is a element in AA .

Then, by the definition above, we would know that AA and BB are equal! As such, we can use this two-part approach to prove that many pairs of objects are equal. We study a few examples here, to get the hang of this:

Claim 2.3.Link copied Let A,BA, B be any two sets such that ABA \subseteq B . Then AB=BA \cup B = B .

Proof.

We proceed as suggested above:

  1. First, we show that every element in ABA \cup B is in BB . To do this, we note that by definition ABA \cup B is the set of all elements that are in either AA or BB . Therefore, if we take any element sABs \in A \cup B , we either have sAs \in A or sBs \in B . This lets us work in cases:

Therefore, we’ve shown that for any sABs\in A \cup B , we have sBs \in B , as desired.

  1. Second, we show that every element in BB is in ABA \cup B . This is not too challenging.

Just notice that ABA \cup B , by definition, contains all elements in either AA or BB . Therefore, for any sBs \in B , we have sABs \in A \cup B by definition.

This completes the two-way argument, as desired!

This is not the only way to prove that two sets are equal! As always, simply expanding the definitions of both sets can often do the same trick:

Claim 2.4.Link copied Let A,BA, B be any two sets. Then (AB)A=(A \setminus B) \setminus A = \emptyset .

Proof.

We proceed by expanding our definitions:

  1. First, notice that ABA \setminus B , by definition, is the collection of all elements in AA that are not in BB .
  2. By definition again, (AB)A(A \setminus B) \setminus A is “(the collection of all elements in AA that are not in BB ) that are also not in AA .”
  3. We can simplify this to “the collection of all elements in AA that are not in BB or AA .”
  4. Every element of AA is, um, in AA .
  5. Therefore, “the collection of all elements in AA that are not in BB or AA ” is the empty set, as the “not in AA ” condition eliminates all of the elements in AA .

So we’ve proven our claim!

To close our chapter, we study a trickier example of this process:

Claim 2.5.Link copied If A,B,CA, B, C are three sets, then A(BC)=(AB)(AC)A \setminus (B \cup C) = (A \setminus B) \cap (A \setminus C) .

Proof.

We again proceed by expanding our definitions:

As well, by definition we know that BCB \cup C is the set of all elements that are in either BB or CC , or both.

Therefore, A(BC)A \setminus (B\cup C) can just be thought of as all of the elements that are in AA , but not in either BB or CC .

These are the same statements; therefore we’ve shown that these sets are equal! \square

Practice Problems

  1. (-) Show that any suffix of a word tt is a substring of tt .
  2. If ss is both a prefix and a suffix of tt , then must s=ts=t ? Either show that this is true, or find a counterexample.
  3. Show that if ss is a substring of tt and tt is a substring of ss , then s=ts=t .
  4. (+) Suppose our safe in Exercise 2.1 had PIN numbers with decimal digits, not binary (i.e. they could be any digit from 0-9, instead of just 0 and 1). What is the smallest number of buttons we would have to press to guarantee that the safe would open in this situation?
  5. Suppose that AA and BB are two sets with the following property: every string in AA is a prefix of a string in BB . Is it possible for AA to contain more elements than BB ? Either find such an example, or explain why this is impossible.
  6. (-) Explain why L=L\emptyset \cup L = L for any set LL .
  7. Show that for any three set A,B,CA, B, C , that A(BC)=(AB)(AC)A \setminus (B \cap C) = (A \setminus B) \cup (A \setminus C) . (Try doing so using the method from Claim 2.3, and then try with the method from Claim 2.4! Which do you prefer?)
  8. Take two binary strings s,ts, t of the same length. We say that ss and tt are orthogonal if they disagree at precisely half of their locations: for example, s=s= “1111” and t=t= “1100” are orthogonal.