COMS W3261 CS Theory
Lecture 3: Regular Expressions

1. Nondeterministic Finite Automata with Epsilon-Transitions

An ε-NFA is an NFA (Q, Σ, δ, q₀, F) whose transition function δ is a mapping from Q × (Σ ∪ {ε}) to P(Q), the set of subsets of Q.
The language of an ε-NFA is the set of all strings that spell out a path from the start state to a final state. There can be ε-transitions along this path.
Epsilon-closures

We define ECLOSE(q), the ε-closure of a state q of an ε-NFA, recursively as follows:

State q is in ECLOSE(q).
If state p is in ECLOSE(q), then all states in δ(p, ε) are also in ECLOSE(q).
ECLOSE(q) is the set all states that can be reached from state q using zero or more ε-transitions.

We can compute the ε-closure of a set of nondeterministic states S by taking the union of the ε-closures of all the individual states in S.

2. Converting an ε-NFA to an Equivalent DFA Using the Subset Construction

We can eliminate all ε-transitions from an ε-NFA by converting it into an equivalent DFA using the subset construction.
Given an ε-NFA E = (Q_E, Σ, δ_E, q_E, F_E), we construct the DFA D = (Q_D, Σ, δ_D, q_D, F_D) as follows:

Q_D = P(Q_E).
δ_D is computed for all a in Σ and S in P(Q_E) as follows:

Let S = {p₁, p₂, …, p_k} and let {r₁, r₂, …, r_m} be the union of δ_E(p_i, a) for i = 1, 2, …, k.

Then, δ_D(S, a) = ECLOSE({r₁, r₂, …, r_m}).

q_D = ECLOSE(q_E).
F_D = { S | S is in Q_D and S contains a state in F_E }.

As with the subset construction for NFA's, we can prove by induction that L(D) = L(E).

3. Regular Expressions

Regular expressions are another very different formalism for defining the regular languages.
A regular expression E is an algebraic expression that denotes a language L(E).
Programming languages such as awk, java, javascript, perl, python use regular expressions to match patterns in strings.
There are differences in the regular expression notations used by various programming languages, the most common variants being POSIX regular expressions and perl-compatible regular expressions.
However, virtually all regular-expression notations have the operations of union, concatenation, and Kleene closure. We shall call regular expressions with just these three operators Kleene regular expressions.

Kleene regular expressions

We can specify Kleene regular expressions over an alphabet Σ and the languages they denote using the following inductive definition:

Basis

The constants ε and ∅ are regular expressions that denote the languages { ε } and { }, respectively.
A symbol c in Σ by itself is a regular expression that denotes the language { c }.

Induction: Let E and F be regular expressions denoting the languages L(E) and L(F), respectively.

E + F is a regular expression that denotes L(E) ∪ L(F).
EF is a regular expression that denotes L(E)L(F), the concatenation of L(E) and L(F).
E* is a regular expression that denotes (L(E))*.
(E) is a regular expression that denotes L(E).

Precedence and associativity of the regular-expression operators

The regular-expression operator star has the highest precedence and is left associative.
The regular-expression operator concatenation has the next highest precedence and is left associative.
The regular-expression operator + has the lowest precedence and is left associative.
Thus the regular expression a + b*c would be grouped a + ((b*)c).

If a regular expression E denotes a language L and a string w is in L, we will often say that E matches w.

Examples of Kleene regular expressions and the languages they denote

0*10* denotes the set of all strings of 0's and 1's containing a single 1.
(0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at least one 1.
1*(01*01*)* denotes the set of all strings of 0's and 1's containing an even number of 0's.
(a+b)*(abba+baab)(a+b)* denotes the set of all strings of a's and b's containing the substring abba or the substring baab (or both).

4. POSIX Regular Expressions

The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions to make it easier to specify languages. It also tried to standardize the different regular-expression conventions used by various Unix utilities.
Here is a list of some useful POSIX regular-expression operators and the strings they match.

Some POSIX regular expression operators

POSIX uses ? to mean "zero or one instance of".

The regular expression a?b?c? denotes the language {ε, a, b, c, ab, ac, bc, abc}. Thus a?b?c? matches any of the eight strings in this language.

. matches any character except a newline.
^ matches the empty string at the beginning of a line.
$ matches the empty string at the end of a line.
[abc] matches an a, b, or c.
[a-z] matches any lowercase letter from a to z.
[A-Za-z0-9] matches any alphanumeric character.
[^abc] matches any character except an a, b, or c.
[^0-9] matches any nonnumeric character.
a* matches any string of zero or more a's (including the empty string).
a? matches any string of zero or one a's (including the empty string).
a{2,5} matches any string consisting of two to five a's.
(a) matches an a.
Note that in POSIX regular expressions the operator | (rather than +) is used to denote union. In POSIX regular expressions + is a unary postfix operator that matches one or more instances of its operand. For example, [abc]+ matches any nonempty string of a's, b's, and c's.
\ is a metacharacter that turns off any special meaning of the following character. For example, d\*g matches the string d*g. Another example, \\ matches the string consisting of the single character \.

Examples of POSIX regular expressions and the strings they match

The Unix command egrep 'regexp' file prints all lines in file that contain a substring matched by the POSIX regular expression regexp. Examples:

The command egrep 'dog' file would print all lines in file containing the substring dog.
The command egrep '^a?b?c?d?e?$' file would print all lines in file consisting of an optional a followed by an optional b followed by an optional c followed by an optional d followed by an optional e. The metacharacters ^ and $ match the empty strings at the beginning and end of a line, respectively.

We can extend this example to find all words in a wordlist in which the letters are in strictly increasing alphabetic order. (aegilops is the longest English word whose letters are in strictly increasing alphabetic order.)

When we say "regular expression" we mean a Kleene regular expression.

5. A detailed proof that a DFA defines a given language

To prove that a DFA D defines a given language L we need to show that every string in L(D) is in L and that every string in L is in L(D). Here is a detailed example of how this can be done using induction for both parts.
Consider the DFA D = ({A, B}, {0, 1}, δ, A, {A}) in Example 1 from Lecture 2 with start state A, set of final states {A}, and the following transition function δ:

`State`	`Input Symbol`
`State`	`0`	`1`
`A`	`B`	`A`
`B`	`A`	`B`

Let L be the language consisting of all strings of 0’s and 1’s with an even number of zeros. We shall prove that L(D) = L.

To prove L(D) is contained in L, we shall prove the following inductive hypothesis on n, the number of transitions made by D processing a string w, for n ≥ 0:

Inductive Hypothesis IH1:

(a) If δⁿ(A, w) = A, then w has an even number of 0’s. Here, δⁿ means n transitions by D.

(b) If δⁿ(A, w) = B, then w has an odd number of 0’s.

Basis. n = 0.

Part (a). When n = 0, w must be the empty string so by definition δ⁰(A, ε) = A. Since the empty string has an even number of 0’s, part (a) is true.
Part (b) is trivially true since δ⁰(A, ε) = B is false.

Induction. Assume the inductive hypothesis IH1 is true for n transitions and suppose D makes n+1 transitions on a string w. There are four cases to consider depending on whether w ends with a 0 or a 1 and whether D was in state A or state B before processing the last symbol of w.

Suppose w = x0 for some x and δⁿ(A, x) = A. From the inductive hypothesis we know that x has an even number of 0’s so the string x0 has an odd number of 0's. In this case D makes the sequence of transitions δⁿ⁺¹(A, x0) = δ(A, 0) = B, satisfying part (b) of the inductive hypothesis.
Suppose w = x0 and δⁿ(A, x) = B. From the inductive hypothesis we know that x has an odd number of 0’s so x0 has an even number of 0's. In this case D makes the transitions δⁿ⁺¹(A, x0) = δ(B, 0) = A, satisfying part (a) of the inductive hypothesis.
Suppose w = x1 and δⁿ(A, x) = A. From the inductive hypothesis we know that x has an even number of 0’s so x1 still has an even number of 0's. In this case D makes the transitions δⁿ⁺¹(A, x1) = δ(A, 1) = A, satisfying part (a) of the inductive hypothesis.
Suppose w = x1 and δⁿ(A, x) = B. From the inductive hypothesis we know that x has an odd number of 0’s so x1 still has an odd number of 0's. In this case D makes the transitions δⁿ⁺¹(A, x1) = δ(B, 1) = B, satisfying part (b) of the inductive hypothesis.

We have now shown that IH1 is true for all sequences of n moves, where n ≥ 0. Part (a) of IH1 implies that if D accepts a string w, then w must have an even number of 0's. This proves that L(D) is contained in L

To prove L is contained in L(D), we need to show that every string in L is accepted by D. To do this, we shall prove the following inductive hypothesis by induction on n, the length of an input string w, for n ≥ 0:

Inductive Hypothesis IH2:

(a) If w has an even number of 0’s, then δ*(A,w) = A. Here, δ* means zero or more moves by D. (δ* is the same as delta-hat in the text.)

(b) If w has an odd number of 0’s, then δ*(A,w) = B.

Basis. n = 0; that is, w = ε.

The empty string has an even number of 0's. D accepts the empty string since the start state is a final state.

Induction. Assume the inductive hypothesis IH2 is true for all strings w of length 0, 1, 2, . . . , n and consider a string w of length n+1. Here there are four cases to consider depending on whether w ends with a 0 or a 1 and whether the string consisting of the first n symbols of w has an even or odd number of 0's.

Suppose w = x0 and x has an even number of 0’s. From the inductive hypothesis IH2, δ*(A,x) = A. Since since x0 has an odd number of 0's and D makes the sequence of moves δ*(A,x0) = δ(A, 0) = B, this case satisfies part (b) of IH2.
Suppose w = x0 and x has an odd number of 0’s. From IH2, δ*(A,x) = B. Since since x0 has an even number of 0's and D makes the sequence of moves δ*(A,x0) = δ(B, 0) = A, this case satisfies part (a) of IH2.
Suppose w = x1 and x has an even number of 0’s. From IH2, δ*(A,x) = A. Since since x1 has an even number of 0's and D makes the sequence of moves δ*(A,x1) = δ(A, 1) = A, this case satisfies part (a) of IH2.
Suppose w = x1 and x has an odd number of 0’s. From IH2, δ*(A,x) = B. Since since x1 has an odd number of 0's and D makes the sequence of moves δ*(A,x0) = δ(B, 1) = B, this case satisfies part (b) of IH2.

We have now shown that IH2 is true for all strings of length n, n ≥ 0.

From the two inductive hypotheses, we can conclude that w is accepted by D iff w is in L. In other words, L(D) = L.

6. Practice Problems

Construct an epsilon-NFA that accepts all strings of a's and b's ending in abb or in baa.

Show all sequences of moves that your NFA can make on the input string ababb.
Use the subset construction to convert your epsilon-NFA into an equivalent DFA.

Do the two regular expressions (a+b)* and (a*b*)* denote the same language?
Write a Kleene regular expression for all strings of 0's and 1's with an even number of 0's and an even number of 1's.
Let L be the language { abxba | x is any string of a's, b's, and c's not containing ba }. This language models comments in the programming language C.

Construct a regular expression for L.
Explain how your regular expression defines the string abcbaba.

Write a Kleene regular expression for all strings of a's and b's that begin and end with an a.
Write a POSIX regular expression that matches all English words ending with the substring dous.
Write a POSIX regular expression that matches all English words with the five vowels a, e, i, o, u in order. All five vowels have to appear in order but they do not have to be next to one another. (Example: facetious.) Note that vowels can be repeated. (Example: sacrilegious.)

7. References

aho@cs.columbia.edu

verma@cs.columbia.edu

COMS W3261 CS Theory Lecture 3: Regular Expressions