COMS W3261 CS Theory
Lecture 3: Regular Expressions
1. Nondeterministic Finite Automata with Epsilon-Transitions
- An ε-NFA is an
NFA (Q, Σ, δ, q0, F)
whose transition function δ is a mapping from
Q × (Σ ∪ {ε}) to P(Q), the set of subsets of
Q.
- The language of an ε-NFA is the set of all strings that
spell out a path from the start state to a final state.
There can be ε-transitions along this path.
- Epsilon-closures
- We define ECLOSE(q), the ε-closure of a state
q of an ε-NFA,
recursively as follows:
- State q is in ECLOSE(q).
- If state p is in ECLOSE(q), then all states
in δ(p, ε)
are also in ECLOSE(q).
- ECLOSE(q) is the set all states that can be reached from
state q using zero or more ε-transitions.
- We can compute the ε-closure of a set of
nondeterministic states S by taking the
union of the ε-closures of all the individual states in S.
2. Converting an ε-NFA to an Equivalent DFA Using the Subset Construction
- We can eliminate all ε-transitions from an ε-NFA
by converting it into an equivalent DFA using the subset construction.
- Given an ε-NFA E =
(QE, Σ, δE,
qE,
FE),
we construct the DFA D =
(QD, Σ, δD,
qD,
FD)
as follows:
- QD = P(QE).
- δD is computed for all a in Σ
and S in P(QE) as follows:
- Let S = {p1, p2, …,
pk} and let
{r1, r2, …, rm}
be the union of
δE(pi, a) for
i = 1, 2, …, k.
- Then, δD(S, a) =
ECLOSE({r1, r2, …,
rm}).
- qD = ECLOSE(qE).
- FD = { S |
S is in QD and
S contains a state in
FE }.
- As with the subset construction for NFA's, we can prove by induction that
L(D) = L(E).
3. Regular Expressions
- Regular expressions are another very different formalism
for defining the regular languages.
- A regular expression E is an algebraic expression that denotes a language
L(E).
- Programming languages such as awk, java,
javascript, perl, python use regular expressions to match patterns in strings.
- There are differences in the regular expression notations used by various programming languages,
the most common variants being POSIX regular expressions and perl-compatible regular expressions.
- However, virtually all regular-expression notations have the operations
of union, concatenation, and Kleene closure.
We shall call regular expressions with just these three operators
Kleene regular expressions.
Kleene regular expressions
- We can specify Kleene regular expressions over an
alphabet Σ and the languages they denote using the following
inductive definition:
- Basis
- The constants ε and ∅ are regular expressions that denote
the languages { ε } and { }, respectively.
- A symbol c in Σ by itself is a regular expression that denotes the
language { c }.
- Induction: Let E and F be regular expressions
denoting the languages L(E) and L(F),
respectively.
- E + F is
a regular expression that denotes L(E) ∪ L(F).
- EF is
a regular expression that denotes L(E)L(F),
the concatenation of L(E) and L(F).
- E* is
a regular expression that denotes (L(E))*.
- (E) is
a regular expression that denotes L(E).
- Precedence and associativity of the regular-expression operators
- The regular-expression operator star has the highest precedence and is
left associative.
- The regular-expression operator concatenation has the next highest precedence and is
left associative.
- The regular-expression operator + has the lowest precedence and is
left associative.
- Thus the regular expression a + b*c would be grouped
a + ((b*)c).
- If a regular expression E denotes a language L
and a string w is in L, we will often say that
E matches w.
Examples of Kleene regular expressions and the languages they denote
- 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
- (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at
least one 1.
- 1*(01*01*)* denotes the set of all strings of 0's and 1's containing an
even number of 0's.
- (a+b)*(abba+baab)(a+b)*
denotes the set of all strings of a's and b's
containing the substring abba
or the substring baab (or both).
4. POSIX Regular Expressions
- The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions
to make it easier to specify languages. It also tried to standardize the different regular-expression
conventions used by various Unix utilities.
- Here is a list of some useful POSIX regular-expression operators
and the strings they match.
Some POSIX regular expression operators
- POSIX uses
?
to mean "zero or one instance of".
- The regular expression
a?b?c?
denotes the language
{ε, a, b, c, ab, ac, bc, abc
}. Thus a?b?c?
matches any of the eight strings in this language.
.
matches any character except a newline.
^
matches the empty string at the beginning of a line.
$
matches the empty string at the end of a line.
[abc]
matches an a
, b
, or c
.
[a-z]
matches any lowercase letter from a
to z
.
[A-Za-z0-9]
matches any alphanumeric character.
[^abc]
matches any character except an
a
, b
, or c
.
[^0-9]
matches any nonnumeric character.
a*
matches any string of zero or more a
's
(including the empty string).
a?
matches any string of zero or one a
's
(including the empty string).
a{2,5}
matches any string consisting of two to five a
's.
(a)
matches an a
.
- Note that in POSIX regular expressions the operator
|
(rather than +
) is used to denote union.
In POSIX regular expressions +
is a unary postfix operator
that matches one or more instances of its operand.
For example, [abc]+
matches any nonempty string
of a
's, b
's, and c
's.
\
is a metacharacter that turns off any special meaning of
the following character. For example, d\*g
matches the string d*g
.
Another example, \\
matches the string consisting of the
single character \
.
Examples of POSIX regular expressions and the strings they match
- The Unix command
egrep 'regexp' file
prints all lines in
file
that contain a substring matched by the
POSIX regular expression regexp
. Examples:
- The command
egrep 'dog' file
would print all lines in file
containing the substring
dog
.
- The command
egrep '^a?b?c?d?e?$' file
would print all lines in file
consisting of
an optional a
followed by
an optional b
followed by
an optional c
followed by
an optional d
followed by
an optional e
.
The metacharacters ^
and $
match the
empty strings at the beginning and end of a line, respectively.
- We can extend this example to find all words in a
wordlist in which the
letters are in strictly increasing alphabetic order.
(
aegilops
is the longest English word whose letters
are in strictly increasing alphabetic order.)
- When we say "regular expression" we mean a Kleene regular expression.
5. A detailed proof that a DFA defines a given language
- To prove that a DFA D defines a given language
L we need to show that every string in
L(D) is in L and that every string in L is in L(D).
Here is a detailed example of how
this can be done using induction for both parts.
- Consider the DFA D =
({
A, B
}, {0, 1}, δ, A
, {A
})
in Example 1 from Lecture 2 with start state A
,
set of final states {A
}, and the following transition
function δ:
State |
Input Symbol |
0 |
1 |
A |
B |
A |
B |
A |
B |
- Let L be the language consisting of all strings of 0’s and 1’s with an
even number
of zeros. We shall prove that L(D) = L.
- To prove L(D) is contained in L,
we shall prove the following inductive
hypothesis on n, the number of transitions made by D
processing a string w,
for n ≥ 0:
- Inductive Hypothesis IH1:
- (a) If δn(
A
, w) = A
,
then w has an even number of 0’s.
Here, δn means n transitions by D.
- (b) If δn(
A
, w) = B
,
then w has an odd number of 0’s.
- Basis. n = 0.
- Part (a). When n = 0, w must be the empty string
so by definition δ0(
A
, ε) = A
.
Since the empty string has an even number of 0’s, part (a) is true.
- Part (b) is trivially true since
δ0(
A
, ε) = B
is false.
- Induction. Assume the inductive hypothesis IH1 is true for n transitions
and suppose D makes n+1 transitions on a string w.
There are four cases to consider depending on whether w ends with a
0 or a 1 and whether D was in state
A
or state B
before processing the last symbol of w.
- Suppose w = x0 for some x and
δn(
A
, x) = A
.
From the inductive hypothesis we know that x has
an even number of 0’s so the string x0 has an odd number of 0's.
In this case D makes the sequence of transitions
δn+1(A
, x0)
= δ(A
, 0) = B
, satisfying
part (b) of the inductive hypothesis.
- Suppose w = x0 and
δn(
A
, x) = B
.
From the inductive hypothesis we know that x has
an odd number of 0’s so x0 has an even number of 0's.
In this case D makes the transitions
δn+1(A
, x0)
= δ(B
, 0) = A
, satisfying
part (a) of the inductive hypothesis.
- Suppose w = x1 and
δn(
A
, x) = A
.
From the inductive hypothesis we know that x has
an even number of 0’s so x1 still has an even number of 0's.
In this case D makes the transitions
δn+1(A
, x1)
= δ(A
, 1) = A
, satisfying
part (a) of the inductive hypothesis.
- Suppose w = x1 and
δn(
A
, x) = B
.
From the inductive hypothesis we know that x has
an odd number of 0’s so x1 still has an odd number of 0's.
In this case D makes the transitions
δn+1(A
, x1)
= δ(B
, 1) = B
, satisfying
part (b) of the inductive hypothesis.
- We have now shown that IH1 is true for all sequences of n moves, where
n ≥ 0.
Part (a) of IH1 implies that if D accepts a string w,
then w must have an even number of 0's.
This proves that L(D) is contained in L
- To prove L is contained in L(D), we need to show that
every string in L is accepted by D.
To do this, we shall prove the following inductive
hypothesis by induction on n, the length of an input string w,
for n ≥ 0:
- Inductive Hypothesis IH2:
- (a) If w has an even number of 0’s,
then δ*(
A
,w) = A
.
Here, δ* means zero or more moves by D.
(δ* is the same as delta-hat in the text.)
- (b) If w has an odd number of 0’s,
then δ*(
A
,w) = B
.
- Basis. n = 0; that is, w = ε.
- The empty string has an even number of 0's.
D accepts the empty string since
the start state is a final state.
- Induction. Assume the inductive hypothesis
IH2 is true for all strings w of
length 0, 1, 2, . . . , n and consider a string
w of length n+1.
Here there are four cases to consider depending on whether w
ends with a 0 or a 1 and whether the string consisting of the first
n symbols of w has an even or odd number of 0's.
- Suppose w = x0 and x has an even number of 0’s.
From the inductive hypothesis IH2,
δ*(
A
,x) = A
.
Since since x0 has an odd number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(A
, 0) = B
,
this case satisfies part (b) of IH2.
- Suppose w = x0 and x has an odd number of 0’s.
From IH2,
δ*(
A
,x) = B
.
Since since x0 has an even number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(B
, 0) = A
,
this case satisfies part (a) of IH2.
- Suppose w = x1 and x has an even number of 0’s.
From IH2, δ*(
A
,x) = A
.
Since since x1 has an even number of 0's and D makes the
sequence of moves
δ*(A
,x1) =
δ(A
, 1) = A
,
this case satisfies part (a) of IH2.
- Suppose w = x1 and x has an odd number of 0’s.
From IH2, δ*(
A
,x) = B
.
Since since x1 has an odd number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(B
, 1) = B
,
this case satisfies part (b) of IH2.
- We have now shown that IH2 is true for all strings of length n, n ≥ 0.
- From the two inductive hypotheses, we can conclude that w is accepted by D
iff w is in L.
In other words, L(D) = L.
6. Practice Problems
- Construct an epsilon-NFA that accepts all strings of
a
's
and b
's ending in abb
or in baa
.
- Show all sequences of moves that your NFA can make on the input string ababb.
- Use the subset construction to convert your epsilon-NFA into an equivalent DFA.
- Do the two regular expressions
(a+b)* and (a*b*)* denote the same language?
- Write a Kleene regular expression for all strings of 0's and 1's with an
even number of 0's and an even number of 1's.
- Let L be the language
{
abxba
| x
is any string of a
's,
b
's, and c
's not containing ba
}.
This language models comments in the programming language C.
- Construct a regular expression for L.
- Explain how your regular expression defines the string
abcbaba
.
- Write a Kleene regular expression for all strings of a's
and b's that begin
and end with an a.
- Write a POSIX regular expression that matches all English words ending with the substring
dous
.
- Write a POSIX regular expression that matches all English words with the five vowels
a, e, i, o, u in order.
All five vowels have to appear in order but they do not have to be next
to one another. (Example:
facetious
.)
Note that vowels can be repeated. (Example: sacrilegious
.)
7. References
aho@cs.columbia.edu
verma@cs.columbia.edu