COMS W3261 CS Theory
Lecture 7: Context-Free Grammars
1. Definition of Context-Free Grammar
- We now begin the study of context-free languages,
the next class of languages in the Chomsky hierarchy that properly includes
the class of regular languages.
- A context-free language is defined by a context-free grammar, a formalism
that generates the strings in the language defined by the grammar.
- Context-free grammars are a key formalism for describing the syntactic structure
of programming languages. They are also useful in the study of natural languages.
- A context-free grammar (CFG) G has four components
(V, T, P, S) where
- V is a finite set of variables called nonterminals,
or sometimes syntactic categories.
- T is a finite set of symbols called terminals.
- The set of terminals is the alphabet of the language
defined by the grammar.
- P is a finite set of productions, rewrite rules of the form
A → α
where A is a nonterminal called the head of the production and
α is a string (possibly empty)
of nonterminals and terminals called the body of the production.
- S is a nonterminal, called the start symbol.
- Example context-free grammar G1:
- V = {
S
}
- T = { ( , ) }
- P is the set with the two productions
S → S ( S )
S → ε
S
is the start symbol.
- We shall see that G1 generates the language consisting of all strings of balanced parentheses.
2. Derivations Using a Grammar
- A grammar defines a language. One common way is by means of derivations.
- Let αAβ be a string of terminals and nonterminals of
a grammar G and
let A → γ be a production of G. We can use this
production to rewrite A by γ in αAβ to create the string
αγβ. We define a relation ⇒ (read derives) on strings of
terminals and nonterminals and write
αAβ ⇒ αγβ to model this rewriting step.
This rewrite is one step of a derivation.
- A derivation is a sequence of zero or more consecutive rewrite steps.
We will use
⇒* to represent the reflexive and transitive closure of ⇒.
We will also use ⇒n to represent an n-step
derivation, n ≥ 0.
- A string of consisting only of terminals (possibly empty) that can be derived
from the start symbol of G is called a sentence of G.
- A string of terminals and nonterminals that can be derived from
the start symbol of a grammar is called a sentential form.
- L(G), the language defined by G, is the set of all sentences
that can be derived from the start symbol of G.
- Here is an example of a derivation of the string
( )( )
from S
in G1:
S ⇒ S ( S )
⇒ S ( S ) ( S )
⇒ ( S ) ( S )
⇒ ( ) ( S )
⇒ ( ) ( )
This derivation shows that ( )( )
is sentence in the
language defined by G1. In each step of the derivation a nonterminal symbol S
in a sentential form has been replaced by the body of a production
that has S
on the left hand side.
We can write S ⇒* ( ) ( )
and
S ⇒5 ( ) ( )
.
3. Leftmost and Rightmost Derivations
- A derivation in which at each step we replace the leftmost nonterminal
by one of its production bodies is called a leftmost derivation.
- The derivation above is a leftmost derivation of
( )( )
from S
in G1.
- A rightmost derivation is one in which at each step we replace the
rightmost nonterminal by one of its production bodies.
- Here is a rightmost derivation of
( )( )
from S
in G1:
S ⇒ S ( S )
⇒ S ( )
⇒ S ( S ) ( )
⇒ S ( ) ( )
⇒ ( ) ( )
4. Parse Trees
- A derivation can be represented by a parse tree.
- Let G = (V, T, P, S)
be a CFG. A parse tree for G is a tree in which:
- Each interior node is labeled by a nonterminal in V.
- Each leaf is labeled by a nonterminal, or a terminal, or ε.
- If an interior node is labeled by a nonterminal A and its children are
labeled X1, X2, …,
Xk, then
A →
X1X2…Xk
is a production in P.
- The yield of a parse tree is the string obtained by
concatenating the labels of its leaves from left to right.
- A parser for a grammar G is a program that takes as input a string
and produces as output a parse tree for the string or a message
saying that the string cannot be generated by G.
- A parser generator is a program that takes as input a grammar G
and produces as output a parser for G. YACC is a widely used
bottom-up parser generator.
5. Ambiguity
- A grammar G is ambiguous if there is a sentence in L(G)
with two or more distinct parse trees.
- The following grammar G2 for arithmetic expressions is ambiguous
because
a + a * a
has two parse trees.
E → E + E | E * E | ( E ) | a
We can remove the ambiguity by specifying the associativity
and precedence of the +
and *
.
The grammar G3 below is unambiguous and makes *
have higher precedence than +
and makes both
*
and +
left associative.
E → E + T | T
T → T * F | F
F → ( E ) | a
A context-free language is inherently ambiguous if it
cannot be generated by an unambiguous grammar.
6. Practice Problems
- Construct a CFG that generates the language
{
anbn
| n
≥ 0 }.
- Prove that the language generated by the grammar G1 in section 2 consists of all
and only all strings of balanced parentheses.
- Construct a CFG that generates ELP = {
wwR
| w
is any string of a
's and b
's }. This is the
language of even-length palindromes over the alphabet {a
, b
}.
A palindrome is a string that reads the same in both directions.
- Prove that ELP is not a regular language.
- Construct an unambiguous CFG for all regular expressions over the alphabet {a, b}.
7. Reading
aho@cs.columbia.edu
verma@cs.columbia.edu