COMS W4115
Programming Languages and Translators
Lecture 6: Context-Free Grammars
February 9, 2015
Language whitepapers due in your team folder
on Courseworks/COMSW4115/Files&Resources by Feb. 25, 2015.
Lecture Outline
- Context-free grammars
- Derivations and parse trees
- Ambiguity
- Examples of context-free grammars
- Yacc: a language for specifying syntax-directed translators
1. Context-Free Grammars (CFG's)
- CFG's are very useful for representing the syntactic structure
of programming languages.
- A CFG is sometimes called Backus-Naur Form (BNF).
- A context-free grammar consists of
- A finite set of terminal symbols,
- A finite nonempty set of nonterminal symbols,
- One distinguished nonterminal called the start symbol, and
- A finite set of rewrite rules, called productions, each of the form
A → α
where A is a nonterminal and α is a string (possibly empty)
of terminals and nonterminals.
- Consider the context-free grammar G with the productions
E → E + T | T
T → T * F | F
F → ( E ) | id
- The terminal symbols are the alphabet from which strings are formed.
In this grammar the set of terminal symbols is
{ id, +, *, (, ) }. The terminal symbols are the token names.
- The nonterminal symbols are syntactic variables that denote sets
of strings of terminal symbols. In this grammar the set of nonterminal
symbols is {
E
, T
, F
}.
- The start symbol is
E
.
2. Derivations and Parse Trees
- L(G), the language generated by a grammar G, consists of all strings of
terminal symbols that can be derived from the start symbol of G.
- A leftmost derivation expands the leftmost nonterminal in
each sentential form:
E ⇒ E + T
⇒ T + T
⇒ F + T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
A rightmost derivation expands the rightmost nonterminal in each sentential form:
E ⇒ E + T
⇒ E + T * F
⇒ E + T * id
⇒ E + F * id
⇒ E + id * id
⇒ T + id * id
⇒ F + id * id
⇒ id + id * id
Note that these two derivations have the same parse tree.
3. Ambiguity
- Consider the context-free grammar G with the productions
E → E + E | E * E | ( E ) | id
This grammar has the following leftmost derivation for
id + id * id
E ⇒ E + E
⇒ id + E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
This grammar also has the following leftmost derivation for
id + id * id
E ⇒ E * E
⇒ E + E * E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
These derivations have different parse trees.
A grammar is ambiguous if there is a sentence with two
or more parse trees.
The problem is that the grammar above does not specify
- the precedence of the + and * operators, or
- the associativity of the + and * operators
However, the grammar in section (1) generates the same language
and is unambiguous because
it makes * of higher precedence than +, and makes both operators
left associative.
A context-free language is inherently ambiguous if it
cannot be generated by any unambiguous context-free grammar.
The context-free language
{ ambmanbn
| m > 0 and n > 0} ∪
{ ambnanbm
| m > 0 and n > 0}
is inherently ambiguous.
Most (all?) natural languages are inherently ambiguous but no
programming languages are inherently ambiguous.
Unfortunately, there is no algorithm to determine whether a CFG is ambiguous;
that is, the problem of determining whether a CFG is ambiguous is undecidable.
We can, however, give some practically useful sufficient conditions to guarantee that a CFG
is unambiguous.
4. Examples of Context-Free Grammars
- Nonempty palindromes of
a
's and b
's.
(A palindrome is a string that reads the same forwards as backwards;
e.g., abba
.)
- CFG:
S → a S a | b S b | a a | b b | a | b
- Note that the language generated by this grammar is not regular.
Can you prove this using the pumping lemma for regular languages?
- Strings with an equal number of
a
's and b
's:
- CFG:
S → a S b | b S a | S S | ε
- Note that this grammar is ambiguous.
Can you find an equivalent unambiguous grammar?
- If- and if-else statements:
stmt → if ( expr ) stmt else stmt
| if (expr) stmt
| other
Note that this grammar is ambiguous.
Some typical programming language constructs:
stmt → expr ;
| if (expr) stmt
| for ( optexpr; optexpr; optexpr;) stmt
| other
optexpr → ε
| expr
5. Yacc: a Language for Specifying Syntax-Directed Translators
- Yacc is popular domain-specific language, created by
Steve Johnson of Bell Labs, for specifying and implementing syntax-directed
translators based on context-free grammars.
- Bison is a gnu version of Yacc, upwards compatible with the original Yacc,
written by Charles Donnelly and Richard Stallman.
Many other versions of Yacc are also available.
- The original Yacc used C for semantic actions. Yacc has been rewritten for
many other languages including Java (BYACC/J, CUP, etc.), OCaml (ocamlyacc),
Python (PLY [Python Lex-Yacc]), and Haskell (Happy).
- Yacc specifications
- A Yacc program has three parts:
declarations
%%
translation rules
%%
supporting C-routines
The declarations part may be empty and the last part (%%
followed by the supporting C-routines) may also be omitted.
Each translation rule consists of a context-free grammar production
followed by a semantic action consisting of statement or block
of C code.
Here is a Yacc program for a simple desk calculator
that adds and multiplies integers:
%{
#include <stdio.h>
int yylex(void);
void yyerror (char *s);
%}
%token INTEGER
%left '+'
%left '*'
%%
lines : lines expr '\n' { printf("%d\n", $2); }
| lines '\n'
| /* empty */
;
expr : expr '+' expr { $$ = $1 + $3; }
| expr '*' expr { $$ = $1 * $3; }
| '(' expr ')' { $$ = $2; }
| INTEGER { $$ = $1; }
;
%%
void yyerror(char * s) {
fprintf(stderr, "%s\n", s);
}
int main(void) {
yyparse();
return 0;
}
The declarations
%left '+'
%left '*'
make the operator +
left associative and of lower
precedence than the left-associative operator *
.
The leftmost nonterminal (here lines
) in the production
of the first translation rule is the start symbol of the grammar.
Lex and Yacc have been designed to work together. We can use the following
Lex program to create a lexical analyzer for our desk calculator:
%{
#include <stdio.h>
void yyerror(char *s);
#include "y.tab.h";
%}
[0-9]+ { yylval = atoi(yytext);
return INTEGER; }
[()+*\n] return *yytext;
[ \t] ; /* skip whitespace */
. yyerror("invalid character");
%%
int yywrap(void) {
return 1;
}
Let us put the Yacc program into a file say
desk.y
and the Lex program into a file say
desk.l
.
Then in a Unix-based environment we can create the desk calculator as follows.
- Invoke the command
yacc -d desk.y
to create the yacc output file
y.tab.c
.
(The -d
flag causes yacc
to generate definitions
for tokens and put them in a file y.tab.h
.
The parser is in a function yyparse
included in
y.tab.c
.)
- Invoke the command
lex desk.l
to create the yacc output file
lex.yy.c
.
(Lex includes the file of tokens y.tab.h
. The lexer is in a
function yylex
included in lex.yy.c
.)
- Invoke the command
gcc lex.yy.c y.tab.c -ly
to create the desk calculator in the executable output file a.out
.
- Tom Niemann's Lex & Yacc tutorial uses these kinds of examples to clearly explain
how to construct translators using Lex and Yacc.
6. Practice Problems
- Let G be the grammar
S → a S b S | b S a S | ε.
- What language is generated by this grammar?
- Draw all parse trees for the sentence
abab
.
- Is this grammar ambiguous?
- Let G be the grammar
S → a S b | ε.
Prove that L(G) =
{
a
nb
n | n ≥ 0 }.
- Consider a sentence of the form
id + id + ... + id
where there are
n plus signs. Let G be the grammar in section (3) above.
How many parse trees are there in G for this sentence when n equals
- 1
- 2
- 3
- 4
- m?
- Write down a CFG for regular expressions over the alphabet {
a
, b
}.
Show a parse tree for the regular expression a | b*a
.
7. Reading
aho@cs.columbia.edu