COMS W4115
Programming Languages and Translators
Lecture 6: Context-Free Grammars
February 9, 2015

Language whitepapers due in your team folder on Courseworks/COMSW4115/Files&Resources by Feb. 25, 2015.

Lecture Outline

Context-free grammars
Derivations and parse trees
Ambiguity
Examples of context-free grammars
Yacc: a language for specifying syntax-directed translators

1. Context-Free Grammars (CFG's)

CFG's are very useful for representing the syntactic structure of programming languages.
A CFG is sometimes called Backus-Naur Form (BNF).
A context-free grammar consists of

A finite set of terminal symbols,
A finite nonempty set of nonterminal symbols,
One distinguished nonterminal called the start symbol, and
A finite set of rewrite rules, called productions, each of the form A → α where A is a nonterminal and α is a string (possibly empty) of terminals and nonterminals.

Consider the context-free grammar G with the productions


      E → E + T | T
      T → T * F | F
      F → ( E ) | id

The terminal symbols are the alphabet from which strings are formed. In this grammar the set of terminal symbols is { id, +, *, (, ) }. The terminal symbols are the token names.
The nonterminal symbols are syntactic variables that denote sets of strings of terminal symbols. In this grammar the set of nonterminal symbols is { E, T, F}.
The start symbol is E.

2. Derivations and Parse Trees

L(G), the language generated by a grammar G, consists of all strings of terminal symbols that can be derived from the start symbol of G.
A leftmost derivation expands the leftmost nonterminal in each sentential form:


      E ⇒ E + T
        ⇒ T + T
        ⇒ F + T
        ⇒ id + T
        ⇒ id + T * F
        ⇒ id + F * F
        ⇒ id + id * F
        ⇒ id + id * id

A rightmost derivation expands the rightmost nonterminal in each sentential form:


      E ⇒ E + T
        ⇒ E + T * F
        ⇒ E + T * id
        ⇒ E + F * id
        ⇒ E + id * id
        ⇒ T + id * id
        ⇒ F + id * id
        ⇒ id + id * id

Note that these two derivations have the same parse tree.

3. Ambiguity

Consider the context-free grammar G with the productions


      E → E + E | E * E | ( E ) | id

This grammar has the following leftmost derivation for id + id * id


      E ⇒ E + E
        ⇒ id + E
        ⇒ id + E * E
        ⇒ id + id * E
        ⇒ id + id * id

This grammar also has the following leftmost derivation for id + id * id


      E ⇒ E * E
        ⇒ E + E * E
        ⇒ id +  E * E
        ⇒ id + id * E
        ⇒ id + id * id

These derivations have different parse trees.
A grammar is ambiguous if there is a sentence with two or more parse trees.
The problem is that the grammar above does not specify

the precedence of the + and * operators, or
the associativity of the + and * operators

However, the grammar in section (1) generates the same language and is unambiguous because it makes * of higher precedence than +, and makes both operators left associative.
A context-free language is inherently ambiguous if it cannot be generated by any unambiguous context-free grammar.
The context-free language { a^mb^maⁿbⁿ | m > 0 and n > 0} ∪ { a^mbⁿaⁿb^m | m > 0 and n > 0} is inherently ambiguous.
Most (all?) natural languages are inherently ambiguous but no programming languages are inherently ambiguous.
Unfortunately, there is no algorithm to determine whether a CFG is ambiguous; that is, the problem of determining whether a CFG is ambiguous is undecidable.
We can, however, give some practically useful sufficient conditions to guarantee that a CFG is unambiguous.

4. Examples of Context-Free Grammars

Nonempty palindromes of a's and b's. (A palindrome is a string that reads the same forwards as backwards; e.g., abba.)

CFG: S → a S a | b S b | a a | b b | a | b

Note that the language generated by this grammar is not regular. Can you prove this using the pumping lemma for regular languages?

Strings with an equal number of a's and b's:

CFG: S → a S b | b S a | S S | ε

Note that this grammar is ambiguous. Can you find an equivalent unambiguous grammar?

If- and if-else statements:


      stmt → if ( expr ) stmt else stmt
           | if (expr) stmt
           | other

Note that this grammar is ambiguous.

Some typical programming language constructs:


      stmt → expr ;
           | if (expr) stmt
           | for ( optexpr; optexpr; optexpr;) stmt
           | other
      optexpr → ε
           | expr

5. Yacc: a Language for Specifying Syntax-Directed Translators

Yacc is popular domain-specific language, created by Steve Johnson of Bell Labs, for specifying and implementing syntax-directed translators based on context-free grammars.
Bison is a gnu version of Yacc, upwards compatible with the original Yacc, written by Charles Donnelly and Richard Stallman. Many other versions of Yacc are also available.
The original Yacc used C for semantic actions. Yacc has been rewritten for many other languages including Java (BYACC/J, CUP, etc.), OCaml (ocamlyacc), Python (PLY [Python Lex-Yacc]), and Haskell (Happy).
Yacc specifications

A Yacc program has three parts:

      declarations
      %%
      translation rules
      %%
      supporting C-routines

The declarations part may be empty and the last part (%% followed by the supporting C-routines) may also be omitted.

Each translation rule consists of a context-free grammar production followed by a semantic action consisting of statement or block of C code.

Here is a Yacc program for a simple desk calculator that adds and multiplies integers:


      %{
      #include <stdio.h>
      int yylex(void);
      void yyerror (char *s);
      %}

      %token INTEGER

      %left '+'
      %left '*'

      %%
      lines : lines expr '\n'   { printf("%d\n", $2); }
            | lines '\n'
            | /* empty */
            ;

      expr  : expr '+' expr     { $$ = $1 + $3; }
            | expr '*' expr     { $$ = $1 * $3; }
            | '(' expr ')'      { $$ = $2; }
            | INTEGER		{ $$ = $1; }
            ;

      %%
      void yyerror(char * s) {
        fprintf(stderr, "%s\n", s);
      }

      int main(void) {
        yyparse();
        return 0;
      }

The declarations

%left '+'

%left '*'

make the operator + left associative and of lower precedence than the left-associative operator *.

The leftmost nonterminal (here lines) in the production of the first translation rule is the start symbol of the grammar.

Lex and Yacc have been designed to work together. We can use the following Lex program to create a lexical analyzer for our desk calculator:


      %{
      #include <stdio.h>
      void yyerror(char *s);
      #include "y.tab.h";
      %}

      [0-9]+		{ yylval = atoi(yytext);
			  return INTEGER; }
      [()+*\n]		return *yytext;
      [ \t]		; /* skip whitespace */
      .			yyerror("invalid character");
      %%
      int yywrap(void) {
        return 1;
      }

Let us put the Yacc program into a file say desk.y and the Lex program into a file say desk.l. Then in a Unix-based environment we can create the desk calculator as follows.

Invoke the command yacc -d desk.y to create the yacc output file y.tab.c. (The -d flag causes yacc to generate definitions for tokens and put them in a file y.tab.h. The parser is in a function yyparse included in y.tab.c.)
Invoke the command lex desk.l to create the yacc output file lex.yy.c. (Lex includes the file of tokens y.tab.h. The lexer is in a function yylex included in lex.yy.c.)
Invoke the command gcc lex.yy.c y.tab.c -ly to create the desk calculator in the executable output file a.out.
Tom Niemann's Lex & Yacc tutorial uses these kinds of examples to clearly explain how to construct translators using Lex and Yacc.

6. Practice Problems

Let G be the grammar S → a S b S | b S a S | ε.
1. What language is generated by this grammar?
2. Draw all parse trees for the sentence abab.
3. Is this grammar ambiguous?
Let G be the grammar S → a S b | ε. Prove that L(G) = { aⁿbⁿ | n ≥ 0 }.
Consider a sentence of the form id + id + ... + id where there are n plus signs. Let G be the grammar in section (3) above. How many parse trees are there in G for this sentence when n equals

Write down a CFG for regular expressions over the alphabet {a, b}. Show a parse tree for the regular expression a | b*a.

7. Reading

ALSU Sects. 4.1-4.2, 4.9
Lex & Yacc tutorial by Tom Niemann

aho@cs.columbia.edu

COMS W4115 Programming Languages and Translators Lecture 6: Context-Free Grammars February 9, 2015