COMS W4115
Programming Languages and Translators
Lecture 12: The Compiler Front End
March 2, 2015

Lecture Outline

The compiler front end
Producing abstract syntax trees
Semantic analysis
Intermediate representations
Directed acyclic graphs (DAGs)

1. The Compiler Front End

The primary tasks of the front end of a compiler are

lexical analysis in which the source program is mapped into a stream of tokens
parsing in which the token stream from the lexer is mapped into some kind of syntax tree to make sure the source program conforms to the grammatical rules of the source language
semantic analysis in which the syntax tree is checked to make sure the source program has followed the semantic rules of the source language
intermediate code generation in which some form of an intermediate representation for the source program is produced from the syntax tree

Other tasks of the front end include

the collection of information into a symbol table about the types and other properties of the syntactic constructs in the source program
the detection and reporting of lexical, syntactic, and semantic errors found in the source program
the removal of comments and extraneous characters, such as whitespace, from the source program

2. Producing Abstract Syntax Trees

In many compilers the parser produces an abstract syntax tree (AST) as output. An AST represents the hierarchical structure of the source program in which each node corresponds to a programming language construct such as an operator and the children of the node represent its meaningful components such as its operands.

An AST can be easily constructed using syntax-directed translation with synthesized attributes during bottom-up parsing. Here is an S-attributed SDD based on an SLR(1) grammar that translates arithmetic expressions into ASTs.


   E → E₁ + T     { E.node = new Node('+', E₁.node, T.node) }

   E → T          { E.node = T.node }

   T → ( E )      { T.node = E.node }

   T → id         { T.node = new Leaf(id, id.value) }

The nonterminal E has the synthesized attributed E.node and T the synthesized attribute T.node. E.node and T.node are pointers to nodes in the AST being created. The function Node(op, left, right) returns a pointer to the node it creates which has three fields: the first contains the operator op, the second a pointer to a subtree representing the left operand of the operator, and the third a pointer to a subtree representing the right operand. The function Leaf(id, value) returns a pointer to the node it creates with two fields: the first labeled id, the second containing the value of the token that is in id.value. See Example 5.11 in ALSU.

An AST can also be constructed using syntax-directed translation with synthesized and inherited attributes during top-down parsing. Here is an L-attributed SDD based on an LL(1) grammar that translates arithmetic expressions into ASTs. This SDD uses inherited attributes to transfer information from the left side of the syntax tree to the right side.


   E → T A        { E.node = A.s
                    A.i = T.node }

   A → + T A₁     { A₁.i = new Node('+', A.i, T.node)
                    A.s = A₁.s }

   A → ε          { A.s = A.i; }

   T → ( E )      { T.node = E.node }

   T → id         { T.node = new Leaf(id, id.value) }

In the first semantic rule E uses a synthesized attribute E.node to point to the root of the constructed AST which has been passed up the right edge of the syntax tree using the synthesized attribute A.s. The nonterminal A also has an inherited attribute A.i to which is passed the value of the synthesized attribute T.node from the left side of the syntax tree.

In the second semantic rule, the nonterminal A on the right side of the production has an inherited attribute whose value becomes the pointer to the node created by calling the function Node('+', A.i, T.node). Note that the left operand of '+' is the inherited attribute A.i which is associated with the nonterminal A on the the left side of the production. The value of A.i is a pointer to the node that is the left argument of the '+' operator.

For more details see Example 5.12 in ALSU.

3. Semantic Analysis

After parsing, the front end tries to make sure the source program has followed the semantic rules of the source language. During semantic analysis additional information may be added to the symbol table or the nodes of the syntax tree.
The exact semantic checks that can be performed by the front end are language dependent and for some languages many semantic checks can only be done at run time.
Here are some semantic checks performed by a static checker for the ANSI C language. These checks can be carried out by traversing the syntax tree or some other intermediate representation for the source program.

ensuring every identifier has been declared before it is used
making sure an operator is applied to type-compatible operands
checking that functions are called with the correct number and types of arguments
checking that a break-statement is enclosed within a while-, for-, do-, or switch-statement

Some semantic rules can only be enforced at run time. For example,

checking that a variable has been given a value before it is used
checking that a pointer refers to a valid object before it is dereferenced
checking that the value of an array subscript lies within the bounds of the array

If the front end cannot enforce a rule statically, it may generate code to perform the appropriate checks at run time. However, a compiler for a given language may not enforce all semantic rules for that language because it is too expensive to do so or it is not possible to do so.

4. Intermediate Representations

In the process of translating a source program into a target program, a compiler will usually construct a sequence of intermediate representations of the program.
An intermediate representation is said to be high level if it is close to the source language, low level if it is close to the target language.
In compiler collections like gcc or LLVM, a front end for a particular source language produces as output a common intermediate representation (RTL for gcc, LLVM IR for LLVM) allowing multiple front ends to be combined with multiple back ends. A back end for a particular target language takes as input the common intermediate representation and maps it into code for that target language. This way m*n compilers can be created just by implementing m front ends and n back ends and combining a particular front end with a particular back end.
Common intermediate representations used in compilers include

Graphical forms such as syntax trees and directed acyclic graphs
Linear forms such as three-address code, static single-assignment form and stack-machine code
Hybrid representations that combine graphical and linear forms

5. Directed acyclic graphs

A directed acyclic graph (DAG) is a directed graph that contains no cycles.
DAGs are a useful space-saving intermediate representation for expressions because they coalesce common subexpressions into a single subtree.
Consider the expression a+a*(b-c)+(b-c)*d. The variable a and the expression (b-c) are common subexpressions because they are each used twice as operands. The DAG for this expression is shown in Fig. 6.3, ALSU, p. 359.
Constructing a DAG

The S-attributed SDD in Section 2 above can be easily adapted to construct a DAG instead of an AST. All we need to do is modify the functions Node and Leaf to first check if an identical node already exists. If it does, the function returns a pointer to the previously constructed node. If not, the function constructs a new node and returns a pointer to the newly constructed node.
Value-number method for constructing a DAG

We can represent the nodes of a syntax tree or DAG as an array of records consisting of three fields. See Fig. 6.6, ALSU, p. 361. Each row of the array represents one record and hence one node. In this array, the integer index of the record is often called the value number for the node.
To implement the function Node('+', left, right) in the SDD, we first search the array to see if there is a record in which the first field contains '+', the second field contains left, and the third field contains right. If so, we return the label number (index) of that node. If not, we create a new record with these three fields and return the index of the newly created record.
To make this algorithm more efficient we can use a hash table for the nodes of a DAG. This allows each call of Node to take constant expected time.
For details, see Section 6.1.2 of ALSU (pp. 360-362).

6. Practice Problems

Construct a DAG for the expression

((x + y) - ((x + y) * (x - y))) + ((x + y) * (x - y))

Translate the following assignments into (a) syntax trees, (b) quadruples, (c) triples, (d) three-address code:

x = a + -(b+c)
x[i] = y[i] + z[i]
x = f(y+1) + 2

7. Reading

ALSU, Section 6.1

aho@cs.columbia.edu

COMS W4115 Programming Languages and Translators Lecture 12: The Compiler Front End March 2, 2015