COMS W4115
Programming Languages and Translators
Lecture 12: The Compiler Front End
March 2, 2015
Lecture Outline
- The compiler front end
- Producing abstract syntax trees
- Semantic analysis
- Intermediate representations
- Directed acyclic graphs (DAGs)
1. The Compiler Front End
- The primary tasks of the front end of a compiler are
- lexical analysis in which the source program is mapped into a stream of tokens
- parsing in which the token stream from the
lexer is mapped into some kind of syntax tree to make
sure the source program conforms to the grammatical rules of the source language
- semantic analysis in which the syntax tree is checked to make sure the source
program has followed the semantic rules of the source language
- intermediate code generation in which some form of an intermediate representation
for the source program is produced from the syntax tree
- Other tasks of the front end include
- the collection of information into a symbol table about the types and other
properties of the syntactic constructs in the source program
- the detection and reporting of lexical, syntactic, and semantic
errors found in the source program
- the removal of comments and extraneous characters, such as whitespace, from the
source program
2. Producing Abstract Syntax Trees
- In many compilers the parser produces an abstract syntax tree (AST) as output.
An AST represents the hierarchical structure of the source program in which
each node corresponds to a programming language construct such as an operator
and the children of the node represent its meaningful components such as its operands.
- An AST can be easily constructed using syntax-directed translation
with synthesized attributes during bottom-up parsing.
Here is an S-attributed SDD based on an SLR(1) grammar that translates
arithmetic expressions into ASTs.
E → E1 + T { E.node = new Node('+', E1.node, T.node) }
E → T { E.node = T.node }
T → ( E ) { T.node = E.node }
T → id { T.node = new Leaf(id, id.value) }
The nonterminal E
has the synthesized attributed E.node
and
T
the synthesized attribute T.node
.
E.node
and T.node
are pointers to nodes in the AST
being created.
The function Node(op, left, right)
returns a pointer to the node it creates which has three fields:
the first contains the operator op
, the second a pointer
to a subtree representing the left operand of the operator,
and the third a pointer to a subtree representing the right operand.
The function Leaf(id, value)
returns a pointer to the node
it creates with two fields: the first labeled id
, the second containing the
value of the token that is in id.value
. See Example 5.11 in ALSU.
An AST can also be constructed using syntax-directed translation
with synthesized and inherited attributes during top-down parsing.
Here is an L-attributed SDD based on an LL(1) grammar that translates
arithmetic expressions into ASTs. This SDD uses inherited attributes
to transfer information from the left side of the syntax tree
to the right side.
E → T A { E.node = A.s
A.i = T.node }
A → + T A1 { A1.i = new Node('+', A.i, T.node)
A.s = A1.s }
A → ε { A.s = A.i; }
T → ( E ) { T.node = E.node }
T → id { T.node = new Leaf(id, id.value) }
In the first semantic rule E
uses a synthesized attribute E.node
to point to the root of the constructed AST
which has been passed up the right edge of the syntax tree using the synthesized attribute
A.s
.
The nonterminal A
also has an inherited attribute A.i
to which is passed the value of the synthesized attribute T.node
from the left side of the syntax tree.
In the second semantic rule, the nonterminal A on the right side of the production
has an inherited attribute whose value becomes the pointer to the node
created by calling the function Node('+', A.i, T.node)
.
Note that the left operand of '+' is the inherited attribute
A.i
which is associated with the nonterminal A
on the
the left side of the production. The value of A.i
is a pointer
to the node that is the left argument of the '+' operator.
For more details see Example 5.12 in ALSU.
3. Semantic Analysis
- After parsing, the front end tries to make sure the source program has followed
the semantic rules of the source language.
During semantic analysis additional information may be added to the symbol table
or the nodes of the syntax tree.
- The exact semantic checks that can
be performed by the front end are language dependent and for some languages
many semantic checks can only be done at run time.
- Here are some semantic checks performed by a static checker for the ANSI C language.
These checks can be carried out by traversing the syntax tree or some other
intermediate representation for the source program.
- ensuring every identifier has been declared before it is used
- making sure an operator is applied to type-compatible operands
- checking that functions are called with the correct number and types of arguments
- checking that a break-statement is enclosed within a while-, for-,
do-, or switch-statement
- Some semantic rules can only be enforced at run time. For example,
- checking that a variable has been given a value before it is used
- checking that a pointer refers to a valid object before it is dereferenced
- checking that the value of an array subscript lies within the bounds of the array
- If the front end cannot enforce a rule statically, it may generate code to
perform the appropriate checks at run time. However, a compiler for a given language
may not enforce all semantic rules for that language because it is too expensive
to do so or it is not possible to do so.
4. Intermediate Representations
- In the process of translating a source program into a target program,
a compiler will usually construct a sequence of intermediate representations
of the program.
- An intermediate representation is said to be high level if it is close to the
source language, low level if it is close to the target language.
- In compiler collections like gcc or LLVM, a front end for a particular
source language produces as output a common intermediate representation (RTL for gcc, LLVM IR for LLVM)
allowing multiple front ends to be combined with multiple back ends.
A back end for a particular target language takes as input
the common intermediate representation and maps it into code for that target language.
This way m*n compilers can be created just by implementing m
front ends and n back ends and combining a particular front end
with a particular back end.
- Common intermediate representations used in compilers include
- Graphical forms such as syntax trees and directed acyclic graphs
- Linear forms such as three-address code,
static single-assignment form and stack-machine code
- Hybrid representations that combine graphical and linear forms
5. Directed acyclic graphs
- A directed acyclic graph (DAG) is a directed graph that contains no cycles.
- DAGs are a useful space-saving intermediate representation for expressions
because they coalesce common subexpressions into a single subtree.
- Consider the expression
a+a*(b-c)+(b-c)*d
. The variable a
and the expression (b-c)
are common subexpressions because they
are each used twice as operands. The DAG for this expression is shown in
Fig. 6.3, ALSU, p. 359.
- Constructing a DAG
- The S-attributed SDD in Section 2 above can be easily adapted to construct
a DAG instead of an AST. All we need to do is modify the functions
Node
and Leaf
to first check if an identical
node already exists. If it does, the function returns a pointer to the
previously constructed node. If not, the function constructs a new node
and returns a pointer to the newly constructed node.
- Value-number method for constructing a DAG
- We can represent the nodes of a syntax tree or DAG as an array of records
consisting of three fields. See Fig. 6.6, ALSU, p. 361.
Each row of the array represents one record and hence one node.
In this array, the integer index of the record is often called the
value number for the node.
- To implement the function
Node('+', left, right)
in the SDD,
we first search the array to see if there is a record in which the first field
contains '+', the second field contains left
, and the third
field contains right
. If so, we return the label number (index)
of that node. If not, we create a new record with these three fields
and return the index of the newly created record.
- To make this algorithm more efficient we can use a hash table for the
nodes of a DAG. This allows each call of
Node
to take
constant expected time.
- For details, see Section 6.1.2 of ALSU (pp. 360-362).
6. Practice Problems
- Construct a DAG for the expression
((x + y) - ((x + y) * (x - y))) + ((x + y) * (x - y))
- Translate the following assignments into (a) syntax trees, (b) quadruples, (c) triples,
(d) three-address code:
x = a + -(b+c)
x[i] = y[i] + z[i]
x = f(y+1) + 2
7. Reading
aho@cs.columbia.edu