title | date | headerImg |
---|---|---|
Numbers, Unary Operations, Variables |
2016-09-30 |
adder.jpg |
Our goal is to write a compiler which is a function:
compiler :: SourceProgram -> TargetProgram
In 131 TargetProgram
is going to be a binary executable.
SourceProgram
will be a sequence of four tiny "languages"
- Numbers
- e.g.
7
,12
,42
...
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, ...
- Numbers + Increment + Decrement
- e.g.
add1(7)
,add1(add1(12))
,sub1(add1(42))
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
An input source program is converted to an executable binary in many stages:
- Parsed into a data structure called an Abstract Syntax Tree
- Checked to make sure code is well-formed (and well-typed)
- Simplified into some convenient Intermediate Representation
- Optimized into (equivalent) but faster program
- Generated into assembly
x86
- Linked against a run-time (usually written in C)
Goal: Compile source into executable that, when run, prints the result of evaluating the source.
Approach: Lets figure out how to write
- A compiler from the input string into assembly,
- A run-time that will let us do the printing.
Next, lets see how to do (1) and (2) using our
sequence of adder
languages.
- Numbers
- e.g.
7
,12
,42
...
Lets work backwards and start with the run-time.
Here's what it looks like as a C
program main.c
#include <stdio.h>
extern int our_code() asm("our_code_label");
int main(int argc, char** argv) {
int result = our_code();
printf("%d\n", result);
return 0;
}
main
just callsour_code
and prints its return value,our_code
is (to be) implemented in assembly,- Starting at label
our_code_label
, - With the desired return value stored in register
EAX
- per, the
C
calling convention
- Starting at label
Key idea in SW-Engg:
Decouple systems so you can test one component without (even implementing) another.
Lets test our "run-time" without even building the compiler.
Given a SourceProgram
42
We want to compile the above into an assembly
file forty_two.s
that looks like:
section .text
global our_code_label
our_code_label:
mov eax, 42
ret
For now, lets just
- write that file by hand, and test to ensure
- object-generation and then
- linking works
$ nasm -f aout -o forty_two.o forty_two.s
$ clang -g -m32 -o forty_two.run forty_two.o main.c
On a Mac use -f macho
instead of -f aout
We can now run it:
$ forty_two.run
42
Hooray!
Recall, that compilers were invented to avoid writing assembly by hand
To go from source to assembly, we must do:
Our first step will be to model the problem domain using types.
Lets create types that represent each intermediate value:
Text
for the raw input sourceExpr
for the ASTAsm
for the output x86 assembly
Text
is raw strings, i.e. sequences of characters
texts :: [Text]
texts =
[ "It was a dark and stormy night..."
, "I wanna hold your hand..."
, "12"
]
We convert the Text
into a tree-structure defined by the datatype
data Expr = Number Int
Note: As we add features to our language, we will keep adding cases to Expr
.
Lets also do this gradually as the x86 instruction set is HUGE!
Recall, we need to represent
section .text
global our_code_label
our_code_label:
mov eax, 42
ret
An Asm
program is a list of instructions each of which can:
- Create a
Label
, or - Move a
Arg
into aRegister
Return
back to the run-time.
type Asm = [Instruction]
data Instruction
= ILabel Text
| IMov Arg Arg
| IRet
Where we have
data Register
= EAX
data Arg
= Const Int -- a fixed number
| Reg Register -- a register
Ok, now we just need to write the functions:
parse :: Text -> Expr -- 1. Transform source-string into AST
compile :: Expr -> Asm -- 2. Transform AST into assembly
asm :: Asm -> Text -- 3. Transform assembly into output-string
Pretty straightforward:
parse :: Text -> Expr
parse = parseWith expr
where
expr = integer
compile :: Expr -> Asm
compile (Number n) =
[ IMov (Reg EAX) (Const n)
, IRet
]
asm :: Asm -> Text
asm is = L.intercalate "\n" [instr i | i <- is]
Where instr
is a Text
representation of each Instruction
instr :: Instruction -> Text
instr (IMov a1 a2) = printf "mov %s, %s" (arg a1) (arg a2)
arg :: Arg -> Text
arg (Const n) = printf "%d" n
arg (Reg r) = reg r
reg :: Register -> Text
reg EAX = "eax"
Note that above we have four separate functions that crunch
different types to the Text
representation of x86 assembly:
asm :: Asm -> Text
instr :: Instruction -> Text
arg :: Arg -> Text
reg :: Register -> Text
Remembering names is hard.
We can write an overloaded function, and let the compiler figure out the correct implementation from the type, using Typeclasses.
The following defines an interface for all those types a
that
can be converted to x86 assembly:
class ToX86 a where
asm :: a -> Text
Now, to overload, we say that each of the types
Asm
, Instruction
, Arg
and Register
implements or has an instance of ToX86
instance ToX86 Asm where
asm is = L.intercalate "\n" [asm i | i <- is]
instance ToX86 Instruction where
asm (IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)
instance ToX86 Arg where
asm (Const n) = printf "%d" n
asm (Reg r) = asm r
instance ToX86 Register where
asm EAX = "eax"
Note in each case above, the compiler figures out the correct implementation, from the types...
Well that was easy! Lets beef up the language!
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, ...
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
First, lets look at some examples.
How should we compile?
add1(7)
In English
- Move
7
into theeax
register - Add
1
to the contents ofeax
In ASM
mov eax, 7
add eax, 1
Aha, note that add
is a new kind of Instruction
How should we compile
add1(add1(12))
In English
- Move
12
into theeax
register - Add
1
to the contents ofeax
- Add
1
to the contents ofeax
In ASM
mov eax, 12
add eax, 1
add eax, 1
Note correspondence between sub-expressions of source and assembly
We will write compiler in compositional manner
- Generating
Asm
for each sub-expression (AST subtree) independently, - Generating
Asm
for super-expression, assuming the value of sub-expression is inEAX
Next, lets extend the types to incorporate new language features
Source Expressions
data Expr = ...
| Add1 Expr
Assembly Instructions
data Instruction
= ...
| IAdd Arg Arg
src1 = "add1(7)"
exp1 = Add1 (Number 7)
asm1 = [ IMov (Reg EAX) (Const 7)
, IAdd (Reg EAX) (Const 1)
]
src2 = "add1(add1(12))"
exp2 = Add1 (Add1 (Number 12))
asm2 = [ IMov (Reg EAX) (Const 12)
, IAdd (Reg EAX) (Const 1)
, IAdd (Reg EAX) (Const 1)
]
Now lets go back and suitably extend the transforms:
parse :: Text -> Expr -- 1. Transform source-string into AST
compile :: Expr -> Asm -- 2. Transform AST into assembly
asm :: Asm -> Text -- 3. Transform assembly into output-string
Lets do the easy bits first, namely parse
and asm
parse :: Text -> Expr
parse = parseWith expr
expr :: Parser Expr
expr = try primExpr
<|> integer
primExpr :: Parser Expr
primExpr = Add1 <$> rWord "add1" *> parens expr
To update asm
just need to handle case for IAdd
instance ToX86 Instruction where
asm (IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)
asm (IAdd a1 a2) = printf "add %s, %s" (asm a1) (asm a2)
Note
- GHC will tell you exactly which functions need to be extended (Types, FTW!)
- We will not discuss
parse
andasm
any more...
Finally, the key step is
compile :: Expr -> Asm
compile (Number n)
= [ IMov (Reg EAX) (Const n)
, IRet
]
compile (Add1 e)
= compile e -- EAX holds value of result of `e` ...
++ [ IAdd (Reg EAX) (Const 1) ] -- ... so just increment it.
Lets check that compile behaves as desired:
ghci> (compile (Number 12)
[ IMov (Reg EAX) (Const 12) ]
ghci> compile (Add1 (Number 12))
[ IMov (Reg EAX) (Const 12)
, IAdd (Reg EAX) (Const 1)
]
ghci> compile (Add1 (Add1 (Number 12)))
[ IMov (Reg EAX) (Const 12)
, IAdd (Reg EAX) (Const 1)
, IAdd (Reg EAX) (Const 1)
]
You do it!
- Numbers + Increment + Double
- e.g.
add1(7)
,twice(add1(12))
,twice(twice(add1(42)))
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
Local variables make things more interesting
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
Lets look at some examples
let x = 10
in
x
Need to store 1 variable -- x
let x = 10 -- x = 10
, y = add1(x) -- y = 11
, z = add1(y) -- z = 12
in
add1(z) -- 13
Need to store 3 variable -- x, y, z
let a = 10
, c = let b = add1(a)
in
add1(b)
in
add1(c)
Need to store 3 variables -- a
, b
, c
-- but at most 2 at a time
- First
a, b
, thena, c
- Don't need
b
andc
simultaneously
A single register eax
is useless:
- May need 2 or 3 or 4 or 5 ... values.
There is only a fixed number (say, N
) of registers:
- And our programs may need to store more than
N
values, so - Need to dig for more storage space!
Here's what the memory -- i.e. storage -- looks like:
Lets zoom into the stack region, which when we start looks like this:
The stack grows downward (i.e. to smaller addresses)
We have lots of 4-byte slots on the stack at offsets from the "stack pointer" at addresses:
[EBP - 4 * 1]
,[EBP - 4 * 2]
, ...,
The i
-th stack-variable lives at address [EBP - 4 * i]
Required A mapping
- From source variables (
x
,y
,z
...) - To stack positions (
1
,2
,3
...)
Solution The structure of the let
s is stack-like too...
- Maintain an
Env
that mapsId |-> StackPosition
let x = e1 in e2
addsx |-> i
toEnv
- where
i
is current height of stack.
- where
let x = 1 -- []
in -- [ x |-> 1 ]
x
let x = 1 -- []
, y = add1(x) -- [x |-> 1]
, z = add1(y) -- [y |-> 2, x |-> 1]
in
add1(z) -- [z |- 3, y |-> 2, x |-> 1]
At what position on the stack do we store variable c
?
-- []
let a = 1 -- [a |-> 1]
, c = -- [a |-> 1]
let b = add1(a) -- [b |-> 2, a |-> 1]
in add1(b) -- [a |-> 1]
-- [c |-> 2, a |-> 1]
in
add1(c)
A. 1 B. 2 C. 3 D. 4 E. not on stack!
-- []
let a = 1 -- [a |-> 1]
, b = 2 -- [b |-> 2, a |-> 1]
, c = --
let b = add1(a) -- [b |-> 3, b -> 2, a |-> 1]
in add1(b) --
-- [b |-> 2, a |-> 1]
in
add1(b)
-- ENV(n)
let x = STUFF
-- [x |-> n+1, ENV(n)]
in OTHERSTUFF
-- ENV(n)
At each point, we have env
that maps (previously defined) Id
to StackPosition
Variable Use
To compile x
given env
- Move
[ebp - 4 * i]
intoeax
(where env
maps x |-> i
)
Variable Definition
To compile let x = e1 in e2
we
- Compile
e1
usingenv
(i.e. resulting value will be stored ineax
) - Move
eax
into[ebp - 4 * i]
- Compile
e2
usingenv'
(where env'
be env
with x |-> i
i.e. push x
onto env
at position i
)
Lets see how our strategy works by example:
When we compile
let x = 10
, y = add1(x)
in
add1(y)
The assembly looks like
mov eax, 10 ; LHS of let x = 10
mov [ebp - 4*1], eax ; save x on the stack
mov eax, [ebp - 4*1] ; LHS of , y = add1(x)
add eax, 1 ; ""
???
add eax, 1
What .asm instructions shall we fill in for ???
mov [ebp - 4 * 1], eax ; A
mov eax, [ebp - 4 * 1]
mov [ebp - 4 * 1], eax ; B
mov [ebp - 4 * 2], eax ; C
mov [ebp - 4 * 2], eax ; D
mov eax, [ebp - 4 * 2]
; E (empty! no instructions)
Lets compile
let a = 10
, c = let b = add1(a)
in
add1(b)
in
add1(c)
Lets figure out what the assembly looks like!
mov eax, 10 ; LHS of let a = 10
mov [ebp - 4*1], eax ; save a on the stack
???
Now, we're ready to move to the implementation!
Lets extend the types for Source Expressions
type Id = Text
data Expr = ...
| Let Id Expr Expr -- `let x = e1 in e2` modeled as is `Let x e1 e2`
| Var Id
Lets enrich the Instruction
to include the register-offset [ebp - 4*i]
data Arg = ...
| RegOffset Reg Int -- `[ebp - 4*i]` modeled as `RegOffset EBP i`
Lets create a new Env
type to track stack-positions of variables
data Env = [(Id, Int)]
data Maybe a = Nothing | Just a
lookupEnv :: Env -> Id -> Maybe Int
lookupEnv [] x = Nothing
lookupEnv ((y, n) : rest) x = if x == y
then Just n
else lookupEnv rest x
pushEnv :: Env -> Id -> (Int, Env)
pushEnv env x = (xn , env')
where
env' = (x, xn) : env
xn = 1 + length env
compile env (Let x e1 e2) =
compile env e1
++ -- EAX hold the value of "x"
[IMov (RegOffset EBP xn) EAX ]
++
compile env' e2
where
(xn, env') = pushEnv env x
compile env (Var x) = [IMov EAX (RegOffset EBP xn)]
where
xn = case lookupEnv env x of
Just n -> n
Nothing -> error "variable out of scope"
API:
- Push variable onto
Env
(returning its position), - Lookup variable's position in
Env
push :: Id -> Env -> (Int, Env)
push x env = (i, (x, i) : env)
where
i = 1 + length env
lookup :: Id -> Env -> Maybe Int
lookup x [] = Nothing
lookup x ((y, i) : env)
| x == y = Just i
| otherwise = lookup x env
Ok, now we're almost done. Just add the code formalizing the above strategy
Variable Use
compileEnv env (Var x) = [ IMov (Reg EAX) (RegOffset EBP i) ]
where
i = fromMaybe err (lookup x env)
err = error (printf "Error: Variable '%s' is unbound" x)
Variable Definition
compileEnv env (Let x e1 e2 l) = compileEnv env e1
++ IMov (RegOffset EBP i) (Reg EAX)
: compileEnv env' e2
where
(i, env') = pushEnv x env
Lets take our adder
compiler out for a spin!
SourceProgram
will be a sequence of four tiny "languages"
- Numbers
- e.g.
7
,12
,42
...
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, ...
- Numbers + Increment + Decrement
- e.g.
add1(7)
,add1(add1(12))
,sub1(add1(42))
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
Will iterate on this till we have a pretty kick-ass language.