A toy C compiler

A toy compiler for a (small) subset of the C11 standard. It generates 64-bit x86 assembly (AT&T syntax), following the System V AMD64 ABI. gcc is called for assembling and linking (so functions from the standard C library are available). This is intended for fun and learning purposes, see limitations here and an overview of the source files here.

Developed following the excellent "Writing a C Compiler" series by Nora Sandler, available at: https://norasandler.com/2017/11/29/Write-a-Compiler.html.

Example

The following C program is an example of what can be currently compiled. The program receives an integer N on stdin and prints the Nth first Fibonacci numbers on stdout.

/* We don't have includes, so we must declare the prototypes. */
int getchar(void);
int putchar(int c);

int fibbonacci(int n)
{
	if (n == 1)
		return 0;
	if (n <= 3)
		return 1;
	/* Very inefficient. Only to demonstrate recursion. */
	return fibbonacci(n-1) + fibbonacci(n-2);
}

/*
 * Get an unsigned int from stdin. Return -1 on failure. The integer must
 * be followed by a newline and cannot be surrounded by spaces.
 *
 * Note that we only support `int` for now, so we obviously cannot use scanf.
 */
int getint(void)
{
	int val = 0, got_input = 0;
	/* We don't have 'EOF' define, so stop on any negative number. */
	for (int c = getchar(); c >= 0; c = getchar()) {
		if (c == 10) // newline
			break;
		if (c < 48 || c > 57) // Error: not a digit
			return -1;
		c -= 48;
		val *= 10;
		val += c;
		got_input = 1;
	}
	return got_input ? val : -1;
}

/* Prints an unsigned integer followed by a newline. */
void putint(int val)
{
	int divisor = 1;
	for (int val_cpy = val; val_cpy / 10; val_cpy /= 10)
		divisor *= 10;

	while (divisor) {
		int digit = val / divisor;
		putchar(digit + 48);
		val -= digit * divisor;
		divisor /= 10;
	}
	putchar(10);
}

/*
 * Receives a positive integer N from stdin and prints the first Nth
 * Fibonacci numbers to stdout.
 */
int main()
{
	int val = getint();
	if (val <= 0)
		return 1;
	for (int i = 1; i <= val; i++) {
		/* Again: inefficient, only for demonstration */
		putint(fibbonacci(i));
	}
	return 0;
}

Clone and build the compiler

$ git clone --recurse-submodules # To clone the tests submodule too
$ make
# Produces the binary `./cc`.
# Note: it requires `gcc` for assembling and linking.

Running

$ ./cc file.c
# Creates the binary file "a.out"
$ ./cc -S file.c
# Creates the assembly file "file.s"

$ ./cc -l file.c
# Outputs the identified tokens, one per line.
$ ./cc -t file.c | dot -Tpng | display
# Outputs the abstract syntax tree in dot format. Use dot and
# ImageMagick (display) to show it.

Check ./cc -h for all available options.

Current features and limitations

Implemented:

Types: signed integer only (int)
Operators (for int):
- Arithmetic (+, *, etc.)
- Relational (>=, <, ==, etc.)
- Bitwise (<<, &, etc.)
- Logical (&&, ||, etc.)
- Compound assignment (+=, *=, etc.)
- Suffix/postfix increment and decrement (++, --, etc.)
- Ternary operator
Statements:
- for, while, and do-while loops
- if and if-else statements
- int and void types
- goto and labels
- Function calls
Other features:
- Inline and block comments
- Local variables (with scoping rules)
- Global variables
- Multiple source files support.

Missing:

Overview of the source files

Main source files:

cc.c: the CLI option parser and main entry point
lexer.c: the tokenizer
parser.c: a recursive descent parser. Syntactic errors are detected and printed out at this step, but semantic errors (like function redefinition), are only detected during code generation.
x86.c: code generation to x86_64 assembly (AT&T syntax). Also implements some semantic validations.

Auxiliary source files:

lib: miscellaneous utilities to handle errors, strings, arrays, hashtables, temporary files, etc.
dot-printer.[ch]: prints an AST (abstract syntax tree) in dot format.
symtable.[ch]: table of "currently known symbols" (variables and functions) during assembly generation. This is where we check for errors like symbol redefinition and use-before-declaration. This table is duplicated when entering an inner scope (e.g. a code block) to allow for symbol shadowing without affecting the code generation on the upper scope.
labelset.[ch]: set of user defined labels (i.e. those used in goto statements) to assist the assembly generation. Like symtable.c, labelset.c checks for redefinition and use-before-declaration errors regarding labels.

Testing

Tests are divided in three parts:

Tests of internal lib routines and APIs, available at lib-tests.
General end-to-end tests, which are available at the submodule compiler-tests. This is a fork of the test repo provided by Nora. These tests are input based. They check that the compiler fails on invalid input and succeeds on valid input (i.e produces the same exit code and output as gcc).
extra-tests: these are additional end-to-end tests that would not fit in the framework of the previous item. They check CLI options, default filenames, x86 conventions, etc.

To run all tests, execute:

$ make tests

Or select only one type:

$ make compiler-tests [STAGES="X Y ..."]
$ make lib-tests
$ make extra-tests

Use STAGES=... to limit the compiler-tests to a desired set of stages. (See available stage names at the compiler-tests directory.) This can also be used with the "tests" make rule.

Resources

Nora's tutorial posts on writing a C compiler: https://norasandler.com/2017/11/29/Write-a-Compiler.html
C11 Standard: latest draft. (In particular, Annex A's "Language syntax summary".)
Eli Bendersky's "Parsing expressions by precedence climbing"
Brown University's CS033 "x64 Cheat Sheet"
Yale University's CS421 "x86 Assembly Guide"
cppreference's "operator precedence" page
Wikipedia's "x86-64 calling conventions"

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
compiler-tests @ b0f3c48		compiler-tests @ b0f3c48
extra-tests		extra-tests
lib-tests		lib-tests
lib		lib
.gitignore		.gitignore
.gitmodules		.gitmodules
COPYING		COPYING
Makefile		Makefile
README.md		README.md
cc.c		cc.c
dot-printer.c		dot-printer.c
dot-printer.h		dot-printer.h
labelset.c		labelset.c
labelset.h		labelset.h
lexer.c		lexer.c
lexer.h		lexer.h
parser.c		parser.c
parser.h		parser.h
symtable.c		symtable.c
symtable.h		symtable.h
util.h		util.h
x86.c		x86.c
x86.h		x86.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A toy C compiler

Example

Clone and build the compiler

Running

Current features and limitations

Implemented:

Missing:

Overview of the source files

Testing

Resources

About

Languages

License

matheustavares/toy-C-compiler

Folders and files

Latest commit

History

Repository files navigation

A toy C compiler

Example

Clone and build the compiler

Running

Current features and limitations

Implemented:

Missing:

Overview of the source files

Testing

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages