Skip to content

matheustavares/toy-C-compiler

Repository files navigation

A toy C compiler

A toy compiler for a (small) subset of the C11 standard. It generates 64-bit x86 assembly (AT&T syntax), following the System V AMD64 ABI. gcc is called for assembling and linking (so functions from the standard C library are available). This is intended for fun and learning purposes, see limitations here and an overview of the source files here.

Developed following the excellent "Writing a C Compiler" series by Nora Sandler, available at: https://norasandler.com/2017/11/29/Write-a-Compiler.html.

Example

The following C program is an example of what can be currently compiled. The program receives an integer N on stdin and prints the Nth first Fibonacci numbers on stdout.

/* We don't have includes, so we must declare the prototypes. */
int getchar(void);
int putchar(int c);

int fibbonacci(int n)
{
	if (n == 1)
		return 0;
	if (n <= 3)
		return 1;
	/* Very inefficient. Only to demonstrate recursion. */
	return fibbonacci(n-1) + fibbonacci(n-2);
}

/*
 * Get an unsigned int from stdin. Return -1 on failure. The integer must
 * be followed by a newline and cannot be surrounded by spaces.
 *
 * Note that we only support `int` for now, so we obviously cannot use scanf.
 */
int getint(void)
{
	int val = 0, got_input = 0;
	/* We don't have 'EOF' define, so stop on any negative number. */
	for (int c = getchar(); c >= 0; c = getchar()) {
		if (c == 10) // newline
			break;
		if (c < 48 || c > 57) // Error: not a digit
			return -1;
		c -= 48;
		val *= 10;
		val += c;
		got_input = 1;
	}
	return got_input ? val : -1;
}

/* Prints an unsigned integer followed by a newline. */
void putint(int val)
{
	int divisor = 1;
	for (int val_cpy = val; val_cpy / 10; val_cpy /= 10)
		divisor *= 10;

	while (divisor) {
		int digit = val / divisor;
		putchar(digit + 48);
		val -= digit * divisor;
		divisor /= 10;
	}
	putchar(10);
}

/*
 * Receives a positive integer N from stdin and prints the first Nth
 * Fibonacci numbers to stdout.
 */
int main()
{
	int val = getint();
	if (val <= 0)
		return 1;
	for (int i = 1; i <= val; i++) {
		/* Again: inefficient, only for demonstration */
		putint(fibbonacci(i));
	}
	return 0;
}

Clone and build the compiler

$ git clone --recurse-submodules # To clone the tests submodule too
$ make
# Produces the binary `./cc`.
# Note: it requires `gcc` for assembling and linking.

Running

$ ./cc file.c
# Creates the binary file "a.out"
$ ./cc -S file.c
# Creates the assembly file "file.s"

$ ./cc -l file.c
# Outputs the identified tokens, one per line.
$ ./cc -t file.c | dot -Tpng | display
# Outputs the abstract syntax tree in dot format. Use dot and
# ImageMagick (display) to show it.

Check ./cc -h for all available options.

Current features and limitations

Implemented:

  • Types: signed integer only (int)
  • Operators (for int):
    • Arithmetic (+, *, etc.)
    • Relational (>=, <, ==, etc.)
    • Bitwise (<<, &, etc.)
    • Logical (&&, ||, etc.)
    • Compound assignment (+=, *=, etc.)
    • Suffix/postfix increment and decrement (++, --, etc.)
    • Ternary operator
  • Statements:
    • for, while, and do-while loops
    • if and if-else statements
    • int and void types
    • goto and labels
    • Function calls
  • Other features:
    • Inline and block comments
    • Local variables (with scoping rules)
    • Global variables
    • Multiple source files support.

Missing:

  • Storage classifiers and qualifiers (static, extern, volatile, const, etc.)
  • Pointers and arrays (and the * and & operators)
  • Structs and unions (and the . and -> operators)
  • Preprocessing (macros, includes, ifdef, pragmas, etc.)
  • Strings
  • Many other types: char, float, unsigned, etc.
  • Type casting
  • Switch-case
  • CRLF line ending support
  • Variadic functions and parameter list without names (e.g. int func(int))
  • Code optimization
  • Certainly a lot more :)

Overview of the source files

Main source files:

  • cc.c: the CLI option parser and main entry point
  • lexer.c: the tokenizer
  • parser.c: a recursive descent parser. Syntactic errors are detected and printed out at this step, but semantic errors (like function redefinition), are only detected during code generation.
  • x86.c: code generation to x86_64 assembly (AT&T syntax). Also implements some semantic validations.

Auxiliary source files:

  • lib: miscellaneous utilities to handle errors, strings, arrays, hashtables, temporary files, etc.
  • dot-printer.[ch]: prints an AST (abstract syntax tree) in dot format.
  • symtable.[ch]: table of "currently known symbols" (variables and functions) during assembly generation. This is where we check for errors like symbol redefinition and use-before-declaration. This table is duplicated when entering an inner scope (e.g. a code block) to allow for symbol shadowing without affecting the code generation on the upper scope.
  • labelset.[ch]: set of user defined labels (i.e. those used in goto statements) to assist the assembly generation. Like symtable.c, labelset.c checks for redefinition and use-before-declaration errors regarding labels.

Testing

Tests are divided in three parts:

  • Tests of internal lib routines and APIs, available at lib-tests.

  • General end-to-end tests, which are available at the submodule compiler-tests. This is a fork of the test repo provided by Nora. These tests are input based. They check that the compiler fails on invalid input and succeeds on valid input (i.e produces the same exit code and output as gcc).

  • extra-tests: these are additional end-to-end tests that would not fit in the framework of the previous item. They check CLI options, default filenames, x86 conventions, etc.

To run all tests, execute:

$ make tests

Or select only one type:

$ make compiler-tests [STAGES="X Y ..."]
$ make lib-tests
$ make extra-tests

Use STAGES=... to limit the compiler-tests to a desired set of stages. (See available stage names at the compiler-tests directory.) This can also be used with the "tests" make rule.

Resources