Skip to content

Latest commit

 

History

History
361 lines (267 loc) · 24.8 KB

README.md

File metadata and controls

361 lines (267 loc) · 24.8 KB

Build Status codecov.io codecov.io

tipc

A compiler from TIP to llvm bitcode

TIP Language, Interpreter, and Analyzers

TIP is a "Tiny Imperative Programming" language developed by Anders Møller and Michael I. Schwartzbach for the Static Program Analysis lecture notes that they developed for graduate instruction at Aarhus University.

Accompanying those notes is a Scala implementation that provides a number of static analysis implementations and interpreter-based evaluators.

This project implements tipc which compiles TIP programs into LLVM bitcode. Linking that bitcode with the runtime library routines and standard libraries will produce an executable.

Dependencies

tipc is implemented in C++17 and depends on a number of tools and packages, e.g., ANTLR4, Catch2, CMake, Doxygen, loguru, Java, LLVM. To simplify dependency management the project provides a bootstrap script to install all of the required dependencies on linux ubuntu and mac platforms; if you are using portal.cs.virginia.edu to build then you can replace this script with running module load <pathto>/tipc/conf/modulefiles/tipc/F24, where <pathto> is the path to where you have installed tipc.

Building tipc

The project uses GitHub Actions for building and testing and CodeCov for reporting code and documentation coverage. The build-and-test.yml file provides details of this process. If you would prefer to build and test manually then read on.

After cloning this repository you can build the compiler by moving to into the top-level directory and issuing these commands:

  1. ./bin/bootstrap.sh
  2. . ~/.bashrc
  3. mkdir build
  4. cd build
  5. cmake ..
  6. make

The build process will download an up to date version of ANTLR4 if needed, build the C++ target for ANTLR4, and then build all of tipc including its substantial body of unit tests. This may take some time - to speed it up use multiple threads in the make command, e.g., make -j6.

You may see some warnings, e.g., CMake policy warnings, due to some of the packages we use in the project. As those projects are updated, to avoid CMake feature deprecation, these will go away.

When finished the tipc executable will be located in build/src/. You can copy it to a more convenient location if you like, but a number of scripts in the project expect it to be in this location so don't move it.

The project includes more than 300 unit tests grouped into several executables. The project also includes more than 90 system tests. These are TIP programs that have built in test oracles that check for the expected results. For convenience, there is a runtests.sh script provided in the bin directory. You can run this script to invoke the entire collection of tests. See the README in the bin directory for more information.

All of the tests should pass.

Ubuntu Linux

Our continuous integration process builds on both Ubuntu 22.04 and 20.04, so these are well-supported. We do not support other linux distributions, but we know that people in the past have ported tipc to different distributions.

Mac OS

Our continuous integration process builds on macOS 12 and macOS 13 so modern versions of macOS are well-supported. tipc builds on both Intel and Apple Silicon, i.e., Apple's M1 ARM processor.

Windows Subsystem for Linux

If you are using a Windows machine, tipc can be built in the Windows Subsystem for Linux (WSL). Here are instructions to install WSL and upgrade to WSL2. It is highly recommended to upgrade to WSL2. Once installed, you should install Ubuntu 20.04. Once finished, you can open a virtual instance of Ubuntu and follow the instructions above to build tipc.

You may recieve an error saying "No CMAKE_CXX_COMPILER could be found" when running cmake ... If this is the case, you should install g++ with the command: sudo apt-get install g++.

Using tipc

The tipc compiler has a limited set of options available through the --help flag.

OVERVIEW: tipc - a TIP to llvm compiler

USAGE: tipc [options] <tip source file>

OPTIONS:

Generic Options:

  --help                 - Display available options (--help-hidden for more)
  --help-list            - Display list of available options (--help-list-hidden for more)
  --version              - Display the version of this program

tipc Options:
Options for controlling the TIP compilation process.

  --asm                          - emit human-readable LLVM assembly language
  --do                           - disable bitcode optimization
  --log=<logfile>                - log all messages to logfile (enables --verbose 3)
  -o=<outputfile>                - write output to <outputfile>
  --pa=<AST output file>         - print AST to a file in dot syntax
  --pcg=<call graph output file> - print call graph to a file in dot syntax
  --pi                           - perform polymorphic type inference
  --pp                           - pretty print
  --ps                           - print symbols
  --pt                           - print symbols with types (supercedes --ps)
  --verbose=<int>                - enable log messages (Levels 1-3) 
                                    Level 1 - Basic logging for every phase.
                                    Level 2 - Level 1 and type constraints being unified.
                                    Level 3 - Level 2 and union-find solving steps.

By default it will accept a .tip file, parse it, perform a series of semantic analyses to determine if it is a legal TIP program, generate LLVM bitcode, and emit a .bc file which is a binary encoding of the bitcodes. You can see a human readable version of the bitcodes by running llvm-dis on the .bc file.

To produce an executable version of a TIP program, the .bc file must be linked with the bitcode for tip_rtlib.c. Running the build.sh script in the rtlib directory once will create that library bitcode file.

The link step is performed using clang which will include additional libraries needed by tip_rtlib.c.

For convenience, we provide a script build.sh that will compile the tip program and perform the link step. The script can be used within this git repository, or if you define the shell variable TIPDIR to the path to the root of the repository you can run it from any location as follows:

$ cd
$ more hello.tip
main() { return 42; }
$ $HOME/tipc/bin/build.sh hello.tip
$ ./hello
Program output: 42
$ $HOME/tipc/bin/build.sh -pp -pt hello.tip
main() 
{
  return 42;
}

Functions : {
  main : () -> int
}

Locals for function main : {

}

Working with tipc

The instructions above, and the scripts described below, make it possible to develop from the command line. This gives you lots of control, but it means you will miss the benefit of modern IDEs. Below we describe how to set up the CLion IDE for use with the project.

Command line

During development you need only run build steps 1 through 5 a single time, unless you modify some CMakeLists.txt file. Just run make in the build directory to rebuild after making changes to the source.

If you do need to add a source file then you will have to edit the appropriate CMakeLists.txt file to add it. In this case, you should:

  • cd build
  • rm CMakeCache.txt
  • cmake ..

which will regenerate the makefiles that you can then run, by typing make, to build.

Note that the tipg4 directory has a standalone ANTLR4 grammar. It's README describes how to build it in isolation and run it using the ANTLR4 jar file.

The bin directory

To facilitate development of tipc we have collected a number of helper scripts into the bin directory of the project. Among them are scripts to run the entire test bed (runtests.sh), to run a code coverage analysis (gencov.sh), and to generate the project documentation (gendocs.sh). Please see the README in the bin directory for example usages.

When rebuilding and rerunning tests you may get errors about failing to merge gcov files. This happens when gcov files linger from previous runs. To cleanup these messages, simply run the cleancov.sh script.

CLion

CLion is C++ IDE that can be used to develop and build tipc. CLion can be installed with the JetBrains suite of tools, or as a standalone tool here. Once installed, you can start a 30 day trial license or, as a student, you can get a free educational license here.

These instructions are with respect to CLion 2023.1.3, but older or new versions work similarly - though the UI may be a bit different.

If you are building for the first time with CLion, follow the first two steps of the installation process to install any needed tipc dependencies.

From the File menu select New and then Project from Version Control. You can type in the URL for this github repository and then hit the Clone button. The scripts described above assume a directory structure, but a little bit of setup will synchronize your CLion project with those assumptions and allow for easy development using both CLion and scripts, when needed.

From the CLion menu select Build, Execution, Deployment and then CMake. You want to change the Build directory to build and then define an Environment variable. When you ran the bootstrap.sh script it defined a shell variable LLVM_DIR in your .bashrc. Copy that definition into the Environment field under Cache variables. Your Settings should look as follows:

CLion CMake Settings

Now you can click Apply and then OK to complete the setup.

The project can now be built or rebuilt by clicking the "Build" button in the toolbar.

CLion has great debugging support as well as test coverage support for the Catch2 tests included in the project. You will rarely need to use the commandline scripts, but if you do just move to ~/CLionProjects/tipc and you can execute them there to:

  1. resolve gcov merge errors by running cleancov.sh
  2. run system tests with runtests.sh
  3. generate documentation with gendocs.sh

CLion also has some nice plugins. For example, there is an ANTLR v4 plugin that allows you to more easily develop extensions to the grammar for TIP. Installation is easy, just click on the Install to CLion ... button on the web-page. Then right click on any rule and select Test Rule ... and two frames pop up at the bottom of the UI: the left frame allows you to type in fragments of input and the right frame shows the resulting parse tree.

Log Messages

When working on the tipc compiler, it may be helpful to enable logging messages when testing your changes on programs. We have inserted logging messages using loguru. These can be turned using the flag --verbose [x] where x is a number between 1-3. These messages get more verbose as you increase x. The first setting shows when symbols are added to the symbol table and when type constraints are generated for the type solver. The second setting shows the previous information and type constraints being unified. The third setting shows types being search for and added into the type graph. When adding to theses features, you can add logging messages by adding a line LOG_S(x) where x is an integer to describe the level of log verbosity you want. You can use the existing levels or make new levels.

Code Style

tipc follows llvm coding standards. clang-format is used to apply the llvm style rules. The following command can be used to apply the llvm style across the tipc src directory.

find src -iname *.h -o -iname *.cpp | xargs clang-format -style=llvm -i

Using pre-commit we can enforce styling before each commit. This is encourged to keep a uniform style across the codebase. Install pre-commit by following the instructions in their documentation. Then, install the tipc hooks by running,

pre-commit install

Now, c++ and cmake formatting will be checked before each commit.

Documentation

The TIP grammar, tipg4, is implemented using ANTLR4. This grammar is free of any semantic actions, though it does use ANTLR4 rule features which allow for control over the tree visitors that form key parts of the compiler. This allows the structure of the grammar to remain relatively clean, i.e., no grammar factoring or stratification needed.

The tipc compiler is has a pretty classic design. It is comprised of four phases:

  • frontend takes care of parsing, constructing an AST representation, and pretty printing
  • semantic analysis that performs assignability, symbol, and type checking
  • code generation that produces LLVM bitcode from an AST and emits a binary
  • optimization that runs a few LLVM optimization passes to improve the bitcode

Doxygen documentation for the project is available for the project. The documentation is a work in progress and will improve over time.

The tipc driver program only produces a bitcode file, .bc. You need to link it with the runtime library which define the processing of command line arguments, which is non-trivial for TIP, establish necessary runtime structures, and implement IO routines. A script is available to link binaries compiled by tipc with the runtime library.

Goals and Plans

The goal of this project is to provide a starting point for project work in an undergraduate compilers course. As such it similar to lots of other compiler projects, but there are some differences.

First and foremost, the TIP language includes a number of rich features, e.g., high-order functions, and type inference, and the tipc compiler targets LLVM - a key component of a production compiler infrastructure. These choices are intentional and while they create some challenges the project is intended to help demystify complex language features, e.g., parametric polymorphism, by illustrating how they can be realized.

Second, the project attempts to use modern software development practices, e.g. Doxygen for in-code documentation, unit testing with Catch2, continuous integration with GitHub Actions, and code coverage with lcov.

Third, the project intentionally makes heavy use of the Visitor pattern which is quite appropriate in the context of a compiler. Our use of it is intended to demonstrate how this type of abstract design element in a system can yield conceptual simplicity and savings in development. The project currently uses 6 visitors that extend ASTVisitor and another visitor from ANTLR4.

Finally, the project is implemented in C++17 using modern features. For example, all memory allocation uses smart pointers, we use unique pointers where possible and shared pointers as well, to realize the RAII pattern. Again this presents some challenges, but addressing them is illustrated in the tipc code base and hopefully they provide a good example for students.

The project is a work-in-progress in the sense that we are planning to perform corrective maintenance, as needed, as well as perfective maintenance. For the latter, we expect to make a new release of the project in early August every year. This release will focus on improving the use of modern C++ and to incorporate better design principles, patterns, and practices. We welcome issue reports and pull-requests along these lines.

Differences from TIP and Limitations

Other than issues related to the efficiency of the code that it generates, the tipc compiler has two limitations.

tipc supports a variant of the TIP language semantics in a few ways. It implements the != operator which allows us to conveniently write self-contained system tests and it implements the C operator precedence rules, whereas the original TIP uses a few different rules. This surfaces in the interplay between pointer dereference and field access. An expression *r.f is interpreted as *(r.f) according t the C precedence rules and as (*r).f according to the TIP Scala implementation. If in doubt, add parentheses to express your meaning.

By default tipc implements the unification-based monomorphic type inference algorithm used in the Scala implementation. tipc also support a form of polymorphic type inference using the --pi option. To use polymorphic type inference programmers can add the poly keyword to a function declaration, but tipc will only perform polymorphic type inference for such functions if they are non-recursive.

There is also a difference in type inference related to records. The type system that ensures that any expression appearing in the record position of a field access expression is in fact a record, but it does not infer precise record typing. Instead the strategy used is to define an uber record that consists of all of the fields referenced across the program. Code generation for records will allocate a global record and then explicitly initialize fields present in a record expression. If the record is being created via an alloc expresion, the fields that aren't explictly set will be set to a default value 0, as we allocate memory using the C function calloc. If the record is being created without explictly allocating memory for it, the fields that aren't explictly set will not be given any value, but may still be referenced. This can lead to some unexpected behavior. Consider this TIP program:

main() { var r; r = {g:1}; return access(r); }
access(r) { return r.f; }

The record expression, {g:1}, forces the global record to contain a field g, and the access expression r.f, forces the presence of field f. Because r is not being allocated using an alloc expression, access will return whatever value was in memory at the location of r.f. One might prefer that this program yield a type error, but that would require a more sophisticated type system. We chose not to implement that to manage the complexity of what is primarily a pedagogical project. If instead, r is allocated using alloc like the following:

main() { var r; r = alloc {g:1}; return access(r); }
access(r) { return r.f; }

The record expression default initializes f to 0 and this is the value that is accessed and returned from the call to access and then from main.

Second, TIP allows memory allocation, yet its runtime system does not include a garbage collector.
The TIP program recordLeak.tip, shown below, leaks memory because the alloc constructor causes the record to be allocated on the heap:

// recordLeak.tip
foo(x,y,z){
    var rec;
    rec = alloc {l: x, m: y, n: z};
    return (*rec).m;
}

main(){
    var i, j, a;
    a = 0;
    i = 0;
    j = 0;
    while (1000000000 > i) { 
      while (1000000000 > j) { 
        a = a + foo(3,2,4);
        j = j + 1;
      }
      i = i + 1;
    }
    return 0;
}

This is a valid tip program which can be compiled into an executable using and observe it's memory usage using:

/path/to/tipc/bin/build.sh --do test/system/leak/recordLeak.tip
./recordLeak &; top

You can then kill top using Ctrl+C and then kill the ./recordleak with fg and Ctrl+C. It's important that you disable the optimizer with the --do flag. Otherwise, the optimizer would be smart enough to simply return the y parameter's value. If we remove the alloc from foo, as we do in recordNoLeak.tip, then the record is allocated on the call stack and it is reclained when the call to foo returns:

// recordNoLeak.tip
foo(x,y,z){
    var rec;
    rec = {l: x, m: y, n: z};
    return (*rec).m;
}

main(){
    var i, j, a;
    a = 0;
    i = 0;
    j = 0;
    while (1000000000 > i) { 
      while (1000000000 > j) { 
        a = a + foo(3,2,4);
        j = j + 1;
      }
      i = i + 1;
    }
    return 0;
}

We can find that this program will not create a memory leak because rec will be allocated on the stack instead of the heap as the alloc would.

Incorporating a garbage collector is a possible future extension to the runtime library.

Resources

To fully understand this project quite a bit of background is required. We collect a number of resources that we think can be helpful in that regard.

C++ Resources

If you find yourself unfamiliar with certain aspects of the C++ programming language we encourage you to explore the Back To Basics videos that have been presented at CppCon. Provided below are links to a number of these videos, as well as to other resources that are relevant to this project.

Move Semantics

Value Categories

Smart pointers

CMake Resources

Catch2 and Unit Testing Resources

LLVM Resources

To understand this code, and perhaps extend it, you will want to become familiar with the core LLVM classes. It can be difficult to absorb all of the information in this type of documentation just by reading it. A goal-directed strategy where you move back and forth between reading code and reading this documentation seems to work well for many people.

If you are familiar with the LLVM tutorial you will see its influence on this compiler which leverages idioms, strategies, and code fragments from the tutorial. The LLVM tutorials are a great starting point for understanding the APIs in the context of compiling.

There is lots of great advice about using LLVM available:

  • https://www.cs.cornell.edu/~asampson/blog/llvm.html
  • the LLVM Programmer's Manual is a key resource
  • someone once told me to just use a search engine to find the LLVM APIs and its a standard use case for me, e.g., I don't remember where the docs are I just search for llvm irbuilder
  • LLVM has some nuances that take a bit to understand. For instance, the GEP instruction, which tipc uses quite a bit given that it emits calls through a function table.
  • LLVM15+ have implemented the concept of "Opaque Pointers", this removes all the typed pointer implementations and associated functions.

Git Resources

CLion Resources