A simple library to extract a code property graph out of source code. It has support for multiple passes that can extend the analysis after the graph is constructed. It currently supports C/C++ (C17), Java (Java 13) and has experimental support for Golang, Python and TypeScript. Furthermore, it has support for the LLVM IR and thus, theoretically support for all languages that compile using LLVM.
A code property graph (CPG) is a representation of source code in form of a labelled directed multi-graph. Think of it as directed a graph where each node and edge is assigned a (possibly empty) set of key-value pairs (properties). This representation is supported by a range of graph databases such as Neptune, Cosmos, Neo4j, Titan, and Apache Tinkergraph and can be used to store source code of a program in a searchable data structure. Thus, the code property graph allows to use existing graph query languages such as Cypher, NQL, SQL, or Gremlin in order to either manually navigate through interesting parts of the source code or to automatically find "interesting" patterns.
This library uses Eclipse CDT for parsing C/C++ source code JavaParser for parsing Java. In contrast to compiler AST generators, both are "forgiving" parsers that can cope with incomplete or even semantically incorrect source code. That makes it possible to analyze source code even without being able to compile it (due to missing dependencies or minor syntax errors). Furthermore, it uses LLVM through the javacpp project to parse LLVM IR. Note that the LLVM IR parser is not forgiving, i.e., the LLVM IR code needs to be at least considered valid by LLVM. The necessary native libraries are shipped by the javacpp project for most platforms.
In order to improve some formal aspects of our library, we created several specifications of our core concepts. Currently, the following specifications exist:
We aim to provide more specifications over time and also include them in a new generated documentation site.
In order to get familiar with the graph itself, you can use the subproject cpg-neo4j. It uses this library to generate the CPG for a set of user-provided code files. The graph is then persisted to a Neo4j graph database. The advantage this has for the user, is that Neo4j's visualization software Neo4j Browser can be used to graphically look at the CPG nodes and edges, instead of their Java representations.
The most recent version is being published to Maven central and can be used as a simple dependency, either using Maven or Gradle. Since Eclipse CDT is not published on maven central, it is necessary to add a repository with a custom layout to find the released CDT files. For example, using Gradle's Kotlin syntax:
repositories {
ivy {
setUrl("https://download.eclipse.org/tools/cdt/releases/10.3/cdt-10.3.2/plugins")
metadataSources {
artifact()
}
patternLayout {
artifact("/[organisation].[module]_[revision].[ext]")
}
}
}
dependencies {
var cpgVersion = "5.1.0"
// if you want to include all published cpg modules
implementation("de.fraunhofer.aisec", "cpg", cpgVersion)
// if you only want to use some of the cpg modules
// use the 'cpg-core' module
// and then add the needed extra modules, such as Go and Python
implementation("de.fraunhofer.aisec", "cpg-core", cpgVersion)
implementation("de.fraunhofer.aisec", "cpg-language-go", cpgVersion)
implementation("de.fraunhofer.aisec", "cpg-language-python", cpgVersion)
}
Beware, that the cpg
module includes all optional features and might potentially be HUGE (especially because of the LLVM support). If you do not need LLVM, we suggest just using the cpg-core
module with the needed extra modules like cpg-language-go
. In the future we are working on extracting more optional modules into separate modules.
A published artifact of every commit can be requested through JitPack. This is especially useful, if your external project makes use of a specific feature that is not yet merged in yet or not published as a version yet. Please follow the instructions on the JitPack page. Please be aware, that similar to release builds, the CDT repository needs to be added as well (see above).
The library can be used on the command line using the cpg-console
subproject. Please refer to the README.md of the cpg-console
as well as our small tutorial for further details.
Some languages, such as Golang are experimental and depend on other native libraries. Therefore, they are not included as gradle submodules by default.
To include them as submodules simply toggle them on in the gradle properties file by setting the value of the properties to true
e.g., (enableGoFrontend=true
).
Instead of manually editing the gradle properties file, you can also use the configure_frontends.sh
script, which edits the properties for you.
In the case of Golang, the necessary native code can be found in the src/main/golang
folder of the cpg-language-go
submodule. Gradle should automatically find JNI headers and stores the finished library in the src/main/golang
folder. This currently only works for Linux and macOS. In order to use it in an external project, the resulting library needs to be placed somewhere in java.library.path
.
You need to install jep. This can either be system-wide or in a virtual environment. Your jep version has to match the version used by the CPG (see version catalog).
Currently, only Python 3.10 is supported.
Follow the instructions at https://github.com/ninia/jep/wiki/Getting-Started#installing-jep.
python3 -m venv ~/.virtualenvs/cpg
source ~/.virtualenvs/cpg/bin/activate
pip3 install jep
Through the JepSingleton
, the CPG library will look for well known paths on Linux and OS X. JepSingleton
will prefer a virtualenv with the name cpg
, this can be adjusted with the environment variable CPG_PYTHON_VIRTUALENV
.
For parsing TypeScript, the necessary NodeJS-based code can be found in the src/main/nodejs
directory of the cpg-language-typescript
submodule. Gradle should build the script automatically, provided NodeJS (>=16) is installed. The bundles script will be placed inside the jar's resources and should work out of the box.
We use Google Java Style as a formatting. Please install the appropriate plugin for your IDE, such as the google-java-format IntelliJ plugin or google-java-format Eclipse plugin.
Straightforward, however three things are recommended
- Enable gradle "auto-import"
- Enable google-java-format
- Hook gradle spotlessApply into "before build" (might be obsolete with IDEA 2019.1)
You can use the hook in style/pre-commit
to check for formatting errors:
cp style/pre-commit .git/hooks
The following authors have contributed to this project (in alphabetical order):
- fwendland
- JulianSchuette
- konradweiss
- KuechA
- Masrepus
- maximiliankaul
- maximilian-galanis
- obraunsdorf
- oxisto
- peckto
- titze
- vfsrfs
We are currently discussing the implementation of a Contributor License Agreement (CLA). Unfortunately, we cannot merge external pull requests until this issue is resolved.
A quick write-up of our CPG has been published on arXiv:
[1] Konrad Weiss, Christian Banse. A Language-Independent Analysis Platform for Source Code. https://arxiv.org/abs/2203.08424
A preliminary version of this cpg has been used to analyze ARM binaries of iOS apps:
[2] Julian Schütte, Dennis Titze. liOS: Lifting iOS Apps for Fun and Profit. Proceedings of the ESORICS International Workshop on Secure Internet of Things (SIoT), Luxembourg, 2019. https://arxiv.org/abs/2003.12901
An initial publication on the concept of using code property graphs for static analysis:
[3] Yamaguchi et al. - Modeling and Discovering Vulnerabilities with Code Property Graphs. https://www.sec.cs.tu-bs.de/pubs/2014-ieeesp.pdf
[4] is an unrelated, yet similar project by the authors of the above publication, that is used by the open source software Joern [5] for analysing C/C++ code. While [4] is a specification and implementation of the data structure, this project here includes various Language frontends (currently C/C++ and Java, Python to com) and allows creating custom graphs by configuring Passes which extend the graph as necessary for a specific analysis:
[4] https://github.com/ShiftLeftSecurity/codepropertygraph
[5] https://github.com/ShiftLeftSecurity/joern/
Additional extensions of the CPG into the field of Cloud security:
[6] Christian Banse, Immanuel Kunz, Angelika Schneider and Konrad Weiss. Cloud Property Graph: Connecting Cloud Security Assessments with Static Code Analysis. IEEE CLOUD 2021. https://doi.org/10.1109/CLOUD53861.2021.00014