This is the data set and scripts for our paper "Uncovering the Hidden Dangers: Finding Unsafe Go Code in the Wild".
Authors:
Johannes Lauinger, Lars Baumgärtner, Anna-Katharina Wickert, and Mira Mezini
Technische Universität Darmstadt, D-64289 Darmstadt, Germany
E-mail: {baumgaertner, wickert, mezini} (with) cs.tu-darmstadt.de, jlauinger (with) seemoo.tu-darmstadt.de
To create and process the data for our study, we used the following pipeline:
- Raw Projects and Dependencies. This set contains the 500 open-source Go projects that we crawled from GitHub.
The projects at the specific revision that we examined are referenced in this repository through Git submodules.
The
projects/
directory contains the submodules. - Package and Unsafe Data. From the projects and their dependencies, we compiled the list of all packages used
transitively. Within all packages, we identified usages of
unsafe
Go code. The results of this stage are included in thedata/
directory. - Labeled Unsafe Usages. We used Python to examine the data and sample 1,400 code snippets for manual classification by unsafe usage type
and purpose. The results of this stage are included in the
analysis/
andlabeled-usages-dataset/
directories.
The directories in this repository contain the following:
data/
contains gzipped versions of the CSV files holding project, package, and unsafe code block information, as well as the sampled and labeled code snippets.figures/
contains Figures 1 to 5 as included in our paper.labeled-usages-dataset/
contains our data set of labaled usages of unsafe code blocks in Go code. The data set is divided into 400 Go standard library usages (std) and 1,000 application code (non-standard library) usages (app). Each directory contains subfolders with names similar toefficiency__cast-struct
, where the purpose label and usage label as used in our paper are included, separated by two underscores. Each of the directories contains one file for each classified usage, as described in more detail below.projects/
contains Git submodules for each of the 500 projects under examination, set to the specific revision that we analyzed.scripts/
contains Python scripts to replicate the figures and tables included in our paper, as well as the data acquisition tool that we used to extract unsafe code blocks from the projects and a Jupyter notebook with the Python code that we used to explore the data.
We analyzed the 500 top-rated Go projects from GitHub. They are referenced through Git submodules in this repository. From these, we excluded 150 projecs that did not yet support modules, and 7 which did not compile.
The projects that did not support Go modules can be identified from the data/projects.csv.gz
file using the
project_used_modules
column. The projects that did not compile are listed in the data/projects_with_errors.txt
file.
To make it easier to reproduce our study and to fetch the projects we analyzed with their exact revision, as well as to
get the correct dependency versions, we also provide the projects and dependencies as tar.gz
compressed archives on the
Zenodo platform: https://zenodo.org/record/4001728
The data/geiger_findings.csv.gz
file contains the unsafe code findings. Each line in the file represents one
finding. It holds the corresponding code line, as +/- 5 lines of code context, as well as meta data about the finding.
This meta data includes the line number, column, file, package, module, and project where it was found. Package and
project data is a foreign key to the data/packages.csv.gz
and data/projects.csv.gz
files, respectively,
which provide more detailed information. For example, the packages file contains total finding counts for each
package.
The data/vet_findings.csv.gz
and data/gosec_findings.csv.gz
contain warnings that were generated by go vet
and
gosec
on the packages.
The data/
directory also contains the sampled_usages_app.csv.gz
and sampled_usages_std.csv.gz
files, which are
samples subsets of the geiger_findings.csv.gz
file containing 1,000 and 400 unique samples together with two
labels for each line.
As described in our paper, we randomly sampled 1,400 unique unsafe usages from the 10 projects with the most overall unsafe usages. We then manually classified these samples in two dimensions: by what is being done and for what purpose.
We identified the following classes for the first dimension, what is being done:
cast-struct
,cast-basic
,cast-bytes
,cast-pointer
,cast-header
(all summarized ascast
in our paper to save space): all kinds of casts between arbitrary types and structs, basic Go types,[]byte
slices or[N]byte
arrays, actualunsafe.Pointer
values, orreflect.SliceHeader
andreflect.StringHeader
values, respectively.memory-access
: dereferencing of unsafe pointers, manipulation of referenced memory, or comparison of the actual stored addresses.pointer-arithmetic
: all kinds of arithmetic manipulation of addresses, such as manually advancing a slice.definition
: groups usages where a field or method of typeunsafe.Pointer
is declared for later usage.delegate
: instances where unsafe is needed only because another function requires an argument of typeunsafe.Pointer
.syscall
: groups calls tosyscall.Syscall
or other native syscalls.unused
: occurences that are not actually being used, e.g. dead code or unused function parameters.
Purpose of usage is labeled with the following classes:
efficiency
: all uses of unsafe to improve time or space complexity, such as in-place casts. Code contained in this class could also be written without the use of unsafe, decreasing effeciency.serialization
: contains marshalling and serialization operations.generics
: contains usages of unsafe that achieve functionality that could have been written without unsafe if Go provided support for generics.no-gc
(avoid garbage collection): contains usages where unsafe is used to tell the compiler to not free a value until a function returns, such as when calling assembly code.atomic
(atomic operations): contains usages of the atomic package which require unsafe.ffi
(foreign function interface): contains calls to Cgo or other function interfaces that require unsafe by their contract.hide-escape
: contains snippets where unsafe is used to hide a value from Go escape analysis.layout
(memory layout control): contains unsafe usages to achieve low-level memory management, such as precise alignment.types
: contains unsafe usages needed to implement the Go type system itself. Only present in thestd
samples.reflect
: contains instances of type reflection and re-implementations of some types from the reflect package, such as usingunsafe.Pointer
instead ofuintptr
for slice headers.unused
: again, contains occurences that are not actually being used.
The labeled-usages-dataset
is organized as follows: the app
and std
contain 1,000 and 400 samples, respectively, divided by
application (non-standard libraries) and standard-library usages. Each of them contains subdirectories grouping the snippets by
their combination of labels. The subdirectories are named similar to efficiency__cast-struct
. Both labels of the samples are
concatenated using two underscores. Every combination of labels that actually contains samples has its own directory.
The samples are provided as one file for each sample. The file name is a hash of line number, file, package etc. of the finding, providing a guaranteed unique name. The files contain 4 sections divided by dashes. The first section provides information about the module, version, package, file, and line of the snippet. It also states which project included this snippet (but there can be more projects in the data set that share usage of the snippet), and the labels as already included in the directory name. The information is guaranteed to be in the same line number across files. The second section contains the snippet code line. The third and fourth section contain a +/- 5 lines and +/- 100 lines context, respectively.
Additionally, the labaled data set is included in machine-readable CSV format in the data/sampled_usages_app.csv.gz
and
data/sampled_usaged_std.csv.gz
as described previously.
To reproduce the figures and tables included in our paper, simply execute the corresponding scripts in the scripts/
directory.
They also provide formal documentation about the specific data analysis that we did:
cd scripts
./create-figure-distribution-unsafe-types.py
./create-figure-unsafe-import-depth.py
./create-table-dataset-labels.py
./create-table-dataset-projects.py
Figures are saved as PDF files in the same directory, tables are written to the terminal as LaTeX code.
To execute the scripts, you need the following Python libraries:
- Pandas
- Numpy
- Matplotlib
- Tikzplotlib
- Seaborn
To reproduce the data set, first obtain the raw project code and dependencies. The easiest way to do this is to get the compressed archive with the exact project code that we used from our Zenodo record: https://zenodo.org/record/4001728
Alternatively, you can recursively clone this repository to check out the projects data set submodules. The projects
are included as submodules at the correct revision that we used for analysis in this repository. They are located in
the projects/
directory. After recursively cloning the repositories, you may need to run go mod vendor
in the root
directory of each repository to make sure that all dependencies are properly downloaded. This step is unnecessary when
using the Zenodo record.
Then, build and execute the data acquisition tool in the scripts/data-acquisition-tool
directory. The folder contains
a README file with the build instructions and usage information.
All project and dependency code is licensed under the terms of the respective licenses for the specific projects.
Our study material and data set is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.