Archived, v2 implementation is at https://github.com/haxscramper/haxorg/tree/master/scripts/cxx_repository – much better data model, should have tests and so on. This repo is kept around for my own reference, some thing from here are still worth copying.
Historical data analysis for the git repositories. Largely inspired by projects such as src-d/hercules: Gaining advanced insights from Git repository history, adamtornhill/code-maat: A command line tool to mine and analyze data from version-control systems, adamtornhill/maat-scripts: Scripts used to post-process the results from Code Maat, and most importantly a Your Code as a Crime Scene: Use Forensic Techniques to Arrest Defects, Bottlenecks, and Bad Design in Your Programs by Adam Tornhill book.
This project consists of two parts - (1) the C++ CLI script that generates SQLite database with information from the repository, and (2) a collection of user scripts for generating concrete analysis from the repository. Some scripts come with a repository (for more details on the usage see related wiki pages), but you are free to write your own in any language that can interact with a SQLite database file.
code_forensics user_repo(1) [ --filter-script=file.py(2) ] [ --outfile=db.sqlite(3) ]
The main CLI application code_forensics
is a self-contained binary file
that takes in a (1) user-provided repository, an (2) optional python filter
script, and possibly (3) previously generated database (for incremental
updates) generates the new database file (or updates existing one).
- A user-provided git repository - no special constraints on the structure, size etc., although large repositories generally require (2) filter script in order to only pick certain commits - full analysis might take too much time.
- User-provided filter script - regular python script that is provided by
the user in order to customize handling of certain commits and files.
Script should import the
forensics.config
object and configure it as needed. For mode details on the script configuration see the “Scripting” section of this readme. - Path to the output database file
--branch=
which repository branch to analyze (defaults to master)
Custom analysis logic can be injected directly in the database processing stage using a user-provided python script that is
NOTE: Filtering configuration script and subsequent analysis can be tied
together via period indexing scheme or some other means. Examples in
scripts
also use this feature. For example, sampling period mapping is
done using this function, which effectively translates 2011Q2
to 4022
.
Later on this is unpacked by the data visualization part.
def sample_period_mapping(date) -> bool:
result = date.year * 2
if 6 < date.month:
result += 1
return result
Database is comprised of the several tables, mapping different types
declared in the git_ir.hpp
file. Mapping is done via the ORM library
using ir::
types. General overview of each table present in the database:
- line
- Interned information about different lines in the code. Each
line’s content is represented uniquely and includes the
.author
,.time
and.content
ID fields..nesting
is a line complexity analytics. - files
- full list of all processed files - each separate version of the
file is represented as a different database entry, with new ID,
statistics etc.
- file lines
- auxiliary table for data found in the analyzed files.
.file
refers to thefiles
table,index
shows the position in the file and the.line
refers to the information in theline
table.
- rcommit
- full list of all processed commits. contains information about
author id, time and timezone of the commit, period to which commit was
attributed and commit message.
- edited_files
- files explicitly modified for each commit. Contains a commit id, number of added and removed lines and path ID of the file.
- renamed
- pair of old-new file path IDs for each commit
- paths
- interned list of paths used in different tables. Contains full path of the file and ID of it’s direct parent directory.
- author
- List of commit and line authors - name and email
In addition to the main tables a collection of views is added to the database after it is generated.
file_path_with_dir
- simplify getting full file name from the path ID (without this view it would require two joins - on path ID and on the path’s name)file_version_with_path
- simplify getting full name, commit, directory of each file version.file_version_with_path_dir
- extension of the previous view with a joined path of the parent directory.
Select every instance of the file, find it’s relative complexity and order by it.
select strings.`text`, file.total_complexity / cast(file.line_count as real) as result
from file, strings
where file.name == strings.id
order by result;