Skip to content

Latest commit

 

History

History

Identifiers

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Identifiers size 1.0GB

Paper (accepted to ML4P'18).

The dataset was extracted from Public Git Archive and consists of:

  1. 49 million distinct identifiers - 1 GB
  2. identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.

Format

CSV, columns:

  • num_files - number of files where the identifier was found
  • num_occ - number of times the identifier was found overall
  • num_repos - number of repositories in which the identifier was found
  • token - the value of the identifier
  • token_split - the splitted parts using the sourced-ml heuristics

All the stats correspond to the HEAD revision of each repository in PGA.

Code examples

  • Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.

License

Open Data Commons Open Database License (ODbL)