Skip to content

Latest commit

 

History

History

ReviewComments

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

GitHub Pull Request Review Comments size 1.5GB

Download link.

25.3 million pull request review comments on GitHub since January 2015 till December 2018.

Format

xz-compressed CSV, with columns:

  • COMMENT_ID - identifier of the comment in mother dataset - GH Archive
  • COMMIT_ID - commit hash to which the review comment is attached
  • URL - path to the GitHub pull request the comment comes from
  • AUTHOR - GitHub user of the author of the comment
  • CREATED_AT - creation date of the comment
  • BODY - raw content of the comment

Sample code

Python:

# too big for pandas.read_csv
import codecs
import csv
import lzma

with lzma.open("review_comments.csv.xz") as archf:
    reader = csv.DictReader(codecs.getreader("utf-8")(archf))
    for record in reader:
        print(record)

Origin

The dataset was generated from GH Archive in the following notebook. The comments which exceeded Python's csv.field_size_limit equal to 128KB were discarded (~10 comments).

We gathered some statistics about the dataset.

License

Open Data Commons Open Database License (ODbL)