GitHub

Dataset Download

Introduction

This repository contains scripts to crawl SQL-files from GitHub, parse them and extract structured database schema information from them. We do this, because we want to learn about the semantics of database tables in the wild (table names, column names, foreign key relations etc.).

Disclaimer: The dataset was created with research problems in the area of large scale data integration in mind, such as foreign key detection, and CSV-header detection. For other use cases, we strongly encourage users to revisit the data collection, parsing and extraction methods and align them with their concrete research objective.

Other Datasets

Samples, containing only data from repos with a permissive license like MIT, Apache-2.0, are available for direct download above. We can provide other datasets for reproduction purposes upon request.

GitSchemas Raw SQL File URLs (CSV)
GitHub SQL Files Dataset (ZIP)
GitSchemas (JSON; Full Dataset)
Foreign Key Detection Training Dataset (CSV, Parquet)

Crawler

The GitHub SQL Crawler was built using the GitHub Search API. We search GitHub for SQL code that contains a CREATE TABLE and a FOREIGN KEY statement to make sure we retrieve SQL scripts which actually define a schema.

The GitHub Search index contains ~7.8M SQL files, out of which ~0.7M contain the "CREATE TABLE FOREIGN KEY" keywords (s. Table).

Query	# Search Reults (Nov. 2021)
language: SQL	~7.8M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL	~ 0.8M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (successfully downloaded)	~ 0.7M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (deduplicated)	~ 0.37M

The file-sizes of all SQL scripts containing the desired keywords, are distributed as follows:

It is Interesting to note is that there are no files >400 KB. Maybe GitHub does not perform language detection on files >400 KB.

The crawler performs the following three steps:

(1) crawl a list of URLs
(2) download files based on the list of URLs
(3) deduplicate the downloaded files based on their sha256 hash

SQL Parser

To facilitate downstream use cases, we want to extract structured schema information from the crawled SQL files. This includes things such as table names, column names, primary keys, and foreign keys including their reference table and their reference column names. However, the extraction is no simple feat, because the crawled SQL files may use different SQL dialects, contain comments, be incomplete, and/or contain syntactical errors.

We tried different parsing options available in Python, including the libraries sqlparse (Non-validating), mysqlparse (MySQL), pglast (Postgres), and queryparser (MySQL/Postrgres). Untimately, we foundpglastto provide the best tradeoff between conveniece and parsing success (~16%), leaving us with a total of 61,038 schemas (s. Table).

Description	# of files
All SQL files	373,156
Parsable with pglast	61,038 (16.36%)

pglast extracts an abstract syntax tree (AST) from the SQL script, which we ultimately translate to JSON to facilitate further analysis.

Note: There should be quite some room for improvement in terms of the parsing success rate. E.g., backtick-quotes ` (MySQL-style) are incompatible with the postgres parser and currently lead to an immediate error. While something like this might be easy to solve with a search-and-replace, other issues are more intricate. There are many different SQL dialects, and for not every dialect there is an open-source parser available. Therefore it does not seem feasible to simply trial-and-error all possible parsers/dialects. Creating a robust SQL parser which can extract schema information from SQL files, without knowing which particular database system the query was written for, could be an interesting project in and of itself. Any useful pointers regarding this are highly apprecited.

Schema Data

The parsing step results in a JSON file which looks as follows. A sample dataset can be downloaded (s. Download Links section):

{'schema_000001':
	{'INFO': {
		'user': 'user_xxx',
		'url': 'https://github.com/2212khushboo/Project/blob/5a6746d../db/sql/tables.sql?raw=true\n',
		'filename': 'tables.sql',
		'project': 'Project',
		'filesize': 2885},
	 'TABLES': {
	  'state': {
		'COLUMNS': [
			['id', 'serial'],
			['uuid', 'varchar'],
			['state_name', 'varchar']],
	    'PRIMARY_KEYS': ['id'],
	    'FOREIGN_KEYS': []},
	  'address': {
		'COLUMNS': [
			['id', 'serial'],
			['uuid', 'varchar'],
			['flat_buil_number', 'varchar'],
			['locality', 'varchar'],
			['city', 'varchar'],
			['pincode', 'varchar'],
			['state_id', 'int4']],
	    'PRIMARY_KEYS': ['id'],
	    'FOREIGN_KEYS': [{
		 'FOREIGN_KEY': ['state_id'],
		 'REFERENCE_TABLE': 'state',
		 'REFERENCE_COLUMN': ['id']}]}}},
   'schema_000002': {....}
}

The following section gives an overview of some of the properties of the dataset.

Entity	#
schemas	61,038
tables	393,653
columns	2,544,164
unique column names	303,443
primary keys	322,627
unique primary key names	31,599
foreign keys	175,589
unique foreign key names	31,041

Apart from the raw JSON data, we derived a tabular data set particularly for the foreign key detection problem. The dataset contains only those tables, that have a non-composite-key foreign key relation to another table. For tables that have multiple foreign key relations, we added one line for every relation. It has the following schema and a sample can be downloaded as CSV or parquet file (s. Download Links section):

schema	table_name_a	table_name_b	columns_a	columns_b	primary_keys_a	primary_keys_b	key_a	key_b
000361_xxxx.sql	event	resourcetbl	planno, lineno, locno, resno, timestart, time...		id	id	resno	resno
000376_yyyy.derby.sql	category	db_category	id, name, parent_id	id, name, parent_id	id	id	parent_id	id

total rows: 124605

Publication

📄 GitSchemas: A Dataset for Automating Relational Data Preparation Tasks

@inproceedings{gitschemas2022,
  author    = {D{\"o}hmen, Till and Hulsebos, Madelon and Beecks, Christian and Schelter, Sebastian},
  title     = {GitSchemas: A Dataset for Automating Relational Data Preparation Tasks},
  year      = {2022},
  maintitle = {ICDE 2022 – 38th IEEE International Conference on Data Engineering},
  booktitle = {Workshop on Databases and Machine Learning (DBML)},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
imgs		imgs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Download

Introduction

Other Datasets

Crawler

SQL Parser

Schema Data

Publication

About

Releases

Packages

Languages

License

tdoehmen/gitschemas

Folders and files

Latest commit

History

Repository files navigation

Dataset Download

Introduction

Other Datasets

Crawler

SQL Parser

Schema Data

Publication

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages