Skip to content

httdty/REDSQL_VLDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REDSQL

This README provides comprehensive instructions for setting up the environment, downloading the dataset, and running REDSQL.

Dataset Structure

The dataset is organized into the following directory structure:

.
├── bird
│   ├── database             # Database directory
│   ├── dev_annotation.json  # Generated annotations
│   ├── dev.json            # Development set input
│   ├── dev.sql             # Ground truth SQL queries
│   └── dev_tables.json     # Database schema
├── spider
│   ├── database
│   ├── dev_annotation.json
│   ├── dev_gold.sql
│   ├── dev.json
│   └── dev_tables.json
├── preds
│   └── Predicted_SQLs      # SQL predictions from baseline methods (e.g., PURPLE, Codes)
...

Dataset Components

  • bird: Contains the BIRD dataset files including database, annotations, development set, ground truth SQL queries, and schema information.
  • spider: Contains the Spider dataset files with similar structure to BIRD.
  • preds: Contains SQL predictions from various baseline methods.

Environment Setup

1. System Requirements

sudo apt-get update
sudo apt-get install -y openjdk-11-jdk

2. Create Conda Environment

conda create -n red python=3.9
conda activate red

3. Install Dependencies

# Install PyTorch
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

# Install NMSLib
conda install -c conda-forge nmslib

# Install remaining requirements
pip install -r requirements.txt

Usage Instructions

1. Directory Setup

mkdir output logs

2. Build Value Search Index

python -m pre_processing.build_contents_index \
    --output_dir=./index/bird/db_contents_index/ \
    --db_dir=./datasets/bird/dev_database/

3. Generate Annotations (Optional)

Note: This step can be skipped as we provide pre-generated annotations for the following datasets:

  • BIRD
  • Science
  • Spider

These annotations are available in our open-source repository.

If you need to generate annotations for a custom dataset, use the following command:

python -m pre_processing.doc \
    --model_name=gpt-4o-2024-08-06 \
    --output_file=./annotation.json \
    --table_file=./datasets/bird/dev_tables.json \
    --db_dir=./datasets/bird/database/

4. Run REDSQL

python -m main.run \
    --model_name=model_name \
    --batch_size=2 \
    --exp_name=exp_name \
    --bug_fix \
    --consistency_num=30 \
    --stage=dev \
    --preds=/path/to/predicted/sql.txt \
    --db_content_index_path=/path/to/db/content/index \
    --annotation=/path/to/dev_annotation.json \
    --output_dir=./output \
    --dev_file=/path/to/dev.json \
    --table_file=/path/to/dev_tables.json \
    --db_dir=/path/to/database

Command Line Arguments

Argument Description
--model_name Name of the Language Model to use
--batch_size Batch size for processing (default: 2)
--exp_name Name of the experiment
--bug_fix Enable bug fixing functionality
--bug_only Only fix SQL when errors are detected
--consistency_num Number of consistency checks (default: 30)
--stage Processing stage (e.g., 'dev')
--preds Path to predicted SQL statements
--db_content_index_path Path to database content index
--annotation Path to annotation file
--output_dir Directory for output files
--dev_file Path to development set file
--table_file Path to table schema file
--db_dir Path to database directory

Note: Ensure all required files are in place and paths are correctly configured before running the commands.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published