Skip to content

HasanAlpCaferoglu/SING-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

arXiv Project Website Bibtex

Table of Contents

Overview

SING-SQL is a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database — without relying on SQL logs or manual annotations.

The framework:

  • Partitions the database schema hierarchically into table-level and column-level sub-schemas.
  • Presents a quality-aware SQL–Text generation pipeline that incorporates complexity-controlled SQL synthesis, LLM-as-a-judge validation, executability checks, automatic SQL repair, and reasoning trace generation.
  • Balances schema coverage with a column-focused generation step to include underrepresented attributes.

Main Results

Comparison of system performance on the California Schools subset of the BIRD development benchmark

Comparison of system performance (Candidate SQL count is 8) on the Synthetic California Schools Development and Test Sets of SING Framework

Data Statistics

SING-SQL produces datasets with comprehensive schema coverage and broader SQL diversity.
Below, we highlight join complexity and aggregation usage across BIRD and our synthetic splits.

Question Count Comparison

Question counts per difficulty level for the California Schools database in BIRD-Dev and synthetic splits:

Dataset Overall Simple Moderate Challenging Window
BIRD-Dev 89 54 30 5 2
Synthetic Train 34,266 8,685 8,556 8,046 9,286
Synthetic Dev 1,124 297 259 259 319
Synthetic Test 1,124 299 248 248 340

Join Count Comparison

The average number of joins per SQL query across difficulty levels.
This demonstrates that SING-SQL captures more complex relational reasoning patterns compared to BIRD.

Aggregation Comparison

The proportion of SQL queries with aggregation operators across levels.
Our synthetic dataset provides richer coverage of aggregations at moderate, challenging, and window levels.

Models

We released SingSQL-LM, a family of compact language models fine-tuned on the synthetic data of California Schools database, achieving state-of-the-art in-domain performance.

Model Specialized Database Base Model Train Method HuggingFace
SingSQL-LM-1.5B-R32_CS California Schools Qwen2.5-Coder-1.5B-Instruct SFT 🤗 HuggingFace
SingSQL-LM-1.5B-R64_CS California Schools Qwen2.5-Coder-1.5B-Instruct SFT 🤗 HuggingFace
SingSQL-LM-3B-R32_CS California Schools Qwen2.5-Coder-3B-Instruct SFT 🤗 HuggingFace
SingSQL-LM-3B-R64_CS California Schools Qwen2.5-Coder-3B-Instruct SFT 🤗 HuggingFace

Dataset

  • The synthetic data generated using SING-SQL is available for the California Schools Database.
  • It includes train, dev, and test splits with full schema coverage and balanced complexity levels.
Dataset HuggingFace
California Schools 🤗 HuggingFace

Citation

If you find this repository helpful, please cite the following paper:

@misc{caferoğlu2025singsqlsyntheticdatageneration,
      title={SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation}, 
      author={Hasan Alp Caferoğlu and Mehmet Serhat Çelik and Özgür Ulusoy},
      year={2025},
      eprint={2509.25672},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.25672}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published