SING-SQL is a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database — without relying on SQL logs or manual annotations.
The framework:
- Partitions the database schema hierarchically into table-level and column-level sub-schemas.
- Presents a quality-aware SQL–Text generation pipeline that incorporates complexity-controlled SQL synthesis, LLM-as-a-judge validation, executability checks, automatic SQL repair, and reasoning trace generation.
- Balances schema coverage with a column-focused generation step to include underrepresented attributes.
Comparison of system performance (Candidate SQL count is 8) on the Synthetic California Schools Development and Test Sets of SING Framework
SING-SQL produces datasets with comprehensive schema coverage and broader SQL diversity.
Below, we highlight join complexity and aggregation usage across BIRD and our synthetic splits.
Question counts per difficulty level for the California Schools database in BIRD-Dev and synthetic splits:
| Dataset | Overall | Simple | Moderate | Challenging | Window |
|---|---|---|---|---|---|
| BIRD-Dev | 89 | 54 | 30 | 5 | 2 |
| Synthetic Train | 34,266 | 8,685 | 8,556 | 8,046 | 9,286 |
| Synthetic Dev | 1,124 | 297 | 259 | 259 | 319 |
| Synthetic Test | 1,124 | 299 | 248 | 248 | 340 |
The average number of joins per SQL query across difficulty levels.
This demonstrates that SING-SQL captures more complex relational reasoning patterns compared to BIRD.
The proportion of SQL queries with aggregation operators across levels.
Our synthetic dataset provides richer coverage of aggregations at moderate, challenging, and window levels.
We released SingSQL-LM, a family of compact language models fine-tuned on the synthetic data of California Schools database, achieving state-of-the-art in-domain performance.
| Model | Specialized Database | Base Model | Train Method | HuggingFace |
|---|---|---|---|---|
| SingSQL-LM-1.5B-R32_CS | California Schools | Qwen2.5-Coder-1.5B-Instruct | SFT | 🤗 HuggingFace |
| SingSQL-LM-1.5B-R64_CS | California Schools | Qwen2.5-Coder-1.5B-Instruct | SFT | 🤗 HuggingFace |
| SingSQL-LM-3B-R32_CS | California Schools | Qwen2.5-Coder-3B-Instruct | SFT | 🤗 HuggingFace |
| SingSQL-LM-3B-R64_CS | California Schools | Qwen2.5-Coder-3B-Instruct | SFT | 🤗 HuggingFace |
- The synthetic data generated using SING-SQL is available for the California Schools Database.
- It includes train, dev, and test splits with full schema coverage and balanced complexity levels.
| Dataset | HuggingFace |
|---|---|
| California Schools | 🤗 HuggingFace |
If you find this repository helpful, please cite the following paper:
@misc{caferoğlu2025singsqlsyntheticdatageneration,
title={SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation},
author={Hasan Alp Caferoğlu and Mehmet Serhat Çelik and Özgür Ulusoy},
year={2025},
eprint={2509.25672},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.25672},
}




