BIRD Benchmark (Text-to-SQL) by redoctopus · Pull Request #1132 · NVIDIA-NeMo/Skills

redoctopus · 2025-12-18T18:43:16Z

Added BIRD's SQL execution accuracy metric to code generation benchmarks.
Also added some documentation on how to download/process data and run the benchmark.

Original dataset website: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird
SQL execution from BIRD: https://github.com/AlibabaResearch/DAMO-ConvAI/blob/main/bird/llm/src/evaluation.py

Summary by CodeRabbit

New Features
- Added BIRD benchmark evaluation framework for text-to-SQL tasks with difficulty-categorized accuracy metrics and database validation testing.
Documentation
- Updated evaluation documentation with BIRD benchmark instructions and data preparation guides; added BIRD to code evaluation references.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

…erabbit fixes Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

…al_full() in BIRD bench Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

docs/evaluation/code.md

gwarmstrong · 2025-12-19T22:51:00Z

docs/evaluation/code.md

+This will download and unpack a file into `<output_directory>/dev_20240627`, which contains the BIRD dev manifest, table information, and database schemas. By default, `output_directory` will be under `nemo_skills/dataset/birdbench/`, though this can be changed via command line argument.
+
+The script will also process the original manifest into `<output_directory>/dev.jsonl`, which will be the input for evaluation.


is this needed now that we have data_dir? The rest of this is pretty consistent across evaluations.

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

greptile-apps · 2026-01-06T22:42:13Z

Greptile Summary

Adds BIRD benchmark support for text-to-SQL evaluation with SQL execution accuracy metrics. The implementation includes data preparation scripts that download and process the BIRD dataset, an evaluator that compares SQL execution results with a 30-second timeout, and metrics calculation grouped by difficulty levels (simple/moderate/challenging). The changes properly integrate with the existing evaluation framework by registering the new evaluator and metrics classes, and include comprehensive documentation for setup and usage.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-structured with proper error handling, timeout protection, MIT-licensed code attribution, and follows existing framework patterns. The SQL comment removal regex was already fixed in the latest commit, and all components integrate cleanly into the existing codebase
No files require special attention

Important Files Changed

Filename	Overview
nemo_skills/evaluation/evaluator/bird.py	Adds BIRD SQL evaluator with proper licensing, comment removal, timeout handling, and SQL execution comparison
nemo_skills/evaluation/metrics/bird_metrics.py	Implements difficulty-categorized accuracy metrics for BIRD benchmark evaluation
nemo_skills/dataset/birdbench/prepare.py	Downloads BIRD data, extracts database schemas via sqlite3.iterdump(), and formats manifest
nemo_skills/prompt/config/generic/text_to_sql.yaml	Prompt template for text-to-SQL with step-by-step reasoning and SQL code block format

Sequence Diagram

sequenceDiagram
    participant User
    participant ns_prepare_data
    participant Download as wget/zipfile
    participant SQLite as sqlite3
    participant ns_eval
    participant BirdEvaluator
    participant LLM as Model
    participant BirdMetrics
    
    User->>ns_prepare_data: prepare_data birdbench
    ns_prepare_data->>Download: Download dev.zip from bird-bench
    Download-->>ns_prepare_data: dev_20240627/
    ns_prepare_data->>SQLite: Extract table schemas via iterdump()
    SQLite-->>ns_prepare_data: Table definitions + sample data
    ns_prepare_data->>ns_prepare_data: Format entries to dev.jsonl
    ns_prepare_data-->>User: Data ready
    
    User->>ns_eval: eval --benchmarks=birdbench
    ns_eval->>LLM: Generate SQL from question + sql_context
    LLM-->>ns_eval: Generated SQL in ```sql blocks
    ns_eval->>BirdEvaluator: eval_single(data_point)
    BirdEvaluator->>BirdEvaluator: Extract SQL with regex
    BirdEvaluator->>BirdEvaluator: Remove comments (-- and /* */)
    BirdEvaluator->>SQLite: Execute predicted SQL
    SQLite-->>BirdEvaluator: predicted_results
    BirdEvaluator->>SQLite: Execute ground truth SQL
    SQLite-->>BirdEvaluator: ground_truth_results
    BirdEvaluator->>BirdEvaluator: Compare sets (timeout: 30s)
    BirdEvaluator-->>ns_eval: res = 0 or 1
    ns_eval->>BirdMetrics: update(predictions)
    BirdMetrics->>BirdMetrics: Group by difficulty (simple/moderate/challenging)
    BirdMetrics-->>User: Accuracy per difficulty + total_acc

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/evaluation/evaluator/bird.py

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

redoctopus added 29 commits December 18, 2025 10:42

Add BIRD benchmark preparation script

f98958f

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Add prompt config for text-to-sql (qwen3)

ac51852

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Moved text-to-SQL prompt config to generic/

5db1d9c

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

BIRD evaluation class draft

81970f1

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Draft eval/metrics for BIRD benchmark

92823ae

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Try different timeout method

b403c47

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Remove func_timeout from evaluation file (and trim other functions)

86b95e4

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Try fixing metrics arg for birdbench init file

98fa12e

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Commit of shame: minor typo correction

72d31bb

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

s/-/_/ in config path

1932ae6

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix config path for Bird eval init file

be8b619

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix BIRD eval config, update paths to data

7bef844

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Whoops load file before trying to parse it

6a98d4b

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Add id to data point to allow for ground truth comparison

cee066f

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Path to string for jsom serialization

e652fe6

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix SQL eval hanging on timeout

50bfcb1

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Remove unnecessary sort, fixed minor bugs in birdbench

4ba8729

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Debugging metrics

20cf10f

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Added checks for zero length lists in BIRD metrics

dd9adb1

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix BIRD metrics update step and reporting

044261e

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

BIRD eval printing

e290cc6

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Cleaning up some loose code (birdbench)

a69e895

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Ruff style fixes

588fd86

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Add BIRD benchmark documentation

1ad1ddb

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Remove dockerfile install in favor of commandline arg

ecde59a

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix style issues

cf0e0d0

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Refactor to avoid re-loading files during evaluation (BIRD), plus cod…

37ebe15

…erabbit fixes Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Add pred to data dict, fix sqlite3 call

d3b8fbf

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Lint fixes

996ea0d

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

redoctopus requested a review from gwarmstrong December 18, 2025 18:44

gwarmstrong assigned redoctopus Dec 18, 2025

redoctopus added 6 commits December 18, 2025 16:39

Add check for db_dir, move evaluation function, remove superfluous ev…

490efe6

…al_full() in BIRD bench Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Lint fixes

4356c00

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Whitespace massaging to appease the linter

8cf5373

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Remove manual check for BIRD eval file param

ac39390

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Remove db_dir arg in favor of relative path

21be22c

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Clean up unused arg

66aace8

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

gwarmstrong reviewed Dec 19, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

gwarmstrong reviewed Dec 19, 2025

View reviewed changes

gwarmstrong added the run GPU tests label Dec 19, 2025

redoctopus added 3 commits December 19, 2025 15:41

BIRD: remove output dir arg in favor of data_dir

f6cfae2

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Whitespace removal

6d4daec

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Merge branch 'main' into birdbench

1a259df

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

nemo_skills/evaluation/evaluator/bird.py Outdated Show resolved Hide resolved

gwarmstrong added run GPU tests and removed run GPU tests labels Jan 6, 2026

redoctopus added 4 commits January 6, 2026 15:21

Slight clarification to BIRD docs re data_dir

d5c37eb

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

More robust SQL comment removal

f4efad7

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Increment counter for dependency fix

97bf47b

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

Fix SQL comment regex

465e781

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

redoctopus added run GPU tests and removed run GPU tests labels Jan 6, 2026

gwarmstrong approved these changes Jan 7, 2026

View reviewed changes

gwarmstrong merged commit bfaf83f into main Jan 7, 2026
7 checks passed

gwarmstrong deleted the birdbench branch January 7, 2026 17:11

hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026

BIRD Benchmark (Text-to-SQL) (#1132)

2dcfe41

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

BIRD Benchmark (Text-to-SQL) (#1132)

5e32002

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

BIRD Benchmark (Text-to-SQL) (#1132)

df07374

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIRD Benchmark (Text-to-SQL)#1132

BIRD Benchmark (Text-to-SQL)#1132
gwarmstrong merged 48 commits intomainfrom
birdbench

redoctopus commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

gwarmstrong Dec 19, 2025

Uh oh!

redoctopus Dec 19, 2025

Uh oh!

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		This will download and unpack a file into `<output_directory>/dev_20240627`, which contains the BIRD dev manifest, table information, and database schemas. By default, `output_directory` will be under `nemo_skills/dataset/birdbench/`, though this can be changed via command line argument.

		The script will also process the original manifest into `<output_directory>/dev.jsonl`, which will be the input for evaluation.

Conversation

redoctopus commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

gwarmstrong Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

redoctopus Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

redoctopus commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 6, 2026 •

edited

Loading