Skip to content

BIRD Benchmark (Text-to-SQL)#1132

Merged
gwarmstrong merged 48 commits intomainfrom
birdbench
Jan 7, 2026
Merged

BIRD Benchmark (Text-to-SQL)#1132
gwarmstrong merged 48 commits intomainfrom
birdbench

Conversation

@redoctopus
Copy link
Collaborator

@redoctopus redoctopus commented Dec 18, 2025

Added BIRD's SQL execution accuracy metric to code generation benchmarks.
Also added some documentation on how to download/process data and run the benchmark.

Original dataset website: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird
SQL execution from BIRD: https://github.com/AlibabaResearch/DAMO-ConvAI/blob/main/bird/llm/src/evaluation.py

Summary by CodeRabbit

  • New Features

    • Added BIRD benchmark evaluation framework for text-to-SQL tasks with difficulty-categorized accuracy metrics and database validation testing.
  • Documentation

    • Updated evaluation documentation with BIRD benchmark instructions and data preparation guides; added BIRD to code evaluation references.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
…erabbit fixes

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
…al_full() in BIRD bench

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Comment on lines +331 to +333
This will download and unpack a file into `<output_directory>/dev_20240627`, which contains the BIRD dev manifest, table information, and database schemas. By default, `output_directory` will be under `nemo_skills/dataset/birdbench/`, though this can be changed via command line argument.

The script will also process the original manifest into `<output_directory>/dev.jsonl`, which will be the input for evaluation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed now that we have data_dir? The rest of this is pretty consistent across evaluations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

Adds BIRD benchmark support for text-to-SQL evaluation with SQL execution accuracy metrics. The implementation includes data preparation scripts that download and process the BIRD dataset, an evaluator that compares SQL execution results with a 30-second timeout, and metrics calculation grouped by difficulty levels (simple/moderate/challenging). The changes properly integrate with the existing evaluation framework by registering the new evaluator and metrics classes, and include comprehensive documentation for setup and usage.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured with proper error handling, timeout protection, MIT-licensed code attribution, and follows existing framework patterns. The SQL comment removal regex was already fixed in the latest commit, and all components integrate cleanly into the existing codebase
  • No files require special attention

Important Files Changed

Filename Overview
nemo_skills/evaluation/evaluator/bird.py Adds BIRD SQL evaluator with proper licensing, comment removal, timeout handling, and SQL execution comparison
nemo_skills/evaluation/metrics/bird_metrics.py Implements difficulty-categorized accuracy metrics for BIRD benchmark evaluation
nemo_skills/dataset/birdbench/prepare.py Downloads BIRD data, extracts database schemas via sqlite3.iterdump(), and formats manifest
nemo_skills/prompt/config/generic/text_to_sql.yaml Prompt template for text-to-SQL with step-by-step reasoning and SQL code block format

Sequence Diagram

sequenceDiagram
    participant User
    participant ns_prepare_data
    participant Download as wget/zipfile
    participant SQLite as sqlite3
    participant ns_eval
    participant BirdEvaluator
    participant LLM as Model
    participant BirdMetrics
    
    User->>ns_prepare_data: prepare_data birdbench
    ns_prepare_data->>Download: Download dev.zip from bird-bench
    Download-->>ns_prepare_data: dev_20240627/
    ns_prepare_data->>SQLite: Extract table schemas via iterdump()
    SQLite-->>ns_prepare_data: Table definitions + sample data
    ns_prepare_data->>ns_prepare_data: Format entries to dev.jsonl
    ns_prepare_data-->>User: Data ready
    
    User->>ns_eval: eval --benchmarks=birdbench
    ns_eval->>LLM: Generate SQL from question + sql_context
    LLM-->>ns_eval: Generated SQL in ```sql blocks
    ns_eval->>BirdEvaluator: eval_single(data_point)
    BirdEvaluator->>BirdEvaluator: Extract SQL with regex
    BirdEvaluator->>BirdEvaluator: Remove comments (-- and /* */)
    BirdEvaluator->>SQLite: Execute predicted SQL
    SQLite-->>BirdEvaluator: predicted_results
    BirdEvaluator->>SQLite: Execute ground truth SQL
    SQLite-->>BirdEvaluator: ground_truth_results
    BirdEvaluator->>BirdEvaluator: Compare sets (timeout: 30s)
    BirdEvaluator-->>ns_eval: res = 0 or 1
    ns_eval->>BirdMetrics: update(predictions)
    BirdMetrics->>BirdMetrics: Group by difficulty (simple/moderate/challenging)
    BirdMetrics-->>User: Accuracy per difficulty + total_acc
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
@gwarmstrong gwarmstrong merged commit bfaf83f into main Jan 7, 2026
7 checks passed
@gwarmstrong gwarmstrong deleted the birdbench branch January 7, 2026 17:11
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>

# Conflicts:
#	nemo_skills/evaluation/evaluator/__init__.py
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants