Skip to content

feat(examples): Modernize example data loading with Parquet and YAML configs#36538

Merged
rusackas merged 108 commits into
masterfrom
revamped-example-loading
Jan 21, 2026
Merged

feat(examples): Modernize example data loading with Parquet and YAML configs#36538
rusackas merged 108 commits into
masterfrom
revamped-example-loading

Conversation

@rusackas
Copy link
Copy Markdown
Member

@rusackas rusackas commented Dec 11, 2025

Summary

This PR modernizes the Superset example data loading system by migrating to a Parquet-based approach with YAML configuration files, organized by dashboard for better developer experience.

Key Changes

  1. New Directory Structure by Dashboard

    • Each dashboard is self-contained in its own directory
    • Data and configuration co-located for easy maintenance
    • Shared configs in _shared/ directory
    • Support for both simple (single dataset) and complex (multiple datasets) examples
    superset/examples/
    ├── _shared/
    │   ├── database.yaml      # Database connection config
    │   └── metadata.yaml      # Import metadata
    ├── birth_names/           # Simple example (single dataset)
    │   ├── data.parquet
    │   ├── dataset.yaml
    │   ├── dashboard.yaml
    │   └── charts/
    ├── deck_gl/               # Complex example (multiple datasets)
    │   ├── dashboard.yaml
    │   ├── charts/
    │   ├── datasets/          # Multiple dataset configs
    │   │   ├── long_lat.yaml
    │   │   ├── flights.yaml
    │   │   ├── bart_lines.yaml
    │   │   └── sf_population_polygons.yaml
    │   └── data/              # Multiple parquet files
    │       ├── long_lat.parquet
    │       ├── flights.parquet
    │       ├── bart_lines.parquet
    │       └── sf_population_polygons.parquet
    └── ... (11 example dashboards total)
    
  2. Migrated to Parquet Storage Format

    • Converted all example datasets to compressed Parquet files (Snappy compression)
    • Reduced total data size from 79MB to 58MB (27% smaller)
    • Parquet is an Apache project - ideal fit for ASF codebase
  3. Auto-Discovery System

    • Just drop a data.parquet file in a new directory to add an example
    • YAML configs are auto-discovered and imported
    • No Python code changes needed to add new examples
  4. Generic Loading System

    • Implemented load_parquet_table() for unified data loading
    • Removed dataset-specific Python modules (birth_names.py, flights.py, energy.py, etc.)
    • Added robust error handling with directory traversal prevention
  5. Export as Example Feature ✨ NEW

    • Added "Export as Example" option to the Dashboard header Download menu
    • Exports any dashboard in the new Parquet + YAML format
    • Makes it easy for developers to create new examples from existing dashboards
    • Includes CLI command: superset export-example --dashboard-id <id> --name <name> --output-dir <dir>
    • API endpoint: GET /api/v1/dashboard/<pk>/export_as_example/
    • Protected by the same export permission as the regular YAML export
  6. Showtime/Ephemeral Environment Support 🐤 NEW

    • Removed dependency on pre-built examples.duckdb file from external repo
    • Showtime environments now load examples directly from Parquet files at runtime
    • Removed LOAD_EXAMPLES_DUCKDB build argument from Dockerfile
    • Examples stay automatically in sync with YAML configs

Export as Example UI

The "Export as Example" option appears in the Download submenu alongside "Export YAML":

  • Export YAML - Standard export format for import/export workflows
  • Export as Example - New format with Parquet data for the examples system

Example Dashboards (10 total)

Dashboard Datasets Description
birth_names 1 Baby names data
cleaned_sales_data 1 Sales data for Featured Charts
deck_gl 4 Geospatial visualizations (long_lat, flights, bart_lines, sf_population_polygons)
fcc_2018_survey 1 FCC survey data
featured_charts 2 SQL virtual datasets (hierarchical_dataset, project_management)
misc_charts 2 Miscellaneous chart types (birth_france_by_region, energy_usage)
slack_dashboard 11 Slack analytics (channels, users, messages, etc.)
video_game_sales 1 Video game sales data
wb_health_population 1 World Bank health & population data

Why Parquet?

  • Apache-friendly: Parquet is an Apache project, making it ideal for ASF codebases
  • Compressed: Built-in Snappy compression reduces storage by ~27%
  • Widely supported: Compatible with pandas, pyarrow, DuckDB, Spark, and many other tools
  • Self-describing: Schema is embedded in the file
  • Industry standard: De facto standard for columnar data storage

Benefits

  • Better DevEx: Examples grouped by dashboard, data and configs together
  • Smaller footprint: 27% reduction in example data size
  • Maintainability: YAML configs are easier to update than Python code
  • Consistency: Single source of truth for example data across tests and production
  • Security: Added validation to prevent directory traversal
  • Extensibility: Easy to add new examples by dropping in a directory
  • Easy contribution: "Export as Example" lets anyone create properly-formatted examples

Testing

  • All Python unit and integration tests pass
  • Cypress tests updated to use dynamic ID lookups instead of hardcoded IDs
  • Some Cypress tests temporarily skipped (see Next Steps below)
  • Playwright E2E tests added for Export as Example functionality

Cypress Test Status

Test Status Notes
box_plot.test.js ✅ Fixed Dynamic dataset lookup
bubble.test.js ✅ Fixed Dynamic dataset lookup
nativeFilters.test.ts ✅ Fixed Dynamic dataset/chart lookup
filter.test.ts (chart_list) ✅ Fixed Datasets & dashboards exist in YAML
_skip.tabs.test.ts ⏸️ Skipped Needs charts not in YAML (Treemap, Box plot, etc.)
_skip.AdhocMetrics.test.ts ⏸️ Skipped Needs "Num Births Trend" chart
_skip.advanced_analytics.test.ts ⏸️ Skipped Needs "Num Births Trend" chart
_skip.link.test.ts ⏸️ Skipped Needs "Growth Rate" chart
_skip.annotations.test.ts ⏸️ Skipped Was already skipped

Breaking Changes

None for end users. The superset load-examples command works exactly as before.

For developers:

  • Python modules like superset.examples.birth_names are removed
  • Test fixtures now use the config-based loading system
  • Example data moved from superset/examples/data/ to superset/examples/{name}/data.parquet
  • Docker: LOAD_EXAMPLES_DUCKDB build argument removed - examples are loaded from Parquet at runtime

Next Steps (Follow-up PRs)

The following items are out of scope for this PR but should be addressed in follow-up work:

1. Add Missing Charts to YAML Configs

The skipped Cypress tests depend on specific charts that were created by the old Python code but aren't in the YAML configs yet:

  • Treemap - needed for tabs.test.ts
  • Box plot - needed for tabs.test.ts
  • Growth Rate - needed for tabs.test.ts, link.test.ts
  • Num Births Trend - needed for AdhocMetrics.test.ts, advanced_analytics.test.ts
  • Number of Girls - needed for tabs.test.ts
  • Names Sorted by Num in California - needed for tabs.test.ts

2. Convert Tabbed Dashboard to YAML

The tabbed_dashboard.py creates a special dashboard for testing tab navigation. This should be converted to YAML format with all required charts.

3. Apply Dynamic ID Lookup Pattern to More Tests

The pattern introduced in this PR (getDatasetId(), getChartId()) can be applied to other tests that may have hardcoded IDs, making them more resilient to changes in example data.

4. Remove Remaining Python Example Loaders

A few Python modules remain for backwards compatibility (birth_names.py, world_bank.py). Once the Cypress tests are fully migrated, these can be removed.

5. Deprecate Pre-built examples.duckdb

The pre-built examples.duckdb file in the apache-superset/examples-data repo is no longer used. It can be removed or marked as deprecated in a follow-up.

@github-actions github-actions Bot added the doc Namespace | Anything related to documentation label Dec 11, 2025
@dosubot dosubot Bot added the doc:examples Related to example datasets and dashboards label Dec 11, 2025
@codeant-ai-for-open-source codeant-ai-for-open-source Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Dec 11, 2025
Comment thread superset/cli/examples.py Outdated
Comment thread superset/commands/importers/v1/utils.py Outdated
Comment thread superset/commands/importers/v1/utils.py Outdated
Comment thread superset/examples/generic_loader.py Outdated
Comment thread superset/examples/generic_loader.py
Comment thread superset/examples/helpers.py
Comment thread superset/examples/helpers.py Outdated
Comment thread superset/examples/helpers.py Outdated
@rusackas rusackas force-pushed the revamped-example-loading branch from fd5b47e to be5e9f1 Compare December 12, 2025 23:12
@rusackas rusackas force-pushed the revamped-example-loading branch 3 times, most recently from 2131e36 to cd9cf76 Compare December 16, 2025 22:30
@rusackas rusackas changed the title feat(examples): Revamp example data loading with DuckDB and fix chart issues feat(examples): Modernize example data loading with Parquet and YAML configs Dec 16, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from bito-code-review Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source Bot Dec 17, 2025
@rusackas rusackas added the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025
@github-actions github-actions Bot removed the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025
@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI Incremental review completed.

rusackas and others added 6 commits January 18, 2026 18:13
The css_templates.py was removed in the previous commit but references
remained in data_loading.py and cli/examples.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
These files were incorrectly deleted:
- css_templates.py: Creates CSS templates in DB, not a data loader
- countries.py: Used by superset/viz.py for country lookups
- countries.md: Documentation about data source

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The data_file field is used by the example loading system to explicitly
reference Parquet files. Without this field in the schema, the import
validation rejects the YAML files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Virtual datasets (SQL-based) were being exported with their SQL queries,
but the underlying tables they reference weren't included. This caused
errors when loading the examples.

Changes:
- Export script now sets sql=null to convert virtual to physical datasets
- Fixed existing virtual datasets in slack_dashboard, featured_charts
- Fixed empty string sql values to null in fcc_survey, video_game_sales

The parquet files contain the query results, so no data is lost.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolved conflicts in:
- superset/examples/data_loading.py (keep our Parquet-based auto-discovery)
- superset/cli/examples.py (keep our simplified loader)
…ion)

The international_sales example was added to master as a Python loader.
It needs to be converted to Parquet/YAML format in a follow-up PR to
work with the new example loading system.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

Comment thread superset/examples/data_loading.py Outdated
rusackas and others added 6 commits January 19, 2026 15:38
Converts the international_sales example from Python loader to
Parquet format with YAML metadata. This dataset demonstrates
multi-currency transactions for the Dynamic Currency feature:

- Multiple currencies: USD, EUR, GBP, JPY, CAD, AUD
- Case variations for normalization testing
- NULL and empty currency values for fallback testing
- Multiple monetary columns (revenue, cost, profit, unit_price)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ASF license headers to all example YAML files (135 files)
- Update export_example.py to auto-generate license headers in YAML output
- Restore birth_names.py and world_bank.py needed by integration tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests were failing because they depended on members_channels_2 being
a virtual dataset, but the Parquet conversion made it physical.
Virtual datasets show a "Duplicate" button; physical datasets don't.

Changes:
- Add createTestVirtualDataset helper to create test-owned virtual datasets
- Update delete/duplicate tests to create their own virtual datasets
- Update navigation test to use birth_names (physical dataset is fine)

This decouples tests from example data changes, making them hermetic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Virtual datasets that only reference physical tables within the same
export are now preserved with their SQL intact, instead of being
materialized to Parquet. This prevents data duplication when a dashboard
has multiple virtual datasets built on the same underlying table.

Logic:
- Physical datasets: export data as Parquet (unchanged)
- Virtual datasets referencing only exported physical tables: preserve SQL
- Virtual datasets referencing external tables: materialize to Parquet

Uses Superset's existing SQL parser (superset.sql.parse.SQLStatement)
to extract table references from virtual dataset SQL queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Re-exported all example dashboards from running Superset instance
using the Export as Example API. This ensures examples are up-to-date
with recent dashboard edits and use the new export format with ASF
license headers.

Dashboards exported:
- world_health (World Bank's Data)
- usa_births_names (USA Births Names)
- misc_charts (Misc Charts)
- deckgl_demo (deck.gl Demo)
- slack_dashboard (Slack Dashboard)
- sales_dashboard (Sales Dashboard)
- fcc_new_coder_survey (FCC New Coder Survey 2018)
- featured_charts (Featured Charts)
- video_game_sales (Video Game Sales)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Wrap discover_datasets() call in try/except to gracefully handle
imports outside of an active Flask application context (e.g., in
tests or tooling).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI Incremental review completed.

@github-actions
Copy link
Copy Markdown
Contributor

🎪 Showtime is building environment on GHA for dda62d3

Copy link
Copy Markdown
Contributor

@richardfogaca richardfogaca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@github-actions
Copy link
Copy Markdown
Contributor

🎪 Showtime is building environment on GHA for dda62d3

@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@github-actions
Copy link
Copy Markdown
Contributor

🎪 Showtime is building environment on GHA for d6f4b6f

@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

CodeAnt AI Incremental review completed.

Copy link
Copy Markdown
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

My suggestion for future improvements: let's put all virtual datasets in per-DB directories, and use them depending of the examples DB:

  • datasets/virtual/postgres/
  • datasets/virtual/mysql/

We can have only Postgres intiially, and add support for more over time.

@sfirke
Copy link
Copy Markdown
Member

sfirke commented Jan 22, 2026

I love the decision to go with Parquet instead of DuckDB for this use case. Nicely done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Related to the REST API data Namespace | Anything related to data, including databases configurations, datasets, etc. doc:examples Related to example datasets and dashboards doc Namespace | Anything related to documentation preset-io size/XXL size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants