feat(examples): Modernize example data loading with Parquet and YAML configs#36538
Conversation
fd5b47e to
be5e9f1
Compare
2131e36 to
cd9cf76
Compare
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
The css_templates.py was removed in the previous commit but references remained in data_loading.py and cli/examples.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
These files were incorrectly deleted: - css_templates.py: Creates CSS templates in DB, not a data loader - countries.py: Used by superset/viz.py for country lookups - countries.md: Documentation about data source Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The data_file field is used by the example loading system to explicitly reference Parquet files. Without this field in the schema, the import validation rejects the YAML files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Virtual datasets (SQL-based) were being exported with their SQL queries, but the underlying tables they reference weren't included. This caused errors when loading the examples. Changes: - Export script now sets sql=null to convert virtual to physical datasets - Fixed existing virtual datasets in slack_dashboard, featured_charts - Fixed empty string sql values to null in fcc_survey, video_game_sales The parquet files contain the query results, so no data is lost. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolved conflicts in: - superset/examples/data_loading.py (keep our Parquet-based auto-discovery) - superset/cli/examples.py (keep our simplified loader)
…ion) The international_sales example was added to master as a Python loader. It needs to be converted to Parquet/YAML format in a follow-up PR to work with the new example loading system. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
Converts the international_sales example from Python loader to Parquet format with YAML metadata. This dataset demonstrates multi-currency transactions for the Dynamic Currency feature: - Multiple currencies: USD, EUR, GBP, JPY, CAD, AUD - Case variations for normalization testing - NULL and empty currency values for fallback testing - Multiple monetary columns (revenue, cost, profit, unit_price) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ASF license headers to all example YAML files (135 files) - Update export_example.py to auto-generate license headers in YAML output - Restore birth_names.py and world_bank.py needed by integration tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests were failing because they depended on members_channels_2 being a virtual dataset, but the Parquet conversion made it physical. Virtual datasets show a "Duplicate" button; physical datasets don't. Changes: - Add createTestVirtualDataset helper to create test-owned virtual datasets - Update delete/duplicate tests to create their own virtual datasets - Update navigation test to use birth_names (physical dataset is fine) This decouples tests from example data changes, making them hermetic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Virtual datasets that only reference physical tables within the same export are now preserved with their SQL intact, instead of being materialized to Parquet. This prevents data duplication when a dashboard has multiple virtual datasets built on the same underlying table. Logic: - Physical datasets: export data as Parquet (unchanged) - Virtual datasets referencing only exported physical tables: preserve SQL - Virtual datasets referencing external tables: materialize to Parquet Uses Superset's existing SQL parser (superset.sql.parse.SQLStatement) to extract table references from virtual dataset SQL queries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Re-exported all example dashboards from running Superset instance using the Export as Example API. This ensures examples are up-to-date with recent dashboard edits and use the new export format with ASF license headers. Dashboards exported: - world_health (World Bank's Data) - usa_births_names (USA Births Names) - misc_charts (Misc Charts) - deckgl_demo (deck.gl Demo) - slack_dashboard (Slack Dashboard) - sales_dashboard (Sales Dashboard) - fcc_new_coder_survey (FCC New Coder Survey 2018) - featured_charts (Featured Charts) - video_game_sales (Video Game Sales) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Wrap discover_datasets() call in try/except to gracefully handle imports outside of an active Flask application context (e.g., in tests or tooling). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
betodealmeida
left a comment
There was a problem hiding this comment.
Yes!
My suggestion for future improvements: let's put all virtual datasets in per-DB directories, and use them depending of the examples DB:
- datasets/virtual/postgres/
- datasets/virtual/mysql/
We can have only Postgres intiially, and add support for more over time.
|
I love the decision to go with Parquet instead of DuckDB for this use case. Nicely done! |
Summary
This PR modernizes the Superset example data loading system by migrating to a Parquet-based approach with YAML configuration files, organized by dashboard for better developer experience.
Key Changes
New Directory Structure by Dashboard
_shared/directoryMigrated to Parquet Storage Format
Auto-Discovery System
data.parquetfile in a new directory to add an exampleGeneric Loading System
load_parquet_table()for unified data loadingExport as Example Feature ✨ NEW
superset export-example --dashboard-id <id> --name <name> --output-dir <dir>GET /api/v1/dashboard/<pk>/export_as_example/exportpermission as the regular YAML exportShowtime/Ephemeral Environment Support 🐤 NEW
examples.duckdbfile from external repoLOAD_EXAMPLES_DUCKDBbuild argument from DockerfileExport as Example UI
The "Export as Example" option appears in the Download submenu alongside "Export YAML":
Example Dashboards (10 total)
Why Parquet?
Benefits
Testing
Cypress Test Status
box_plot.test.jsbubble.test.jsnativeFilters.test.tsfilter.test.ts(chart_list)_skip.tabs.test.ts_skip.AdhocMetrics.test.ts_skip.advanced_analytics.test.ts_skip.link.test.ts_skip.annotations.test.tsBreaking Changes
None for end users. The
superset load-examplescommand works exactly as before.For developers:
superset.examples.birth_namesare removedsuperset/examples/data/tosuperset/examples/{name}/data.parquetLOAD_EXAMPLES_DUCKDBbuild argument removed - examples are loaded from Parquet at runtimeNext Steps (Follow-up PRs)
The following items are out of scope for this PR but should be addressed in follow-up work:
1. Add Missing Charts to YAML Configs
The skipped Cypress tests depend on specific charts that were created by the old Python code but aren't in the YAML configs yet:
tabs.test.tstabs.test.tstabs.test.ts,link.test.tsAdhocMetrics.test.ts,advanced_analytics.test.tstabs.test.tstabs.test.ts2. Convert Tabbed Dashboard to YAML
The
tabbed_dashboard.pycreates a special dashboard for testing tab navigation. This should be converted to YAML format with all required charts.3. Apply Dynamic ID Lookup Pattern to More Tests
The pattern introduced in this PR (
getDatasetId(),getChartId()) can be applied to other tests that may have hardcoded IDs, making them more resilient to changes in example data.4. Remove Remaining Python Example Loaders
A few Python modules remain for backwards compatibility (
birth_names.py,world_bank.py). Once the Cypress tests are fully migrated, these can be removed.5. Deprecate Pre-built examples.duckdb
The pre-built
examples.duckdbfile in the apache-superset/examples-data repo is no longer used. It can be removed or marked as deprecated in a follow-up.