Skip to content

Conversation

@trentlavoie
Copy link
Contributor

SUMMARY

Add cache for table schema in sqllab.

Motivation: Fetching schema can be a slow operation on query engines such as Trino and Athena. We have noticed the schema takes up to 300s to appear when Trino cluster is under load. Caching the schema allows users to have better interactive experience with sqllab.

Design: Added a new schema cache to cache manager so this can be managed separately from other caches. Many users including relational DB's such as Postgres or MySQL likely won't require this feature as this feature is ideal for query engines where fetching table schema is not a trivial operation. Feature is disabled by default and requires setting SCHEMA_CACHE_CONFIG in config.py to be enable.

TESTING INSTRUCTIONS

  1. Add Trino cluster as a database
  2. Test to ensure there are no breaking changes (schema cache disabled, by default): Open sqllab and select a schema and table. Opening table schema takes up to 300s.
  3. Enable schema cache: Add to config.py and restart:
SCHEMA_CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_DEFAULT_TIMEOUT': 60 * 60 * 12, # 12 hr cache
    'CACHE_KEY_PREFIX': 'superset_schema_',
    'CACHE_REDIS_URL': f"redis://{REDIS}:{REDIS_PORT}/{REDIS_DB}"
}
  1. Test schema cache enabled: open sqllab and select a schema and table. Opening schema table takes long on cache miss but subsequent requests are fast.

ADDITIONAL INFORMATION

  • Has associated issue:
    [x] Required feature flags:
    SCHEMA_CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_DEFAULT_TIMEOUT': 60 * 60 * 12, # 12 hr cache
    'CACHE_KEY_PREFIX': 'superset_schema_',
    'CACHE_REDIS_URL': f"redis://{REDIS}:{REDIS_PORT}/{REDIS_DB}"
    }
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@dosubot dosubot bot added infra:caching Infra setup and configuration related to caching sqllab Namespace | Anything related to the SQL Lab labels Mar 24, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Documentation Incomplete circular dependency comment ▹ view
Performance Ineffective Schema Cache Configuration ▹ view
Suppressed issues based on your team's Korbit activity
This issue Is similar to Because

line 76:

Cache key construction doesn't account for potential special characters in table or schema names that could cause key collisions

Cache key not handling None catalog properly

Similar issues were not addressed in the past

lines 77:79:

The cache implementation lacks TTL (Time To Live) settings which could lead to stale data being served indefinitely.

Missing Cache Timeout in Slack Channels Cache

Similar issues were not addressed in the past

When you react to issues (for example, an upvote or downvote) or you fix them, Korbit will tune future reviews based on these signals.

Files scanned
File Path Reviewed
superset/utils/cache_manager.py
superset/databases/utils.py
superset/config.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • On any given comment that Korbit raises on your pull request, you can have a discussion with Korbit by replying to the comment.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Documentation
Logging
Error Handling
Readability
Design
Performance
Security
Functionality

Feedback and Support

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

Comment on lines +74 to +75
# Lazy import to prevent circular dependency
from superset.extensions import cache_manager
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete circular dependency comment category Documentation

Tell me more
What is the issue?

The inline comment about lazy import is too brief and doesn't explain which modules are involved in the circular dependency.

Why this matters

Future maintainers may inadvertently create import problems without understanding which modules are affected.

Suggested change ∙ Feature Preview

Lazy import to prevent circular dependency between superset.extensions and superset.databases modules

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

DATA_CACHE_CONFIG: CacheConfig = {"CACHE_TYPE": "NullCache"}

# Schema caching configuration
SCHEMA_CACHE_CONFIG: CacheConfig = {"CACHE_TYPE": "NullCache"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ineffective Schema Cache Configuration category Performance

Tell me more
What is the issue?

The default schema cache configuration is set to use NullCache, which effectively disables caching.

Why this matters

The NullCache setting will prevent any actual caching from happening, defeating the purpose of implementing the schema caching mechanism for faster table metadata retrieval.

Suggested change ∙ Feature Preview

Set a default cache backend that actually performs caching, for example:

SCHEMA_CACHE_CONFIG: CacheConfig = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_DEFAULT_TIMEOUT": 86400,  # 24 hours
    "CACHE_KEY_PREFIX": "schema_",
    "CACHE_REDIS_URL": "redis://localhost:6379/1"
}
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@mistercrunch
Copy link
Member

mistercrunch commented Mar 24, 2025

Cool featuer! But wherever we use caching in the app, we try and couple the feature with:

  • info/metadata about the data being served from cache
  • a way for the user to force-refresh, bypassing the cache

Would be nice to have something similar here. Otherwise say I ALTER TABLE in SQL Lab, I might be in a dead-end around seeing my own changes reflected in the side-bar or elsewhere.

Examples

Screenshot 2025-03-24 at 1 48 16 PM Screenshot 2025-03-24 at 1 48 01 PM

@codecov
Copy link

codecov bot commented Mar 24, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.41%. Comparing base (76d897e) to head (d36033b).
Report is 2164 commits behind head on master.

Files with missing lines Patch % Lines
superset/databases/utils.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #32830       +/-   ##
===========================================
+ Coverage   60.48%   83.41%   +22.92%     
===========================================
  Files        1931      549     -1382     
  Lines       76236    39476    -36760     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32927    -13187     
+ Misses      28017     6549    -21468     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.42% <35.71%> (-0.74%) ⬇️
javascript ?
mysql 75.64% <92.85%> (?)
postgres 75.71% <92.85%> (?)
presto 52.90% <35.71%> (-0.90%) ⬇️
python 83.41% <92.85%> (+19.90%) ⬆️
sqlite 75.22% <92.85%> (?)
unit 61.38% <35.71%> (+3.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sadpandajoe sadpandajoe requested a review from justinpark March 25, 2025 17:24
@sadpandajoe sadpandajoe changed the title feature(sqllab): Add cache sqllab sidebar table schema (#32796) feat(sqllab): Add cache sqllab sidebar table schema (#32796) Mar 25, 2025
@justinpark
Copy link
Member

To add to Max's opinion, there is already a UI related to refreshing the table schema.
Screenshot 2025-03-31 at 3 03 08 PM

However, this UI clears the cache on the frontend and re-calls the existing /api/.../table_metadata and /api/.../table_metadata/extra. Since this will fetch the cached metadata in the server cache mechanism, it will be necessary to implement logic that refreshes the cached data on the server by adding parameters similar to the force refresh (force=true) that applies to the existing schema and table list.

@mistercrunch
Copy link
Member

Oh right. Forgot about this button. So if that button adds a ?force=true, and normal call doesn't, things should work.

Extra points if you add a latest_refreshed or similar to the payload(s) metadata, so that the frontend would be able to surface that say on hover of that button.

@trentlavoie
Copy link
Contributor Author

Makes sense. Let me work on adding a force refresh and latest_refreshed to the API

@trentlavoie trentlavoie marked this pull request as draft April 3, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infra:caching Infra setup and configuration related to caching review:draft size/S sqllab Namespace | Anything related to the SQL Lab

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants