feat: add global task framework#36368
Conversation
eb74de2 to
d1cc92e
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #36368 +/- ##
==========================================
+ Coverage 60.48% 66.67% +6.18%
==========================================
Files 1931 671 -1260
Lines 76236 51593 -24643
Branches 8568 5770 -2798
==========================================
- Hits 46114 34399 -11715
+ Misses 28017 15810 -12207
+ Partials 2105 1384 -721
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d1cc92e to
4ec9258
Compare
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
434a7d8 to
63a7755
Compare
95d1350 to
59ad288
Compare
d0d323d to
0948773
Compare
michael-s-molina
left a comment
There was a problem hiding this comment.
Thanks for the latest changes @villebro. Second-pass review:
6fb1c4a to
1b0fdcd
Compare
|
Thanks for the re-review @michael-s-molina ! Please check the latest commit with the following changes (these should address the issues you raised + some other improvements):
|
| setTimeout(() => setCopied(false), 2000); | ||
| }) | ||
| .catch(() => { | ||
| // Failed to copy, ignore |
There was a problem hiding this comment.
Perhaps we could display an error toast instead of failing silently?
michael-s-molina
left a comment
There was a problem hiding this comment.
Thank you for this great PR @villebro and for addressing hundreds of comments 😱
My only comment left is a non-blocking one about the log messages on tasks manager which I think should be debug instead of info. My main concern is with the volume of messages as you might have multiple entries for each scheduled task.
1b0fdcd to
ca1aa46
Compare
|
Bito Automatic Review Failed - Technical Failure |
91466f8 to
0d06e3b
Compare
Sounds reasonable. Main thing I'm slightly worried about is multiple execution paths for extended periods of time on the repo. Would love to have all tasks go through the same general codepaths/decorators/abstractions, even if some tasks don't provide the same guarantees (dedup on/off, database tracking on/off, ...). Doesn't have to be in this PR necessarily. Not sure how complex that would be either, would have to spend more time reviewing, but some sort of |
5a2bbc4 to
9d8dc67
Compare
The majority of logic should be shared. I'll be updating the actual functional code to have optional abort handlers that will simply be undefined for the legacy paths. But I'll give it another go when this is merged and I start working in earnest on the first migration. |
SUMMARY
This PR introduces the Global Task Framework (GTF) (scope changed from GATF to GTF to also support sync tasks), a unified system for managing long running tasks in Superset. GTF provides task execution, progress tracking, cancellation with graceful abort handling, and deduplication with scope-aware visibility.
Screen.Recording.2026-01-22.at.7.04.18.PM.mov
Key Features
Infrastructure Improvements
This PR also introduces foundational infrastructure improvements that benefit GTF but are designed for broader reuse:
SIGNAL_CACHE_CONFIGconfiguration enables Redis-based pub/sub for real-time notifications and low-overhead distributed locking. This reduces metastore load and provides near-instant event delivery for abort signals and task completion notifications. In the future,GLOBAL_ASYNC_QUERIES_CACHE_BACKENDwill be consolidated into this unified signal cache, providing a single Redis configuration for all signaling and coordination features.SET NX EXoperations, moving lock overhead entirely to cache with only two round trips per lock cycle.TECHNICAL DETAILS
Architecture Overview
Cancellation Logic
GTF uses a unified Cancel action that determines whether to abort a task or unsubscribe the user, based on task scope and subscriber count.
The Core Principle
The cancellation system is designed around a simple idea: a task should only be fully aborted when no one needs it anymore. For private tasks, this is straightforward—there's only one user. For shared tasks, the system distinguishes between "I don't need this anymore" (unsubscribe) and "stop this for everyone" (abort).
How Cancel Decides What to Do
When a user clicks Cancel, the system evaluates the situation in this order:
Private Tasks: Abort. Since only the creator can see private tasks, cancelling means stopping the task.
System Tasks: Abort. Only admins can see system tasks.
Shared Tasks with Single Subscriber: Abort. The sole subscriber cancelling it stops the task.
Shared Tasks with Multiple Subscribers: The user is unsubscribed, and the task continues for others.
Admin with Force Abort: Stops the task for all subscribers. For admins on shared tasks with multiple subscribers, a "Force abort (stops task for all subscribers)" checkbox appears in the cancel modal. If the admin is not subscribed to the task, this checkbox is pre-checked and disabled (since aborting is the only sensible action—they can't unsubscribe from something they're not subscribed to). If the admin is subscribed, the checkbox is enabled and unchecked by default. When the admin is the sole subscriber, the checkbox is hidden since cancelling will automatically abort the task.
The Last Subscriber Rule
When the final subscriber unsubscribes from a shared task, the task is automatically aborted since no users remain to receive the results. The user clicks Cancel, and if they are the last subscriber, the task stops.
Abortability and In-Progress Tasks
Not all in-progress tasks can be aborted. A task is only abortable if it has registered an abort handler (via
ctx.on_abort()):This ensures developers must explicitly opt-in to abort support by providing abortion handling logic. If no abort hander is provided, the cancellation action is disabled in the list view for in-progress tasks with a tooltip:
Timeout Handling
Tasks can be configured with a timeout that automatically triggers the abort flow when the duration is exceeded.
Setting Timeouts
Timeout Behavior
The timeout timer starts when the task transitions to
IN_PROGRESS. When timeout expires:With abort handler registered:
ABORTINGstateTIMED_OUTstatusFAILUREstatusWithout abort handler:
ABORTED vs TIMED_OUT
Both flows use the same
ABORTINGintermediate state, but the terminal state differs based on cause:ABORTEDTIMED_OUTFAILUREThis distinction allows users to immediately understand why a task stopped without reading error messages.
Abort Detection: Polling vs Pub/Sub
GTF supports two abort detection mechanisms:
1. Database Polling (Default)
Pros: No infrastructure dependencies
Cons: Up to N-second latency, periodic database queries
2. Redis Pub/Sub (When Configured)
When
SIGNAL_CACHE_CONFIGis configured, GTF uses Redis pub/sub for instant abort notification:Concurrency & Performance
GTF uses two complementary strategies to handle concurrent operations, optimized for their different usage patterns and performance requirements.
Why Two Strategies?
GTF uses atomic SQL wherever possible—it adds minimal overhead and provides the same safety guarantees as locking when the operation can be expressed as a conditional update. Distributed locks are reserved for operations that require read-then-write semantics.
WHERE status = 'IN_PROGRESS') provide race-safe updates in a single query.Operations Reference
Lock overhead for KV backend: DELETE expired + INSERT to acquire, DELETE to release (3 metastore queries, reduced from 4 by eliminating a redundant SELECT). With
SIGNAL_CACHE_CONFIG, locking uses Redis SET NX EX + DELETE (2 cache operations), moving lock operations to the cache and freeing the metastore entirely.When atomic updates return 0 rows affected (race detected), the executor accepts the concurrent outcome gracefully—no retries or errors. Cleanup handlers always run via the finally block regardless of which transition "won."
Progress Update Throttling
update_task()calls are throttled to a configurable minimum interval (default: 2 seconds) between database writes:Configure via
TASK_PROGRESS_UPDATE_THROTTLE_INTERVAL(set to 0 to disable).Implementation Details
Lock ordering: Submit and Cancel acquire the lock before opening a database transaction. This prevents holding a DB connection while waiting for the lock. DAOs perform pure data operations and assume the caller holds the lock.
Post-commit notifications: When aborting a task,
publish_abort()is called after the transaction commits. This ensures theABORTINGstate is visible in the database before any pub/sub listener queries it.When to Leverage GTF
Recommended for:
May be unnecessary for:
Deduplication Strategy
Deduplication uses a hashed
dedup_keycolumn with a unique index. The composite key is hashed using the configuredHASH_ALGORITHM(default: SHA-256) to produce a fixed-length key that avoids performance and compatibility issues with longtask_keyvalues:Unified join semantics: When a task with matching
dedup_keyexists, the framework adds the user as a subscriber (if not already subscribed) and returns the existing task. This applies uniformly to all scopes—private tasks naturally have only one subscriber since theirdedup_keyincludes the user_id.Sync join-and-wait: When a sync caller (direct function call, not
.schedule()) joins an existing active task, it blocks until the task completes rather than returning immediately. This uses the same notification mechanism as abort detection (Redis pub/sub if configured, otherwise database polling).When a task completes, its
dedup_keyis changed to its UUID (36 chars, no hashing needed), freeing the slot:Ambient Context Pattern
Tasks access execution context via
get_context()using Python'scontextvars:This eliminates the need to pass context through function signatures—tasks call
get_context()wherever needed.REST API Endpoints
The Task API provides read-only access to task data and cancellation capabilities. Tasks are created programmatically through the
@taskdecorator and cleaned up automatically by scheduled prune jobs.GET/api/v1/task/GET/api/v1/task/{uuid}GET/api/v1/task/{uuid}/statusPOST/api/v1/task/{uuid}/cancelGET/api/v1/task/related/created_byGET/api/v1/task/related/subscribersWhy No DELETE or CREATE Endpoints?
No CREATE: Tasks are created programmatically when code calls
my_task.schedule()ormy_task(). The framework handles task creation, deduplication, and subscriber management internally.No DELETE: Tasks are cleaned up by a scheduled prune job rather than manual deletion. This prevents accidental deletion of in-progress tasks and maintains audit trails. Configure the prune job in your Celery beat schedule (see
superset/config.pyfor an example).ADDITIONAL INFORMATION
Configuration
Enabling GTF
GTF is disabled by default and must be enabled via the
GLOBAL_TASK_FRAMEWORKfeature flag:When GTF is disabled:
/api/v1/task/*endpoints return 404@task-decorated function raisesGlobalTaskFrameworkDisabledErrorThis flag controls all GTF functionality. In the future, enabling this flag will also switch built-in features (thumbnails, alerts & reports, etc.) to use GTF-based tasks instead of legacy Celery tasks.
Configuration Options
GLOBAL_TASK_FRAMEWORKFalseTrueto use GTF features.SIGNAL_CACHE_CONFIGNoneGLOBAL_ASYNC_QUERIES_CACHE_BACKENDwill be consolidated into this cache.DISTRIBUTED_LOCK_DEFAULT_TTL30TASKS_ABORT_CHANNEL_PREFIX"gtf:abort:"TASK_ABORT_POLLING_DEFAULT_INTERVAL10TASK_PROGRESS_UPDATE_THROTTLE_INTERVAL2update_task()DB writes. Set to 0 to disable throttling.SHOW_STACKTRACETrue(dev)Database Schema
New tables:
tasks- Task metadata, status, and propertiestask_subscribers- Many-to-many relationship for shared task subscriptionsThe migration includes indexes for deduplication lookups, scope-based filtering in the list view, and efficient pruning of old completed tasks.
Relationship Loading Strategy
To avoid N+1 query issues when listing tasks with subscriber information, the models use SQLAlchemy eager loading:
Task.subscribers: Useslazy="selectin"- when listing N tasks, fires 2 queries total (1 for tasks + 1 IN-clause query for all subscribers) instead of N+1TaskSubscriber.user: Useslazy="joined"- fetches user info (first_name, last_name) via JOIN in the same query as subscribersThis ensures the API list endpoint remains efficient regardless of task count or subscriber count.
Properties JSON Column
The
taskstable uses apropertiesJSON column to store runtime state and execution configuration. This design provides schema flexibility for incrementally adding new features without requiring database migrations.Current properties structure:
execution_modeLiteral["async", "sync"]"async"for scheduled (Celery) execution,"sync"for inline executiontimeoutintis_abortableboolprogress_percentfloatprogress_currentintprogress_totalinterror_messagestrexception_typestrstack_tracestrWhy JSON over columns:
propertiesdict with consistent shapeTypedDict-based implementation:
Properties use a
TypedDictwithtotal=False(all fields optional) for type safety without runtime overhead:Access pattern (sparse dict):
Helper functions (in
superset/tasks/utils.py):Breaking Changes
None - this is a new feature.
TODO / Future Work
During migration, abort handlers will be added to tasks that currently don't support termination, and deduplication support will be added where relevant to prevent duplicate task execution. Also, new properties will likely be added to the
taskdecorator signature +TaskOptionsdata class as needed (retries, max queue times etc).TESTING INSTRUCTIONS
Unit Tests
Integration Tests
Manual Testing
Create and execute a task:
Test abort via UI: Navigate to Task List, click Cancel on an in-progress task
Test Redis pub/sub: Configure
SIGNAL_CACHE_CONFIGand verify instant abort responseADDITIONAL INFORMATION