Skip to content

Conversation

@michael-s-molina
Copy link
Member

SUMMARY

This PR fixes a problem with the log table retention policy. Currently, there's no way to limit the number of records retained in that table which leads to problems in Superset when the table contains millions of rows. This table is queried from Superset's Welcome page to get recent modified items and not being able to set a retention policy is a critical error.

To keep backward compatibility, this PR does not change the current retention policy which is to preserve all records. Adding a default retention policy would be a good practice that we can do for 6.0.

TESTING INSTRUCTIONS

Configure the celery task called prune_logs and check that the records are deleted according to the retention_days configuration.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@dosubot dosubot bot added the logging Creates a UI or API endpoint that could benefit from logging. label Mar 10, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Performance Memory-intensive ID loading ▹ view
Readability Magic Number Should Be Named Constant ▹ view
Performance Missing Celery Task Performance Guards ▹ view
Logging Generic exception log missing retention period context ▹ view
Files scanned
File Path Reviewed
superset/tasks/scheduler.py
superset/commands/logs/prune.py
superset/config.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • On any given comment that Korbit raises on your pull request, you can have a discussion with Korbit by replying to the comment.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Documentation
Logging
Error Handling
Readability
Design
Performance
Security
Functionality

Feedback and Support

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

Comment on lines +59 to +68
ids_to_delete = (
db.session.execute(
sa.select(Log.id).where(
Log.dttm
< datetime.now() - timedelta(days=self.retention_period_days)
)
)
.scalars()
.all()
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory-intensive ID loading category Performance

Tell me more
What is the issue?

Loading all IDs into memory at once could cause memory issues with large log tables.

Why this matters

For tables with millions of records to delete, this approach could exhaust available memory and crash the application.

Suggested change ∙ Feature Preview
def run(self) -> None:
    batch_size = 999
    total_deleted = 0
    start_time = time.time()
    
    while True:
        # Select only the next batch of IDs
        ids_to_delete = (
            db.session.execute(
                sa.select(Log.id)
                .where(Log.dttm < datetime.now() - timedelta(days=self.retention_period_days))
                .limit(batch_size)
            )
            .scalars()
            .all()
        )
        
        if not ids_to_delete:
            break
            
        result = db.session.execute(sa.delete(Log).where(Log.id.in_(ids_to_delete)))
        total_deleted += result.rowcount
        db.session.commit()
        
        logger.info(
            "Deleted %s rows from the logs table older than %s days",
            total_deleted,
            self.retention_period_days,
        )

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

"""
Executes the prune command
"""
batch_size = 999 # SQLite has a IN clause limit of 999
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic Number Should Be Named Constant category Readability

Tell me more
What is the issue?

The magic number 999 should be defined as a named constant at the module or class level.

Why this matters

Magic numbers make code harder to maintain and understand their purpose without the comment. A named constant makes the intent clear and provides a single point of change.

Suggested change ∙ Feature Preview
# At module or class level
SQLITE_IN_CLAUSE_LIMIT = 999

def run(self) -> None:
    batch_size = SQLITE_IN_CLAUSE_LIMIT

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +148 to +149
@celery_app.task(name="prune_logs")
def prune_logs(retention_period_days: Optional[int] = None) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Celery Task Performance Guards category Performance

Tell me more
What is the issue?

The prune_logs task lacks performance-related task options that could help manage resource consumption during log pruning operations.

Why this matters

Without proper task options like rate limiting or soft/hard time limits, large log pruning operations could consume excessive system resources or run indefinitely, potentially impacting other operations.

Suggested change ∙ Feature Preview

Add appropriate Celery task options to manage resource consumption:

@celery_app.task(
    name="prune_logs",
    soft_time_limit=3600,  # 1 hour soft timeout
    time_limit=3900,      # 1 hour + 5 min hard timeout
    rate_limit="1/hour"   # Limit to one execution per hour
)

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

try:
LogPruneCommand(retention_period_days).run()
except CommandException as ex:
logger.exception("An error occurred while pruning logs: %s", ex)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic exception log missing retention period context category Logging

Tell me more
What is the issue?

The exception log message is too generic and lacks context about the retention period being used.

Why this matters

During troubleshooting, it would be difficult to determine which retention period was active when the pruning failed, making debugging more time-consuming.

Suggested change ∙ Feature Preview
logger.exception(
    "An error occurred while pruning logs with retention period of %s days: %s",
    retention_period_days,
    ex
)

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

@codecov
Copy link

codecov bot commented Mar 10, 2025

Codecov Report

❌ Patch coverage is 32.60870% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.45%. Comparing base (76d897e) to head (091581f).
⚠️ Report is 2419 commits behind head on master.

Files with missing lines Patch % Lines
superset/commands/logs/prune.py 35.29% 22 Missing ⚠️
superset/tasks/scheduler.py 25.00% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #32572       +/-   ##
===========================================
+ Coverage   60.48%   83.45%   +22.96%     
===========================================
  Files        1931      548     -1383     
  Lines       76236    39358    -36878     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32847    -13267     
+ Misses      28017     6511    -21506     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.16% <32.60%> (-1.00%) ⬇️
javascript ?
mysql 74.78% <32.60%> (?)
postgres 74.80% <32.60%> (?)
presto 51.73% <32.60%> (-2.07%) ⬇️
python 83.26% <32.60%> (+19.76%) ⬆️
sqlite 74.22% <32.60%> (?)
unit 61.10% <0.00%> (+3.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rusackas rusackas requested a review from sadpandajoe March 10, 2025 21:19
@michael-s-molina michael-s-molina merged commit 89b6d7f into apache:master Mar 10, 2025
44 checks passed
michael-s-molina added a commit that referenced this pull request Mar 11, 2025
@michael-s-molina michael-s-molina added the v5.0 Label added by the release manager to track PRs to be included in the 5.0 branch label Mar 11, 2025
michael-s-molina added a commit that referenced this pull request Mar 17, 2025
@sadpandajoe sadpandajoe added the v4.1 Label added by the release manager to track PRs to be included in the 4.1 branch label Mar 26, 2025
sadpandajoe pushed a commit that referenced this pull request Mar 26, 2025
@landryb
Copy link
Contributor

landryb commented Apr 9, 2025

@michael-s-molina a bit puzzled, doing a pip install from master branch here doesn't install the new superset/commands/logs subdir, is there something needed in the build system/python goo so that the prune.py file gets installed ?

without it, trigerring a job on a celery worker blows with:

Apr 09 10:07:02 demo celery[4084022]:   File "/srv/apps/superset/venv/lib/python3.11/site-packages/superset/tasks/scheduler.py", line 28, in <module>
Apr 09 10:07:02 demo celery[4084022]:     from superset.commands.logs.prune import LogPruneCommand
Apr 09 10:07:02 demo celery[4084022]: ModuleNotFoundError: No module named 'superset.commands.logs'

landryb added a commit to georchestra/ansible that referenced this pull request Apr 9, 2025
cf apache/superset#32572, for some reason that file is missing
when installing superset in a virtualenv.

that file is needed to properly start the celery worker
@michael-s-molina
Copy link
Member Author

is there something needed in the build system/python goo so that the prune.py file gets installed ?

Nothing that I'm aware of.

@landryb
Copy link
Contributor

landryb commented Apr 9, 2025

weird, probably something bogus on my side then, or a wrong invocation via ansible.. but that's reproducible in a new empty virtualenv, deploying the last commit from master (6b7394e78998):

$ ansible mygeorchestra -i host -u root -m pip -a 'name="git+https://github.com/apache/superset@6b7394e78998#egg=apache-superset" virtualenv=/usr/venv virtualenv_site_packages=true'
...
...
        "Building wheels for collected packages: apache-superset",
        "  Building wheel for apache-superset (pyproject.toml): started",
        "  Building wheel for apache-superset (pyproject.toml): finished with status 'done'",
        "  Created wheel for apache-superset: filename=apache_superset-0.0.0.dev0-py3-none-any.whl size=4617711 sha256=9c6c1087e3a9ce1b309c8632691511c3ab1a543100794ab3fdf2965f6eceee88",
        "  Stored in directory: /tmp/pip-ephem-wheel-cache-1h96rixi/wheels/a7/4e/65/283b7f2fcde75ded0fb99b186d00bf35378e26be358d2ca9ab",
        "Successfully built apache-superset",

and on the target venv where the resulting wheel is installed, the dir is missing:

[09/04 13:28] [email protected]:/srv/apps/superset $ls -d /usr/venv/lib/python3.11/site-packages/superset/commands/logs
ls: cannot access '/usr/venv/lib/python3.11/site-packages/superset/commands/logs': No such file or directory

@landryb
Copy link
Contributor

landryb commented Apr 9, 2025

and same thing, doing a pip wheel . from a clone builds a wheel that doesn't contain the file:

/data/src/georchestra/superset-core $pip wheel --wheel-dir=/tmp/wheel .
/data/src/georchestra/superset-core $unzip -l /tmp/wheel/apache_superset-0.0.0.dev0-py3-none-any.whl |grep /prune

the last command yields nothing (and ofc all the other files are here)

the RECORD file inside the wheel doesnt contain a line for the missing /prune.py file.

@michael-s-molina
Copy link
Member Author

@landryb I downloaded the 5.0.0rc2 Pypi package and that folder is indeed not there 🤔 Could you open an issue for this and tag me? I'll add it to the 5.0.0 board.

@michael-s-molina
Copy link
Member Author

@landryb I discovered why the folder was missing. It was because the folder was missing the __init__.py file. I provided a fix in #33059.

@mistercrunch mistercrunch added 🍒 4.1.3 Cherry-picked to 4.1.3 🍒 5.0.0 Cherry-picked to 5.0.0 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 29, 2025
cyber-jessie added a commit to CybercentreCanada/superset that referenced this pull request Jan 8, 2026
* chore: bump base image in Dockerfile with `ARG PY_VER=3.11.11-slim-bookworm` (apache#32780)

* chore: Revert "chore: bump base image in Dockerfile with `ARG PY_VER=3.11.11-slim-bookworm`" (apache#32782)

* fix(chart data): removing query from /chart/data payload when accessing as guest user (apache#30858)

(cherry picked from commit dd39138)

* fix: upgrade to 3.11.11-slim-bookworm to address critical vulnerabilities (apache#32240)

(cherry picked from commit ad05732)

* fix(model/helper): represent RLS filter clause in proper textual SQL string (apache#32406)

Signed-off-by: hainenber <[email protected]>
(cherry picked from commit ff0529c)

* fix: Log table retention policy (apache#32572)

(cherry picked from commit 89b6d7f)

* fix(welcome): perf on distinct recent activities (apache#32608)

(cherry picked from commit 832e028)

* fix(log): Update recent_activity by event name (apache#32681)

(cherry picked from commit 449f51a)

* fix: Signature of Celery pruner jobs (apache#32699)

(cherry picked from commit df06bdf)

* fix(logging): missing path in event data (apache#32708)

(cherry picked from commit cd5a943)

* fix(fe/dashboard-list): display modifier info for `Last modified` data (apache#32035)

Signed-off-by: hainenber <[email protected]>
(cherry picked from commit 88cf2d5)

* fix: make packages PEP 625 compliant (apache#32866)

Co-authored-by: Michael S. Molina <[email protected]>
(cherry picked from commit 6e02d19)

* all cccs changes

* fix: Downgrade to marshmallow<4 (apache#33216)

* fix(log): store navigation path to get correct logging path (apache#32795)

(cherry picked from commit 4a70065)

* fix(pivot-table): Revert "fix(Pivot Table): Fix column width to respect currency config (apache#31414)" (apache#32968)

(cherry picked from commit a36e636)

* fix: improve error type on parse error (apache#33048)

(cherry picked from commit ed0cd5e)

* fix(plugin-chart-echarts): remove erroneous upper bound value (apache#32473)

(cherry picked from commit 5766c36)

* fix(pinot): revert join and subquery flags (apache#32382)

(cherry picked from commit 822d72c)

* fix: loading examples from raw.githubusercontent.com fails with 429 errors (apache#33354)

(cherry picked from commit f045a73)

* chore: creating 4.1.3rc1 change log and updating frontend json

(cherry picked from commit 72cf9b6)

* chore(🦾): bump python sqlglot 26.1.3 -> 26.11.1 (apache#32745)

Co-authored-by: GitHub Action <[email protected]>
(cherry picked from commit 66c1a6a)

* chore(🦾): bump python h11 0.14.0 -> 0.16.0 (apache#33339)

Co-authored-by: GitHub Action <[email protected]>
(cherry picked from commit 8252686)

* docs: CVEs fixed on 4.1.2 (apache#33435)

(cherry picked from commit 8a8fb49)

* feat(api): Added uuid to list api calls (apache#32414)

(cherry picked from commit 8decc9e)

* fix(table-chart): time shift is not working (apache#33425)

(cherry picked from commit dc44748)

* fix(Sqllab):  Autocomplete got stuck in UI when open it too fast (apache#33522)

(cherry picked from commit b4e2406)

* chore: update Dockerfile - Upgrade to 3.11.12 (apache#33612)

(cherry picked from commit f0b6e87)

* chore: updating 4.1.3rc2 change log

* Select all Drag and Drop (#546)

* add a select all button for the dnd select

* remove cypress

* chore(deps): bump cryptography from 43.0.3 to 44.0.1 (apache#32236)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit fa09d81)

* fix: Adds missing __init__ file to commands/logs (apache#33059)

(cherry picked from commit c1159c5)

* fix: Saved queries list break if one query can't be parsed (apache#34289)

(cherry picked from commit 1e5a4e9)

* chore: Adds 4.1.4RC1 data to CHANGELOG.md and UPDATING.md

* tag bump for select all drag and drop

* Fix package-lock.json

* Add db migration, bump Docker image base

* gevent for gunicorn

* remove threads and make worker-connections configurable

* Fix package-lock.json

* tag bump for cccs build

* Remove CCCS Dataset Explorer (#550)

* tag bump for CCCS build

---------

Signed-off-by: hainenber <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: gpchandran <[email protected]>
Co-authored-by: Joe Li <[email protected]>
Co-authored-by: Jack <[email protected]>
Co-authored-by: Đỗ Trọng Hải <[email protected]>
Co-authored-by: Michael S. Molina <[email protected]>
Co-authored-by: JUST.in DO IT <[email protected]>
Co-authored-by: Michael S. Molina <[email protected]>
Co-authored-by: Andreas Motl <[email protected]>
Co-authored-by: Ville Brofeldt <[email protected]>
Co-authored-by: Yuri <[email protected]>
Co-authored-by: Maxime Beauchemin <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Action <[email protected]>
Co-authored-by: sha174n <[email protected]>
Co-authored-by: Paul Rhodes <[email protected]>
Co-authored-by: Rafael Benitez <[email protected]>
Co-authored-by: cccs-RyanK <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: cyber-jessie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels logging Creates a UI or API endpoint that could benefit from logging. size/L v4.1 Label added by the release manager to track PRs to be included in the 4.1 branch v5.0 Label added by the release manager to track PRs to be included in the 5.0 branch 🍒 4.1.3 Cherry-picked to 4.1.3 🍒 4.1.4 🍒 5.0.0 Cherry-picked to 5.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants