Skip to content

fix(ds query): isolate temp table names#1321

Merged
shcheklein merged 1 commit intomainfrom
isolate-temp-table-names
Oct 9, 2025
Merged

fix(ds query): isolate temp table names#1321
shcheklein merged 1 commit intomainfrom
isolate-temp-table-names

Conversation

@shcheklein
Copy link
Contributor

@shcheklein shcheklein commented Sep 7, 2025

Updated: might also fix #722

When we run multiple joins, within the same chain, due to a recursive line:

temp_tables.extend(dq.temp_table_names)

in SQLJoin (link)

we might end up with a list of 8K+ items, with a lot a lot of duplicates.

It means query can run very long at the end.

Script to reproduce this. Mind we run show and save at the end, essentially also means we are doubling the list.

from dotenv import load_dotenv

import datachain as dc
from datachain import C, func


load_dotenv("local/.env.test")

all_files = dc.read_storage("gs://datachain-demo", anon=False).mutate(s3_path=dc.C("file.path")).persist()

laion_files = all_files.filter(dc.C("file.path").glob("50k-laion-files/*")).mutate(laion_file=dc.C("file"))
aspset510_files = all_files.filter(dc.C("file.path").glob("aspset510/*")).mutate(aspset510_file=dc.C("file"))
coco2017_files = all_files.filter(dc.C("file.path").glob("coco2017/*")).mutate(coco2017_file=dc.C("file"))
datacomp_small_files = all_files.filter(dc.C("file.path").glob("datacomp-small/*")).mutate(datacomp_small_file=dc.C("file"))
open_images_v6_files = all_files.filter(dc.C("file.path").glob("open-images-v6/*")).mutate(open_images_v6_file=dc.C("file"))
nlp_cnn_stories_files = all_files.filter(dc.C("file.path").glob("nlp-cnn-stories/*")).mutate(nlp_cnn_stories_file=dc.C("file"))

raw_data = (
    laion_files
    .merge(aspset510_files, on="s3_path", inner=True)
    .merge(coco2017_files, on="s3_path", inner=True)
    .merge(datacomp_small_files, on="s3_path", inner=True)
    .merge(open_images_v6_files, on="s3_path", inner=True)
    .merge(nlp_cnn_stories_files, on="s3_path", inner=False)
    .select(
        "s3_path",
        "laion_file",
        "aspset510_file",
        "coco2017_file",
        "datacomp_small_file",
        "open_images_v6_file",
        "nlp_cnn_stories_file",
    )
)

raw_data.save("datachain-demo-merge")
raw_data.show(1000)

TODO:

  • Run all tests
  • Add description
  • Add proper tests
  • Confirm semantics one more time

Summary by Sourcery

Isolate temporary table names per dataset clone and enforce no duplicates to prevent runaway temp table list growth, and fix cleanup to target the cloned query instance.

Bug Fixes:

  • Add assertion to ensure temp_table_names has no duplicates before cleanup
  • Reset temp_table_names on dataset.clone to avoid carrying over names from previous queries
  • Modify exec to clean up temp tables on the cloned query instance rather than the original

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Sep 7, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR isolates temporary table names by resetting the list on each cloned dataset and ensuring no duplicates, then performs cleanup on the cloned instance rather than the original.

Sequence diagram for updated query execution and cleanup

sequenceDiagram
participant "Dataset"
participant "Cloned Dataset"
participant "Metastore"
participant "Warehouse"

"Dataset"->>"Cloned Dataset": clone(new_table=True)
"Cloned Dataset"->>"Cloned Dataset": apply_steps()
"Cloned Dataset"->>"Metastore": cleanup_tables(temp_table_names)
"Cloned Dataset"->>"Warehouse": cleanup_tables(temp_table_names)
Loading

Class diagram for Dataset temp_table_names handling

classDiagram
class Dataset {
  +List temp_table_names
  +clone(new_table=True)
  +cleanup()
  +exec()
}
Dataset : clone() resets temp_table_names to []
Dataset : cleanup() asserts no duplicates in temp_table_names
Loading

File-Level Changes

Change Details Files
Enforce uniqueness of temp_table_names before cleanup
  • Add assertion in cleanup() to verify no duplicate entries
src/datachain/query/dataset.py
Reset temp_table_names when cloning a dataset
  • Initialize temp_table_names to an empty list in clone()
src/datachain/query/dataset.py
Perform cleanup on cloned query instance
  • Move query cloning outside try, then call cleanup() on cloned object instead of self in exec()
src/datachain/query/dataset.py

Possibly linked issues

  • It hangs for cleaning up tables #722: The PR's changes to isolate and clean up temporary table names ensure the Dataset object remains stateless and its resources are managed, preventing the 'broken dataset' state after an unsuccessful operation.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Sep 7, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: f5a0558
Status: ✅  Deploy successful!
Preview URL: https://aab39580.datachain-documentation.pages.dev
Branch Preview URL: https://isolate-temp-table-names.datachain-documentation.pages.dev

View logs

@shcheklein shcheklein force-pushed the isolate-temp-table-names branch from ee46cea to be6eeaf Compare September 7, 2025 17:47
@codecov
Copy link

codecov bot commented Sep 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.72%. Comparing base (13fe476) to head (f5a0558).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1321   +/-   ##
=======================================
  Coverage   87.72%   87.72%           
=======================================
  Files         160      160           
  Lines       14995    14997    +2     
  Branches     2156     2156           
=======================================
+ Hits        13154    13156    +2     
  Misses       1349     1349           
  Partials      492      492           
Flag Coverage Δ
datachain 87.67% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/query/dataset.py 93.62% <100.00%> (+0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dreadatour
Copy link
Contributor

Fixes for the tests in separate PR: #1322

@shcheklein
Copy link
Contributor Author

Fixes for the tests in separate PR: #1322

yep, thanks @dreadatour ... I'll keep looking into this PR ... it is probably right for the current approach with temp tables, but I need to understand the whole temp table mechanics a bit better

@shcheklein shcheklein self-assigned this Sep 8, 2025
@shcheklein shcheklein added bug Something isn't working performance labels Sep 8, 2025
@shcheklein shcheklein force-pushed the isolate-temp-table-names branch from 33e80f8 to f91da0a Compare October 8, 2025 21:44
@shcheklein shcheklein force-pushed the isolate-temp-table-names branch from f91da0a to f5a0558 Compare October 9, 2025 00:29
# This is needed to always use a new connection with all metastore and warehouse
# implementations, as errors may close or render unusable the existing
# connections.
assert len(self.temp_table_names) == len(set(self.temp_table_names))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C: trying to add this instead of tests (we should not be getting duplicates)

In tests we have additional check that no tables left behind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dreadatour not sure if it makes to add a specific complicated test ... this should be working better

@shcheklein shcheklein marked this pull request as ready for review October 9, 2025 00:45
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@shcheklein shcheklein merged commit de18e0b into main Oct 9, 2025
38 checks passed
@shcheklein shcheklein deleted the isolate-temp-table-names branch October 9, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It hangs for cleaning up tables

2 participants