-
Notifications
You must be signed in to change notification settings - Fork 114
Distributed compute refactoring #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying datachain-documentation with
|
Latest commit: |
ef13025
|
Status: | ✅ Deploy successful! |
Preview URL: | https://7ff4cdb2.datachain-documentation.pages.dev |
Branch Preview URL: | https://distributed-refactoring.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a refactoring of the distributed compute components along with several improvements and cleanups across tests, UDF dispatching, SQL query handling, and CLI commands. Key changes include:
- Consolidation and cleanup of test functions and internal UDF dispatch APIs.
- Refinements in error handling, queue management, and process parallelism to improve robustness.
- Updates in SQL query constructions, particularly in the metastore update, that require careful validation.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/unit/test_dispatch.py | Consolidation of imports and removal of a redundant test case. |
tests/test_query_e2e.py | Updated expected process return codes in the interruption test. |
src/datachain/utils.py | Minor improvements in parallel processing configuration. |
src/datachain/query/utils.py | Stricter type checking for query column retrieval. |
src/datachain/query/udf.py | Removal of the unused abstract static method run_worker. |
src/datachain/query/dispatch.py | Refactoring of UDF dispatch logic with new catalog property and queue handling. |
src/datachain/query/dataset.py | Improved exception handling and use of walrus operator in dataset population. |
src/datachain/query/batch.py | Deferred import for SELECT_BATCH_SIZE moved into the function. |
src/datachain/lib/listing.py | Updated list function naming to better reflect its purpose. |
src/datachain/data_storage/warehouse.py | New method added to select dataset rows from a list of IDs. |
src/datachain/data_storage/metastore.py | Revised dataset version update with a new SQL WHERE clause. |
src/datachain/cli/parser/utils.py | Updated internal commands for argument parsing. |
src/datachain/cli/init.py | Adjustments to handle UDF dispatch commands. |
Comments suppressed due to low confidence (5)
tests/unit/test_dispatch.py:38
- The removal of the test_send_stop_signal_to_workers function reduces the test coverage for UDFDispatcher's behavior when sending stop signals. Consider reintroducing a test or ensuring coverage through alternative tests.
- def test_send_stop_signal_to_workers():
tests/test_query_e2e.py:200
- The expected process return code has been changed from 1 to 11. Please verify that the update to (interrupt_exit_code, 11) is intentional and consistent with the underlying interruption handling logic.
- if process.returncode not in (interrupt_exit_code, 1):
src/datachain/query/udf.py:44
- The abstract static method run_worker has been removed. Ensure that its removal aligns with the overall UDF design and that no consumers rely on this method for worker execution.
- @staticmethod
src/datachain/query/dispatch.py:166
- The call to self.run_udf_parallel within run_udf may unintentionally recurse into the same method rather than invoking the separate parallel processing logic defined later. Consider reviewing the method naming or call structure to avoid potential confusion or recursion issues.
+ self.run_udf_parallel(n_workers, input_rows, download_cb, processed_cb, generated_cb)
src/datachain/query/utils.py:25
- The updated check now raises an error if the column is not an instance of Column, which might exclude valid ColumnElement types like TextClause. Confirm that this stricter type check is intended.
+ if col is None or not isinstance(col, Column):
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1047 +/- ##
==========================================
- Coverage 88.14% 88.06% -0.09%
==========================================
Files 146 146
Lines 12508 12572 +64
Branches 1742 1758 +16
==========================================
+ Hits 11025 11071 +46
- Misses 1059 1067 +8
- Partials 424 434 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
In this PR I am mostly updating internals for the distributed compute refactoring needed for SaaS. Also some related refactoring was made and some tests were added, more tests will be added in the follow-up PRs.