-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental support for scanning Delta Tables inside Mount Points #1095
Conversation
for path in table_paths: | ||
table = Table( | ||
catalog="hive_metastore", | ||
database="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database can't be empty. Infer it from path and default to "default"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really want to do the mapping of paths -> databases/tables there. As this will be handled later in the process.
What if instead just using the mount name as a database ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it would even better to default the database of this to 'mounts' or something that would be easily searcheable for downstream applications that must use this table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what if there are duplicate table names?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that can totally happen ... the strategy of picking the parent folder as the table name is irrelevant then.
Maybe let's simply use the table path in mount as the table name instead ? ...
@@ -47,6 +47,10 @@ class WorkspaceConfig: # pylint: disable=too-many-instance-attributes | |||
# Whether the assessment should capture a specific list of databases, if not specified, it will list all databases. | |||
include_databases: list[str] | None = None | |||
|
|||
# Whether the tables in mounts crawler should crawl a specific list of mounts. | |||
# If not specified, it will list all moubts. | |||
include_mounts: list[str] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, alternatively, we can ask for it during the installer
def test_mount_listing_one_table(): | ||
client = create_autospec(WorkspaceClient) | ||
client.dbutils.fs.ls.return_value = [ | ||
FileInfo("/mnt/lmao/_delta_log", "_delta_log", "", ''), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use different name in public code ;)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1095 +/- ##
==========================================
+ Coverage 89.80% 89.85% +0.05%
==========================================
Files 61 61
Lines 7249 7347 +98
Branches 1300 1318 +18
==========================================
+ Hits 6510 6602 +92
- Misses 475 481 +6
Partials 264 264 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few bugs
if delta_log_folders is None: | ||
delta_log_folders = {} | ||
|
||
entries = self._dbutils.fs.ls(root_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From below code, it would be wise to rename it as paths. and rename "path" variable from line 326 to directory_name to precise.
elif self._is_parquet(entry.path): | ||
delta_log_folders[path] = TableInMount(format="PARQUET", is_partitioned=False) | ||
else: | ||
self._find_delta_log_folders(entry.path, delta_log_folders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous logic, as you're forcing to use heap structure in recursion, that can slow down the process, if you've huge list of paths. Line no 328 will also contribute more to this slowdown. Rather, use a logic of processing single path, process it till leaf path, make TableInMount object there and return from stack to accumulate.
❌ 131/133 passed, 4 flaky, 2 failed, 20 skipped, 1h54m11s total ❌ test_running_real_assessment_job: TimeoutError: timed out after 0:20:00: (21m2.648s)
❌ test_partitioned_tables: TimeoutError: Timed out after 0:05:00 (21m1.32s)
Flaky tests:
Running from acceptance #1882 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix db name detection
92847a6
to
e442123
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
* Ensure proper sequencing of view migrations ([#1157](#1157)). In this release, we have introduced a `views_migrator` module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue [#1132](#1132). The module contains two main classes: `ViewToMigrate` and `ViewsMigrator`. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method, `__hash__`, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new file `tables_and_views.json` for unit testing and added a `views_migrator` module that takes a `TablesCrawler` object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests. * Experimental support for scanning Delta Tables inside Mount Points ([#1095](#1095)). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new `TablesInMounts` crawler. Users can now scan specific mount points using the `--include-mounts` flag and include Parquet files in the scan results with the `--include-parquet-files` flag. Additionally, the `--filter-paths` flag allows for filtering paths in a mount point and the `--max-depth` flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to use `databricks-labs-lsql~=0.3.0`. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases. * Fixed `NULL` values in `ucx.views.table_format` to have `UNKNOWN` value instead ([#1156](#1156)). This commit includes a fix for handling NULL values in the `table_format` column of Views in the `ucx.views.table_format` module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue [#115](#115) * Fixing run_workflow functionality for better error handling ([#1159](#1159)). In this release, the `run_workflow` method in the `workflows.py` file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method, `job_initial_run`, has been added to initiate a job run and return the run ID, raising a `NotFound` exception if the job run is not found. The `run_workflow` functionality in the `WorkflowsInstall` module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when an `OperationFailed` exception is raised during the installation process. These changes improve the robustness and stability of the system. * Use experimental Permissions Migration API also for Legacy Table ACLs ([#1161](#1161)). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The `TableAclSupport` import and class have been removed, as they are no longer needed. A new `apply_to_renamed_groups` method has been added for production usage, and a `apply_to_groups_with_different_names` method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. The `permission_manager` parameter has been removed from several test functions in the `test_generic.py` file and replaced with the `MigrationState` class, which is used directly with the `WorkspaceClient` object to apply permissions to groups with different names. The `test_some_entitlements` function in the `test_scim.py` file has also been updated to use the `MigratedGroup` class and the `MigrationState` class's `apply_to_groups_with_different_names` method. Finally, new tests for the Permissions Migration API have been added to the `test_tacl.py` file in the `tests/integration/workspace_access` directory to verify the behavior of the Permissions Migration API when migrating different grants.
* Ensure proper sequencing of view migrations ([#1157](#1157)). In this release, we have introduced a `views_migrator` module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue [#1132](#1132). The module contains two main classes: `ViewToMigrate` and `ViewsMigrator`. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method, `__hash__`, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new file `tables_and_views.json` for unit testing and added a `views_migrator` module that takes a `TablesCrawler` object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests. * Experimental support for scanning Delta Tables inside Mount Points ([#1095](#1095)). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new `TablesInMounts` crawler. Users can now scan specific mount points using the `--include-mounts` flag and include Parquet files in the scan results with the `--include-parquet-files` flag. Additionally, the `--filter-paths` flag allows for filtering paths in a mount point and the `--max-depth` flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to use `databricks-labs-lsql~=0.3.0`. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases. * Fixed `NULL` values in `ucx.views.table_format` to have `UNKNOWN` value instead ([#1156](#1156)). This commit includes a fix for handling NULL values in the `table_format` column of Views in the `ucx.views.table_format` module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue [#115](#115) * Fixing run_workflow functionality for better error handling ([#1159](#1159)). In this release, the `run_workflow` method in the `workflows.py` file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method, `job_initial_run`, has been added to initiate a job run and return the run ID, raising a `NotFound` exception if the job run is not found. The `run_workflow` functionality in the `WorkflowsInstall` module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when an `OperationFailed` exception is raised during the installation process. These changes improve the robustness and stability of the system. * Use experimental Permissions Migration API also for Legacy Table ACLs ([#1161](#1161)). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The `TableAclSupport` import and class have been removed, as they are no longer needed. A new `apply_to_renamed_groups` method has been added for production usage, and a `apply_to_groups_with_different_names` method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. The `permission_manager` parameter has been removed from several test functions in the `test_generic.py` file and replaced with the `MigrationState` class, which is used directly with the `WorkspaceClient` object to apply permissions to groups with different names. The `test_some_entitlements` function in the `test_scim.py` file has also been updated to use the `MigratedGroup` class and the `MigrationState` class's `apply_to_groups_with_different_names` method. Finally, new tests for the Permissions Migration API have been added to the `test_tacl.py` file in the `tests/integration/workspace_access` directory to verify the behavior of the Permissions Migration API when migrating different grants.
Changes
This feature adds a new crawler that scans inside mount points for Delta Tables.
Functionality
Required
Nice to have