[FEATURE]: History log encoder for tables #3061

JCZuurmond · 2024-10-24T11:30:24Z

Is there an existing issue for this?

I have searched the existing issues

Problem statement

For creating a history log from a table, we need an encoder that captures the failures.

Proposed Solution

Create a history log encoder for tables to encode the failures.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

## Changes Add pipeline progress encoder ### Linked issues Resolves #3061 Resolves #3064 ### Functionality - [x] modified existing workflow: `migration-progress` ### Tests - [x] added unit tests

* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process. * Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions. * Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages. * Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. * Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193) * Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. * Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. * Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. * Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. * Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. * Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064). * Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs. * Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object. * Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them. * Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory. * Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions. * Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster. * Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults. * Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff. * Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set. * Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios. * Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process. * Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats. * Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process. * Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes. * Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar. * Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow. * Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances. * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library. * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements. * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line. Dependency updates: * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).

## Changes Track `UsedTables` on `TableProgressEncoder` ### Linked issues Resolves #3061 ### Functionality - [x] modified existing workflow: `migration-progress-experimental` ### Tests - [ ] manually tested - [x] added unit tests - [x] added integration tests

* Added `assign-owner-group` command ([#3111](#3111)). The Databricks Labs Unity Catalog Exporter (UCX) tool now includes a new `assign-owner-group` command, allowing users to assign an owner group to the workspace. This group will be designated as the owner for all migrated tables and views, providing better control and organization of resources. The command can be executed in the context of a specific workspace or across multiple workspaces. The implementation includes new classes, methods, and attributes in various files, such as `cli.py`, `config.py`, and `groups.py`, enhancing ownership management functionality. The `assign-owner-group` command replaces the functionality of issue [#3075](#3075) and addresses issue [#2890](#2890), ensuring proper schema ownership and handling of crawled grants. Developers should be aware that running the `migrate-tables` workflow will result in assigning a new owner group for the Hive Metastore instance in the workspace installation. * Added `opencensus` to known list ([#3052](#3052)). In this release, we have added OpenCensus to the list of known libraries in our configuration file. OpenCensus is a popular set of tools for distributed tracing and monitoring, and its inclusion in our system will enhance support and integration for users who utilize this tool. This change does not affect existing functionality, but instead adds a new entry in the configuration file for OpenCensus. This enhancement will allow our library to better recognize and work with OpenCensus, enabling improved performance and functionality for our users. * Added default owner group selection to the installer ([#3370](#3370)). A new class, AccountGroupLookup, has been added to the AccountGroupLookup module to select the default owner group during the installer process, addressing previous issue [#3111](#3111). This class uses the workspace_client to determine the owner group, and a pick_owner_group method to prompt the user for a selection if necessary. The ownership selection process has been improved with the addition of a check in the installer's `_static_owner` method to determine if the current user is part of the default owner group. The GroupManager class has been updated to use the new AccountGroupLookup class and its methods, `pick_owner_group` and `validate_owner_group`. A new variable, `default_owner_group`, is introduced in the ConfigureGroups class to configure groups during installation based on user input. The installer now includes a unit test, "test_configure_with_default_owner_group", to demonstrate how it sets expected workspace configuration values when a default owner group is specified during installation. * Added handling for non UTF-8 encoded notebook error explicitly ([#3376](#3376)). A new enhancement has been implemented to address the issue of non-UTF-8 encoded notebooks failing to load by introducing explicit error handling for this case. A UnicodeDecodeError exception is now caught and logged as a warning, while the notebook is skipped and returned as None. This change is implemented in the load_dependency method in the loaders.py file, which is a part of the assessment workflow. Additionally, a new unit test has been added to verify the behavior of this change, and the assessment workflow has been updated accordingly. The new test function in test_loaders.py checks for different types of exceptions, specifically PermissionError and UnicodeDecodeError, ensuring that the system can handle notebooks with non-UTF-8 encoding gracefully. This enhancement resolves issue [#3374](#3374), thereby improving the overall robustness of the application. * Added migration progress documentation ([#3333](#3333)). In this release, we have updated the `migration-progress-experimental` workflow to track the migration progress of a subset of inventory tables related to workspace resources being migrated to Unity Catalog (UCX). The workflow updates the inventory tables and tracks the migration progress in the UCX catalog tables. To use this workflow, users must attach a UC metastore to the workspace, create a UCX catalog, and ensure that the assessment job has run successfully. The `Migration Progress` section in the documentation has been updated with a new markdown file that provides details about the migration progress, including a migration progress dashboard and an experimental migration progress workflow that generates historical records of inventory objects relevant to the migration progress. These records are stored in the UCX UC catalog, which contains a historical table with information about the object type, object ID, data, failures, owner, and UCX version. The migration process also tracks dangling Hive or workspace objects that are not referenced by business resources, and the progress is persisted in the UCX UC catalog, allowing for cross-workspace tracking of migration progress. * Added note about running assessment once ([#3398](#3398)). In this release, we have introduced an update to the UCX assessment workflow, which will now only be executed once and will not update existing results in repeated runs. To accommodate this change, we have updated the README file with a note clarifying that the assessment workflow is a one-time process. Additionally, we have provided instructions on how to update the inventory and findings by uninstalling and reinstalling the UCX. This will ensure that the inventory and findings for a workspace are up-to-date and accurate. We recommend that software engineers take note of this change and follow the updated instructions when using the UCX assessment workflow. * Allowing skipping TACLs migration during table migration ([#3384](#3384)). A new optional flag, "skip_tacl_migration", has been added to the configuration file, providing users with more flexibility during migration. This flag allows users to control whether or not to skip the Table Access Control Language (TACL) migration during table migrations. It can be set when creating catalogs and schemas, as well as when migrating tables or using the `migrate_grants` method in `application.py`. Additionally, the `install.py` file now includes a new variable, `skip_tacl_migration`, which can be set to `True` during the installation process to skip TACL migration. New test cases have been added to verify the functionality of skipping TACL migration during grants management and table migration. These changes enhance the flexibility of the system for users managing table migrations and TACL operations in their infrastructure, addressing issues [#3384](#3384) and [#3042](#3042). * Bump `databricks-sdk` and `databricks-labs-lsql` dependencies ([#3332](#3332)). In this update, the `databricks-sdk` and `databricks-labs-lsql` dependencies are upgraded to versions 0.38 and 0.14.0, respectively. The `databricks-sdk` update addresses conflicts, bug fixes, and introduces new API additions and changes, notably impacting methods like `create()`, `execute_message_query()`, and others in workspace-level services. While `databricks-labs-lsql` updates ensure compatibility, its changelog and specific commits are not provided. This pull request also includes ignore conditions for the `databricks-sdk` dependency to prevent future Dependabot requests. It is strongly advised to rigorously test these updates to avoid any compatibility issues or breaking changes with the existing codebase. This pull request mirrors another ([#3329](#3329)), resolving integration CI issues that prevented the original from merging. * Explain failures when cluster encounters Py4J error ([#3318](#3318)). In this release, we have made significant improvements to the error handling mechanism in our open-source library. Specifically, we have addressed issue [#3318](#3318), which involved handling failures when the cluster encounters Py4J errors in the `databricks/labs/ucx/hive_metastore/tables.py` file. We have added code to raise noisy failures instead of swallowing the error with a warning when a Py4J error occurs. The functions `_all_databases()` and `_list_tables()` have been updated to check if the error message contains "py4j.security.Py4JSecurityException", and if so, log an error message with instructions to update or reinstall UCX. If the error message does not contain "py4j.security.Py4JSecurityException", the functions log a warning message and return an empty list. These changes also resolve the linked issue [#3271](#3271). The functionality has been thoroughly tested and verified on the labs environment. These improvements provide more informative error messages and enhance the overall reliability of our library. * Rearranged job summary dashboard columns and make job_name clickable ([#3311](#3311)). In this update, the job summary dashboard columns have been improved and the need for the `30_3_job_details.sql` file, which contained a SQL query for selecting job details from the `inventory.jobs` table, has been eliminated. The dashboard columns have been rearranged, and the `job_name` column is now clickable, providing easy access to job details via the corresponding job ID. The changes include modifying the dashboard widget and adding new methods for making the `job_name` column clickable and linking it to the job ID. Additionally, the column titles have been updated to display more relevant information. These improvements have been manually tested and verified in a labs environment. * Refactor refreshing of migration-status information for tables, eliminate another redundant refresh ([#3270](#3270)). This pull request refactors the way table records are enriched with migration-status information during encoding for the history log in the `migration-progress-experimental` workflow. It ensures that the refresh of migration-status information is explicit and under the control of the workflow, addressing a previously expressed intent. A redundant refresh of migration-status information has been eliminated and additional unit test coverage has been added to the `migration-progress-experimental` workflow. The changes include modifying the existing workflow, adding new methods for refreshing table migration status without updating the history log, and splitting the crawl and update-history-log tasks into three steps. The `TableMigrationStatusRefresher` class has been introduced to obtain the migration status of a table, and new tests have been added to ensure correctness, making the `migration-progress-experimental` workflow more efficient and reliable. * Safe read files in more places ([#3394](#3394)). This release introduces significant improvements to file handling, addressing issue [#3386](#3386). A new function, `safe_read_text`, has been implemented for safe reading of files, catching and handling exceptions and returning None if reading fails. This function is utilized in the `is_a_notebook` function and replaces the existing `read_text` method in specific locations, enhancing error handling and robustness. The `databricks labs ucx lint-local-code` command and the `assessment` workflow have been updated accordingly. Additionally, new test files and methods have been added under the `tests/integration/source_code` directory to ensure comprehensive testing of file handling, including handling of unsupported file types, encoding checks, and ignorable files. * Track `DirectFsAccess` on `JobsProgressEncoder` ([#3375](#3375)). In this release, the open-source library has been updated with new features related to tracking Direct File System Access (DirectFsAccess) in the JobsProgressEncoder. This change includes the addition of a new `_direct_fs_accesses` method, which detects direct filesystem access by code used in a job and generates corresponding failure messages. The DirectFsAccessCrawler object is used to crawl and track file system access for directories and queries, providing more detailed tracking and encoding of job progress. Additionally, new methods `make_job` and `make_dashboard` have been added to create instances of Job and Dashboard, respectively, and new unit and integration tests have been added to ensure the proper functionality of the updated code. These changes improve the functionality of JobsProgressEncoder by providing more comprehensive job progress information, making the code more modular and maintainable for easier management of jobs and dashboards. This release resolves issue [#3059](#3059) and enhances the tracking and encoding of job progress in the system, ensuring more comprehensive and accurate reporting of job status and issues. * Track `UsedTables` on `TableProgressEncoder` ([#3373](#3373)). In this release, the tracking of `UsedTables` has been implemented on the `TableProgressEncoder` in the `tables_progress` function, addressing issue [#3061](#3061). The workflow `migration-progress-experimental` has been updated to incorporate this change. New objects, `self.used_tables_crawler_for_paths` and `self.used_tables_crawler_for_queries`, have been added as instances of a class responsible for crawling used tables. A `full_name` property has been introduced as a read-only attribute for a source code class, providing a more convenient way of accessing and manipulating the full name of the source code object. A new integration test for the `TableProgressEncoder` component has also been added, specifically testing table failure scenarios. The `TableProgressEncoder` class has been updated to track `UsedTables` using the `UsedTablesCrawler` class, and a new class, `UsedTable`, has been introduced to represent the catalog, schema, and table name of a table. Two new unit tests have been added to ensure the correct functionality of this feature.

JCZuurmond added enhancement New feature or request needs-triage labels Oct 24, 2024

JCZuurmond added this to UCX Oct 24, 2024

github-project-automation bot moved this to Triage in UCX Oct 24, 2024

JCZuurmond mentioned this issue Oct 24, 2024

[EPIC]: Track migration process and display it in a dashboard #2074

Open

34 tasks

JCZuurmond self-assigned this Oct 25, 2024

JCZuurmond mentioned this issue Oct 25, 2024

Add table progress encoder #3083

Merged

2 tasks

JCZuurmond moved this from Triage to Archive in UCX Oct 28, 2024

JCZuurmond closed this as completed Oct 28, 2024

nfx mentioned this issue Oct 30, 2024

Release v0.48.0 #3106

Merged

nfx removed this from UCX Oct 31, 2024

JCZuurmond mentioned this issue Nov 20, 2024

Track UsedTables on TableProgressEncoder #3373

Merged

4 tasks

gueniai mentioned this issue Dec 2, 2024

Release v0.51.0 #3400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: History log encoder for tables #3061

[FEATURE]: History log encoder for tables #3061

JCZuurmond commented Oct 24, 2024

[FEATURE]: History log encoder for tables #3061

[FEATURE]: History log encoder for tables #3061

Comments

JCZuurmond commented Oct 24, 2024

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context