Skip to content

Releases: databrickslabs/ucx

v0.53.1

30 Dec 16:57
a77ca8b
Compare
Choose a tag to compare
  • Removed packaging package dependency (#3469). In this release, we have removed the dependency on the packaging package in the open-source library to address a release issue. The import statements for "packaging.version.Version" and "packaging.version.InvalidVersion" have been removed. The function _external_hms in the federation.py file has been updated to retrieve the Hive Metastore version using the "spark.sql.hive.metastore.version" configuration key and validate it using a regular expression pattern. If the version is not valid, the function logs an informational message and returns None. This change modifies the Hive Metastore version validation logic and improves the overall reliability and maintainability of the library.

Contributors: @FastLee

v0.53.0

23 Dec 18:28
dcfe27e
Compare
Choose a tag to compare
  • Added dashboard crawlers (#3397). The open-source library has been updated with new dashboard crawlers for the assessment workflow, Redash migration, and QueryLinter. These crawlers are responsible for crawling and persisting dashboards, as well as migrating or reverting them during Redash migration. They also lint the queries of the crawled dashboards using QueryLinter. This change resolves issues #3366 and #3367, and progresses #2854. The 'databricks labs ucx {migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the assessment workflow have been modified to incorporate these new features. Unit tests and integration tests have been added to ensure proper functionality of the new dashboard crawlers. Additionally, two new tables, $inventory.redash_dashboards and $inventory.lakeview_dashboards, have been introduced to hold a list of all Redash or Lakeview dashboards and are used by the QueryLinter and Redash migration. These changes improve the assessment, migration, and linting processes for dashboards in the library.
  • DBFS Root Support for HMS Federation (#3425). The commit DBFS Root Support for HMS Federation introduces changes to support the DBFS root location for HMS federation. A new method, external_locations_with_root, is added to the ExternalLocations class to return a list of external locations including the DBFS root location. This method is used in various functions and test cases, such as test_create_uber_principal_no_storage, test_create_uc_role_multiple_raises_error, test_create_uc_no_roles, test_save_spn_permissions, and test_create_access_connectors_for_storage_accounts, to ensure that the DBFS root location is correctly identified and tested in different scenarios. Additionally, the external_locations.snapshot.return_value is changed to external_locations.external_locations_with_root.return_value in test functions test_create_federated_catalog and test_already_existing_connection to retrieve a list of external locations including the DBFS root location. This commit closes issue #3406, which was related to this functionality. Overall, these changes improve the handling and testing of DBFS root location in HMS federation.
  • Log message as error when legacy permissions API is enabled/disabled depending on the workflow ran (#3443). In this release, logging behavior has been updated in several methods in the 'workflows.py' file. When the use_legacy_permission_migration configuration is set to False and specific conditions are met, error messages are now logged instead of info messages for the methods 'verify_metastore_attached', 'rename_workspace_local_groups', 'reflect_account_groups_on_workspace', 'apply_permissions_to_account_groups', 'apply_permissions', and 'validate_groups_permissions'. This change is intended to address issue #3388 and provides clearer guidance to users when the legacy permissions API is not functioning as expected. Users will now see an error message advising them to run the migrate-groups job or set use_legacy_permission_migration to True in the config.yml file. These updates will help ensure smoother workflow runs and more accurate logging for better troubleshooting.
  • MySQL External HMS Support for HMS Federation (#3385). This commit adds support for MySQL-based Hive Metastore (HMS) in HMS Federation, enhances the CLI for creating a federated catalog, and improves external HMS functionality. It introduces a new parameter enable_hms_federation in the Locations class constructor, allowing users to enable or disable MySQL-based HMS federation. The external_locations method in application.py now accepts enable_hms_federation as a parameter, enabling more granular control of the federation feature. Additionally, the CLI for creating a federated catalog has been updated to accept a prompts parameter, providing more flexibility. The commit also introduces a new dataclass ExternalHmsInfo for external HMS connection information and updates the HiveMetastoreFederationEnabler and HiveMetastoreFederation classes to support non-Glue external metastores. Furthermore, it adds methods to handle the creation of a Federated Catalog from the command-line interface, split JDBC URLs, and manage external connections and permissions.
  • Skip listing built-in catalogs to update table migration process (#3464). In this release, the migration process for updating tables in the Hive Metastore has been optimized with the introduction of the TableMigrationStatusRefresher class, which inherits from CrawlerBase. This new class includes modifications to the _iter_schemas method, which now filters out built-in catalogs and schemas when listing catalogs and schemas, thereby skipping unnecessary processing during the table migration process. Additionally, the get_seen_tables method has been updated to include checks for schema.name and schema.catalog_name, and the _crawl and _try_fetch methods have been modified to reflect changes in the TableMigrationStatus constructor. These changes aim to improve the efficiency and performance of the migration process by skipping built-in catalogs and schemas. The release also includes modifications to the existing migrate-tables workflow and adds unit tests that demonstrate the exclusion of built-in catalogs during the table migration status update process. The test case utilizes the CatalogInfoSecurableKind enumeration to specify the kind of catalog and verifies that the seen tables only include the non-builtin catalogs. These changes should prevent unnecessary processing of built-in catalogs and schemas during the table migration process, leading to improved efficiency and performance.
  • Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40 (#3434). In this release, the requirement for the databricks-sdk package has been updated in the pyproject.toml file to be strictly greater than or equal to 0.39 and less than 0.40, allowing for the use of the latest version of the package while preventing the use of versions above 0.40. This change is based on the release notes and changelog for version 0.39 of the package, which includes bug fixes, internal changes, and API changes such as the addition of the cleanrooms package, delete() method for workspace-level services, and fields for various request and response objects. The commit history for the package is also provided. Dependabot has been configured to resolve any conflicts with this PR and can be manually triggered to perform various actions as needed. Additionally, Dependabot can be used to ignore specific dependency versions or close the PR.
  • Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 (#3456). In this pull request, the version range of the databricks-sdk dependency has been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of the latest version of the databricks-sdk while ensuring that it is less than 0.41. The pull request also includes release notes detailing the API changes in version 0.40.0, such as the addition of new fields to various compute, dashboard, job, and pipeline services. A changelog is provided, outlining the bug fixes, internal changes, new features, and improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits is also included, showing the development progress of these versions.
  • Use LTS Databricks runtime version (#3459). This release introduces a change in the Databricks runtime version to a Long-Term Support (LTS) release to address issues encountered during the migration to external tables. The previous runtime version caused the convert to external table migration strategy to fail, and this change serves as a temporary solution. The migrate-tables workflow has been modified, and existing integration tests have been reused to ensure functionality. The test_job_cluster_policy function now uses the LTS version instead of the latest version, ensuring a specified Spark version for the cluster policy. The function also checks for matching node type ID, Spark version, and necessary resources. However, users may still encounter problems with the latest Universal Connectivity (UCX) release. The _convert_hms_table_to_external method in the table_migrate.py file has been updated to return a boolean value, with a new TODO comment about a possible failure with Databricks runtime 16.0 due to a JDK update.
  • Use CREATE_FOREIGN_CATALOG instead of CREATE_FOREIGN_SECURABLE with HMS federation enablement commands (#3309). A change has been made to update the databricks-sdk dependency version from >=0.38,<0.39 to >=0.39 in the pyproject.toml file, which may affect the project's functionality related to the databricks-sdk library. In the Hive Metastore Federation codebase, CREATE_FOREIGN_CATALOG is now used instead of CREATE_FOREIGN_SECURABLE for HMS federation enablement commands, aligned with issue #3308. The _add_missing_permissions_if_needed method has been updated to check for CREATE_FOREIGN_SECURABLE instead of CREATE_FOREIGN_CATALOG when granting permissions. Additionally, a unit test file for HiveMetastore Federation has ...
Read more

v0.52.0

12 Dec 14:42
136c536
Compare
Choose a tag to compare
  • Added handling for Databricks errors during workspace listings in the table migration status refresher (#3378). In this release, we have implemented changes to enhance error handling and improve the stability of the table migration status refresher in the open-source library. We have resolved issue #3262, which addressed Databricks errors during workspace listings. The assessment workflow has been updated, and new unit tests have been added to ensure proper error handling. The changes include the import of DatabricksError from the databricks.sdk.errors module and the addition of a new method _iter_catalogs to list catalogs with error handling for DatabricksError. The _iter_schemas method now replaces _ws.catalogs.list() with self._iter_catalogs(), also including error handling for DatabricksError. Furthermore, new unit tests have been developed to check the logging of the TableMigration class when listing tables in the Databricks workspace, focusing on handling errors during catalog, schema, and table listings. These changes improve the library's robustness and ensure that it can gracefully handle errors during the table migration status refresher process.
  • Convert READ_METADATA to UC BROWSE permission for tables, views and database (#3403). The uc_grant_sql method in the grants.py file has been modified to convert READ_METADATA permissions to BROWSE permissions for tables, views, and databases. This change involves adding new entries to the dictionary used to map permission types to their corresponding UC actions and has been manually tested. The behavior of the grant_loader function in the hive_metastore module has also been modified to change the action type of a grant from READ_METADATA to EXECUTE for a specific case. Additionally, the test_grants.py unit test file has been updated to include a new test case that verifies the conversion of READ_METADATA to BROWSE for a grant on a database and handles the conversion of READ_METADATA permission to UC BROWSE for a new udf="function" parameter. These changes resolve issue #2023 and have been tested through manual testing and unit tests. No new methods have been added, and existing functionality has been changed in a limited scope. No new unit or integration tests have been added as it is assumed that the existing tests will continue to pass after these changes have been made.
  • Migrates Pipelines crawled during the assessment phase (#2778). A new utility class, PipelineMigrator, has been introduced in this release to facilitate the migration of Databricks Labs SQL (DLT) pipelines. This class is used in a new workflow that tests pipeline migration, which involves cloning DLT pipelines in the assessment phase with specific configurations to a new Unity Catalog (UC) pipeline. The migration can be skipped for certain pipelines by specifying their pipeline IDs in a list. Three test scenarios, each with different pipeline specifications, are defined to ensure the proper functioning of the migration process under various conditions. The class and the migration process are thoroughly tested with manual testing, unit tests, and integration tests, with no reliance on a staging environment. The migration process takes into account the WorkspaceClient, WorkspaceContext, AccountClient, and a flag for running the command as a collection. The PipelinesMigrator class uses a PipelinesCrawler and JobsCrawler to perform the migration and ensures better functionality for the users with additional parameters. The commit also introduces a new command, migrate_dlt_pipelines, to the CLI of the ucx package, which helps migrate DLT pipelines. The migration process is tested using a mock installation, unit tests, and integration tests. The tests cover the scenario where the installation has two jobs, test and 'assessment', with job IDs 123 and 456 respectively. The state of the installation is recorded in a state.json file. A configuration file pipeline_mapping.csv is used to map the source pipeline ID to the target catalog, schema, pipeline, and workspace names.
  • Removed try-except around verifying the migration progress prerequisites in the migrate-tables cli command (#3439). In the latest release, the ucx package's migrate-tables CLI command has undergone a significant modification in the handling of progress tracking prerequisites. The previous try-except block surrounding the verification has been removed, and the RuntimeWarning is now propagated, providing a more specific and helpful error message. If the prerequisites are not met, the verify method will raise an exception, and the migration will not proceed. This change enhances the accuracy of error messages for users and ensures that the prerequisites for migration are properly met. The tests for migrate_tables have been updated accordingly, including a new test case test_migrate_tables_errors_out_before_assessment that checks whether the migration does not proceed with the verification fails. This change affects the existing databricks labs ucx migrate-tables command and brings improved precision and reliability to the migration process.
  • Removed redundant internal methods from create_account_group (#3395). In this change, the create_account_group function's internal methods have been removed, and its signature has been modified to retrieve the workspace ID from accountworkspace._workspaces() instead of passing it as a parameter. This resolves issue #3170 and improves code efficiency by removing unnecessary parameters and methods. The AccountWorkspaces class now accepts a list of workspace IDs upon instantiation, enhancing code readability and eliminating redundancy. The function has been tested with unit tests, ensuring it creates a group if it doesn't exist, throws an exception if a group already exists, filters system groups, and handles cases where a group already has the required number of members in a workspace. These changes simplify the codebase, eliminate redundancy, and improve the maintainability of the project.
  • Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.34 (#3407). In this release, we have updated the sqlglot requirement to version 25.33.9999 from a range that included versions 25.5.0 to 25.32.9999. This update allows us to utilize the latest version of sqlglot, which includes various bug fixes and new features. In v25.33.0, there were two breaking changes: the TIMESTAMP data type now maps to Type.TIMESTAMPTZ, and the NEXT keyword is now treated as a function keyword. Several new features were also introduced, including support for generated columns in PostgreSQL and the ability to preserve tables in the replace_table method. Additionally, there were several bug fixes, including fixes for issues related to BigQuery, Presto, and Spark. The v25.32.1 release contained two bug fixes related to BigQuery and one bug fix related to Presto. Furthermore, v25.32.0 had three breaking changes: support for ATTACH/DETACH statements, tokenization of hints as comments, and a fix to datetime coercion in the canonicalize rule. This release also introduced new features, such as support for TO_TIMESTAMP* variants in Snowflake and improved error messages in the Redshift transpiler. Lastly, there were several bug fixes, including fixes for issues related to SQL Server, MySQL, and PostgreSQL.
  • Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.35 (#3413). In this release, the sqlglot dependency has been updated from a version range that allows up to 25.33, but excludes 25.34, to a version range that allows 25.5.0 and above, but excludes 25.35. This update was made to enable the latest version of sqlglot, which includes one breaking change related to the alias expansion of USING STRUCT fields. This version also introduces two new features, an optimization for alias expansion of USING STRUCT fields, and support for generated columns in PostgreSQL. Additionally, two bug fixes were implemented, addressing proper consumption of dashed table parts and removal of parentheses from CURRENT_USER in Presto. The update also includes a fix to make TIMESTAMP map to Type.TIMESTAMPTZ, a fix to parse DEFAULT in VALUES clause into a Var, and changes to the BigQuery and Snowflake dialects to improve transpilation and JSONPathTokenizer leniency. The commit message includes a reference to issue [#3413](https://github.com/databrickslabs/ucx/issues/3413) and a link to the sqlglot changelog for further reference.
  • Updated sqlglot requirement from <25.35,>=25.5.0 to >=25.5.0,<26.1 (#3433). In this release, we have updated the required version of the sqlglot library to a range that includes version 25.5.0 but excludes version 26.1. This change is crucial due to the breaking changes introduced in sqlglot v26.0.0 that are not yet compatible with our project. The commit message includes the changelog for sqlglot v26.0.0, which highlights the breaking changes, new features, bug fixes, and other modifications in this version. Additionally, the commit includes a list of commits merged into the sqlglot repository for a comprehensive understanding of the changes. As a software engineer, I recommend approving this change to maintain compatibility with sqlglot. However, I advise thorough testing to ensure the updated version does n...
Read more

v0.51.0

02 Dec 20:39
b422e78
Compare
Choose a tag to compare
  • Added assign-owner-group command (#3111). The Databricks Labs Unity Catalog Exporter (UCX) tool now includes a new assign-owner-group command, allowing users to assign an owner group to the workspace. This group will be designated as the owner for all migrated tables and views, providing better control and organization of resources. The command can be executed in the context of a specific workspace or across multiple workspaces. The implementation includes new classes, methods, and attributes in various files, such as cli.py, config.py, and groups.py, enhancing ownership management functionality. The assign-owner-group command replaces the functionality of issue #3075 and addresses issue #2890, ensuring proper schema ownership and handling of crawled grants. Developers should be aware that running the migrate-tables workflow will result in assigning a new owner group for the Hive Metastore instance in the workspace installation.
  • Added opencensus to known list (#3052). In this release, we have added OpenCensus to the list of known libraries in our configuration file. OpenCensus is a popular set of tools for distributed tracing and monitoring, and its inclusion in our system will enhance support and integration for users who utilize this tool. This change does not affect existing functionality, but instead adds a new entry in the configuration file for OpenCensus. This enhancement will allow our library to better recognize and work with OpenCensus, enabling improved performance and functionality for our users.
  • Added default owner group selection to the installer (#3370). A new class, AccountGroupLookup, has been added to the AccountGroupLookup module to select the default owner group during the installer process, addressing previous issue #3111. This class uses the workspace_client to determine the owner group, and a pick_owner_group method to prompt the user for a selection if necessary. The ownership selection process has been improved with the addition of a check in the installer's _static_owner method to determine if the current user is part of the default owner group. The GroupManager class has been updated to use the new AccountGroupLookup class and its methods, pick_owner_group and validate_owner_group. A new variable, default_owner_group, is introduced in the ConfigureGroups class to configure groups during installation based on user input. The installer now includes a unit test, "test_configure_with_default_owner_group", to demonstrate how it sets expected workspace configuration values when a default owner group is specified during installation.
  • Added handling for non UTF-8 encoded notebook error explicitly (#3376). A new enhancement has been implemented to address the issue of non-UTF-8 encoded notebooks failing to load by introducing explicit error handling for this case. A UnicodeDecodeError exception is now caught and logged as a warning, while the notebook is skipped and returned as None. This change is implemented in the load_dependency method in the loaders.py file, which is a part of the assessment workflow. Additionally, a new unit test has been added to verify the behavior of this change, and the assessment workflow has been updated accordingly. The new test function in test_loaders.py checks for different types of exceptions, specifically PermissionError and UnicodeDecodeError, ensuring that the system can handle notebooks with non-UTF-8 encoding gracefully. This enhancement resolves issue #3374, thereby improving the overall robustness of the application.
  • Added migration progress documentation (#3333). In this release, we have updated the migration-progress-experimental workflow to track the migration progress of a subset of inventory tables related to workspace resources being migrated to Unity Catalog (UCX). The workflow updates the inventory tables and tracks the migration progress in the UCX catalog tables. To use this workflow, users must attach a UC metastore to the workspace, create a UCX catalog, and ensure that the assessment job has run successfully. The Migration Progress section in the documentation has been updated with a new markdown file that provides details about the migration progress, including a migration progress dashboard and an experimental migration progress workflow that generates historical records of inventory objects relevant to the migration progress. These records are stored in the UCX UC catalog, which contains a historical table with information about the object type, object ID, data, failures, owner, and UCX version. The migration process also tracks dangling Hive or workspace objects that are not referenced by business resources, and the progress is persisted in the UCX UC catalog, allowing for cross-workspace tracking of migration progress.
  • Added note about running assessment once (#3398). In this release, we have introduced an update to the UCX assessment workflow, which will now only be executed once and will not update existing results in repeated runs. To accommodate this change, we have updated the README file with a note clarifying that the assessment workflow is a one-time process. Additionally, we have provided instructions on how to update the inventory and findings by uninstalling and reinstalling the UCX. This will ensure that the inventory and findings for a workspace are up-to-date and accurate. We recommend that software engineers take note of this change and follow the updated instructions when using the UCX assessment workflow.
  • Allowing skipping TACLs migration during table migration (#3384). A new optional flag, "skip_tacl_migration", has been added to the configuration file, providing users with more flexibility during migration. This flag allows users to control whether or not to skip the Table Access Control Language (TACL) migration during table migrations. It can be set when creating catalogs and schemas, as well as when migrating tables or using the migrate_grants method in application.py. Additionally, the install.py file now includes a new variable, skip_tacl_migration, which can be set to True during the installation process to skip TACL migration. New test cases have been added to verify the functionality of skipping TACL migration during grants management and table migration. These changes enhance the flexibility of the system for users managing table migrations and TACL operations in their infrastructure, addressing issues #3384 and #3042.
  • Bump databricks-sdk and databricks-labs-lsql dependencies (#3332). In this update, the databricks-sdk and databricks-labs-lsql dependencies are upgraded to versions 0.38 and 0.14.0, respectively. The databricks-sdk update addresses conflicts, bug fixes, and introduces new API additions and changes, notably impacting methods like create(), execute_message_query(), and others in workspace-level services. While databricks-labs-lsql updates ensure compatibility, its changelog and specific commits are not provided. This pull request also includes ignore conditions for the databricks-sdk dependency to prevent future Dependabot requests. It is strongly advised to rigorously test these updates to avoid any compatibility issues or breaking changes with the existing codebase. This pull request mirrors another (#3329), resolving integration CI issues that prevented the original from merging.
  • Explain failures when cluster encounters Py4J error (#3318). In this release, we have made significant improvements to the error handling mechanism in our open-source library. Specifically, we have addressed issue #3318, which involved handling failures when the cluster encounters Py4J errors in the databricks/labs/ucx/hive_metastore/tables.py file. We have added code to raise noisy failures instead of swallowing the error with a warning when a Py4J error occurs. The functions _all_databases() and _list_tables() have been updated to check if the error message contains "py4j.security.Py4JSecurityException", and if so, log an error message with instructions to update or reinstall UCX. If the error message does not contain "py4j.security.Py4JSecurityException", the functions log a warning message and return an empty list. These changes also resolve the linked issue #3271. The functionality has been thoroughly tested and verified on the labs environment. These improvements provide more informative error messages and enhance the overall reliability of our library.
  • Rearranged job summary dashboard columns and make job_name clickable (#3311). In this update, the job summary dashboard columns have been improved and the need for the 30_3_job_details.sql file, which contained a SQL query for selecting job details from the inventory.jobs table, has been eliminated. The dashboard columns have been rearranged, and the job_name column is now clickable, providing easy access to job details via the corresponding job ID. The changes include modifying the...
Read more

v0.50.0

18 Nov 14:48
@nfx nfx
2483f3f
Compare
Choose a tag to compare
  • Added pytesseract to known list (#3235). A new addition has been made to the known.json file, which tracks packages with native code, to include pytesseract, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of pytesseract within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation of pytesseract and its native components. However, specific details on the usage of pytesseract within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of pytesseract and its native dependencies, making it easier to work with.
  • Added hyperlink to database names in database summary dashboard (#3310). The recent change to the Database Summary dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a linkUrlTemplate property to the database field in the encodings object within the overrides property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard.
  • Bump codecov/codecov-action from 4 to 5 (#3316). In this release, the version of the codecov/codecov-action dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including binary, gcov_args, gcov_executable, gcov_ignore, gcov_include, report_type, skip_validation, and swift_project. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking.
  • Depend on a Databricks SDK release compatible with 0.31.0 (#3273). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new InvalidState error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the pyproject.toml file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project.
  • Eliminate redundant migration-index refresh and loads during view migration (#3223). In this pull request, we have optimized the view migration process in the databricks/labs/ucx/hive_metastore/table_metastore.py file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class TableMigrationIndex and imported the TableMigrationStatusRefresher class. The _migrate_views method now takes an additional argument migration_index, which is used in the ViewsMigrationSequencer and in the _migrate_view method. The _view_can_be_migrated and _sql_migrate_view methods now also take migration_index as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly.
  • Fixed backwards compatibility breakage from Databricks SDK (#3324). In this release, we have addressed a backwards compatibility issue (Issue #3324) that was caused by an update to the Databricks SDK. This was done by adding new methods to the databricks.sdk.service module to interact with dashboards. Additionally, we have fixed bug #3322 and updated the create function in the conftest.py file to utilize the new dashboards module and its Dashboard class. The function now returns the dashboard object as a dictionary and calls the publish method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the --cov-fail-under=89 flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality.
  • Fixed issue with cleanup of failed create-missing-principals command (#3243). In this update, we have improved the create_uc_roles method within the access.py file of the databricks/labs/ucx/aws directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a PermissionDenied or NotFound exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the databricks labs ucx create-missing-principals command by handling permission errors and restoring the system to its initial state.
  • Improve error handling for assess_workflows task (#3255). This pull request introduces improvements to the assess_workflows task in the databricks/labs/ucx module, focusing on error handling and logging. A new error type, DatabricksError, has been added to handle Databricks-specific exceptions in the _temporary_copy method, ensuring proper handling and re-raising of Databricks-related errors as InvalidPath exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from error to warning. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the assess_workflows task, ensuring appropriate handling and logging of any errors that may occur during execution.
  • Require at least 4 cores for UCX VMs (#3229). In this release, the selection of node_type_id in the policy.py file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the node_type_id parameter. The updated node_type_id selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly.
  • Skip test_feature_tables integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features.
  • Speed up update_migration_status jobs by eliminating lots of redundant SQL queries (#3200). In this relea...
Read more

v0.49.0

08 Nov 15:37
@nfx nfx
f97883e
Compare
Choose a tag to compare
  • Added MigrationSequencer for jobs (#3008). In this commit, a MigrationSequencer class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable of MigrationStep objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue #1415 and supersedes issue #2980. Additionally, the commit removes some unnecessary imports and fixtures from a test file.
  • Added phik to known list (#3198). In this release, we have added phik to the known list in the provided JSON file. This change addresses part of issue #1931, as outlined in the linked issues. The phik key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding the phik key.
  • Added pmdarima to known list (#3199). In this release, we are excited to announce the addition of support for the pmdarima library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have added pmdarima to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integrating pmdarima, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue #1931 and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available.
  • Added preshed to known list (#3220). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython, preshed is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules, preshed and "preshed.about," this addition partially resolves issue #1931, improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage the preshed library's features and optimized routines for their projects, reducing development time and increasing efficiency.
  • Added py-cpuinfo to known list (#3221). In this release, we have added support for the py-cpuinfo library to our project, enabling the use of the cpuinfo functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue #1931 and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources.
  • Cater for empty python cells (#3212). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the _python_trees dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the _load_children_from_tree method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input.
  • Create TODO issues every nightly run (#3196). A commit has been made to update the acceptance repository version in the acceptance.yml GitHub workflow from acceptance/v0.4.0 to acceptance/v0.4.2, which affects the integration tests. The Run nightly tests step in the GitHub repository's workflow has also been updated to use a newer version of the databrickslabs/sandbox/acceptance action, from v0.3.1 to v0.4.2. Software engineers should verify that the new version of the acceptance repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues.
  • Fixed Integration test failure of migration_tables (#3108). This release includes a fix for two integration tests (test_migrate_managed_table_to_external_table_without_conversion and test_migrate_managed_table_to_external_table_with_clone) related to Hive Metastore table migration, addressing issues #3054 and #3055. Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing @pytest.mark.skip markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase.
  • Replace MockInstallation with MockPathLookup for testing fixtures (#3215). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue #3115.
  • Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 (#3224). The open-source library sqlglot has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpiling ANY to EXISTS, supporting the MEDIAN() function, wrapping values in NOT value IS ..., and parsing information schema views into a single identifier. New features include support for the JSONB_EXISTS function in PostgreSQL, transpiling ANY to EXISTS in Spark, transpiling Snowflake's TIMESTAMP() function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding a NULL filter on ARRAY_AGG only for columns, improving parsing of WITH FILL ... INTERPOLATE in Clickhouse, generating LOG(...) for exp.Ln in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release.
  • Use acceptance/v0.4.0 (#3192). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the databrickslabs/sandbox/acceptance runner to acceptance/v0.4.0 and granting write permissions for the issues field in the permissions section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. A TODO comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly.
  • Warn about errors instead to a...
Read more

v0.48.0

30 Oct 16:37
@nfx nfx
b184c04
Compare
Choose a tag to compare
  • Added --dry-run option for ACL migrate (#3017). In this release, we have added a --dry-run option to the migrate-acls command in the labs.yml file, enabling a preview of the migration process without executing it. This feature also introduces the hms-fed flag, allowing migration of HMS-FED ACLs while migrating tables. The ACLMigrator class in the application.py file has been updated to include new parameters, sql_backend and inventory_database, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new retrieve method has been added to the ACLMigrator class to retrieve a list of grants based on the source and destination objects, and a CrawlerBase class has been introduced for fetching grants. We have also introduced a new inferred_grants table in the deployment schema to store inferred grants during the migration process.
  • Added WorkspacePathOwnership to determine transitive owners for files and notebooks (#3047). In this release, we introduce a new class WorkspacePathOwnership in the owners.py module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of Ownership and takes AdministratorLocator and WorkspaceClient as inputs. It has methods to infer the owner from the first CAN_MANAGE permission level in the access control list. We also added a new property workspace_path_ownership to the existing HiveMetastoreContext class, which returns a WorkspacePathOwnership object initialized with an AdministratorLocator object and a workspace_client. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to test_owners.py. The new tests, test_notebook_owner and test_file_owner, create a notebook and a workspace file and verify the owner of each using the owner_of method. The AdministratorLocator is used to locate the administrators group for the workspace and the PermissionLevel class is used to specify the permission level for the notebook permissions.
  • Added mosaicml-streaming to known list (#3029). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include mosaicml-streaming, oci, pynacl, pyopenssl, python-snapy, and zstd. Notably, mosaicml-streaming has two new entries, simulation and streaming, while the other packages have a single entry each. This update addresses issue #1931 and enhances the system's ability to identify and work with a wider variety of packages.
  • Added msal-extensions to known list (#3030). In this release, we have added support for two new packages, msal-extensions and portalocker, to our project. The msal-extensions package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The portalocker package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications.
  • Added multimethod to known list (#3031). In this release, we have added support for the multimethod programming concept to the library. This feature has been added to the known.json file, which partially resolves issue #193
  • Added murmurhash to known list (#3032). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue #1931. The MurmurHash function includes two variants, murmurhash and "murmurhash.about", with distinct functionalities. The murmurhash variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities.
  • Added ninja to known list (#3050). In this release, we have added Ninja to the known list in the known.json file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue #1931, which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project.
  • Added nvidia-ml-py to known list (#3051). In this release, we have added support for the nvidia-ml-py package to our project. This addition consists of two components: example and 'pynvml'. Example is likely a placeholder or sample usage of the package, while pynvml is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue #1931, which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities.
  • Added dashboard for tracking migration progress (#3016). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, _create_dashboard, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard.
  • Added grant progress encoder (#3079). A new GrantsProgressEncoder class has been introduced in the progress/grants.py file to encode Grant objects into History objects for the migration-progress workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where Grant objects fail to map to the Unity Catalog by adding a list of failures to the History object. The commit also modifies the migration-progress workflow to incorporate the new GrantsProgressEncoder class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue #3058, which was related to grant progress encoding. The GrantsProgressEncoder class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database.
  • Added table progress encoder (#3083). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues #3061 and #3064.
  • Combine static code analysis results with historical job snapshots (#3074). In this release, we have added a new method, JobsProgressEncoder, to the WorkflowTask class in the databricks.labs.ucx.contexts module. This method is used to track the progress of jobs in the context of a workflow task, re...
Read more

v0.47.0

21 Oct 12:18
@nfx nfx
8837bb4
Compare
Choose a tag to compare
  • Added mdit-py-plugins to known list (#3013). In this release, the open-source library has been updated with several new features to enhance its functionality and usability for software engineers. Firstly, a new module has been introduced to support multi-threading, allowing for more efficient processing of large datasets. Additionally, a new configuration system has been implemented, providing users with greater flexibility in customizing the library's behavior to their specific needs. Furthermore, the library now includes a set of diagnostic tools to help developers identify and troubleshoot issues more effectively. These new features are expected to significantly improve the performance and productivity of the library, making it an even more powerful tool for software development projects.
  • Added memray to known list (#3014). In this release, we have integrated two new libraries to enhance the project's functionality and maintainability. We have added memray to our list of known libraries, which allows for memory profiling and analysis within the project's environment. Additionally, we have added the textual library and its related modules, a TUI (Text User Interface) library, which provides a wide variety of user interface components. These additions partially resolve issue #1931, enabling the development of more sophisticated and user-friendly interfaces, and improving memory profiling capabilities.
  • Added mlflow-skinny to known list (#3015). A new version of our library includes the addition of mlflow-skinny to the known packages list in a JSON file. mlflow-skinny is a lightweight version of the widely-used machine learning platform, MLflow. This integration enables users to utilize mlflow-skinny in their projects and have their runs automatically tracked and logged. Furthermore, this commit partially addresses issue #1931, hinting at a possible connection to a larger issue or feature request. Software engineers will now have access to a more streamlined MLflow package, allowing for easier and more efficient integration in their projects.
  • Added handling for installing libraries multiple times in PipResolver (#3024). In this commit, the PipResolver class has been updated to handle the installation of libraries multiple times, resolving issues #3022 and #3023. The _resolve_libraries method has been modified to resolve pip installs as libraries or paths based on whether they are found in the path lookup or not, and whether they are already installed in the temporary virtual environment. The _install_pip method has also been updated to include the --upgrade flag to upgrade libraries if they are already installed. Code linting has been improved, and integration tests have been added to the test_libraries.py file to ensure the proper functioning of the updated code. These tests include installing the pytest library twice in a Databricks notebook and then importing it to verify its installation. These changes aim to improve the reliability and robustness of the library installation process in the context of multiple installations.
  • Fixed errors related to unsupported cell languages (#3026). In this release, we have made significant improvements to the _Collector abstract base class by adding support for multiple cell languages in the _collect_from_source method. Previously, the implementation only supported Python and SQL languages, but with this update, we have added support for several new languages including R, Scala, Shell, Markdown, Run, and Pip. The new methods added to the class handle the source code collection for their respective languages and return an empty iterable or log a warning if a language is not supported yet. This change enhances the functionality and flexibility of the class, enabling it to handle a wider variety of cell languages. Additionally, this commit resolves the issue #2977 and includes new methods to the DfsaCollectorWalker class, allowing it to collect information from cells of any language. The test case test_collector_supports_all_cell_languages has also been added to ensure that the collector supports all cell languages. This release also includes manually tested and added unit tests, and is co-authored by Eric Vergnaud.
  • Preemptively fix unknown errors of Python AST parsing coming from astroid and ast libraries (#3027). A new update has been implemented in the library to improve Python AST parsing and error handling. The maybe_parse function has been enhanced to catch all types of exceptions using a broad exception clause, extending from the previous limitation of only catching AstroidSyntaxError and SystemError. The _definitely_failure function now includes the type of exception in the error message for better visibility and troubleshooting. In the test cases, the graph_builder_parse_error function's test has been updated to check for a system-error code instead of syntax-error to preemptively fix unknown errors from Python AST parsing. Additionally, the test for parses_python_cell_with_magic_commands function has been added, ensuring that any Python cell with magic commands is correctly parsed. These changes aim to increase robustness in handling exceptional cases during parsing, provide more informative error messages, and prevent potential unknown parsing errors.
  • Updated migration progress workflow to also re-lint dashboards and jobs (#3025). In this release, we have updated the table utilization documentation to include the ability to lint directFS paths and queries, and modified the migration-progress-experimental workflow to re-run linting tasks for dashboard queries and notebooks associated with jobs. Additionally, we have updated the MigrationProgress workflow to include the scanning of dashboards and jobs for migration issues, assessing SQL code in embedded widgets of dashboards and inventory & linting of jobs. To support these changes, we have added unit tests and updated existing integration tests in test_workflows.py. The new test function, test_linter_runtime_refresh, tests the linter refresh behavior for dashboard and workflow tasks. These updates aim to ensure consistent linting and maintain the accuracy of the experimental-migration-progress workflow for users who adopt the project.

Contributors: @pritishpai, @JCZuurmond, @asnare, @ericvergnaud, @nfx

v0.46.0

17 Oct 16:16
@nfx nfx
462abe6
Compare
Choose a tag to compare
  • Added lazy_loader to known list (#2991). With this commit, the lazy_loader module has been added to the known list in the configuration file, addressing a portion of issue #193, which may have been caused by the discovery or loading of this module. The lazy_loader is a package or module that, once added to the known list, will be recognized and loaded by the system. This change does not affect any existing functionality or introduce new methods. The commit solely updates the known.json file to include lazy_loader with an empty list, indicating that it is ready for use. This modification will enable the correct loading and recognition of the lazy_loader module in the system.
  • Added librosa to known list (#2992). In this update, we have added several open-source libraries to the known list in the configuration file, including librosa, llvmlite, msgpack, pooch, soundfile, and soxr. These libraries are commonly used in data engineering, machine learning, and scientific computing tasks. librosa is a Python library for audio and music analysis, while llvmlite is a lightweight Python interface to the LLVM compiler infrastructure. msgpack is a binary serialization format like JSON, pooch is a package for managing external data files, soundfile is a library for reading and writing audio files, and soxr is a library for high-quality audio resampling. Each library has an empty list next to it for specifying additional configuration related to the library. This update partially resolves issue #1931 by adding librosa to the known list, ensuring that these libraries will be properly recognized and utilized by the codebase.
  • Added linkify-it-py to known list (#2993). In this release, we have added support for two new open-source packages, linkify-it-py and uc-micro-py, to enhance the software's functionality and compatibility. The addition of linkify-it-py and its constituent modules, as well as the incorporation of uc-micro-py with its modules and classes, aims to expand the software's capabilities. These changes are related to the resolution of issue #1931, and they will enable the software to work seamlessly with these packages, thereby providing a better user experience.
  • Added lz4 to known list (#2994). In this release, we have added support for the LZ4 lossless data compression algorithm, which is known for its focus on compression and decompression speed. The implementation includes four variants: lz4, lz4.block, lz4.frame, and lz4.version, each providing different levels of compression and decompression speed and flexibility. This addition expands the range of supported compression algorithms, providing more options for users to choose from and partially addressing issue #1931 related to supporting additional compression algorithms. This improvement will be beneficial to software engineers working with data compression in their projects.
  • Fixed SystemError: AST constructor recursion depth mismatch failing the entire job (#3000). This PR introduces more deterministic, Go-style, error handling for parsing Python code, addressing issues that caused the entire job to fail due to a SystemError: AST constructor recursion depth mismatch (#3000) and bug #2976. It includes removing the AstroidSyntaxError import, adding an import for SqlglotError, and updating the SqlParseError exception to SqlglotError in the lint method of the SqlLinter class. Additionally, abstract classes TablePyCollector and DfsaPyCollector and their respective methods for collecting tables and direct file system accesses have been removed. The PythonSequentialLinter class, previously handling multiple responsibilities, has also been removed, enhancing code modularity, understandability, maintainability, and testability. The changes affect the base.py, python_ast.py, and python_sequential_linter.py modules.
  • Skip applying permissions for workspace system groups to Unity Catalog resources (#2997). This commit introduces changes to the ACL-related code in the databricks labs ucx create-catalog-schemas command and the migrate-table-* workflow, skipping the application of permissions for workspace system groups in the Unity Catalog. These system groups, which include 'admins', do not exist at the account level. To ensure the correctness of these modifications, unit and integration tests have been added, including a test that checks the proper handling of user privileges in system groups during catalog schema creation. The AccessControlResponse object has been updated for the admins and users groups, granting them specific permissions for a workspace and warehouse object, respectively, enhancing the system's functionality in multi-user environments with system groups.

Contributors: @pritishpai, @asnare, @JCZuurmond, @nfx

v0.45.0

16 Oct 15:50
@nfx nfx
2cbd166
Compare
Choose a tag to compare
  • Added DBFS Root resolution when HMS Federation is enabled (#2947). This commit introduces a DBFS resolver for use with HMS (Hive Metastore) federation, enabling accurate resolution of DBFS root locations when HMS federation is enabled. A new _resolve_dbfs_root() class method is added to the MountsCrawler class, and a boolean argument enable_hms_federation is included in the MountsCrawler constructor, providing better handling of federation functionality. The commit also adds a test function, test_resolve_dbfs_root_in_hms_federation, to validate the resolution of DBFS roots with HMS federation. The test covers special cases, such as the /user/hive/metastore path, and utilizes LocationTrie for more accurate location guessing. These changes aim to improve the overall DBFS root resolution when using HMS federation.
  • Added jax-jumpy to known list (#2959). In this release, we have added the jax-jumpy package to the list of known packages in our system. jax-jumpy is a Python-based numerical computation library, which includes modules such as jumpy, jumpy._base_fns, jumpy.core, jumpy.lax, jumpy.numpy, jumpy.numpy._factory_fns, jumpy.numpy._transform_data, jumpy.numpy._types, jumpy.numpy.linalg, jumpy.ops, and jumpy.random. These modules are now recognized by our system, which partially resolves issue #1931, which may have been caused by the integration of the jax-jumpy package. Engineers can now utilize the capabilities of this library in their numerical computations.
  • Added joblibspark to known list (#2960). In this release, we have added support for the joblibspark library in our system by updating the known.json file, which keeps track of various libraries and their associated components. This change is a part of the resolution for issue #1931 and includes new elements such as doc, doc.conf, joblibspark, joblibspark.backend, and joblibspark.utils. These additions enable the system to recognize and manage the new components related to joblibspark, allowing for improved compatibility and functionality.
  • Added jsonpatch to known list (#2969). In this release, we have added jsonpatch to the list of known libraries in the known.json file. Jsonpatch is a library used for applying JSON patches, which allow for partial updates to a JSON document. By including jsonpatch in the known list, developers can now easily utilize its functionality for JSON patching, and any necessary dependencies will be handled automatically. This change partially addresses issue #1931, which may have been caused by the use or integration of jsonpatch. We encourage developers to take advantage of this new addition to enhance their code and efficiently make partial updates to JSON documents.
  • Added langchain-community to known list (#2970). A new entry for langchain-community has been added to the configuration file for known language chain components in this release. This entry includes several sub-components such as 'langchain_community.agents', 'langchain_community.callbacks', 'langchain_community.chat_loaders', 'langchain_community.chat_message_histories', 'langchain_community.chat_models', 'langchain_community.cross_encoders', 'langchain_community.docstore', 'langchain_community.document_compressors', 'langchain_community.document_loaders', 'langchain_community.document_transformers', 'langchain_community.embeddings', 'langchain_community.example_selectors', 'langchain_community.graph_vectorstores', 'langchain_community.graphs', 'langchain_community.indexes', 'langchain_community.llms', 'langchain_community.memory', 'langchain_community.output_parsers', 'langchain_community.query_constructors', 'langchain_community.retrievers', 'langchain_community.storage', 'langchain_community.tools', 'langchain_community.utilities', and 'langchain_community.utils'. Currently, these sub-components are empty and have no additional configuration or code. This change partially resolves issue #1931, but the specifics of the issue and how these components will be used are still unclear.
  • Added langcodes to known list (#2971). A new langcodes library has been added to the project, addressing part of issue #1931. This library includes several modules that provide functionalities related to language codes and their manipulation, including langcodes, langcodes.build_data, langcodes.data_dicts, langcodes.language_distance, langcodes.language_lists, langcodes.registry_parser, langcodes.tag_parser, and langcodes.util. Additionally, the memory-efficient trie (prefix tree) data structure library, marisa-trie, has been included in the known list. It is important to note that no existing functionality has been altered in this commit.
  • Addressing Ownership Conflict when creating catalog/schemas (#2956). This release introduces new functionality to handle ownership conflicts during catalog/schema creation in our open-source library. The _apply_from_legacy_table_acls method has been enhanced with two loops to address non-own grants and own grants separately. This ensures proper handling of ownership conflicts by generating and executing UC grant SQL for each grant type, with appropriate exceptions. Additionally, a new helper function, this_type_and_key(), has been added to improve code readability. The release also introduces new methods, GrantsCrawler and Rule, in the hive_metastore package of the labs.ucx module, responsible for populating views and mapping source and destination objects. The test_catalog_schema.py file has been updated to include tests for creating catalogs and schemas with legacy ACLs, utilizing the new Rule method and GrantsCrawler. Issue #2932 has been addressed with these changes, which include adding new methods and updating existing tests for hive_metastore.
  • Clarify skip and unskip commands work on views (#2962). In this release, the skip and unskip commands in the databricks labs UCX tool have been updated to clarify their functionality on views and to make it more explicit with the addition of the --view flag. These commands allow users to skip or unskip certain schemas, tables, or views during the table migration process. This is useful for temporarily disabling migration of a particular schema, table, or view. Unit tests have been added to ensure the correct behavior of these commands when working with views. Two new methods have been added to test the behavior of the unskip command when a schema or table is specified, and two additional methods test the behavior of the unskip command when a view or no schema is specified. Finally, two methods test that an error message is logged when both the --table and --view flags are specified.
  • Fixed issue with migrating MANAGED hive_metastore table to UC (#2928). This commit addresses an issue with migrating Hive Metastore (HMS) MANAGED tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table can result in data loss. To prevent this, a new option CONVERT_TO_EXTERNAL has been added to the migrate_tables method for migrating managed tables to UC as external, ensuring that the HMS managed table is converted to an external table in HMS and UC, and protecting against data loss when deleting a managed table that has been migrated to UC as external. Additionally, new caching properties have been added for better performance, and existing methods have been modified to handle the migration of managed tables to UC as external. Tests, including unit and integration tests, have been added to ensure the proper functioning of these changes. It is important to note that changing MANAGED tables to EXTERNAL can have potential consequences on regulatory data cleanup, and the impact of this change should be carefully validated for existing workloads.
  • Let create-catalogs-schemas reuse MigrateGrants so that it applies group renaming (#2955). The create-catalogs-schemas command in the databricks labs ucx package has been enhanced to reuse the MigrateGrants function, enabling group renaming and eliminating redundant code. The migrate-tables workflow remains functionally the same. Changes include modifying the CatalogSchema class to accept a migrate_grants argument, introducing new Catalog and Schema dataclasses, and updating various methods in the hive_metastore module. Unit and integration tests have been added and manually verified to ensure proper functionality. The MigrateGrants class has been updated to accept two SecurableObject arguments and sort matched grants. The from_src_dst function in mapping.py now includes a new as_uc_table method and updates to as_uc_table_key. Addressing issues #2934, #2932, and #2955, the changes also include a new key property for the tables.py file, and updates to the test_create_catalogs_schemas and test_migrate_tables test functions.
  • Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](...
Read more