Releases: databrickslabs/ucx
Releases · databrickslabs/ucx
v0.53.1
- Removed
packaging
package dependency (#3469). In this release, we have removed the dependency on thepackaging
package in the open-source library to address a release issue. The import statements for "packaging.version.Version" and "packaging.version.InvalidVersion" have been removed. The function _external_hms in the federation.py file has been updated to retrieve the Hive Metastore version using the "spark.sql.hive.metastore.version" configuration key and validate it using a regular expression pattern. If the version is not valid, the function logs an informational message and returns None. This change modifies the Hive Metastore version validation logic and improves the overall reliability and maintainability of the library.
Contributors: @FastLee
v0.53.0
- Added dashboard crawlers (#3397). The open-source library has been updated with new dashboard crawlers for the assessment workflow, Redash migration, and QueryLinter. These crawlers are responsible for crawling and persisting dashboards, as well as migrating or reverting them during Redash migration. They also lint the queries of the crawled dashboards using QueryLinter. This change resolves issues #3366 and #3367, and progresses #2854. The 'databricks labs ucx {migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the
assessment
workflow have been modified to incorporate these new features. Unit tests and integration tests have been added to ensure proper functionality of the new dashboard crawlers. Additionally, two new tables, $inventory.redash_dashboards and $inventory.lakeview_dashboards, have been introduced to hold a list of all Redash or Lakeview dashboards and are used by theQueryLinter
andRedash
migration. These changes improve the assessment, migration, and linting processes for dashboards in the library. - DBFS Root Support for HMS Federation (#3425). The commit
DBFS Root Support for HMS Federation
introduces changes to support the DBFS root location for HMS federation. A new method,external_locations_with_root
, is added to theExternalLocations
class to return a list of external locations including the DBFS root location. This method is used in various functions and test cases, such astest_create_uber_principal_no_storage
,test_create_uc_role_multiple_raises_error
,test_create_uc_no_roles
,test_save_spn_permissions
, andtest_create_access_connectors_for_storage_accounts
, to ensure that the DBFS root location is correctly identified and tested in different scenarios. Additionally, theexternal_locations.snapshot.return_value
is changed toexternal_locations.external_locations_with_root.return_value
in test functionstest_create_federated_catalog
andtest_already_existing_connection
to retrieve a list of external locations including the DBFS root location. This commit closes issue #3406, which was related to this functionality. Overall, these changes improve the handling and testing of DBFS root location in HMS federation. - Log message as error when legacy permissions API is enabled/disabled depending on the workflow ran (#3443). In this release, logging behavior has been updated in several methods in the 'workflows.py' file. When the
use_legacy_permission_migration
configuration is set to False and specific conditions are met, error messages are now logged instead of info messages for the methods 'verify_metastore_attached', 'rename_workspace_local_groups', 'reflect_account_groups_on_workspace', 'apply_permissions_to_account_groups', 'apply_permissions', and 'validate_groups_permissions'. This change is intended to address issue #3388 and provides clearer guidance to users when the legacy permissions API is not functioning as expected. Users will now see an error message advising them to run themigrate-groups
job or setuse_legacy_permission_migration
to True in the config.yml file. These updates will help ensure smoother workflow runs and more accurate logging for better troubleshooting. - MySQL External HMS Support for HMS Federation (#3385). This commit adds support for MySQL-based Hive Metastore (HMS) in HMS Federation, enhances the CLI for creating a federated catalog, and improves external HMS functionality. It introduces a new parameter
enable_hms_federation
in theLocations
class constructor, allowing users to enable or disable MySQL-based HMS federation. Theexternal_locations
method inapplication.py
now acceptsenable_hms_federation
as a parameter, enabling more granular control of the federation feature. Additionally, the CLI for creating a federated catalog has been updated to accept aprompts
parameter, providing more flexibility. The commit also introduces a new dataclassExternalHmsInfo
for external HMS connection information and updates theHiveMetastoreFederationEnabler
andHiveMetastoreFederation
classes to support non-Glue external metastores. Furthermore, it adds methods to handle the creation of a Federated Catalog from the command-line interface, split JDBC URLs, and manage external connections and permissions. - Skip listing built-in catalogs to update table migration process (#3464). In this release, the migration process for updating tables in the Hive Metastore has been optimized with the introduction of the
TableMigrationStatusRefresher
class, which inherits fromCrawlerBase
. This new class includes modifications to the_iter_schemas
method, which now filters out built-in catalogs and schemas when listing catalogs and schemas, thereby skipping unnecessary processing during the table migration process. Additionally, theget_seen_tables
method has been updated to include checks forschema.name
andschema.catalog_name
, and the_crawl
and_try_fetch
methods have been modified to reflect changes in theTableMigrationStatus
constructor. These changes aim to improve the efficiency and performance of the migration process by skipping built-in catalogs and schemas. The release also includes modifications to the existingmigrate-tables
workflow and adds unit tests that demonstrate the exclusion of built-in catalogs during the table migration status update process. The test case utilizes theCatalogInfoSecurableKind
enumeration to specify the kind of catalog and verifies that the seen tables only include the non-builtin catalogs. These changes should prevent unnecessary processing of built-in catalogs and schemas during the table migration process, leading to improved efficiency and performance. - Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40 (#3434). In this release, the requirement for the
databricks-sdk
package has been updated in the pyproject.toml file to be strictly greater than or equal to 0.39 and less than 0.40, allowing for the use of the latest version of the package while preventing the use of versions above 0.40. This change is based on the release notes and changelog for version 0.39 of the package, which includes bug fixes, internal changes, and API changes such as the addition of thecleanrooms
package, delete() method for workspace-level services, and fields for various request and response objects. The commit history for the package is also provided. Dependabot has been configured to resolve any conflicts with this PR and can be manually triggered to perform various actions as needed. Additionally, Dependabot can be used to ignore specific dependency versions or close the PR. - Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 (#3456). In this pull request, the version range of the
databricks-sdk
dependency has been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of the latest version of thedatabricks-sdk
while ensuring that it is less than 0.41. The pull request also includes release notes detailing the API changes in version 0.40.0, such as the addition of new fields to various compute, dashboard, job, and pipeline services. A changelog is provided, outlining the bug fixes, internal changes, new features, and improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits is also included, showing the development progress of these versions. - Use LTS Databricks runtime version (#3459). This release introduces a change in the Databricks runtime version to a Long-Term Support (LTS) release to address issues encountered during the migration to external tables. The previous runtime version caused the
convert to external table
migration strategy to fail, and this change serves as a temporary solution. Themigrate-tables
workflow has been modified, and existing integration tests have been reused to ensure functionality. Thetest_job_cluster_policy
function now uses the LTS version instead of the latest version, ensuring a specified Spark version for the cluster policy. The function also checks for matching node type ID, Spark version, and necessary resources. However, users may still encounter problems with the latest Universal Connectivity (UCX) release. The_convert_hms_table_to_external
method in thetable_migrate.py
file has been updated to return a boolean value, with a new TODO comment about a possible failure with Databricks runtime 16.0 due to a JDK update. - Use
CREATE_FOREIGN_CATALOG
instead ofCREATE_FOREIGN_SECURABLE
with HMS federation enablement commands (#3309). A change has been made to update thedatabricks-sdk
dependency version from>=0.38,<0.39
to>=0.39
in thepyproject.toml
file, which may affect the project's functionality related to thedatabricks-sdk
library. In the Hive Metastore Federation codebase,CREATE_FOREIGN_CATALOG
is now used instead ofCREATE_FOREIGN_SECURABLE
for HMS federation enablement commands, aligned with issue #3308. The_add_missing_permissions_if_needed
method has been updated to check forCREATE_FOREIGN_SECURABLE
instead ofCREATE_FOREIGN_CATALOG
when granting permissions. Additionally, a unit test file for HiveMetastore Federation has ...
v0.52.0
- Added handling for Databricks errors during workspace listings in the table migration status refresher (#3378). In this release, we have implemented changes to enhance error handling and improve the stability of the table migration status refresher in the open-source library. We have resolved issue #3262, which addressed Databricks errors during workspace listings. The
assessment
workflow has been updated, and new unit tests have been added to ensure proper error handling. The changes include the import ofDatabricksError
from thedatabricks.sdk.errors
module and the addition of a new method_iter_catalogs
to list catalogs with error handling forDatabricksError
. The_iter_schemas
method now replaces_ws.catalogs.list()
withself._iter_catalogs()
, also including error handling forDatabricksError
. Furthermore, new unit tests have been developed to check the logging of theTableMigration
class when listing tables in the Databricks workspace, focusing on handling errors during catalog, schema, and table listings. These changes improve the library's robustness and ensure that it can gracefully handle errors during the table migration status refresher process. - Convert READ_METADATA to UC BROWSE permission for tables, views and database (#3403). The
uc_grant_sql
method in thegrants.py
file has been modified to convertREAD_METADATA
permissions toBROWSE
permissions for tables, views, and databases. This change involves adding new entries to the dictionary used to map permission types to their corresponding UC actions and has been manually tested. The behavior of thegrant_loader
function in thehive_metastore
module has also been modified to change the action type of a grant fromREAD_METADATA
toEXECUTE
for a specific case. Additionally, thetest_grants.py
unit test file has been updated to include a new test case that verifies the conversion ofREAD_METADATA
toBROWSE
for a grant on a database and handles the conversion ofREAD_METADATA
permission toUC BROWSE
for a newudf="function"
parameter. These changes resolve issue #2023 and have been tested through manual testing and unit tests. No new methods have been added, and existing functionality has been changed in a limited scope. No new unit or integration tests have been added as it is assumed that the existing tests will continue to pass after these changes have been made. - Migrates Pipelines crawled during the assessment phase (#2778). A new utility class,
PipelineMigrator
, has been introduced in this release to facilitate the migration of Databricks Labs SQL (DLT) pipelines. This class is used in a new workflow that tests pipeline migration, which involves cloning DLT pipelines in the assessment phase with specific configurations to a new Unity Catalog (UC) pipeline. The migration can be skipped for certain pipelines by specifying their pipeline IDs in a list. Three test scenarios, each with different pipeline specifications, are defined to ensure the proper functioning of the migration process under various conditions. The class and the migration process are thoroughly tested with manual testing, unit tests, and integration tests, with no reliance on a staging environment. The migration process takes into account theWorkspaceClient
,WorkspaceContext
,AccountClient
, and a flag for running the command as a collection. ThePipelinesMigrator
class uses aPipelinesCrawler
andJobsCrawler
to perform the migration and ensures better functionality for the users with additional parameters. The commit also introduces a new command,migrate_dlt_pipelines
, to the CLI of the ucx package, which helps migrate DLT pipelines. The migration process is tested using a mock installation, unit tests, and integration tests. The tests cover the scenario where the installation has two jobs,test
and 'assessment', with job IDs123
and456
respectively. The state of the installation is recorded in astate.json
file. A configuration filepipeline_mapping.csv
is used to map the source pipeline ID to the target catalog, schema, pipeline, and workspace names. - Removed
try-except
around verifying the migration progress prerequisites in themigrate-tables
cli command (#3439). In the latest release, theucx
package'smigrate-tables
CLI command has undergone a significant modification in the handling of progress tracking prerequisites. The previous try-except block surrounding the verification has been removed, and the RuntimeWarning is now propagated, providing a more specific and helpful error message. If the prerequisites are not met, theverify
method will raise an exception, and the migration will not proceed. This change enhances the accuracy of error messages for users and ensures that the prerequisites for migration are properly met. The tests formigrate_tables
have been updated accordingly, including a new test casetest_migrate_tables_errors_out_before_assessment
that checks whether the migration does not proceed with the verification fails. This change affects the existingdatabricks labs ucx migrate-tables
command and brings improved precision and reliability to the migration process. - Removed redundant internal methods from create_account_group (#3395). In this change, the
create_account_group
function's internal methods have been removed, and its signature has been modified to retrieve the workspace ID fromaccountworkspace._workspaces()
instead of passing it as a parameter. This resolves issue #3170 and improves code efficiency by removing unnecessary parameters and methods. TheAccountWorkspaces
class now accepts a list of workspace IDs upon instantiation, enhancing code readability and eliminating redundancy. The function has been tested with unit tests, ensuring it creates a group if it doesn't exist, throws an exception if a group already exists, filters system groups, and handles cases where a group already has the required number of members in a workspace. These changes simplify the codebase, eliminate redundancy, and improve the maintainability of the project. - Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.34 (#3407). In this release, we have updated the sqlglot requirement to version 25.33.9999 from a range that included versions 25.5.0 to 25.32.9999. This update allows us to utilize the latest version of sqlglot, which includes various bug fixes and new features. In v25.33.0, there were two breaking changes: the TIMESTAMP data type now maps to Type.TIMESTAMPTZ, and the NEXT keyword is now treated as a function keyword. Several new features were also introduced, including support for generated columns in PostgreSQL and the ability to preserve tables in the replace_table method. Additionally, there were several bug fixes, including fixes for issues related to BigQuery, Presto, and Spark. The v25.32.1 release contained two bug fixes related to BigQuery and one bug fix related to Presto. Furthermore, v25.32.0 had three breaking changes: support for ATTACH/DETACH statements, tokenization of hints as comments, and a fix to datetime coercion in the canonicalize rule. This release also introduced new features, such as support for TO_TIMESTAMP* variants in Snowflake and improved error messages in the Redshift transpiler. Lastly, there were several bug fixes, including fixes for issues related to SQL Server, MySQL, and PostgreSQL.
- Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.35 (#3413). In this release, the
sqlglot
dependency has been updated from a version range that allows up to25.33
, but excludes25.34
, to a version range that allows25.5.0
and above, but excludes25.35
. This update was made to enable the latest version ofsqlglot
, which includes one breaking change related to the alias expansion of USING STRUCT fields. This version also introduces two new features, an optimization for alias expansion of USING STRUCT fields, and support for generated columns in PostgreSQL. Additionally, two bug fixes were implemented, addressing proper consumption of dashed table parts and removal of parentheses from CURRENT_USER in Presto. The update also includes a fix to make TIMESTAMP map to Type.TIMESTAMPTZ, a fix to parse DEFAULT in VALUES clause into a Var, and changes to the BigQuery and Snowflake dialects to improve transpilation and JSONPathTokenizer leniency. The commit message includes a reference to issue[#3413](https://github.com/databrickslabs/ucx/issues/3413)
and a link to thesqlglot
changelog for further reference. - Updated sqlglot requirement from <25.35,>=25.5.0 to >=25.5.0,<26.1 (#3433). In this release, we have updated the required version of the
sqlglot
library to a range that includes version 25.5.0 but excludes version 26.1. This change is crucial due to the breaking changes introduced insqlglot
v26.0.0 that are not yet compatible with our project. The commit message includes the changelog forsqlglot
v26.0.0, which highlights the breaking changes, new features, bug fixes, and other modifications in this version. Additionally, the commit includes a list of commits merged into thesqlglot
repository for a comprehensive understanding of the changes. As a software engineer, I recommend approving this change to maintain compatibility withsqlglot
. However, I advise thorough testing to ensure the updated version does n...
v0.51.0
- Added
assign-owner-group
command (#3111). The Databricks Labs Unity Catalog Exporter (UCX) tool now includes a newassign-owner-group
command, allowing users to assign an owner group to the workspace. This group will be designated as the owner for all migrated tables and views, providing better control and organization of resources. The command can be executed in the context of a specific workspace or across multiple workspaces. The implementation includes new classes, methods, and attributes in various files, such ascli.py
,config.py
, andgroups.py
, enhancing ownership management functionality. Theassign-owner-group
command replaces the functionality of issue #3075 and addresses issue #2890, ensuring proper schema ownership and handling of crawled grants. Developers should be aware that running themigrate-tables
workflow will result in assigning a new owner group for the Hive Metastore instance in the workspace installation. - Added
opencensus
to known list (#3052). In this release, we have added OpenCensus to the list of known libraries in our configuration file. OpenCensus is a popular set of tools for distributed tracing and monitoring, and its inclusion in our system will enhance support and integration for users who utilize this tool. This change does not affect existing functionality, but instead adds a new entry in the configuration file for OpenCensus. This enhancement will allow our library to better recognize and work with OpenCensus, enabling improved performance and functionality for our users. - Added default owner group selection to the installer (#3370). A new class, AccountGroupLookup, has been added to the AccountGroupLookup module to select the default owner group during the installer process, addressing previous issue #3111. This class uses the workspace_client to determine the owner group, and a pick_owner_group method to prompt the user for a selection if necessary. The ownership selection process has been improved with the addition of a check in the installer's
_static_owner
method to determine if the current user is part of the default owner group. The GroupManager class has been updated to use the new AccountGroupLookup class and its methods,pick_owner_group
andvalidate_owner_group
. A new variable,default_owner_group
, is introduced in the ConfigureGroups class to configure groups during installation based on user input. The installer now includes a unit test, "test_configure_with_default_owner_group", to demonstrate how it sets expected workspace configuration values when a default owner group is specified during installation. - Added handling for non UTF-8 encoded notebook error explicitly (#3376). A new enhancement has been implemented to address the issue of non-UTF-8 encoded notebooks failing to load by introducing explicit error handling for this case. A UnicodeDecodeError exception is now caught and logged as a warning, while the notebook is skipped and returned as None. This change is implemented in the load_dependency method in the loaders.py file, which is a part of the assessment workflow. Additionally, a new unit test has been added to verify the behavior of this change, and the assessment workflow has been updated accordingly. The new test function in test_loaders.py checks for different types of exceptions, specifically PermissionError and UnicodeDecodeError, ensuring that the system can handle notebooks with non-UTF-8 encoding gracefully. This enhancement resolves issue #3374, thereby improving the overall robustness of the application.
- Added migration progress documentation (#3333). In this release, we have updated the
migration-progress-experimental
workflow to track the migration progress of a subset of inventory tables related to workspace resources being migrated to Unity Catalog (UCX). The workflow updates the inventory tables and tracks the migration progress in the UCX catalog tables. To use this workflow, users must attach a UC metastore to the workspace, create a UCX catalog, and ensure that the assessment job has run successfully. TheMigration Progress
section in the documentation has been updated with a new markdown file that provides details about the migration progress, including a migration progress dashboard and an experimental migration progress workflow that generates historical records of inventory objects relevant to the migration progress. These records are stored in the UCX UC catalog, which contains a historical table with information about the object type, object ID, data, failures, owner, and UCX version. The migration process also tracks dangling Hive or workspace objects that are not referenced by business resources, and the progress is persisted in the UCX UC catalog, allowing for cross-workspace tracking of migration progress. - Added note about running assessment once (#3398). In this release, we have introduced an update to the UCX assessment workflow, which will now only be executed once and will not update existing results in repeated runs. To accommodate this change, we have updated the README file with a note clarifying that the assessment workflow is a one-time process. Additionally, we have provided instructions on how to update the inventory and findings by uninstalling and reinstalling the UCX. This will ensure that the inventory and findings for a workspace are up-to-date and accurate. We recommend that software engineers take note of this change and follow the updated instructions when using the UCX assessment workflow.
- Allowing skipping TACLs migration during table migration (#3384). A new optional flag, "skip_tacl_migration", has been added to the configuration file, providing users with more flexibility during migration. This flag allows users to control whether or not to skip the Table Access Control Language (TACL) migration during table migrations. It can be set when creating catalogs and schemas, as well as when migrating tables or using the
migrate_grants
method inapplication.py
. Additionally, theinstall.py
file now includes a new variable,skip_tacl_migration
, which can be set toTrue
during the installation process to skip TACL migration. New test cases have been added to verify the functionality of skipping TACL migration during grants management and table migration. These changes enhance the flexibility of the system for users managing table migrations and TACL operations in their infrastructure, addressing issues #3384 and #3042. - Bump
databricks-sdk
anddatabricks-labs-lsql
dependencies (#3332). In this update, thedatabricks-sdk
anddatabricks-labs-lsql
dependencies are upgraded to versions 0.38 and 0.14.0, respectively. Thedatabricks-sdk
update addresses conflicts, bug fixes, and introduces new API additions and changes, notably impacting methods likecreate()
,execute_message_query()
, and others in workspace-level services. Whiledatabricks-labs-lsql
updates ensure compatibility, its changelog and specific commits are not provided. This pull request also includes ignore conditions for thedatabricks-sdk
dependency to prevent future Dependabot requests. It is strongly advised to rigorously test these updates to avoid any compatibility issues or breaking changes with the existing codebase. This pull request mirrors another (#3329), resolving integration CI issues that prevented the original from merging. - Explain failures when cluster encounters Py4J error (#3318). In this release, we have made significant improvements to the error handling mechanism in our open-source library. Specifically, we have addressed issue #3318, which involved handling failures when the cluster encounters Py4J errors in the
databricks/labs/ucx/hive_metastore/tables.py
file. We have added code to raise noisy failures instead of swallowing the error with a warning when a Py4J error occurs. The functions_all_databases()
and_list_tables()
have been updated to check if the error message contains "py4j.security.Py4JSecurityException", and if so, log an error message with instructions to update or reinstall UCX. If the error message does not contain "py4j.security.Py4JSecurityException", the functions log a warning message and return an empty list. These changes also resolve the linked issue #3271. The functionality has been thoroughly tested and verified on the labs environment. These improvements provide more informative error messages and enhance the overall reliability of our library. - Rearranged job summary dashboard columns and make job_name clickable (#3311). In this update, the job summary dashboard columns have been improved and the need for the
30_3_job_details.sql
file, which contained a SQL query for selecting job details from theinventory.jobs
table, has been eliminated. The dashboard columns have been rearranged, and thejob_name
column is now clickable, providing easy access to job details via the corresponding job ID. The changes include modifying the...
v0.50.0
- Added
pytesseract
to known list (#3235). A new addition has been made to theknown.json
file, which tracks packages with native code, to includepytesseract
, an Optical Character Recognition (OCR) tool for Python. This change improves the handling ofpytesseract
within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation ofpytesseract
and its native components. However, specific details on the usage ofpytesseract
within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment ofpytesseract
and its native dependencies, making it easier to work with. - Added hyperlink to database names in database summary dashboard (#3310). The recent change to the
Database Summary
dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding alinkUrlTemplate
property to thedatabase
field in theencodings
object within theoverrides
property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard. - Bump codecov/codecov-action from 4 to 5 (#3316). In this release, the version of the
codecov/codecov-action
dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, includingbinary
,gcov_args
,gcov_executable
,gcov_ignore
,gcov_include
,report_type
,skip_validation
, andswift_project
. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking. - Depend on a Databricks SDK release compatible with 0.31.0 (#3273). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new
InvalidState
error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in thepyproject.toml
file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project. - Eliminate redundant migration-index refresh and loads during view migration (#3223). In this pull request, we have optimized the view migration process in the
databricks/labs/ucx/hive_metastore/table_metastore.py
file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new classTableMigrationIndex
and imported theTableMigrationStatusRefresher
class. The_migrate_views
method now takes an additional argumentmigration_index
, which is used in theViewsMigrationSequencer
and in the_migrate_view
method. The_view_can_be_migrated
and_sql_migrate_view
methods now also takemigration_index
as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly. - Fixed backwards compatibility breakage from Databricks SDK (#3324). In this release, we have addressed a backwards compatibility issue (Issue #3324) that was caused by an update to the Databricks SDK. This was done by adding new methods to the
databricks.sdk.service
module to interact with dashboards. Additionally, we have fixed bug #3322 and updated thecreate
function in theconftest.py
file to utilize the newdashboards
module and itsDashboard
class. The function now returns the dashboard object as a dictionary and calls thepublish
method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the--cov-fail-under=89
flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality. - Fixed issue with cleanup of failed
create-missing-principals
command (#3243). In this update, we have improved thecreate_uc_roles
method within theaccess.py
file of thedatabricks/labs/ucx/aws
directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if aPermissionDenied
orNotFound
exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of thedatabricks labs ucx create-missing-principals
command by handling permission errors and restoring the system to its initial state. - Improve error handling for
assess_workflows
task (#3255). This pull request introduces improvements to theassess_workflows
task in thedatabricks/labs/ucx
module, focusing on error handling and logging. A new error type,DatabricksError
, has been added to handle Databricks-specific exceptions in the_temporary_copy
method, ensuring proper handling and re-raising of Databricks-related errors asInvalidPath
exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed fromerror
towarning
. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of theassess_workflows
task, ensuring appropriate handling and logging of any errors that may occur during execution. - Require at least 4 cores for UCX VMs (#3229). In this release, the selection of
node_type_id
in thepolicy.py
file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering thenode_type_id
parameter. The updatednode_type_id
selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly. - Skip
test_feature_tables
integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features. - Speed up
update_migration_status
jobs by eliminating lots of redundant SQL queries (#3200). In this relea...
v0.49.0
- Added
MigrationSequencer
for jobs (#3008). In this commit, aMigrationSequencer
class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable ofMigrationStep
objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue #1415 and supersedes issue #2980. Additionally, the commit removes some unnecessary imports and fixtures from a test file. - Added
phik
to known list (#3198). In this release, we have addedphik
to the known list in the provided JSON file. This change addresses part of issue #1931, as outlined in the linked issues. Thephik
key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding thephik
key. - Added
pmdarima
to known list (#3199). In this release, we are excited to announce the addition of support for thepmdarima
library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have addedpmdarima
to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integratingpmdarima
, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue #1931 and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available. - Added
preshed
to known list (#3220). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython,preshed
is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules,preshed
and "preshed.about," this addition partially resolves issue #1931, improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage thepreshed
library's features and optimized routines for their projects, reducing development time and increasing efficiency. - Added
py-cpuinfo
to known list (#3221). In this release, we have added support for thepy-cpuinfo
library to our project, enabling the use of thecpuinfo
functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue #1931 and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources. - Cater for empty python cells (#3212). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the
_python_trees
dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the_load_children_from_tree
method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input. - Create
TODO
issues every nightly run (#3196). A commit has been made to update theacceptance
repository version in theacceptance.yml
GitHub workflow fromacceptance/v0.4.0
toacceptance/v0.4.2
, which affects the integration tests. TheRun nightly tests
step in the GitHub repository's workflow has also been updated to use a newer version of thedatabrickslabs/sandbox/acceptance
action, fromv0.3.1
tov0.4.2
. Software engineers should verify that the new version of theacceptance
repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues. - Fixed Integration test failure of migration_tables (#3108). This release includes a fix for two integration tests (
test_migrate_managed_table_to_external_table_without_conversion
andtest_migrate_managed_table_to_external_table_with_clone
) related to Hive Metastore table migration, addressing issues #3054 and #3055. Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing@pytest.mark.skip
markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase. - Replace MockInstallation with MockPathLookup for testing fixtures (#3215). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue #3115.
- Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 (#3224). The open-source library
sqlglot
has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpilingANY
toEXISTS
, supporting theMEDIAN()
function, wrapping values inNOT value IS ...
, and parsing information schema views into a single identifier. New features include support for theJSONB_EXISTS
function in PostgreSQL, transpilingANY
toEXISTS
in Spark, transpiling Snowflake'sTIMESTAMP()
function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding aNULL
filter onARRAY_AGG
only for columns, improving parsing ofWITH FILL ... INTERPOLATE
in Clickhouse, generatingLOG(...)
forexp.Ln
in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release. - Use acceptance/v0.4.0 (#3192). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the
databrickslabs/sandbox/acceptance
runner toacceptance/v0.4.0
and granting write permissions for theissues
field in thepermissions
section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. ATODO
comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly. - Warn about errors instead to a...
v0.48.0
- Added
--dry-run
option for ACL migrate (#3017). In this release, we have added a--dry-run
option to themigrate-acls
command in thelabs.yml
file, enabling a preview of the migration process without executing it. This feature also introduces thehms-fed
flag, allowing migration of HMS-FED ACLs while migrating tables. TheACLMigrator
class in theapplication.py
file has been updated to include new parameters,sql_backend
andinventory_database
, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a newretrieve
method has been added to theACLMigrator
class to retrieve a list of grants based on the source and destination objects, and aCrawlerBase
class has been introduced for fetching grants. We have also introduced a newinferred_grants
table in the deployment schema to store inferred grants during the migration process. - Added
WorkspacePathOwnership
to determine transitive owners for files and notebooks (#3047). In this release, we introduce a new classWorkspacePathOwnership
in theowners.py
module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass ofOwnership
and takesAdministratorLocator
andWorkspaceClient
as inputs. It has methods to infer the owner from the firstCAN_MANAGE
permission level in the access control list. We also added a new propertyworkspace_path_ownership
to the existingHiveMetastoreContext
class, which returns aWorkspacePathOwnership
object initialized with anAdministratorLocator
object and aworkspace_client
. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added totest_owners.py
. The new tests,test_notebook_owner
andtest_file_owner
, create a notebook and a workspace file and verify the owner of each using theowner_of
method. TheAdministratorLocator
is used to locate the administrators group for the workspace and thePermissionLevel
class is used to specify the permission level for the notebook permissions. - Added
mosaicml-streaming
to known list (#3029). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions includemosaicml-streaming
,oci
,pynacl
,pyopenssl
,python-snapy
, andzstd
. Notably,mosaicml-streaming
has two new entries,simulation
andstreaming
, while the other packages have a single entry each. This update addresses issue #1931 and enhances the system's ability to identify and work with a wider variety of packages. - Added
msal-extensions
to known list (#3030). In this release, we have added support for two new packages,msal-extensions
andportalocker
, to our project. Themsal-extensions
package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. Theportalocker
package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. - Added
multimethod
to known list (#3031). In this release, we have added support for themultimethod
programming concept to the library. This feature has been added to theknown.json
file, which partially resolves issue #193 - Added
murmurhash
to known list (#3032). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue #1931. The MurmurHash function includes two variants,murmurhash
and "murmurhash.about", with distinct functionalities. Themurmurhash
variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. - Added
ninja
to known list (#3050). In this release, we have added Ninja to the known list in theknown.json
file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue #1931, which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. - Added
nvidia-ml-py
to known list (#3051). In this release, we have added support for thenvidia-ml-py
package to our project. This addition consists of two components:example
and 'pynvml'.Example
is likely a placeholder or sample usage of the package, whilepynvml
is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue #1931, which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. - Added dashboard for tracking migration progress (#3016). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method,
_create_dashboard
, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. - Added grant progress encoder (#3079). A new
GrantsProgressEncoder
class has been introduced in theprogress/grants.py
file to encodeGrant
objects intoHistory
objects for themigration-progress
workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases whereGrant
objects fail to map to the Unity Catalog by adding a list of failures to theHistory
object. The commit also modifies themigration-progress
workflow to incorporate the newGrantsProgressEncoder
class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue #3058, which was related to grant progress encoding. TheGrantsProgressEncoder
class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. - Added table progress encoder (#3083). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues #3061 and #3064.
- Combine static code analysis results with historical job snapshots (#3074). In this release, we have added a new method,
JobsProgressEncoder
, to theWorkflowTask
class in thedatabricks.labs.ucx.contexts
module. This method is used to track the progress of jobs in the context of a workflow task, re...
v0.47.0
- Added
mdit-py-plugins
to known list (#3013). In this release, the open-source library has been updated with several new features to enhance its functionality and usability for software engineers. Firstly, a new module has been introduced to support multi-threading, allowing for more efficient processing of large datasets. Additionally, a new configuration system has been implemented, providing users with greater flexibility in customizing the library's behavior to their specific needs. Furthermore, the library now includes a set of diagnostic tools to help developers identify and troubleshoot issues more effectively. These new features are expected to significantly improve the performance and productivity of the library, making it an even more powerful tool for software development projects. - Added
memray
to known list (#3014). In this release, we have integrated two new libraries to enhance the project's functionality and maintainability. We have addedmemray
to our list of known libraries, which allows for memory profiling and analysis within the project's environment. Additionally, we have added thetextual
library and its related modules, a TUI (Text User Interface) library, which provides a wide variety of user interface components. These additions partially resolve issue #1931, enabling the development of more sophisticated and user-friendly interfaces, and improving memory profiling capabilities. - Added
mlflow-skinny
to known list (#3015). A new version of our library includes the addition ofmlflow-skinny
to the known packages list in a JSON file.mlflow-skinny
is a lightweight version of the widely-used machine learning platform, MLflow. This integration enables users to utilizemlflow-skinny
in their projects and have their runs automatically tracked and logged. Furthermore, this commit partially addresses issue #1931, hinting at a possible connection to a larger issue or feature request. Software engineers will now have access to a more streamlined MLflow package, allowing for easier and more efficient integration in their projects. - Added handling for installing libraries multiple times in
PipResolver
(#3024). In this commit, thePipResolver
class has been updated to handle the installation of libraries multiple times, resolving issues #3022 and #3023. The_resolve_libraries
method has been modified to resolve pip installs as libraries or paths based on whether they are found in the path lookup or not, and whether they are already installed in the temporary virtual environment. The_install_pip
method has also been updated to include the--upgrade
flag to upgrade libraries if they are already installed. Code linting has been improved, and integration tests have been added to thetest_libraries.py
file to ensure the proper functioning of the updated code. These tests include installing thepytest
library twice in a Databricks notebook and then importing it to verify its installation. These changes aim to improve the reliability and robustness of the library installation process in the context of multiple installations. - Fixed errors related to unsupported cell languages (#3026). In this release, we have made significant improvements to the
_Collector
abstract base class by adding support for multiple cell languages in the_collect_from_source
method. Previously, the implementation only supported Python and SQL languages, but with this update, we have added support for several new languages including R, Scala, Shell, Markdown, Run, and Pip. The new methods added to the class handle the source code collection for their respective languages and return an empty iterable or log a warning if a language is not supported yet. This change enhances the functionality and flexibility of the class, enabling it to handle a wider variety of cell languages. Additionally, this commit resolves the issue #2977 and includes new methods to theDfsaCollectorWalker
class, allowing it to collect information from cells of any language. The test casetest_collector_supports_all_cell_languages
has also been added to ensure that the collector supports all cell languages. This release also includes manually tested and added unit tests, and is co-authored by Eric Vergnaud. - Preemptively fix unknown errors of Python AST parsing coming from
astroid
andast
libraries (#3027). A new update has been implemented in the library to improve Python AST parsing and error handling. Themaybe_parse
function has been enhanced to catch all types of exceptions using a broad exception clause, extending from the previous limitation of only catchingAstroidSyntaxError
andSystemError
. The_definitely_failure
function now includes the type of exception in the error message for better visibility and troubleshooting. In the test cases, thegraph_builder_parse_error
function's test has been updated to check for asystem-error
code instead ofsyntax-error
to preemptively fix unknown errors from Python AST parsing. Additionally, the test forparses_python_cell_with_magic_commands
function has been added, ensuring that any Python cell with magic commands is correctly parsed. These changes aim to increase robustness in handling exceptional cases during parsing, provide more informative error messages, and prevent potential unknown parsing errors. - Updated migration progress workflow to also re-lint dashboards and jobs (#3025). In this release, we have updated the table utilization documentation to include the ability to lint directFS paths and queries, and modified the
migration-progress-experimental
workflow to re-run linting tasks for dashboard queries and notebooks associated with jobs. Additionally, we have updated theMigrationProgress
workflow to include the scanning of dashboards and jobs for migration issues, assessing SQL code in embedded widgets of dashboards and inventory & linting of jobs. To support these changes, we have added unit tests and updated existing integration tests intest_workflows.py
. The new test function,test_linter_runtime_refresh
, tests the linter refresh behavior for dashboard and workflow tasks. These updates aim to ensure consistent linting and maintain the accuracy of theexperimental-migration-progress
workflow for users who adopt the project.
Contributors: @pritishpai, @JCZuurmond, @asnare, @ericvergnaud, @nfx
v0.46.0
- Added
lazy_loader
to known list (#2991). With this commit, thelazy_loader
module has been added to the known list in the configuration file, addressing a portion of issue #193, which may have been caused by the discovery or loading of this module. Thelazy_loader
is a package or module that, once added to the known list, will be recognized and loaded by the system. This change does not affect any existing functionality or introduce new methods. The commit solely updates the known.json file to includelazy_loader
with an empty list, indicating that it is ready for use. This modification will enable the correct loading and recognition of thelazy_loader
module in the system. - Added
librosa
to known list (#2992). In this update, we have added several open-source libraries to the known list in the configuration file, includinglibrosa
,llvmlite
,msgpack
,pooch
,soundfile
, andsoxr
. These libraries are commonly used in data engineering, machine learning, and scientific computing tasks.librosa
is a Python library for audio and music analysis, whilellvmlite
is a lightweight Python interface to the LLVM compiler infrastructure.msgpack
is a binary serialization format like JSON,pooch
is a package for managing external data files,soundfile
is a library for reading and writing audio files, andsoxr
is a library for high-quality audio resampling. Each library has an empty list next to it for specifying additional configuration related to the library. This update partially resolves issue #1931 by addinglibrosa
to the known list, ensuring that these libraries will be properly recognized and utilized by the codebase. - Added
linkify-it-py
to known list (#2993). In this release, we have added support for two new open-source packages,linkify-it-py
anduc-micro-py
, to enhance the software's functionality and compatibility. The addition oflinkify-it-py
and its constituent modules, as well as the incorporation ofuc-micro-py
with its modules and classes, aims to expand the software's capabilities. These changes are related to the resolution of issue #1931, and they will enable the software to work seamlessly with these packages, thereby providing a better user experience. - Added
lz4
to known list (#2994). In this release, we have added support for the LZ4 lossless data compression algorithm, which is known for its focus on compression and decompression speed. The implementation includes four variants: lz4, lz4.block, lz4.frame, and lz4.version, each providing different levels of compression and decompression speed and flexibility. This addition expands the range of supported compression algorithms, providing more options for users to choose from and partially addressing issue #1931 related to supporting additional compression algorithms. This improvement will be beneficial to software engineers working with data compression in their projects. - Fixed
SystemError: AST constructor recursion depth mismatch
failing the entire job (#3000). This PR introduces more deterministic, Go-style, error handling for parsing Python code, addressing issues that caused the entire job to fail due to aSystemError: AST constructor recursion depth mismatch
(#3000) and bug #2976. It includes removing theAstroidSyntaxError
import, adding an import forSqlglotError
, and updating theSqlParseError
exception toSqlglotError
in thelint
method of theSqlLinter
class. Additionally, abstract classesTablePyCollector
andDfsaPyCollector
and their respective methods for collecting tables and direct file system accesses have been removed. ThePythonSequentialLinter
class, previously handling multiple responsibilities, has also been removed, enhancing code modularity, understandability, maintainability, and testability. The changes affect thebase.py
,python_ast.py
, andpython_sequential_linter.py
modules. - Skip applying permissions for workspace system groups to Unity Catalog resources (#2997). This commit introduces changes to the ACL-related code in the
databricks labs ucx create-catalog-schemas
command and themigrate-table-*
workflow, skipping the application of permissions for workspace system groups in the Unity Catalog. These system groups, which include 'admins', do not exist at the account level. To ensure the correctness of these modifications, unit and integration tests have been added, including a test that checks the proper handling of user privileges in system groups during catalog schema creation. TheAccessControlResponse
object has been updated for theadmins
andusers
groups, granting them specific permissions for a workspace and warehouse object, respectively, enhancing the system's functionality in multi-user environments with system groups.
Contributors: @pritishpai, @asnare, @JCZuurmond, @nfx
v0.45.0
- Added DBFS Root resolution when HMS Federation is enabled (#2947). This commit introduces a DBFS resolver for use with HMS (Hive Metastore) federation, enabling accurate resolution of DBFS root locations when HMS federation is enabled. A new
_resolve_dbfs_root()
class method is added to theMountsCrawler
class, and a boolean argumentenable_hms_federation
is included in theMountsCrawler
constructor, providing better handling of federation functionality. The commit also adds a test function,test_resolve_dbfs_root_in_hms_federation
, to validate the resolution of DBFS roots with HMS federation. The test covers special cases, such as the/user/hive/metastore
path, and utilizesLocationTrie
for more accurate location guessing. These changes aim to improve the overall DBFS root resolution when using HMS federation. - Added
jax-jumpy
to known list (#2959). In this release, we have added thejax-jumpy
package to the list of known packages in our system.jax-jumpy
is a Python-based numerical computation library, which includes modules such asjumpy
,jumpy._base_fns
,jumpy.core
,jumpy.lax
,jumpy.numpy
,jumpy.numpy._factory_fns
,jumpy.numpy._transform_data
,jumpy.numpy._types
,jumpy.numpy.linalg
,jumpy.ops
, andjumpy.random
. These modules are now recognized by our system, which partially resolves issue #1931, which may have been caused by the integration of thejax-jumpy
package. Engineers can now utilize the capabilities of this library in their numerical computations. - Added
joblibspark
to known list (#2960). In this release, we have added support for thejoblibspark
library in our system by updating theknown.json
file, which keeps track of various libraries and their associated components. This change is a part of the resolution for issue #1931 and includes new elements such asdoc
,doc.conf
,joblibspark
,joblibspark.backend
, andjoblibspark.utils
. These additions enable the system to recognize and manage the new components related tojoblibspark
, allowing for improved compatibility and functionality. - Added
jsonpatch
to known list (#2969). In this release, we have addedjsonpatch
to the list of known libraries in theknown.json
file. Jsonpatch is a library used for applying JSON patches, which allow for partial updates to a JSON document. By including jsonpatch in the known list, developers can now easily utilize its functionality for JSON patching, and any necessary dependencies will be handled automatically. This change partially addresses issue #1931, which may have been caused by the use or integration of jsonpatch. We encourage developers to take advantage of this new addition to enhance their code and efficiently make partial updates to JSON documents. - Added
langchain-community
to known list (#2970). A new entry forlangchain-community
has been added to the configuration file for known language chain components in this release. This entry includes several sub-components such as 'langchain_community.agents', 'langchain_community.callbacks', 'langchain_community.chat_loaders', 'langchain_community.chat_message_histories', 'langchain_community.chat_models', 'langchain_community.cross_encoders', 'langchain_community.docstore', 'langchain_community.document_compressors', 'langchain_community.document_loaders', 'langchain_community.document_transformers', 'langchain_community.embeddings', 'langchain_community.example_selectors', 'langchain_community.graph_vectorstores', 'langchain_community.graphs', 'langchain_community.indexes', 'langchain_community.llms', 'langchain_community.memory', 'langchain_community.output_parsers', 'langchain_community.query_constructors', 'langchain_community.retrievers', 'langchain_community.storage', 'langchain_community.tools', 'langchain_community.utilities', and 'langchain_community.utils'. Currently, these sub-components are empty and have no additional configuration or code. This change partially resolves issue #1931, but the specifics of the issue and how these components will be used are still unclear. - Added
langcodes
to known list (#2971). A newlangcodes
library has been added to the project, addressing part of issue #1931. This library includes several modules that provide functionalities related to language codes and their manipulation, includinglangcodes
,langcodes.build_data
,langcodes.data_dicts
,langcodes.language_distance
,langcodes.language_lists
,langcodes.registry_parser
,langcodes.tag_parser
, andlangcodes.util
. Additionally, the memory-efficient trie (prefix tree) data structure library,marisa-trie
, has been included in the known list. It is important to note that no existing functionality has been altered in this commit. - Addressing Ownership Conflict when creating catalog/schemas (#2956). This release introduces new functionality to handle ownership conflicts during catalog/schema creation in our open-source library. The
_apply_from_legacy_table_acls
method has been enhanced with two loops to address non-own grants and own grants separately. This ensures proper handling of ownership conflicts by generating and executing UC grant SQL for each grant type, with appropriate exceptions. Additionally, a new helper function,this_type_and_key()
, has been added to improve code readability. The release also introduces new methods, GrantsCrawler and Rule, in the hive_metastore package of the labs.ucx module, responsible for populating views and mapping source and destination objects. The test_catalog_schema.py file has been updated to include tests for creating catalogs and schemas with legacy ACLs, utilizing the new Rule method and GrantsCrawler. Issue #2932 has been addressed with these changes, which include adding new methods and updating existing tests for hive_metastore. - Clarify
skip
andunskip
commands work on views (#2962). In this release, theskip
andunskip
commands in the databricks labs UCX tool have been updated to clarify their functionality on views and to make it more explicit with the addition of the--view
flag. These commands allow users to skip or unskip certain schemas, tables, or views during the table migration process. This is useful for temporarily disabling migration of a particular schema, table, or view. Unit tests have been added to ensure the correct behavior of these commands when working with views. Two new methods have been added to test the behavior of theunskip
command when a schema or table is specified, and two additional methods test the behavior of theunskip
command when a view or no schema is specified. Finally, two methods test that an error message is logged when both the--table
and--view
flags are specified. - Fixed issue with migrating MANAGED hive_metastore table to UC (#2928). This commit addresses an issue with migrating Hive Metastore (HMS) MANAGED tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table can result in data loss. To prevent this, a new option
CONVERT_TO_EXTERNAL
has been added to themigrate_tables
method for migrating managed tables to UC as external, ensuring that the HMS managed table is converted to an external table in HMS and UC, and protecting against data loss when deleting a managed table that has been migrated to UC as external. Additionally, new caching properties have been added for better performance, and existing methods have been modified to handle the migration of managed tables to UC as external. Tests, including unit and integration tests, have been added to ensure the proper functioning of these changes. It is important to note that changing MANAGED tables to EXTERNAL can have potential consequences on regulatory data cleanup, and the impact of this change should be carefully validated for existing workloads. - Let
create-catalogs-schemas
reuseMigrateGrants
so that it applies group renaming (#2955). Thecreate-catalogs-schemas
command in thedatabricks labs ucx
package has been enhanced to reuse theMigrateGrants
function, enabling group renaming and eliminating redundant code. Themigrate-tables
workflow remains functionally the same. Changes include modifying theCatalogSchema
class to accept amigrate_grants
argument, introducing newCatalog
andSchema
dataclasses, and updating various methods in thehive_metastore
module. Unit and integration tests have been added and manually verified to ensure proper functionality. TheMigrateGrants
class has been updated to accept twoSecurableObject
arguments and sort matched grants. Thefrom_src_dst
function inmapping.py
now includes a newas_uc_table
method and updates toas_uc_table_key
. Addressing issues #2934, #2932, and #2955, the changes also include a newkey
property for thetables.py
file, and updates to thetest_create_catalogs_schemas
andtest_migrate_tables
test functions. - Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](...