Releases: databrickslabs/ucx
Releases · databrickslabs/ucx
v0.27.0
- Added
mlflow
to known packages (#1895). Themlflow
package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use ofmlflow
in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related tosparkContext
,_conf
, andRDD
APIs. Additionally, the artifact storage system ofmlflow
in Databricks and DBFS has undergone changes. Theknown.json
file has also been updated with several new packages, such asalembic
,aniso8601
,cloudpickle
,docker
,entrypoints
,flask
,graphene
,graphql-core
,graphql-relay
,gunicorn
,html5lib
,isort
,jinja2
,markdown
,markupsafe
,mccabe
,opentelemetry-api
,opentelemetry-sdk
,opentelemetry-semantic-conventions
,packaging
,pyarrow
,pyasn1
,pygments
,pyrsistent
,python-dateutil
,pytz
,pyyaml
,regex
,requests
, and more. These packages are now acknowledged and incorporated into the project's functionality. - Added
tensorflow
to known packages (#1897). In this release, we are excited to announce the addition of thetensorflow
package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such astensorflow
,tensorboard
,tensorboard-data-server
, andtensorflow-io-gcs-filesystem
, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such asgast
,grpcio
,h5py
,keras
,libclang
,mdurl
,namex
,opt-einsum
,optree
,pygments
,rich
,rsa
,termcolor
,pyasn1_modules
,sympy
, andthreadpoolctl
. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools. - Added
torch
to known packages (#1896). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the_analyze_dist_info
method in theknown.py
file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if aRecursionError
occurs. This enhancement increases the robustness of the package analysis process. - Added more known libraries (#1894). In this release, the
known
library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality. - Added more value inference for
dbutils.notebook.run(...)
(#1860). In this release, thedbutils.notebook.run(...)
functionality ingraph.py
has been significantly updated to enhance value inference. The change includes the introduction of new methods for handlingNotebookRunCall
andSysPathChange
objects, as well as the refactoring of theget_notebook_path
method intoget_notebook_paths
. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method,_get_notebook_paths
, has also been added to retrieve notebook paths from a list of nodes. Furthermore, theload_dependency
method inloaders.py
has been updated to detect the language of a notebook based on the file path, in addition to its content. TheNotebook
class now includes a new parameter,SUPPORTED_EXTENSION_LANGUAGES
, which maps file extensions to their corresponding languages. In thedatabricks.labs.ucx
project, more value inference has been added to the linter, including new methods and enhanced functionality fordbutils.notebook.run(...)
. Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for theNotebookLoader
class in thedatabricks.labs.ucx.source_code.notebooks.loaders
module has been added, with a new class,NotebookLoaderForTesting
, that overrides thedetect_language
method to make it a class method. This allows for more robust testing of theNotebookLoader
class. Overall, these changes improve the accuracy and reliability of value inference fordbutils.notebook.run(...)
and enhance the testing and usability of the related classes and methods. - Added nightly workflow to use industry solution accelerators for parser validation (#1883). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the
make solacc
command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness. - Complete support for pip install command (#1853). In this release, we've made significant enhancements to support the
pip install
command in our open-source library. Theregister_library
method in theDependencyResolver
,NotebookResolver
, andLocalFileResolver
classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, theresolve_import
method has been introduced in theNotebookResolver
andLocalFileResolver
classes for improved import resolution. Moreover, the_split
static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for fullpip install
command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries. - Infer simple f-string values when computing values during linting (#1876). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue #1871 and progressing #1205. The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud.
- Propagate widget parameters and data security mode to
CurrentSessionState
(#1872). In this release, thespark_version_compatibility
function incrawlers.py
has been refactored toruntime_version_tuple
, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, theCurrentSessionState
class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attribu...
v0.26.0
- Added migration for Python linters from
ast
(standard library) toastroid
package (#1835). In this release, the Python linters have been migrated from theast
package in the standard library to theastroid
package, version 3.2.2 or higher, with minimal inference implementation. This change includes updates to thepyproject.toml
file to includeastroid
as a dependency and bump the version ofpylint
. No changes have been made to user documentation, CLI commands, workflows, or tables. Testing has been conducted through the addition of unit tests. This update aims to improve the functionality and accuracy of the Python linters. - Added workflow linter for delta live tables task (#1825). In this release, there are updates to the
_register_pipeline_task
method in thejobs.py
file. The method now checks for the existence of the pipeline and its libraries, and registers each notebook or jar library found in the pipeline as a task. If the library is a Maven or file type, it will raise aDependencyProblem
as it is not yet implemented. Additionally, new functions and tests have been added to improve the quality and functionality of the project, including a workflow linter for Delta Live Tables (DLT) tasks and a linter that checks for issues with specified DLT tasks. A new method,test_workflow_linter_dlt_pipeline_task
, has been added to test the workflow linter for DLT tasks, verifying the correct creation and functioning of the pipeline task and checking the building of the dependency graph for the task. These changes enhance the project's ability to ensure the proper configuration and correctness of DLT tasks and prevent potential issues. - Consistent 0-based line tracking for linters (#1855). 0-based line tracking has been consistently implemented for linters in various files and methods throughout the project, addressing issue #1855. This change includes removing direct filesystem references in favor of using the Unity Catalog for table migration and format changes. It also updates comments and warnings to improve clarity and consistency. In particular, the spark-table.py file has been updated to ensure that the spark.log.level is set correctly for UC Shared Clusters, and that the Spark Driver JVM is no longer accessed directly. The new file, simple_notebook.py, demonstrates the consistent line tracking for linters across different cell types, such as Python, Markdown, SQL, Scala, Shell, Pip, and Python (with magic commands). These changes aim to improve the accuracy and reliability of linters, making the codebase more maintainable and adaptable.
Dependency updates:
- Updated sqlglot requirement from <24.2,>=23.9 to >=23.9,<25.1 (#1856).
Contributors: @ericvergnaud, @JCZuurmond, @FastLee, @pritishpai, @dependabot[bot], @asnare
v0.25.0
- Added handling for legacy ACL
DENY
permission in group migration (#1815). In this release, the handling ofDENY
permissions during group migrations in our legacy ACL table has been improved. Previously,DENY
operations were denoted with aDENIED
prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence ofDENIED
in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue #1803. A new test function, test_hive_deny_sql(), has also been added to test the behavior of theDENY
permission. - Added handling for parsing corrupted log files (#1817). The
logs.py
file in thesrc/databricks/labs/ucx/installer
directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new methodtest_parse_logs_warns_for_corrupted_log_file
that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. - Added known problems with
pyspark
package (#1813). In this release, updates have been made to thesrc/databricks/labs/ucx/source_code/known.json
file to document known issues with thepyspark
package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A newKnownProblem
dataclass has been added to theknown.py
file, which includes methods for converting the object to a dictionary for better encoding of problems. The_analyze_file
method has also been updated to use aknown_problems
set ofKnownProblem
objects, improving readability and management of known problems within the application. These changes address issue #1813 and improve the documentation of known issues withpyspark
. - Added library linting for jobs launched on shared clusters (#1689). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue #1637. A new function,
_register_existing_cluster_id(graph: DependencyGraph)
, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to thetest_jobs.py
file in thetests/integration/source_code
directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of thejobs
andcompute
modules from thedatabricks.sdk.service
package. Additionally, a newWorkflowTaskContainer
method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. - Added linters to check for spark logging and configuration access (#1808). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via
sc.conf
, andrdd.mapPartitions
. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to theSparkConnectLinter
class and are executed as part of thedatabricks labs ucx
command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. - Added list of known dependency compatibilities and regeneration infrastructure for it (#1747). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the
known.json
file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. - Added more known libraries from Databricks Runtime (#1812). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
- Added more known packages from Databricks Runtime (#1814). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
- Added support for
.egg
Python libraries in jobs (#1789). This commit adds support for.egg
Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue #1643. It includes the addition of a new method,PythonLibraryResolver
, which replaces the oldPipResolver
, and is used to register egg library dependencies in theDependencyGraph
. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section wherePipResolver
is replaced withPythonLibraryResolver
from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from.egg
files. - Added table migration workflow guide (#1607). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
- Added workflow linter for spark python tasks (#1810). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the
_register_spark_python_task
method in thejobs.py
file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. Thetest_job_spark_python_task_linter_happy_path
t...
v0.24.0
- Added
%pip
cell resolver (#1697). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue #1642 and following up on #1694. The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project. - Added downloads of
requirementst.txt
dependency locally to register it to the dependency graph (#1753). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue #1644 and is similar to #1704. The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of theexperimental-workflow-linter
workflow. Thelint_job
method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files. - Added ability to install UCX on workspaces without Public Internet connectivity (#1566). A new flag,
upload_dependencies
, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue #573 and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version ofdatabricks-labs-blueprint
from<0.7.0
to>=0.6.0
, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when theupload_dependencies
flag is set to True. - Added initial interface for data comparison framework (#1695). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new
StandardDataComparator
class has been implemented for comparing the data of two tables, and aStandardSchemaComparator
class tests the comparison of table schemas. The framework also includes theDatabricksTableMetadataRetriever
class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such asStandardDataProfiler
for profiling data,SchemaComparator
andDataComparator
for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility. - Added lint local code command (#1710). A new
lint local code
command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. Thelint-local-code
command is implemented in theapplication.py
file, with supporting methods and classes added to theworkspace_cli.py
anddatabricks.labs.ucx.source_code
packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards. - Added table in mount migration (#1225). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
- Added workflows to trigger table reconciliations (#1721). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's
$inventory_database.reconciliation_results
view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management. - Always refresh HMS stats when getting table size (#1713). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case
test_table_size_crawler
in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality. - Automatically retrieve
aws_account_id
from aws profile instead of prompting (#1715). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input ofaws_account_id
by automatically retrieving it from the AWS profile. An optionalkms-key
flag has been documented for creating roles, providing more flexibility. Thecreate-missing-principals
command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue #1714. Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacingaws_cli_run_command
, ensuring automated retrieval ofaws_account_id
. A test has also been added to raise an error when AWS CLI is not found in the system path. - Detect dependencies of libraries installed via pip (#1703). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues #1642 and [#1202](https://github.com/databrickslabs/u...
v0.23.1
- Improved error handling for
migrate-tables
workflows (#1674). This commit enhances the error handling formigrate-tables
workflows by introducing new tests that cover specific scenarios where failures may occur during table migration. The changes include the creation of mock objects and injecting failures for theget_tables_to_migrate
method of theTableMapping
class. Three new tests have been added, each testing a specific scenario, including token errors when checking table properties, errors when trying to get properties for a non-existing table, and errors when trying to unset theupgraded_to
property. The commit also asserts that specific error messages are logged during these failures. These improvements ensure better visibility and debugging capabilities during table migration. The code was manually tested, and unit tests were added and verified on a staging environment, ensuring that the new error handling mechanisms function as intended. - Improved error handling for all queries executed during table migration (#1679). This release includes improved error handling during table migration in our data workflow, resolving issue #167
- Removed dependency on internal
pathlib
implementations (#1672). In this release, we have introduced a new custom_DatabricksFlavor
class as a replacement for the internalpathlib._Flavor
implementations, specifically designed for the Databricks environment. This class handles various functionalities including separation of paths, joining and parsing of parts, and casefolding of strings, among others. Themake_uri
method has also been updated to generate the correct URI for the workspace. This change removes the dependency on internalpathlib._Flavor
implementations which were not available on Windows. As part of this change, thetest_wspath.py
file in thetests/integration/mixins
directory has been updated, with thetest_exists
andtest_mkdirs
methods being modified to reflect the removal of_Flavor
. These updates improve the compatibility and reliability of the codebase on Windows systems. - Updated databricks-labs-blueprint requirement from ~=0.4.3 to >=0.4.3,<0.6.0 (#1670). In this update, we have adjusted the requirement for
databricks-labs-blueprint
from version~=0.4.3
to>=0.4.3,<0.6.0
, ensuring the latest version can be installed while remaining below0.6.0
. This change is part of issue #1670 and includes the release notes and changelog in the commit message, highlighting improvements and updates in version0.5.0
. These enhancements consist of content assertion inMockInstallation
, better handling of partial functions inparallel.Threads
, and adjusted configurations aligned with the UCX project. The commit also covers various dependency updates and bug fixes, providing a more robust and efficient library experience for software engineers.
Dependency updates:
- Updated databricks-labs-blueprint requirement from ~=0.4.3 to >=0.4.3,<0.6.0 (#1670).
Contributors: @nkvuong, @dependabot[bot], @nfx, @JCZuurmond
v0.23.0
- Added DBSQL queries & dashboard migration (#1532). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands:
migrate-dbsql-dashboards
andrevert-dbsql-dashboards
. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. Themigrate-dbsql-dashboards
command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries withmigrated by UCX
and backing up original queries. Therevert-dbsql-dashboards
command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a--dashboard-id
flag for migrating or reverting a specific dashboard. Additionally, two new functions,migrate_dbsql_dashboards
andrevert_dbsql_dashboards
, have been added to thecli.py
file, and new classes have been added to interact with Redash for data visualization and querying. Themake_dashboard
fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards. - Added UDFs assessment (#1610). A User Defined Function (UDF) assessment feature has been introduced, addressing issue #1610. A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns,
success
and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count". - Added
databricks labs ucx create-missing-principals
command to create the missing UC roles in AWS (#1495). Thedatabricks labs ucx
tool now includes a new command,create-missing-principals
, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented usingIamRoleCreation
fromdatabricks.labs.ucx.aws.credentials
and updatesAWSRoleAction
with the correspondingrole_arn
while addingAWSUCRoleCandidate
. The new command only supports AWS and does not affect Azure. The existingmigrate_credentials
function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, includingAWSUCRoleCandidate
inaws.py
, andcreate_missing_principals
andlist_uc_roles
methods inaccess.py
. Thecreate_uc_roles_cli
method inaccess.py
has been refactored and renamed tolist_uc_roles
. New unit tests have been implemented to test the functionality ofcreate_missing_principals
for AWS and Azure, as well as testing the behavior when the command is not approved. - Added baseline for workflow linter (#1613). This change introduces the
WorkflowLinter
class in theapplication.py
file of thedatabricks.labs.ucx.source_code.jobs
package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such asworkspace_client
,dependency_resolver
,path_lookup
, andmigration_index
. Several properties have been moved fromdependency_resolver
to theCliContext
class, and theNotebookLoader
class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. Thegeneric
andredash
modules fromdatabricks.labs.ucx.workspace_access
and theGroupManager
class fromdatabricks.labs.ucx.workspace_access.groups
are used. TheVerifyHasMetastore
,UdfsCrawler
, andTablesMigrator
classes fromdatabricks.labs.ucx.hive_metastore
and theDeployedWorkflows
class fromdatabricks.labs.ucx.installer.workflows
are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests. - Added linter to check for RDD use and JVM access (#1606). A new
AstHelper
class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing theAstHelper
class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments. - Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow (#1621). The
migrate_tables
workflow inworkflows.py
has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. Themigrate_external_tables_sync
,migrate_dbfs_root_delta_tables
, andmigrate_views
tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities. - Added "seen tables" feature (#1465). The
seen tables
feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of asnapshot
method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The_crawl
function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and_get_table_location
have been included to facilitate these improvements. In the testing realm, a new testtest_mount_listing_seen_tables
has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore. - Added support for
migrate-tables-ctas
workflow in thedatabricks labs ucx migrate-tables
CLI command (#1660). This commit adds support for themigrate-tables-ctas
workflow in thedatabricks labs ucx migrate-tables
command, which checks for external tables that cannot be synced and prompts the user to run themigrate-tables-ctas
workflow. Two new methods,test_migrate_external_tables_ctas(ws)
andmigrate_tables(ws, prompts, ctx=ctx)
, have been added. The first method checks if themigrate-external-tables-ctas
workflow is called correctly, while the second method runs the workflow after prompting the user. The methodtest_migrate_external_hiveserde_tables_in_place(ws)
has been modified to test if themigrate-external-hiveserde-tables-in-place-experimental
workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project. - Added support for migrating external location permissions from interactive cluster mounts (#1487). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing
databricks labs ucx
command is modified, with the addition of the new methodcreate_external_locations
and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues #1192 and #1193, ensuring a more robust and controlled user experience with interactive clusters. - Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN (#1631). In this release, we've implement...
v0.22.0
- A notebook linter to detect DBFS references within notebook cells (#1393). A new linter has been implemented in the open-source library to identify references to Databricks File System (DBFS) mount points or folders within SQL and Python cells of Notebooks, raising Advisory or Deprecated alerts when detected. This feature, resolving issue #1108, enhances code maintainability by discouraging DBFS usage, and improves security by avoiding hard-coded DBFS paths. The linter's functionality includes parsing the code and searching for Table elements within statements, raising warnings when DBFS references are found. Implementation changes include updates to the
NotebookLinter
class, a newfrom_source
class method, and anoriginal_offset
argument in theCell
class. The linter now also supports thedatabricks
dialect for SQL code parsing. This feature improves the library's security and maintainability by ensuring better data management and avoiding hard-coded DBFS paths. - Added CLI commands to trigger table migration workflow (#1511). A new
migrate_tables
command has been added to the 'databricks.labs.ucx.cli' module, which triggers themigrate-tables
workflow and, optionally, themigrate-external-hiveserde-tables-in-place-experimental
workflow. Themigrate-tables
workflow is responsible for managing table migrations, while themigrate-external-hiveserde-tables-in-place-experimental
workflow handles migrations for external hiveserde tables. The newWhat
class from the 'databricks.labs.ucx.hive_metastore.tables' module is used to identify hiveserde tables. If hiveserde tables are detected, the user is prompted to confirm running themigrate-external-hiveserde-tables-in-place-experimental
workflow. Themigrate_tables
command requires a WorkspaceClient and Prompts objects and accepts an optional WorkspaceContext object, which is set to the WorkspaceContext of the WorkspaceClient if not provided. Additionally, a newmigrate_external_hiveserde_tables_in_place
command has been added which will run themigrate-external-hiveserde-tables-in-place-experimental
workflow if it finds any hiveserde tables, making it easier to manage table migrations from the command line. - Added CSV, JSON and include path in mounts (#1329). In this release, the TablesInMounts function has been enhanced to support CSV and JSON file formats, along with the existing Parquet and Delta table formats. The new
include_paths_in_mount
parameter has been introduced, enabling users to specify a list of paths to crawl within all mounts. The WorkspaceConfig class in the config.py file has been updated to accommodate these changes. Additionally, a new_assess_path
method has been introduced to assess the format of a given file and return aTableInMount
object accordingly. Several existing methods, such as_find_delta_log_folders
,_is_parquet
,_is_csv
,_is_json
, and_path_is_delta
, have been updated to reflect these improvements. Furthermore, two new unit tests,test_mount_include_paths
andtest_mount_listing_csv_json
, have been added to ensure the proper functioning of the TablesInMounts function with the new file formats and theinclude_paths_in_mount
parameter. These changes aim to improve the functionality and flexibility of the TablesInMounts library, allowing for more precise crawling and identification of tables based on specific file formats and paths. - Added CTAS migration workflow for external tables cannot be in place migrated (#1510). In this release, we have added a new CTAS (Create Table As Select) migration workflow for external tables that cannot be migrated in-place. This feature includes a
MigrateExternalTablesCTAS
class with three tasks to migrate non-SYNC supported and non-HiveSerde external tables, migrate HiveSerde tables, and migrate views from the Hive Metastore to the Unity Catalog. We have also added new methods for managed and external table migration, deprecated old methods, and added a new test function to ensure proper CTAS migration for external tables using HiveSerDe. This change also introduces a new JSON file for external table configurations and a mock backend to simulate the Hive Metastore and test the migration process. Overall, these changes improve the migration capabilities for external tables and ensure a more flexible and reliable migration process. - Added Python linter for table creation with implicit format (#1435). A new linter has been added to the Python library to advise on implicit table formats when the 'writeTo', 'table', 'insertInto', or
saveAsTable
methods are invoked without an explicit format specified in the same chain of calls. This feature is useful for software engineers working with Databricks Runtime (DBR) v8.0 and later, where the default table format changed fromparquet
to 'delta'. The linter, implemented in 'table_creation.py', utilizes reusable AST utilities from 'python_ast_util.py' and is not automated, providing advice instead of fixing the code. The linter skips linting when a DRM version of 8.0 or higher is passed, as the default format change only applies to versions prior to 8.0. Unit tests have been added for both files as part of the code migration workflow. - Added Support for Migrating Table ACL of Interactive clusters using SPN (#1077). This change introduces support for migrating table Access Control Lists (ACLs) of interactive clusters using a Security Principal Name (SPN) for Azure Databricks environments in the UCX project. It includes modifications to the
hive_metastore
andworkspace_access
modules, as well as the addition of new classes, methods, and import statements for handling ACLs and grants. This feature enables more secure and granular control over table permissions when using SPN authentication for interactive clusters in Azure. This will benefit software engineers working with interactive clusters in Azure Databricks by enhancing security and providing more control over data access. - Added Support for migrating Schema/Catalog ACL for Interactive cluster (#1413). This commit adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, with partial fixes for issues #1192 and #1193. The changes identify and filter database ACL grants, create mappings from Hive metastore schema to Unity Catalog schema and catalog, and replace Hive metastore actions with equivalent Unity Catalog actions for both schema and catalog. External location permission is not included in this commit and will be addressed separately. New methods for creating mappings, updating principal ACLs, and getting catalog schema grants have been added, and existing functionalities have been modified to handle both AWS and Azure. The code has undergone manual testing and passed unit and integration tests. The changes are targeted towards software engineers who adopt the project.
- Added
databricks labs ucx logs
command (#1350). A new command, 'databricks labs ucx logs', has been added to the open-source library to enhance logging and debugging capabilities. This command allows developers and administrators to view logs from the latest job run or specify a particular workflow name to display its logs. By default, logs with levels of INFO, WARNING, and ERROR are shown, but the --debug flag can be used for more detailed DEBUG logs. This feature utilizes the relay_logs method from the deployed_workflows object in the WorkspaceContext class and addresses issue #1282. The addition of this command aims to improve the usability and maintainability of the framework, making it easier for users to diagnose and resolve issues. - Added check for DBFS mounts in SQL code (#1351). A new feature has been introduced to check for Databricks File System (DBFS) mounts within SQL code, enhancing data management and accessibility in the Databricks environment. The
dbfsqueries.py
file in thedatabricks/labs/ucx/source_code
directory now includes a function that verifies the presence of DBFS mounts in SQL queries and returns appropriate messages. TheLanguages
class in the__init__
method has been updated to incorporate a new class,FromDbfsFolder
, which replaces the existingfrom_table
linter with a new linter,DBFSUsageLinter
, for handling DBFS usage in SQL code. In addition, a Staff Software Engineer has improved the functionality of a DBFS usage linter tool by adding new methods to check for deprecated DBFS mounts in SQL code, returning deprecation warnings as needed. These enhancements ensure more robust handling of DBFS mounts throughout the system, allowing for better integration and management of DBFS-related issues in SQL-based operations. - Added check for circular view dependency (#1502). A circular view dependency check has been implemented to prevent issues caused by circular dependencies in views. This includes a new test for chained circular dependencies (A->B, B->C, C->A) and an update to the existing circular view dependency test. The checks have been implemented through modifications to the tests in
test_views_sequencer.py
, including a new test method and an update to the existing test method. If any circular dependencies are encountered during migration, ...
v0.21.0
- Ensure proper sequencing of view migrations (#1157). In this release, we have introduced a
views_migrator
module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue #1132. The module contains two main classes:ViewToMigrate
andViewsMigrator
. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method,__hash__
, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new filetables_and_views.json
for unit testing and added aviews_migrator
module that takes aTablesCrawler
object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests. - Experimental support for scanning Delta Tables inside Mount Points (#1095). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new
TablesInMounts
crawler. Users can now scan specific mount points using the--include-mounts
flag and include Parquet files in the scan results with the--include-parquet-files
flag. Additionally, the--filter-paths
flag allows for filtering paths in a mount point and the--max-depth
flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to usedatabricks-labs-lsql~=0.3.0
. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases. - Fixed
NULL
values inucx.views.table_format
to haveUNKNOWN
value instead (#1156). This commit includes a fix for handling NULL values in thetable_format
column of Views in theucx.views.table_format
module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue #115 - Fixing run_workflow functionality for better error handling (#1159). In this release, the
run_workflow
method in theworkflows.py
file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method,job_initial_run
, has been added to initiate a job run and return the run ID, raising aNotFound
exception if the job run is not found. Therun_workflow
functionality in theWorkflowsInstall
module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when anOperationFailed
exception is raised during the installation process. These changes improve the robustness and stability of the system. - Use experimental Permissions Migration API also for Legacy Table ACLs (#1161). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The
TableAclSupport
import and class have been removed, as they are no longer needed. A newapply_to_renamed_groups
method has been added for production usage, and aapply_to_groups_with_different_names
method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. Thepermission_manager
parameter has been removed from several test functions in thetest_generic.py
file and replaced with theMigrationState
class, which is used directly with theWorkspaceClient
object to apply permissions to groups with different names. Thetest_some_entitlements
function in thetest_scim.py
file has also been updated to use theMigratedGroup
class and theMigrationState
class'sapply_to_groups_with_different_names
method. Finally, new tests for the Permissions Migration API have been added to thetest_tacl.py
file in thetests/integration/workspace_access
directory to verify the behavior of the Permissions Migration API when migrating different grants.
Contributors: @ericvergnaud, @qziyuan, @nfx, @FastLee, @william-conti, @dmoore247, @pritishpai
v0.20.0
- Added ACL migration to
migrate-tables
workflow (#1135). - Added AVRO to supported format to be upgraded by SYNC (#1134). In this release, the
hive_metastore
package'stables.py
file has been updated to add AVRO as a supported format for the SYNC upgrade functionality. This change includes AVRO in the list of supported table formats in theis_format_supported_for_sync
method, which checks if the table format is notNone
and if the format's uppercase value is one of the supported formats. The addition of AVRO enables it to be upgraded using the SYNC functionality. Moreover, a new format called BINARYFILE has been introduced, which is not supported for SYNC upgrade. This release is part of the implementation of issue #1134, improving the compatibility of the SYNC upgrade functionality with various data formats. - Added
is_partitioned
column (#1130). A new column,is_partitioned
, has been added to theucx.tables
table in the assessment module, indicating whether the table is partitioned or not with valuesYes
or "No". This change addresses issue #871 and has been manually tested. The commit also includes updated documentation for the modified table. No new methods, CLI commands, workflows, or tests (unit, integration) have been introduced as part of this change. - Added assessment of interactive cluster usage compared to UC compute limitations (#1123).
- Added external location validation when creating catalogs with
create-catalogs-schemas
command (#1110). - Added flag to Job to identify Job submitted by jar (#1088). The open-source library has been updated with several new features aimed at enhancing user functionality and convenience. These updates include the addition of a new sorting algorithm, which provides users with an efficient and customizable method for organizing data. Additionally, a new caching mechanism has been implemented, improving the library's performance and reducing the amount of time required to access frequently used data. Furthermore, the library now supports multi-threading, enabling users to perform multiple operations simultaneously and increase overall productivity. Lastly, a new error handling system has been developed, providing users with more informative and actionable feedback when unexpected issues arise. These changes are a significant step forward in improving the library's performance, functionality, and usability for all users.
- Bump databricks-sdk from 0.22.0 to 0.23.0 (#1121). In this version update,
databricks-sdk
is upgraded from 0.22.0 to 0.23.0, introducing significant changes to the handling of AWS and Azure identities. TheAwsIamRole
class is replaced withAwsIamRoleRequest
in thedatabricks.sdk.service.catalog
module, affecting the creation of AWS storage credentials using IAM roles. Thecreate
function insrc/databricks/labs/ucx/aws/credentials.py
is updated to accommodate this modification. Additionally, theAwsIamRole
argument in thecreate
function offixtures.py
in thedatabricks/labs/ucx/mixins
directory is replaced withAwsIamRoleRequest
. The tests intests/integration/aws/test_access.py
are also updated to utilizeAwsIamRoleRequest
, andStorageCredentialInfo
intests/unit/azure/test_credentials.py
now usesAwsIamRoleResponse
instead ofAwsIamRole
. The new classes,AwsIamRoleRequest
andAwsIamRoleResponse
, likely include new features or bug fixes for AWS IAM roles. These changes require software engineers to thoroughly assess their codebase and adjust any relevant functions accordingly. - Deploy static views needed by #1123 interactive dashboard (#1139). In this update, we have added two new views,
misc_patterns_vw
andcode_patterns_vw
, to theinstall.py
script in thedatabricks/labs/ucx
directory. These views were originally intended to be deployed with a previous update (#1123) but were inadvertently overlooked. The addition of these views addresses issues with queries in theinteractive
dashboard. Thedeploy_schema
function has been updated with two new lines,deployer.deploy_view("misc_patterns", "queries/views/misc_patterns.sql")
anddeployer.deploy_view("code_patterns", "queries/views/code_patterns.sql")
, to deploy the new views using their respective SQL files from thequeries/views
directory. No other modifications have been made to the file. - Fixed Table ACL migration logic (#1149). The open-source library has been updated with several new features, providing enhanced functionality for software engineers. A new utility class has been added to simplify the process of working with collections, offering methods to filter, map, and reduce elements in a performant manner. Additionally, a new configuration system has been implemented, allowing users to easily customize library behavior through a simple JSON format. Finally, we have added support for asynchronous processing, enabling efficient handling of I/O-bound tasks and improving overall application performance. These features have been thoroughly tested and are ready for use in your projects.
- Fixed
AssertionError: assert '14.3.x-scala2.12' == '15.0.x-scala2.12'
from nightly integration tests (#1120). In this release, the open-source library has been updated with several new features to enhance functionality and provide more options to users. The library now supports multi-threading, allowing for more efficient processing of large datasets. Additionally, a new algorithm for data compression has been implemented, resulting in reduced memory usage and faster data transfer. The library API has also been expanded, with new methods for sorting and filtering data, as well as improved error handling. These changes aim to provide a more robust and performant library, making it an even more valuable tool for software engineers. - Increase code coverage by 1 percent (#1125).
- Skip installation if remote and local version is the same, provide prompt to override (#1084). In this release, the
new_installation
workflow in the open-source library has been enhanced to include a new use case for handling identical remote and local versions of UCX. When the remote and local versions are the same, the user is now prompted and if no override is requested, a RuntimeWarning is raised. Additionally, users are now prompted to update the existing installation and if confirmed, the installation proceeds. These modifications include manual testing and new unit tests to ensure functionality. These changes provide users with more control over their installation process and address a specific use case for handling identical UCX versions. - Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137). The open-source library has been updated with several new features to enhance usability and functionality. Firstly, we have added support for asynchronous processing, allowing for more efficient handling of large data sets and improving overall performance. Additionally, a new configuration system has been implemented, which simplifies the setup process for users and increases customization options. We have also included a new error handling mechanism that provides more detailed and actionable information, making it easier to diagnose and resolve issues. Lastly, we have made significant improvements to the library's documentation, including updated examples, guides, and an expanded API reference. These changes are part of our ongoing commitment to improving the library and providing the best possible user experience.
- [Experimental] Add support for permission migration API (#1080).
Dependency updates:
- Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137).
Contributors: @nkvuong, @nfx, @ericvergnaud, @pritishpai, @dleiva04, @dmoore247, @dependabot[bot], @qziyuan, @FastLee, @prajin-29
v0.19.0
- Added instance pool id to WorkspaceConfig (#1087). In this release, the
create
method of the_policy_installer
object has been updated to return an additional value,instance_pool_id
, which is then assigned and passed as an argument to theWorkspaceConfig
object in the_configure_new_installation
method. TheClusterPolicyInstaller
class in thev0.15.0_added_cluster_policy.py
file has also been updated to return a fourth value,instance_pool_id
, from thecreate
method, allowing for more flexibility in future enhancements. Additionally, the test functiontest_table_migration_job
in thetest_installation.py
file has been updated to skip when the script is not being run as part of a nightly test job or in debug mode, and the test functions in thetest_policy.py
file have been updated to reflect the new return value in thecreate
method. These changes enable better management and scaling of resources through instance pools, provide more granular control in the WorkspaceConfig, and improve testing efficiency. - Added more cross-linking between CLI commands (#1091). In this release, we have introduced several enhancements to our open-source library's Command Line Interface (CLI) and documentation. Specifically, we have added more cross-linking between CLI commands to improve navigation and usability. The documentation has been updated to include a new step in the UCX installation process, where users are required to run the assessment workflow after installing UCX. This workflow is the first step in the migration process and checks the compatibility of the user's workspace with Unity Catalog. Additionally, we have added new commands for principal-prefix-access, migrate-credentials, and migrate-locations, which are part of the table migration process. These new commands require the assessment workflow and group migration workflow to be completed before they can be executed. Overall, these changes aim to provide a more streamlined and detailed installation and migration process, improving the user experience for software engineers.
- Fixed command references in README.md (#1093). In this release, we have made improvements to the command references in the README.md file to enhance the overall readability and usability of the documentation for software engineers. Specifically, we have updated the links for the
migrate-locations
andvalidate_external_locations
commands to use the correct syntax, enclosing them in backticks to denote code. This change ensures that the links are correctly interpreted as commands and addresses any issues that may have arisen with their previous formatting. It is important to note that no new methods have been added in this release, and the existing functionality of the commands has not been changed in scope or functionality. - Fixing the issue in workspace id flag in create-account-group command (#1094). In this update, we have improved the
create_account_group
command related to theworkspace_ids
flag in our open-source library. Theworkspace_ids
flag's type has been changed fromlist[int] | None
tostr | None
, allowing for easier input of multiple workspace IDs as a string of comma-separated integers. Thecreate_account_level_groups
function in theAccountWorkspaces
class has been updated to accept this string and convert it to a list of integers before proceeding. To ensure proper functioning, we added a new test casetest_create_account_groups_with_id()
to check if the command handles the case when no workspace IDs are provided in the configuration. Thecreate_account_groups()
method now checks for this condition and raises aValueError
. Furthermore, themanual_workspace_info()
method has been updated to handle workspace name input by the user, receiving thews
object, along with prompts that contain the user input for the workspace name and the next workspace ID. - Rely UCX on the latest 14.3 LTS DBR instead of 15.x (#1097). In this release, we have implemented a quick fix to rely on the Long Term Support (LTS) version 14.3 of the Databricks Runtime (DBR) instead of 15.x for UCX, addressing issue #1096. This change affects the
_definition
function, which has been modified to use the latest LTS DBR instead of the latest Spark version. Thelatest_lts_dbr
variable is now assigned the value returned by theselect_spark_version
method with thelatest=True
andlong_term_support=True
parameters. Thespark_version
key in thepolicy_definition
dictionary is set to the value returned by the_policy_config
method withlatest_lts_dbr
as the argument. Additionally, in thetests/unit/installer/test_policy.py
file, theselect_spark_version
method of theclusters
object has been updated to accept any number of arguments and consistently return the string "14.2.x-scala2.12", allowing for greater flexibility. This is a temporary solution, with a more comprehensive fix being tracked in issue #1098. Developers should be aware of how theclusters
object is used in the codebase when adopting this project.
Contributors: @nfx, @qziyuan, @prajin-29