Skip to content

Releases: databrickslabs/ucx

v0.27.0

12 Jun 23:10
@nfx nfx
520f886
Compare
Choose a tag to compare
  • Added mlflow to known packages (#1895). The mlflow package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use of mlflow in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related to sparkContext, _conf, and RDD APIs. Additionally, the artifact storage system of mlflow in Databricks and DBFS has undergone changes. The known.json file has also been updated with several new packages, such as alembic, aniso8601, cloudpickle, docker, entrypoints, flask, graphene, graphql-core, graphql-relay, gunicorn, html5lib, isort, jinja2, markdown, markupsafe, mccabe, opentelemetry-api, opentelemetry-sdk, opentelemetry-semantic-conventions, packaging, pyarrow, pyasn1, pygments, pyrsistent, python-dateutil, pytz, pyyaml, regex, requests, and more. These packages are now acknowledged and incorporated into the project's functionality.
  • Added tensorflow to known packages (#1897). In this release, we are excited to announce the addition of the tensorflow package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such as tensorflow, tensorboard, tensorboard-data-server, and tensorflow-io-gcs-filesystem, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such as gast, grpcio, h5py, keras, libclang, mdurl, namex, opt-einsum, optree, pygments, rich, rsa, termcolor, pyasn1_modules, sympy, and threadpoolctl. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools.
  • Added torch to known packages (#1896). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the _analyze_dist_info method in the known.py file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if a RecursionError occurs. This enhancement increases the robustness of the package analysis process.
  • Added more known libraries (#1894). In this release, the known library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality.
  • Added more value inference for dbutils.notebook.run(...) (#1860). In this release, the dbutils.notebook.run(...) functionality in graph.py has been significantly updated to enhance value inference. The change includes the introduction of new methods for handling NotebookRunCall and SysPathChange objects, as well as the refactoring of the get_notebook_path method into get_notebook_paths. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method, _get_notebook_paths, has also been added to retrieve notebook paths from a list of nodes. Furthermore, the load_dependency method in loaders.py has been updated to detect the language of a notebook based on the file path, in addition to its content. The Notebook class now includes a new parameter, SUPPORTED_EXTENSION_LANGUAGES, which maps file extensions to their corresponding languages. In the databricks.labs.ucx project, more value inference has been added to the linter, including new methods and enhanced functionality for dbutils.notebook.run(...). Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for the NotebookLoader class in the databricks.labs.ucx.source_code.notebooks.loaders module has been added, with a new class, NotebookLoaderForTesting, that overrides the detect_language method to make it a class method. This allows for more robust testing of the NotebookLoader class. Overall, these changes improve the accuracy and reliability of value inference for dbutils.notebook.run(...) and enhance the testing and usability of the related classes and methods.
  • Added nightly workflow to use industry solution accelerators for parser validation (#1883). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the make solacc command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness.
  • Complete support for pip install command (#1853). In this release, we've made significant enhancements to support the pip install command in our open-source library. The register_library method in the DependencyResolver, NotebookResolver, and LocalFileResolver classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, the resolve_import method has been introduced in the NotebookResolver and LocalFileResolver classes for improved import resolution. Moreover, the _split static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for full pip install command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries.
  • Infer simple f-string values when computing values during linting (#1876). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue #1871 and progressing #1205. The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud.
  • Propagate widget parameters and data security mode to CurrentSessionState (#1872). In this release, the spark_version_compatibility function in crawlers.py has been refactored to runtime_version_tuple, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, the CurrentSessionState class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attribu...
Read more

v0.26.0

07 Jun 23:38
@nfx nfx
b19c848
Compare
Choose a tag to compare
  • Added migration for Python linters from ast (standard library) to astroid package (#1835). In this release, the Python linters have been migrated from the ast package in the standard library to the astroid package, version 3.2.2 or higher, with minimal inference implementation. This change includes updates to the pyproject.toml file to include astroid as a dependency and bump the version of pylint. No changes have been made to user documentation, CLI commands, workflows, or tables. Testing has been conducted through the addition of unit tests. This update aims to improve the functionality and accuracy of the Python linters.
  • Added workflow linter for delta live tables task (#1825). In this release, there are updates to the _register_pipeline_task method in the jobs.py file. The method now checks for the existence of the pipeline and its libraries, and registers each notebook or jar library found in the pipeline as a task. If the library is a Maven or file type, it will raise a DependencyProblem as it is not yet implemented. Additionally, new functions and tests have been added to improve the quality and functionality of the project, including a workflow linter for Delta Live Tables (DLT) tasks and a linter that checks for issues with specified DLT tasks. A new method, test_workflow_linter_dlt_pipeline_task, has been added to test the workflow linter for DLT tasks, verifying the correct creation and functioning of the pipeline task and checking the building of the dependency graph for the task. These changes enhance the project's ability to ensure the proper configuration and correctness of DLT tasks and prevent potential issues.
  • Consistent 0-based line tracking for linters (#1855). 0-based line tracking has been consistently implemented for linters in various files and methods throughout the project, addressing issue #1855. This change includes removing direct filesystem references in favor of using the Unity Catalog for table migration and format changes. It also updates comments and warnings to improve clarity and consistency. In particular, the spark-table.py file has been updated to ensure that the spark.log.level is set correctly for UC Shared Clusters, and that the Spark Driver JVM is no longer accessed directly. The new file, simple_notebook.py, demonstrates the consistent line tracking for linters across different cell types, such as Python, Markdown, SQL, Scala, Shell, Pip, and Python (with magic commands). These changes aim to improve the accuracy and reliability of linters, making the codebase more maintainable and adaptable.

Dependency updates:

  • Updated sqlglot requirement from <24.2,>=23.9 to >=23.9,<25.1 (#1856).

Contributors: @ericvergnaud, @JCZuurmond, @FastLee, @pritishpai, @dependabot[bot], @asnare

v0.25.0

04 Jun 18:26
@nfx nfx
a9f874d
Compare
Choose a tag to compare
  • Added handling for legacy ACL DENY permission in group migration (#1815). In this release, the handling of DENY permissions during group migrations in our legacy ACL table has been improved. Previously, DENY operations were denoted with a DENIED prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of DENIED in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue #1803. A new test function, test_hive_deny_sql(), has also been added to test the behavior of the DENY permission.
  • Added handling for parsing corrupted log files (#1817). The logs.py file in the src/databricks/labs/ucx/installer directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method test_parse_logs_warns_for_corrupted_log_file that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files.
  • Added known problems with pyspark package (#1813). In this release, updates have been made to the src/databricks/labs/ucx/source_code/known.json file to document known issues with the pyspark package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new KnownProblem dataclass has been added to the known.py file, which includes methods for converting the object to a dictionary for better encoding of problems. The _analyze_file method has also been updated to use a known_problems set of KnownProblem objects, improving readability and management of known problems within the application. These changes address issue #1813 and improve the documentation of known issues with pyspark.
  • Added library linting for jobs launched on shared clusters (#1689). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue #1637. A new function, _register_existing_cluster_id(graph: DependencyGraph), has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the test_jobs.py file in the tests/integration/source_code directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the jobs and compute modules from the databricks.sdk.service package. Additionally, a new WorkflowTaskContainer method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters.
  • Added linters to check for spark logging and configuration access (#1808). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via sc.conf, and rdd.mapPartitions. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the SparkConnectLinter class and are executed as part of the databricks labs ucx command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected.
  • Added list of known dependency compatibilities and regeneration infrastructure for it (#1747). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the known.json file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library.
  • Added more known libraries from Databricks Runtime (#1812). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
  • Added more known packages from Databricks Runtime (#1814). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
  • Added support for .egg Python libraries in jobs (#1789). This commit adds support for .egg Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue #1643. It includes the addition of a new method, PythonLibraryResolver, which replaces the old PipResolver, and is used to register egg library dependencies in the DependencyGraph. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where PipResolver is replaced with PythonLibraryResolver from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from .egg files.
  • Added table migration workflow guide (#1607). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
  • Added workflow linter for spark python tasks (#1810). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the _register_spark_python_task method in the jobs.py file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The test_job_spark_python_task_linter_happy_path t...
Read more

v0.24.0

27 May 12:26
@nfx nfx
9b83666
Compare
Choose a tag to compare
  • Added %pip cell resolver (#1697). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue #1642 and following up on #1694. The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project.
  • Added downloads of requirementst.txt dependency locally to register it to the dependency graph (#1753). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue #1644 and is similar to #1704. The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the experimental-workflow-linter workflow. The lint_job method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files.
  • Added ability to install UCX on workspaces without Public Internet connectivity (#1566). A new flag, upload_dependencies, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue #573 and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of databricks-labs-blueprint from <0.7.0 to >=0.6.0, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the upload_dependencies flag is set to True.
  • Added initial interface for data comparison framework (#1695). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new StandardDataComparator class has been implemented for comparing the data of two tables, and a StandardSchemaComparator class tests the comparison of table schemas. The framework also includes the DatabricksTableMetadataRetriever class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as StandardDataProfiler for profiling data, SchemaComparator and DataComparator for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility.
  • Added lint local code command (#1710). A new lint local code command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The lint-local-code command is implemented in the application.py file, with supporting methods and classes added to the workspace_cli.py and databricks.labs.ucx.source_code packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards.
  • Added table in mount migration (#1225). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
  • Added workflows to trigger table reconciliations (#1721). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's $inventory_database.reconciliation_results view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management.
  • Always refresh HMS stats when getting table size (#1713). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case test_table_size_crawler in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality.
  • Automatically retrieve aws_account_id from aws profile instead of prompting (#1715). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of aws_account_id by automatically retrieving it from the AWS profile. An optional kms-key flag has been documented for creating roles, providing more flexibility. The create-missing-principals command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue #1714. Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing aws_cli_run_command, ensuring automated retrieval of aws_account_id. A test has also been added to raise an error when AWS CLI is not found in the system path.
  • Detect dependencies of libraries installed via pip (#1703). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues #1642 and [#1202](https://github.com/databrickslabs/u...
Read more

v0.23.1

10 May 15:07
@nfx nfx
7d07554
Compare
Choose a tag to compare
  • Improved error handling for migrate-tables workflows (#1674). This commit enhances the error handling for migrate-tables workflows by introducing new tests that cover specific scenarios where failures may occur during table migration. The changes include the creation of mock objects and injecting failures for the get_tables_to_migrate method of the TableMapping class. Three new tests have been added, each testing a specific scenario, including token errors when checking table properties, errors when trying to get properties for a non-existing table, and errors when trying to unset the upgraded_to property. The commit also asserts that specific error messages are logged during these failures. These improvements ensure better visibility and debugging capabilities during table migration. The code was manually tested, and unit tests were added and verified on a staging environment, ensuring that the new error handling mechanisms function as intended.
  • Improved error handling for all queries executed during table migration (#1679). This release includes improved error handling during table migration in our data workflow, resolving issue #167
  • Removed dependency on internal pathlib implementations (#1672). In this release, we have introduced a new custom _DatabricksFlavor class as a replacement for the internal pathlib._Flavor implementations, specifically designed for the Databricks environment. This class handles various functionalities including separation of paths, joining and parsing of parts, and casefolding of strings, among others. The make_uri method has also been updated to generate the correct URI for the workspace. This change removes the dependency on internal pathlib._Flavor implementations which were not available on Windows. As part of this change, the test_wspath.py file in the tests/integration/mixins directory has been updated, with the test_exists and test_mkdirs methods being modified to reflect the removal of _Flavor. These updates improve the compatibility and reliability of the codebase on Windows systems.
  • Updated databricks-labs-blueprint requirement from ~=0.4.3 to >=0.4.3,<0.6.0 (#1670). In this update, we have adjusted the requirement for databricks-labs-blueprint from version ~=0.4.3 to >=0.4.3,<0.6.0, ensuring the latest version can be installed while remaining below 0.6.0. This change is part of issue #1670 and includes the release notes and changelog in the commit message, highlighting improvements and updates in version 0.5.0. These enhancements consist of content assertion in MockInstallation, better handling of partial functions in parallel.Threads, and adjusted configurations aligned with the UCX project. The commit also covers various dependency updates and bug fixes, providing a more robust and efficient library experience for software engineers.

Dependency updates:

  • Updated databricks-labs-blueprint requirement from ~=0.4.3 to >=0.4.3,<0.6.0 (#1670).

Contributors: @nkvuong, @dependabot[bot], @nfx, @JCZuurmond

v0.23.0

08 May 15:44
@nfx nfx
2f58963
Compare
Choose a tag to compare
  • Added DBSQL queries & dashboard migration (#1532). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: migrate-dbsql-dashboards and revert-dbsql-dashboards. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The migrate-dbsql-dashboards command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with migrated by UCX and backing up original queries. The revert-dbsql-dashboards command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a --dashboard-id flag for migrating or reverting a specific dashboard. Additionally, two new functions, migrate_dbsql_dashboards and revert_dbsql_dashboards, have been added to the cli.py file, and new classes have been added to interact with Redash for data visualization and querying. The make_dashboard fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards.
  • Added UDFs assessment (#1610). A User Defined Function (UDF) assessment feature has been introduced, addressing issue #1610. A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, success and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count".
  • Added databricks labs ucx create-missing-principals command to create the missing UC roles in AWS (#1495). The databricks labs ucx tool now includes a new command, create-missing-principals, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using IamRoleCreation from databricks.labs.ucx.aws.credentials and updates AWSRoleAction with the corresponding role_arn while adding AWSUCRoleCandidate. The new command only supports AWS and does not affect Azure. The existing migrate_credentials function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including AWSUCRoleCandidate in aws.py, and create_missing_principals and list_uc_roles methods in access.py. The create_uc_roles_cli method in access.py has been refactored and renamed to list_uc_roles. New unit tests have been implemented to test the functionality of create_missing_principals for AWS and Azure, as well as testing the behavior when the command is not approved.
  • Added baseline for workflow linter (#1613). This change introduces the WorkflowLinter class in the application.py file of the databricks.labs.ucx.source_code.jobs package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as workspace_client, dependency_resolver, path_lookup, and migration_index. Several properties have been moved from dependency_resolver to the CliContext class, and the NotebookLoader class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The generic and redash modules from databricks.labs.ucx.workspace_access and the GroupManager class from databricks.labs.ucx.workspace_access.groups are used. The VerifyHasMetastore, UdfsCrawler, and TablesMigrator classes from databricks.labs.ucx.hive_metastore and the DeployedWorkflows class from databricks.labs.ucx.installer.workflows are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests.
  • Added linter to check for RDD use and JVM access (#1606). A new AstHelper class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the AstHelper class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments.
  • Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow (#1621). The migrate_tables workflow in workflows.py has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The migrate_external_tables_sync, migrate_dbfs_root_delta_tables, and migrate_views tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities.
  • Added "seen tables" feature (#1465). The seen tables feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a snapshot method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The _crawl function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and _get_table_location have been included to facilitate these improvements. In the testing realm, a new test test_mount_listing_seen_tables has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore.
  • Added support for migrate-tables-ctas workflow in the databricks labs ucx migrate-tables CLI command (#1660). This commit adds support for the migrate-tables-ctas workflow in the databricks labs ucx migrate-tables command, which checks for external tables that cannot be synced and prompts the user to run the migrate-tables-ctas workflow. Two new methods, test_migrate_external_tables_ctas(ws) and migrate_tables(ws, prompts, ctx=ctx), have been added. The first method checks if the migrate-external-tables-ctas workflow is called correctly, while the second method runs the workflow after prompting the user. The method test_migrate_external_hiveserde_tables_in_place(ws) has been modified to test if the migrate-external-hiveserde-tables-in-place-experimental workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project.
  • Added support for migrating external location permissions from interactive cluster mounts (#1487). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing databricks labs ucx command is modified, with the addition of the new method create_external_locations and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues #1192 and #1193, ensuring a more robust and controlled user experience with interactive clusters.
  • Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN (#1631). In this release, we've implement...
Read more

v0.22.0

26 Apr 15:20
@nfx nfx
5273936
Compare
Choose a tag to compare
  • A notebook linter to detect DBFS references within notebook cells (#1393). A new linter has been implemented in the open-source library to identify references to Databricks File System (DBFS) mount points or folders within SQL and Python cells of Notebooks, raising Advisory or Deprecated alerts when detected. This feature, resolving issue #1108, enhances code maintainability by discouraging DBFS usage, and improves security by avoiding hard-coded DBFS paths. The linter's functionality includes parsing the code and searching for Table elements within statements, raising warnings when DBFS references are found. Implementation changes include updates to the NotebookLinter class, a new from_source class method, and an original_offset argument in the Cell class. The linter now also supports the databricks dialect for SQL code parsing. This feature improves the library's security and maintainability by ensuring better data management and avoiding hard-coded DBFS paths.
  • Added CLI commands to trigger table migration workflow (#1511). A new migrate_tables command has been added to the 'databricks.labs.ucx.cli' module, which triggers the migrate-tables workflow and, optionally, the migrate-external-hiveserde-tables-in-place-experimental workflow. The migrate-tables workflow is responsible for managing table migrations, while the migrate-external-hiveserde-tables-in-place-experimental workflow handles migrations for external hiveserde tables. The new What class from the 'databricks.labs.ucx.hive_metastore.tables' module is used to identify hiveserde tables. If hiveserde tables are detected, the user is prompted to confirm running the migrate-external-hiveserde-tables-in-place-experimental workflow. The migrate_tables command requires a WorkspaceClient and Prompts objects and accepts an optional WorkspaceContext object, which is set to the WorkspaceContext of the WorkspaceClient if not provided. Additionally, a new migrate_external_hiveserde_tables_in_place command has been added which will run the migrate-external-hiveserde-tables-in-place-experimental workflow if it finds any hiveserde tables, making it easier to manage table migrations from the command line.
  • Added CSV, JSON and include path in mounts (#1329). In this release, the TablesInMounts function has been enhanced to support CSV and JSON file formats, along with the existing Parquet and Delta table formats. The new include_paths_in_mount parameter has been introduced, enabling users to specify a list of paths to crawl within all mounts. The WorkspaceConfig class in the config.py file has been updated to accommodate these changes. Additionally, a new _assess_path method has been introduced to assess the format of a given file and return a TableInMount object accordingly. Several existing methods, such as _find_delta_log_folders, _is_parquet, _is_csv, _is_json, and _path_is_delta, have been updated to reflect these improvements. Furthermore, two new unit tests, test_mount_include_paths and test_mount_listing_csv_json, have been added to ensure the proper functioning of the TablesInMounts function with the new file formats and the include_paths_in_mount parameter. These changes aim to improve the functionality and flexibility of the TablesInMounts library, allowing for more precise crawling and identification of tables based on specific file formats and paths.
  • Added CTAS migration workflow for external tables cannot be in place migrated (#1510). In this release, we have added a new CTAS (Create Table As Select) migration workflow for external tables that cannot be migrated in-place. This feature includes a MigrateExternalTablesCTAS class with three tasks to migrate non-SYNC supported and non-HiveSerde external tables, migrate HiveSerde tables, and migrate views from the Hive Metastore to the Unity Catalog. We have also added new methods for managed and external table migration, deprecated old methods, and added a new test function to ensure proper CTAS migration for external tables using HiveSerDe. This change also introduces a new JSON file for external table configurations and a mock backend to simulate the Hive Metastore and test the migration process. Overall, these changes improve the migration capabilities for external tables and ensure a more flexible and reliable migration process.
  • Added Python linter for table creation with implicit format (#1435). A new linter has been added to the Python library to advise on implicit table formats when the 'writeTo', 'table', 'insertInto', or saveAsTable methods are invoked without an explicit format specified in the same chain of calls. This feature is useful for software engineers working with Databricks Runtime (DBR) v8.0 and later, where the default table format changed from parquet to 'delta'. The linter, implemented in 'table_creation.py', utilizes reusable AST utilities from 'python_ast_util.py' and is not automated, providing advice instead of fixing the code. The linter skips linting when a DRM version of 8.0 or higher is passed, as the default format change only applies to versions prior to 8.0. Unit tests have been added for both files as part of the code migration workflow.
  • Added Support for Migrating Table ACL of Interactive clusters using SPN (#1077). This change introduces support for migrating table Access Control Lists (ACLs) of interactive clusters using a Security Principal Name (SPN) for Azure Databricks environments in the UCX project. It includes modifications to the hive_metastore and workspace_access modules, as well as the addition of new classes, methods, and import statements for handling ACLs and grants. This feature enables more secure and granular control over table permissions when using SPN authentication for interactive clusters in Azure. This will benefit software engineers working with interactive clusters in Azure Databricks by enhancing security and providing more control over data access.
  • Added Support for migrating Schema/Catalog ACL for Interactive cluster (#1413). This commit adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, with partial fixes for issues #1192 and #1193. The changes identify and filter database ACL grants, create mappings from Hive metastore schema to Unity Catalog schema and catalog, and replace Hive metastore actions with equivalent Unity Catalog actions for both schema and catalog. External location permission is not included in this commit and will be addressed separately. New methods for creating mappings, updating principal ACLs, and getting catalog schema grants have been added, and existing functionalities have been modified to handle both AWS and Azure. The code has undergone manual testing and passed unit and integration tests. The changes are targeted towards software engineers who adopt the project.
  • Added databricks labs ucx logs command (#1350). A new command, 'databricks labs ucx logs', has been added to the open-source library to enhance logging and debugging capabilities. This command allows developers and administrators to view logs from the latest job run or specify a particular workflow name to display its logs. By default, logs with levels of INFO, WARNING, and ERROR are shown, but the --debug flag can be used for more detailed DEBUG logs. This feature utilizes the relay_logs method from the deployed_workflows object in the WorkspaceContext class and addresses issue #1282. The addition of this command aims to improve the usability and maintainability of the framework, making it easier for users to diagnose and resolve issues.
  • Added check for DBFS mounts in SQL code (#1351). A new feature has been introduced to check for Databricks File System (DBFS) mounts within SQL code, enhancing data management and accessibility in the Databricks environment. The dbfsqueries.py file in the databricks/labs/ucx/source_code directory now includes a function that verifies the presence of DBFS mounts in SQL queries and returns appropriate messages. The Languages class in the __init__ method has been updated to incorporate a new class, FromDbfsFolder, which replaces the existing from_table linter with a new linter, DBFSUsageLinter, for handling DBFS usage in SQL code. In addition, a Staff Software Engineer has improved the functionality of a DBFS usage linter tool by adding new methods to check for deprecated DBFS mounts in SQL code, returning deprecation warnings as needed. These enhancements ensure more robust handling of DBFS mounts throughout the system, allowing for better integration and management of DBFS-related issues in SQL-based operations.
  • Added check for circular view dependency (#1502). A circular view dependency check has been implemented to prevent issues caused by circular dependencies in views. This includes a new test for chained circular dependencies (A->B, B->C, C->A) and an update to the existing circular view dependency test. The checks have been implemented through modifications to the tests in test_views_sequencer.py, including a new test method and an update to the existing test method. If any circular dependencies are encountered during migration, ...
Read more

v0.21.0

29 Mar 19:21
@nfx nfx
107fc5b
Compare
Choose a tag to compare
  • Ensure proper sequencing of view migrations (#1157). In this release, we have introduced a views_migrator module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue #1132. The module contains two main classes: ViewToMigrate and ViewsMigrator. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method, __hash__, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new file tables_and_views.json for unit testing and added a views_migrator module that takes a TablesCrawler object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests.
  • Experimental support for scanning Delta Tables inside Mount Points (#1095). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new TablesInMounts crawler. Users can now scan specific mount points using the --include-mounts flag and include Parquet files in the scan results with the --include-parquet-files flag. Additionally, the --filter-paths flag allows for filtering paths in a mount point and the --max-depth flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to use databricks-labs-lsql~=0.3.0. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases.
  • Fixed NULL values in ucx.views.table_format to have UNKNOWN value instead (#1156). This commit includes a fix for handling NULL values in the table_format column of Views in the ucx.views.table_format module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue #115
  • Fixing run_workflow functionality for better error handling (#1159). In this release, the run_workflow method in the workflows.py file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method, job_initial_run, has been added to initiate a job run and return the run ID, raising a NotFound exception if the job run is not found. The run_workflow functionality in the WorkflowsInstall module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when an OperationFailed exception is raised during the installation process. These changes improve the robustness and stability of the system.
  • Use experimental Permissions Migration API also for Legacy Table ACLs (#1161). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The TableAclSupport import and class have been removed, as they are no longer needed. A new apply_to_renamed_groups method has been added for production usage, and a apply_to_groups_with_different_names method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. The permission_manager parameter has been removed from several test functions in the test_generic.py file and replaced with the MigrationState class, which is used directly with the WorkspaceClient object to apply permissions to groups with different names. The test_some_entitlements function in the test_scim.py file has also been updated to use the MigratedGroup class and the MigrationState class's apply_to_groups_with_different_names method. Finally, new tests for the Permissions Migration API have been added to the test_tacl.py file in the tests/integration/workspace_access directory to verify the behavior of the Permissions Migration API when migrating different grants.

Contributors: @ericvergnaud, @qziyuan, @nfx, @FastLee, @william-conti, @dmoore247, @pritishpai

v0.20.0

28 Mar 13:35
@nfx nfx
f445a3d
Compare
Choose a tag to compare
  • Added ACL migration to migrate-tables workflow (#1135).
  • Added AVRO to supported format to be upgraded by SYNC (#1134). In this release, the hive_metastore package's tables.py file has been updated to add AVRO as a supported format for the SYNC upgrade functionality. This change includes AVRO in the list of supported table formats in the is_format_supported_for_sync method, which checks if the table format is not None and if the format's uppercase value is one of the supported formats. The addition of AVRO enables it to be upgraded using the SYNC functionality. Moreover, a new format called BINARYFILE has been introduced, which is not supported for SYNC upgrade. This release is part of the implementation of issue #1134, improving the compatibility of the SYNC upgrade functionality with various data formats.
  • Added is_partitioned column (#1130). A new column, is_partitioned, has been added to the ucx.tables table in the assessment module, indicating whether the table is partitioned or not with values Yes or "No". This change addresses issue #871 and has been manually tested. The commit also includes updated documentation for the modified table. No new methods, CLI commands, workflows, or tests (unit, integration) have been introduced as part of this change.
  • Added assessment of interactive cluster usage compared to UC compute limitations (#1123).
  • Added external location validation when creating catalogs with create-catalogs-schemas command (#1110).
  • Added flag to Job to identify Job submitted by jar (#1088). The open-source library has been updated with several new features aimed at enhancing user functionality and convenience. These updates include the addition of a new sorting algorithm, which provides users with an efficient and customizable method for organizing data. Additionally, a new caching mechanism has been implemented, improving the library's performance and reducing the amount of time required to access frequently used data. Furthermore, the library now supports multi-threading, enabling users to perform multiple operations simultaneously and increase overall productivity. Lastly, a new error handling system has been developed, providing users with more informative and actionable feedback when unexpected issues arise. These changes are a significant step forward in improving the library's performance, functionality, and usability for all users.
  • Bump databricks-sdk from 0.22.0 to 0.23.0 (#1121). In this version update, databricks-sdk is upgraded from 0.22.0 to 0.23.0, introducing significant changes to the handling of AWS and Azure identities. The AwsIamRole class is replaced with AwsIamRoleRequest in the databricks.sdk.service.catalog module, affecting the creation of AWS storage credentials using IAM roles. The create function in src/databricks/labs/ucx/aws/credentials.py is updated to accommodate this modification. Additionally, the AwsIamRole argument in the create function of fixtures.py in the databricks/labs/ucx/mixins directory is replaced with AwsIamRoleRequest. The tests in tests/integration/aws/test_access.py are also updated to utilize AwsIamRoleRequest, and StorageCredentialInfo in tests/unit/azure/test_credentials.py now uses AwsIamRoleResponse instead of AwsIamRole. The new classes, AwsIamRoleRequest and AwsIamRoleResponse, likely include new features or bug fixes for AWS IAM roles. These changes require software engineers to thoroughly assess their codebase and adjust any relevant functions accordingly.
  • Deploy static views needed by #1123 interactive dashboard (#1139). In this update, we have added two new views, misc_patterns_vw and code_patterns_vw, to the install.py script in the databricks/labs/ucx directory. These views were originally intended to be deployed with a previous update (#1123) but were inadvertently overlooked. The addition of these views addresses issues with queries in the interactive dashboard. The deploy_schema function has been updated with two new lines, deployer.deploy_view("misc_patterns", "queries/views/misc_patterns.sql") and deployer.deploy_view("code_patterns", "queries/views/code_patterns.sql"), to deploy the new views using their respective SQL files from the queries/views directory. No other modifications have been made to the file.
  • Fixed Table ACL migration logic (#1149). The open-source library has been updated with several new features, providing enhanced functionality for software engineers. A new utility class has been added to simplify the process of working with collections, offering methods to filter, map, and reduce elements in a performant manner. Additionally, a new configuration system has been implemented, allowing users to easily customize library behavior through a simple JSON format. Finally, we have added support for asynchronous processing, enabling efficient handling of I/O-bound tasks and improving overall application performance. These features have been thoroughly tested and are ready for use in your projects.
  • Fixed AssertionError: assert '14.3.x-scala2.12' == '15.0.x-scala2.12' from nightly integration tests (#1120). In this release, the open-source library has been updated with several new features to enhance functionality and provide more options to users. The library now supports multi-threading, allowing for more efficient processing of large datasets. Additionally, a new algorithm for data compression has been implemented, resulting in reduced memory usage and faster data transfer. The library API has also been expanded, with new methods for sorting and filtering data, as well as improved error handling. These changes aim to provide a more robust and performant library, making it an even more valuable tool for software engineers.
  • Increase code coverage by 1 percent (#1125).
  • Skip installation if remote and local version is the same, provide prompt to override (#1084). In this release, the new_installation workflow in the open-source library has been enhanced to include a new use case for handling identical remote and local versions of UCX. When the remote and local versions are the same, the user is now prompted and if no override is requested, a RuntimeWarning is raised. Additionally, users are now prompted to update the existing installation and if confirmed, the installation proceeds. These modifications include manual testing and new unit tests to ensure functionality. These changes provide users with more control over their installation process and address a specific use case for handling identical UCX versions.
  • Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137). The open-source library has been updated with several new features to enhance usability and functionality. Firstly, we have added support for asynchronous processing, allowing for more efficient handling of large data sets and improving overall performance. Additionally, a new configuration system has been implemented, which simplifies the setup process for users and increases customization options. We have also included a new error handling mechanism that provides more detailed and actionable information, making it easier to diagnose and resolve issues. Lastly, we have made significant improvements to the library's documentation, including updated examples, guides, and an expanded API reference. These changes are part of our ongoing commitment to improving the library and providing the best possible user experience.
  • [Experimental] Add support for permission migration API (#1080).

Dependency updates:

  • Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137).

Contributors: @nkvuong, @nfx, @ericvergnaud, @pritishpai, @dleiva04, @dmoore247, @dependabot[bot], @qziyuan, @FastLee, @prajin-29

v0.19.0

25 Mar 10:42
@nfx nfx
4478be3
Compare
Choose a tag to compare
  • Added instance pool id to WorkspaceConfig (#1087). In this release, the create method of the _policy_installer object has been updated to return an additional value, instance_pool_id, which is then assigned and passed as an argument to the WorkspaceConfig object in the _configure_new_installation method. The ClusterPolicyInstaller class in the v0.15.0_added_cluster_policy.py file has also been updated to return a fourth value, instance_pool_id, from the create method, allowing for more flexibility in future enhancements. Additionally, the test function test_table_migration_job in the test_installation.py file has been updated to skip when the script is not being run as part of a nightly test job or in debug mode, and the test functions in the test_policy.py file have been updated to reflect the new return value in the create method. These changes enable better management and scaling of resources through instance pools, provide more granular control in the WorkspaceConfig, and improve testing efficiency.
  • Added more cross-linking between CLI commands (#1091). In this release, we have introduced several enhancements to our open-source library's Command Line Interface (CLI) and documentation. Specifically, we have added more cross-linking between CLI commands to improve navigation and usability. The documentation has been updated to include a new step in the UCX installation process, where users are required to run the assessment workflow after installing UCX. This workflow is the first step in the migration process and checks the compatibility of the user's workspace with Unity Catalog. Additionally, we have added new commands for principal-prefix-access, migrate-credentials, and migrate-locations, which are part of the table migration process. These new commands require the assessment workflow and group migration workflow to be completed before they can be executed. Overall, these changes aim to provide a more streamlined and detailed installation and migration process, improving the user experience for software engineers.
  • Fixed command references in README.md (#1093). In this release, we have made improvements to the command references in the README.md file to enhance the overall readability and usability of the documentation for software engineers. Specifically, we have updated the links for the migrate-locations and validate_external_locations commands to use the correct syntax, enclosing them in backticks to denote code. This change ensures that the links are correctly interpreted as commands and addresses any issues that may have arisen with their previous formatting. It is important to note that no new methods have been added in this release, and the existing functionality of the commands has not been changed in scope or functionality.
  • Fixing the issue in workspace id flag in create-account-group command (#1094). In this update, we have improved the create_account_group command related to the workspace_ids flag in our open-source library. The workspace_ids flag's type has been changed from list[int] | None to str | None, allowing for easier input of multiple workspace IDs as a string of comma-separated integers. The create_account_level_groups function in the AccountWorkspaces class has been updated to accept this string and convert it to a list of integers before proceeding. To ensure proper functioning, we added a new test case test_create_account_groups_with_id() to check if the command handles the case when no workspace IDs are provided in the configuration. The create_account_groups() method now checks for this condition and raises a ValueError. Furthermore, the manual_workspace_info() method has been updated to handle workspace name input by the user, receiving the ws object, along with prompts that contain the user input for the workspace name and the next workspace ID.
  • Rely UCX on the latest 14.3 LTS DBR instead of 15.x (#1097). In this release, we have implemented a quick fix to rely on the Long Term Support (LTS) version 14.3 of the Databricks Runtime (DBR) instead of 15.x for UCX, addressing issue #1096. This change affects the _definition function, which has been modified to use the latest LTS DBR instead of the latest Spark version. The latest_lts_dbr variable is now assigned the value returned by the select_spark_version method with the latest=True and long_term_support=True parameters. The spark_version key in the policy_definition dictionary is set to the value returned by the _policy_config method with latest_lts_dbr as the argument. Additionally, in the tests/unit/installer/test_policy.py file, the select_spark_version method of the clusters object has been updated to accept any number of arguments and consistently return the string "14.2.x-scala2.12", allowing for greater flexibility. This is a temporary solution, with a more comprehensive fix being tracked in issue #1098. Developers should be aware of how the clusters object is used in the codebase when adopting this project.

Contributors: @nfx, @qziyuan, @prajin-29