Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Unify file loading #1455

Closed
1 task done
Tracked by #1085
ericvergnaud opened this issue Apr 19, 2024 · 0 comments · Fixed by #1557
Closed
1 task done
Tracked by #1085

[FEATURE]: Unify file loading #1455

ericvergnaud opened this issue Apr 19, 2024 · 0 comments · Fixed by #1557
Labels
feat/cli CLI commands migrate/code Abstract Syntax Trees and other dark magic tech debt chores and design flaws

Comments

@ericvergnaud
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

As part of #1440, we have 3 different classes for files: LocalFile, WorkspaceFile and PackageFile. There is an opportunity to unify these classes and re-use code.

Proposed Solution

Create integration tests to cover all scenarios:

  • running as a job
  • running locally via the CLI
  • running locally outside the CLI
    Based on the outcome, simplify the above classes.

Additional Context

No response

@ericvergnaud ericvergnaud added enhancement New feature or request needs-triage labels Apr 19, 2024
@github-project-automation github-project-automation bot moved this to Triage in UCX Apr 19, 2024
@nfx nfx added tech debt chores and design flaws feat/cli CLI commands migrate/code Abstract Syntax Trees and other dark magic and removed enhancement New feature or request needs-triage labels Apr 22, 2024
@nfx nfx moved this from Triage to Quarter Backlog in UCX Apr 24, 2024
nfx added a commit that referenced this issue Apr 25, 2024
This PR allows for more unified work with workspace notebooks and files

Fix #1455
@nfx nfx closed this as completed in #1557 Apr 25, 2024
nfx added a commit that referenced this issue Apr 25, 2024
This PR allows for more unified work with workspace notebooks and files

Fix #1455
@github-project-automation github-project-automation bot moved this from Quarter Backlog to Archive in UCX Apr 25, 2024
nfx added a commit that referenced this issue Apr 26, 2024
*  Fix test failure: `test_running_real_remove_backup_groups_job` ([#1445](https://github.com/databrickslabs/ucx/issues/1445)). In this release, a fix has been implemented to address an issue with the `test_running_real_remove_backup_groups_job` function in the `tests/integration/test_installation.py` file, which was causing a test failure. The changes include the addition of a retry mechanism to wait for the group to be deleted, which will help ensure that the group is properly deleted. This mechanism retries the `ws.groups.get` command up to a minute in case of a `NotFound` or `InvalidParameterValue` exception. It is important to note that this commit introduces a change to the `test_running_real_remove_backup_groups_job` function. Manual testing was conducted to verify the changes, but no new unit or integration tests were added. As a software engineer adopting this project, you should be aware of this modification and its potential impact on your testing processes. This change is part of our ongoing efforts to maintain and improve the project.
* A notebook linter to detect DBFS references within notebook cells ([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new linter has been developed for Notebooks that examines SQL and Python cells for references to DBFS (Databricks File System) mount points or folders, raising Advisory or Deprecated warnings as necessary. This enhances code security and maintainability by helping developers avoid potential issues when working with DBFS. The `NotebookLinter` class accepts a `Languages` object and a `Notebook` object in its constructor, and the `lint` method now checks for DBFS references in the notebook's cells. Two new methods, `original_offset` and `new_cell`, have been added to the `Cell` class, and the `extract_cells` method has been updated accordingly. The `_remove_magic_wrapper` method has also been improved for better code processing and reusability. This linter uses the sqlglot library with the `databricks` dialect to parse SQL statements, recognizing Databricks-specific SQL functions and syntax. This ensures that code using DBFS follows best practices and is up-to-date.
* Added CLI commands to trigger table migration workflow ([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new CLI command, "migrate-tables", has been added to facilitate table migration in a more flexible and convenient manner. This command, implemented in the "cli.py" file of the "databricks/labs/ucx" package, triggers the `migrate-tables` and `migrate-external-hiveserde-tables-in-place-experimental` workflows. It identifies tables with the `EXTERNAL_HIVESERDE` attribute and prompts the user to confirm running the migration for external HiveSerDe tables. The migration process can be assigned to a specific metastore and default catalog, with the latter set to `hive_metastore` if not specified. These changes provide improved table management and migration capabilities, offering greater control and ease of use for our software engineering audience.
* Added CSV, JSON and include path in mounts ([#1329](https://github.com/databrickslabs/ucx/issues/1329)). The latest open-source library update introduces CSV and JSON support in the TablesInMounts class, which crawls for tables within mounts. A new parameter, 'include_paths_in_mount', has been included to specify a list of paths for crawling. This feature allows users to crawl and include specific paths in their mount crawl, providing more fine-grained control over the crawling process. Additionally, new methods have been added to detect CSV, JSON, and partitioned Parquet files, while existing methods have been updated to handle the new parameter. New tests have been added to ensure that only the specified paths are included in the crawl and that the correct file formats are detected. These changes enhance the functionality and flexibility of the TablesInMounts feature, providing greater control and precision in crawling and detecting various file formats.
* Added CTAS migration workflow for external tables cannot be in place migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)). A new CTAS (Create Table As Select) migration workflow has been added for external tables that cannot be migrated in-place, enabling more efficient and flexible data management. The `MigrateExternalTablesCTAS` method is added, facilitating the creation of a Change Data Capture (CDC) task for external tables using CTAS queries. New integration tests have been introduced, covering HiveSerDe format migration, and handling potential NotFound errors with retry decorators and timeout settings. Additionally, a new JSON file for testing has been added, enabling testing of migration workflows for external Hive tables that cannot be in-place migrated. New modules and methods for migrating hive serde tables in-place, handling other external CTAS tables, and managing hive serde CTAS tables have been added, and test cases have been updated to include these new methods.
* Added Python linter for table creation with implicit format ([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new linter has been implemented for Python code to advise on explicit table format specification when using Databricks Runtime (DBR) 8.0 and later versions. This change comes in response to the default table format changing from `parquet` to `delta` in DBR 8.0 when no format is specified. The linter checks for 'writeTo', 'table', 'insertInto', and `saveAsTable` method invocations without an explicit format and suggests updates to include an explicit format. It supports `format` invocation in the same chain of calls and as a direct argument for 'saveAsTable'. Linting is the only functionality provided, and the linter skips linting when the DRM version is 8.0 or later. The linter is implemented in 'table_creation.py', making use of reusable AST utilities in 'python_ast_util.py', and is accompanied by unit tests. The `code migration` workflow has been updated to include this new linting functionality.
* Added Support for Migrating Table ACL of Interactive clusters using SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). A new class `ServicePrincipalClusterMapping` has been added to store the mapping between an interactive cluster and its corresponding Service Principal, and the `AzureServicePrincipalCrawler` class has been updated to include a new method `get_cluster_to_storage_mapping`. This method retrieves the mapping between clusters and their corresponding SPNs by iterating through all the clusters in the workspace, filtering out job clusters and clusters with specific data security modes, and retrieving the corresponding SPNs using the existing `_get_azure_spn_from_cluster_config` method. The retrieved mapping is then returned as a list of `ServicePrincipalClusterMapping` objects. Additionally, the commit adds support for migrating table ACLs of interactive clusters using a Service Principal Name (SPN) in Azure environments, achieved through the introduction of new classes, functions, and changes to existing functionality in various modules. These changes facilitate more flexible and secure ACL management for interactive clusters in Azure environments.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster ([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This pull request adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, addressing partially issues [#1192](https://github.com/databrickslabs/ucx/issues/1192) and [#1193](https://github.com/databrickslabs/ucx/issues/1193). It identifies database ACL grants from the PrincipalACL class, maps Hive Metastore schema to Unity Catalog (UC) schema and catalog using Table Mapping, and replaces Hive Metastore actions with equivalent UC actions. While it covers both cloud platforms, external location permission is excluded and will be addressed in a separate PR. Changes include updating the `_SPARK_CONF` variable in the test_migrate.py file and modifying the `test_migrate_external_tables_with_principal_acl_azure` function to skip tests in non-Azure environments. The `CatalogSchema` class now accepts a `principal_acl` parameter, and a new test function, `test_catalog_schema_acl`, has been added. This PR introduces new methods, modifies existing functionality, and includes unit, integration, and manual tests.
* Added `.suffix` override for notebooks in `WorkspacePath` ([#1557](https://github.com/databrickslabs/ucx/issues/1557)). A new `.suffix` override has been added to the `WorkspacePath` class for more consistent handling of workspace notebooks and files, addressing issue [#1455](https://github.com/databrickslabs/ucx/issues/1455). This enhancement includes a `Language` class import from `databricks.sdk.service.workspace` and enables setting the language and import format for uploading notebooks using the `make_notebook` fixture's `language` parameter. The commit also adds an `overwrite` parameter to handle existing notebook overwriting and modifies the `ws.workspace.upload` function to support new `language` and `format` parameters. Additionally, a new test case `test_file_and_notebook_in_same_folder_with_different_suffixes` in `test_wspath.py` ensures proper behavior when working with multiple file types in a single folder within the workspace.
* Added `databricks labs ucx logs` command ([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new `databricks labs ucx logs` command has been introduced, facilitating the logging of events in UCX installations, addressing issue [#1350](https://github.com/databrickslabs/ucx/issues/1350) and fixing [#1282](https://github.com/databrickslabs/ucx/issues/1282). The command is implemented in the `logs.py` file, and retrieves and logs the most recent run of each job, displaying a warning if there are no jobs to relay logs for. The implementation includes the `relay_logs` method, which logs records using `logging.getLogger`, and the `_fetch_logs` method to retrieve logs for a specified workflow and run. The `tests/unit/test_cli.py` file has been updated to include a new test case for the `logs` function, ensuring the logs are fetched correctly from the Databricks workspace. The `cli_command.py` module includes the new `logs` function, responsible for fetching logs and printing them to the console. Overall, this feature enhances the diagnostic capabilities of the UCX installer, providing a dedicated command for generating and managing logs.
* Added assessment workflow test with external hms ([#1460](https://github.com/databrickslabs/ucx/issues/1460)). This release introduces a new assessment workflow test using an external Hive Metastore Service (hms), which has been manually tested and verified on the staging environment. The `validate_workflow` function has been updated to allow skipping known failed tasks. A new method, `test_running_real_assessment_job_ext_hms`, has been added, which sets up an external hms cluster with specific configurations, grants permissions to a group, deploys and runs a workflow, and validates its success while skipping failed tasks on the SQL warehouse. The `test_migration_job_ext_hms` method has also been updated to include an assertion to check if the Hive Metastore version and GlueCatalog are enabled. Additionally, integration tests have been added to ensure the functionality of these new features. This release is aimed at improving the library's integration with external hms and providing more flexibility in testing and validating workflows.
* Added back prompts for table migration job cluster configuration ([#1195](https://github.com/databrickslabs/ucx/issues/1195)). A new function, `_config_table_migration`, has been added to the `install.py` file to improve the configuration of the table migration job cluster. This function allows users to set the parallelism, minimum and maximum number of workers for auto-scaling. The `spark_conf_dict` parameter is updated with the new spark configuration. The code has been refactored, simplifying the creation of schemas, catalogs, and tables, and improving readability. The `test_table_migration_job` function has been updated to utilize the new schema object, checking if the tables are migrated correctly and validating the configuration of the cluster. Additional properties such as `parallelism_for_migrating`, `min_workers_for_auto_scale`, and `max_workers_for_auto_scale` have been introduced for configuring the parallelism and number of workers for auto-scaling. These properties are passed as arguments to the `test_fresh_install` function. The `test_install_cluster_override_jobs` function has replaced the `WorkspaceInstallation` instance with `WorkflowsDeployment`, which may affect how the installation process handles clusters and jobs. The `test_fresh_install` function now includes configurations for SQL warehouse type, mapping workspace groups, and configuring the number of days for submit runs history, number of threads, policy ID, minimum and maximum workers, renamed group prefix, warehouse ID, workspace start path, and `spark_conf` with the parallelism configuration for spark SQL sources.
* Added check for DBFS mounts in SQL code ([#1351](https://github.com/databrickslabs/ucx/issues/1351)). In this release, we have added a check for Databricks File System (DBFS) mounts in SQL code, enhancing the system's ability to handle DBFS-related operations within SQL contexts. We have introduced a new `FromDbfsFolder` class in the DBFS module of the source code, which is added to the SQL SequentialLinter for SQL code checking. This change ensures that any references to DBFS mounts in SQL code are valid and properly formatted, improving the system's ability to validate SQL code that interacts with DBFS mounts. Additionally, we have updated the `test_dbfs.py` file with new methods to test DBFS-related functionality, and the `FromDbfsFolder` class is now responsible for identifying deprecated DBFS usage in SQL code. These updates provide developers with better insights into how DBFS usage is handled in SQL code and facilitate smoother data manipulation and retrieval for end-users and software engineers adopting this project.
* Added check for circular view dependency ([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular view dependency check has been implemented in the hive metastore to prevent infinite loops during view migrations. This check ensures that views do not depend on each other in a circular manner, handling cases where view A depends on view B, view B depends on view C, and view C depends on view A. Two new methods, `test_migrate_circular_views_raises_value_error` and `test_migrate_circular_view_chain_raises_value_error`, have been added to the `test_views_sequencer.py` file to test for circular view dependencies and circular dependency chains. These methods utilize a mock backend to simulate a real SQL backend and check if the code raises a `ValueError` with the correct error message when circular view dependencies are detected. Additionally, an existing test has been updated, and an error message related to circular view references has been modified. The changes have been manually tested and verified with unit tests. Integration tests and staging environment verification are pending.
* Added commands for metastores listing & assignment ([#1489](https://github.com/databrickslabs/ucx/issues/1489)). A new feature has been implemented in the Unity Catalog (UCX) tool to enhance metastore management and assignment. This feature includes two new commands: `assign-metastore` and `show-all-metastores`. The `assign-metastore` command automatically assigns a UCX metastore to a specified workspace, while the `show-all-metastores` command displays all possible metastores that can be assigned to a workspace. These changes have been thoroughly tested using manual testing and unit tests, with new user documentation added to support this functionality. However, verification on a staging environment is still pending. The new methods have been implemented in the `cli_command.py` file, and the diff shows the addition of the `AccountMetastores` class and its import in the `cli_command.py` file. A new default catalog can be set using the default_namespace setting API. This feature is expected to improve the overall management and assignment of metastores in UCX.
* Added document for table migration workflow ([#1229](https://github.com/databrickslabs/ucx/issues/1229)). This release introduces detailed documentation for a table migration workflow, designed to facilitate the migration of tables from the Hive Metastore to the Unity Catalog in Databricks. The migration process consists of three stages: assessment, group migration, and the table migration workflow. The table migration workflow includes several tasks such as creating table mappings, migrating credentials, and creating catalogs and schemas in the Unity Catalog. The documentation includes the necessary commands to perform these tasks, along with dependency CLI commands like `create-table-mapping`, `principal-prefix-access`, `migrate-credentials`, `migrate-locations`, `create-catalogs-schemas`, and `create-uber-principal`. Additionally, the document covers table migration workflow tasks, including `migrate_dbfs_root_delta_tables` and `migrate_external_tables_sync`, along with other considerations such as running the workflow multiple times, setting higher workers for auto-scale, creating an instance pool, and manually editing job cluster configurations. The table migration workflow requires the assessment workflow and group migration workflow to be completed before running the table migration commands. The utility commands section includes the `ensure-assessment-run` command, the `repair-run` command, and other commands for UCX installation, configuration, and troubleshooting. This comprehensive documentation should assist developers and administrators in migrating tables from the Hive Metastore to the Unity Catalog in Databricks.
* Added error handling to udf crawling ([#1459](https://github.com/databrickslabs/ucx/issues/1459)). This commit addresses error handling in UDF (User Defined Function) crawling, specifically resolving flakiness in `test_permission_for_files_anonymous_func`. Changes include updates to the `apply_group_permissions` method in `manager.py`, introducing error gathering, checking for errors, and raising a `ManyError` exception if necessary. Additionally, the `test_tables_returning_error_when_show_tables` test has been modified to correctly check for a non-existent schema in the Hive Metastore, resolving inconsistencies in test behavior. The `snapshot` method in `logs.py` has been revised to handle specific error messages during testing, enhancing the reliability of UDF crawling. These changes have been manually tested and verified in a staging environment.
* Added functionality to migrate external tables using Create Table (No Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A new feature has been implemented in the open-source library to enable migrating external tables in Databricks' Hive metastore using the "Create Table (No Sync)" method. This feature introduces new methods `_migrate_non_sync_table` and `_get_create_in_place_sql` for handling migration and SQL query generation. The existing methods `_migrate_dbfs_root_table` and `_migrate_acl` have been updated to accommodate these changes. Additionally, a new test case has been added to demonstrate migration of external tables while preserving their location and properties. During migration, SQL queries are generated using the `sqlglot` library, with the SQL create statement for a given table key being obtained through the newly implemented `sql_show_create` method. The `sql_migrate_view` method has also been updated to create a view if it doesn't already exist. The implementation includes a new file in the `tests/unit/hive_metastore/tables/` directory, containing JSON data representing source and destination of migration, including catalog, database, name, object type, table format, workspace name, catalog name, schema mappings, and table mappings.
* Added initial version of account-level installer ([#1339](https://github.com/databrickslabs/ucx/issues/1339)). The commit introduces the initial version of an account-level installer that enables account administrators to install UCX (Unity Catalog eXtensions) on all workspaces in a Databricks account simultaneously. The installer performs necessary authentication to log in to the account, prompts for configuration for the first workspace, runs the installer, and then confirms if the user wants to repeat the process for the remaining workspaces. A new method `prompt_for_new_installation` saves answers to a new `InstallationConfig` data class, allowing answers to be reused for other workspaces. The command `databricks labs install ucx` now supports an account-level installation mode with the environment variable `UCX_FORCE_INSTALL` set to `account`. The changes include handling for `PermissionDenied`, `NotFound`, and `ValueError` exceptions, as well as modifications to the `sync_workspace_info` method to accept a list of workspaces. The `README.md` file has been updated with new sections on advanced force install over existing UCX and installing UCX on all workspaces within a Databricks account. The commit also modifies the `hms_lineage.py` method `apply` to include a new parameter `is_account_install`, which determines whether the HMS lineage init script should be enabled or a new global init script should be added, regardless of the user's response to prompts. Relevant user documentation and tests have been added, and the changes are manually tested. The commit additionally introduces a new method `AccountInstaller` and modifies the existing command `databricks labs install ucx ...`.
* Added integration tests with external HMS & Glue ([#1408](https://github.com/databrickslabs/ucx/issues/1408)). In this release, we have added integration tests for end-to-end workflows with an external Hive Metastore (HMS) and Apache Glue. The new test suite `test_ext_hms.py` utilizes a `sql_backend` fixture with `CommandContextBackend` to execute queries on a cluster with the external HMS set up, and requires a new environment variable `TEST_EXT_HMS_CLUSTER_ID`. Additionally, we have introduced a `make_mounted_location` fixture in `fixtures.py` for testing mounted locations in DBFS with a random suffix. The changes include updates to existing tests for migrating managed tables, tables with cache, external tables, and views, and the addition of tests for reverting migrated tables and handling table mappings. We have also added tests for migrating managed tables with ACLs and introduced a new `CommandContextBackend` class with methods for executing and fetching SQL commands, saving tables, and truncating tables. The new test suite includes manual testing, integration tests, and verification on a staging environment.
* Added linting for DBFS usage ([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new file, dbfs.py, has been added to the project, implementing a linter to detect DBFS (Databricks File System) file system paths in Python code. The linter uses an AST (Abstract Syntax Tree) visitor pattern to search for file system paths within the code, returning Deprecation or Advisory warnings for deprecated usage in calls or constant strings, respectively. This will help project maintainers and users identify and migrate away from deprecated file system paths in their Python code. The linter is also capable of detecting the usage of DBFS paths in string constants, function calls, and variable assignments, recognizing three types of DBFS path patterns and spark.read.parquet() function calls that use DBFS paths. The addition of this feature will ensure the proper usage of file systems in the code and aid in the transition from DBFS to other file systems.
* Added log task to parse logs and store the logs in the ucx database ([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log task has been added that parses logs and stores them in the ucx database, with the ability to only store logs that exceed a minimum log level. The log crawler task has been added to all workflows after other tasks have run. A new CLI command has been added to retrieve errors and warnings from the latest workflow run. The LogRecord has been updated to include all relevant fields. The functionality is thoroughly tested with unit and integration tests. Existing workflows are modified and a new table for logs is added to the SQL database. User documentation and new methods have been added where necessary. This commit resolves issues [#1148](https://github.com/databrickslabs/ucx/issues/1148) and [#1283](https://github.com/databrickslabs/ucx/issues/1283).
* Added migration for non delta dbfs tables using Create Table As Select (CTAS). Convert such tables to Delta tables ([#1434](https://github.com/databrickslabs/ucx/issues/1434)). This release introduces enhancements to migrate non-Delta DBFS root tables to managed Delta tables, expanding support for various table types and configurations during migration. New methods have been added to improve CTAS functionality and SQL statement generation safety. Grant assignments are now supported during migration, along with updated integration tests and additional table format compatibility. The release includes code modifications to import `escape_sql_identifier`, add new methods like `_migrate_table_create_ctas` and `_get_create_in_place_sql`, and update existing methods such as `_migrate_non_sync_table`. Specific changes in the diff file include modifications to "fixtures.py", where the `table_type` variable is set to "TableType.EXTERNAL" for non-Delta tables, and the SQL statement is adjusted accordingly. Additionally, a new test has been added for migrating non-Delta DBFS root tables, ensuring migration success by checking target table properties.
* Added migration for views sequentially ([#1177](https://github.com/databrickslabs/ucx/issues/1177)). The `Migrate views sequentially` feature modifies the views migration process in the Hive metastore to provide better clarity and control. The `ViewsMigrator` class has been renamed to `ViewsMigrationSequencer` and now processes a list of `TableToMigrate` instances instead of fetching tables from `TablesCrawler`. This change introduces a new method, `_migrate_views`, to manage batches of views during migration, ensuring that preliminary tasks have succeeded before running tasks. The `migrate_table` method of `TableMigrate` now requires a mandatory `what` argument to prevent accidental view migrations, and the corresponding tests are updated accordingly. This feature does not add new documentation, CLI commands, or tables, but it modifies an existing command and workflow. Unit tests are added for the new functionality, and the target audience is software engineers who adopt this project. While this commit resolves issue [#1172](https://github.com/databrickslabs/ucx/issues/1172), integration tests are still required for comprehensive validation. Software engineers reviewing the code should focus on understanding the logic behind the renaming and the new `__hash__` and `__eq__` methods in the `TableToMigrate` class to maintain and extend the functionality in a consistent manner.
* Added missing step sync-workspace-info ([#1330](https://github.com/databrickslabs/ucx/issues/1330)). A new step, "sync-workspace-info," has been added to the table migration workflow in the CLI subgraph, prior to the `create-table-mapping` step. This step is designed to synchronize workspace information, ensuring its accuracy and currency before creating table mappings. These changes are confined to the table migration workflow and do not affect other parts of the project. The README file has been updated to reflect the new step in the Table Migration Workflow section, providing detailed information for software engineers. The addition of `sync-workspace-info` aims to streamline the migration process, enhancing the overall efficiency and reliability of the open-source library.
* Added roadmap workflows and tasks to Table Migration Workflow document ([#1274](https://github.com/databrickslabs/ucx/issues/1274)). The table migration workflow has been significantly enhanced in this release to provide additional functionality and flexibility. The `migrate-tables` workflow now includes new tasks for creating table mappings, catalogs and schemas, a principal, prefixing access, migrating credentials and locations, and creating catalog schemas. Additionally, there are new workflows for migrating views, migrating tables using CTAS, and experimentally migrating ParquetHiveSerDe, OrcSerde, AvroSerde, LazySimpleSerDe, JsonSerDe, and OpenCSVSerde tables in place. An experimental workflow for migrating Delta and Parquet data found in DBFS mounts but not registered as Hive Metastore tables into UC tables has also been introduced. Due to the complexity of the migration process, multiple runs of the workflow may be necessary to ensure successful migration of all tables. For more detailed information, please refer to the table migration design.
* Added support for %pip cells ([#1401](https://github.com/databrickslabs/ucx/issues/1401)). The recent commit introduces support for a new `PipCell` in the notebook functionality, enabling the execution of pip commands directly in the notebook environment. The `PipCell` comes with methods specific to its functionality, such as `language`, `is_runnable`, `build_dependency_graph`, and `migrate_notebook_path`. The `language` property returns the string `PIP`, and the `is_runnable` method returns `True`, indicating that this type of cell can be executed. The `build_dependency_graph` and `migrate_notebook_path` methods are currently empty but may be implemented in the future. Additionally, the `CellLanguage` enumeration has been updated to include a new item for the `PIP` language. This change also includes the addition of a new magic command `%pip install some-package`, allowing for easy installation and management of python packages within the notebook. Furthermore, the commit introduces a new tuple `PIP_NOTEBOOK_SAMPLE` in the `test_notebook.py` file for testing pip cells in the notebook, thereby enhancing the versatility of the project. Overall, this commit adds a new, useful functionality for running pip commands within a notebook context.
* Added support for %sh cells ([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new cell type, SHELL, has been introduced in this release, which is implemented in the `ShellCell` class. The `language` property of this class returns `CellLanguage.SHELL`. The `is_runnable` method has been added and returns `True`, but it is marked as `TODO`. The `build_dependency_graph` and `migrate_notebook_path` methods are no-ops. A new case for the `SHELL` CellLanguage has been added to the `CellLanguage` Enum and assigned to the `ShellCell` class. The release also includes a new sample notebook, "notebook-with-shell-cell.py.txt", with a shell script that can be executed using the `%sh` magic command. Two new tuples, `SHELL_NOTEBOOK_SAMPLE` and `PIP_NOTEBOOK_SAMPLE`, have been added to `source_code/test_notebook.py` for testing the new `%sh` cell functionality. Overall, this release adds support for the new `SHELL` cell type, but does not implement any specific behavior for it yet.
* Added support for migrating Table ACL for interactive cluster in AWS using Instance Profile ([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This change adds support for migrating table Access Control Lists (ACLs) in AWS for interactive clusters utilizing Instance Profiles. The update introduces a new method, `get_iam_role_from_cluster_policy`, which replaces the previous `_get_iam_role_from_cluster_policy` method. This new method extracts the IAM role ARN from the cluster policy JSON object and returns the IAM role name. The `create_uber_principal` method has also been updated to use the new `get_iam_role_from_cluster_policy` method for determining the IAM role name in the cluster policy. Additionally, AWS and Google Cloud Platform (GCP) support has been added to the `principal_locations` method, which now checks for Azure, AWS, and GCP in that order. If GCP is not detected, a `NotImplementedError` is raised. These enhancements improve the migration process for table ACLs in AWS interactive clusters by utilizing Instance Profiles and providing unified handling for ACL migration across multiple cloud providers.
* Added support for views in `table-migration` workflow ([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new file, `migration_status.py`, has been added to track table migration status in a Hive metastore, and the `MigrationStatusRefresher` class has been updated to use a new approach for migrating views. The files `views_sequencer.py` and `test_views_sequencer.py` have been renamed to `view_migrate.py` and `test_view_migrate.py`, respectively. A new `MigrationIndex` class has been introduced in the `migration_status` module to keep track of the migration status of tables. The `ViewMigrationSequencer` class has been updated to accept a `migration_index` as an argument, which is used to determine the migration order of views. Relevant tests have been updated to reflect these changes and cover different scenarios of view migration, including views with no dependencies, direct views, views with dependencies, and deep nested views. The changes also include rewriting view code to point to the migrated tables and decoupling the queries module from `table_migrate`.
* Added workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables ([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This change introduces a workflow for in-place upgrading external Hive tables with Parquet, ORC, or Avro hiveserde formats. A new workflow, `MigrateHiveSerdeTablesInPlace`, has been added, which upgrades the specified hiveserde tables to Unity Catalog. The `tables.py` module includes new functions to describe the table, extract hiveserde details, and update the DDL with the new table name and mount point if necessary. A new function, `_migrate_external_table_hiveserde`, has been added to `table_migrate.py`, and the `TablesMigrator` class now includes two new arguments: `mounts` and `hiveserde_in_place_migrate`. These arguments control which hiveserde to migrate and replace the DBFS mnt table location if any. This allows for multiple tasks to run in parallel and migrate only one type of hiveserde at a time. The majority of the code from a previous pull request has been removed as only a subset of table formats can be in-place migrated to UC with DDL from `show create table`. This change includes new unit and integration tests and has been manually tested.
* Addressed Issue with Disabled Feature in certain regions ([#1275](https://github.com/databrickslabs/ucx/issues/1275)). In this release, we have implemented changes to address Issue [#1275](https://github.com/databrickslabs/ucx/issues/1275), which is related to a disabled feature in certain regions. Specifically, a new class attribute, ERRORS_TO_IGNORE with a value of ["FEATURE_DISABLED"], has been added to the PermissionManager class. The inventorize_permissions method has been updated to handle the `FEATURE_DISABLED` error by logging it and skipping it instead of raising an exception. This change improves the system's robustness by handling such cases more gracefully. Additionally, a new test method, 'test_manager_inventorize_ignore_error', has been added to demonstrate how to handle the error caused by the disabled feature in certain regions. This method introduces a new function, 'raise_error', that raises a `DatabricksError` with a specific error message and code. The `PermissionManager` object is then initialized with a mock `some_crawler` object and the `inventorize_permissions` method of the `PermissionManager` object is called, and the expected data is asserted to be written to the 'hive_metastore.test_database.permissions' table. The scope of these changes is limited to modifying the `test_manager_inventorize` method and adding the new `test_manager_inventorize_ignore_error` method to the 'tests/unit/workspace_access/test_manager.py' file.
* Addressed a bug with AWS UC Role Update. Adding unit tests ([#1429](https://github.com/databrickslabs/ucx/issues/1429)). A bug in the AWS Unity Catalog (UC) Role's trust policy update feature has been resolved by updating the `aws.py` file with a new method `_databricks_trust_statement`. This enhancement accurately updates the trust policy for UC roles, with modifications in the `create_uc_role` and `update_uc_trust_role` methods. New unit tests have been added, including the `test_update_uc_trust_role_append` function, which incorporates a mocked AWS CLI command for updating trust relationship policies and checks for updated trust relationships containing two principals and external ID matching conditions. The test function also includes a new mocked response for the `iam get-role` command, returning the role details with an updated ARN to verify if the trust relationship policy is updated correctly. This improvement simplifies the trust policy document generation and enhances the overall functionality of the feature.
* Allow reinstall when retry the failed table migration integration test ([#1224](https://github.com/databrickslabs/ucx/issues/1224)). The latest update introduces the capability to reinstall a table migration job in the event of a failed integration test, effectively addressing issue [#1224](https://github.com/databrickslabs/ucx/issues/1224). Previously, if the table migration job failed during an integration test, the test could not be retried due to the installation being marked as failed. Now, when executing the test_table_migration_job and test_table_migration_job_cluster_override functions, users will be prompted with "Do you want to update the existing installation?" and given the option to select `yes` to proceed with the reinstallation. This functionality is implemented by adding the `extend_prompts` parameter to the new_installation function call in both functions, with the value being a dictionary containing the new prompt. This addition allows the test to retry and the installation to be marked as successful if the table migration job is successful.
* Build dependency graph for local files ([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This commit introduces a local file dependency graph builder, refactoring dependency classes to differentiate between resolution and loading. A new method, LocalFileMigrator.build_dependency_graph, is implemented, following the pattern of NotebookMigrator, for building dependency graphs of local files. The DependencyResolver class and its methods have been refactored for clarity. The Whitelist class is used to parse a compatibility catalog file, and the DependencyResolver's get_advices method returns recommendations for updating to compatible module versions. Test functions compare expected and actual advice objects for correct recommendations. No changes to user documentation, CLI commands, workflows, or tables are made in this commit. Unit tests have been added to ensure that the changes work as expected.
* Build dependency graph for site packages ([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This commit introduces a dependency graph for site packages, adding package files as dependencies if they're not recognized during import, and addressing an infinite loop issue in cyclical graphs. The changes involve introducing new classes `WrappingLoader`, `SitePackageContainer`, and `SitePackage`, as well as updating the `DependencyResolver` class to use the new `SitePackages` object for locating dependencies. Additionally, this commit resolves an issue where the `locate_dependency` method would incorrectly match graph paths starting with './', and includes new unit tests and the removal of a deprecation warning for a specific dependency. The target audience for this commit is software engineers adopting the project.
* Build notebook dependency graph for `%run` cells ([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new Notebook class is implemented to parse source code and split it into cells, and a NotebookDependencyGraph class is added with related utilities to discover dependencies in `%run` cells, addressing issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). This new functionality allows for the creation of a dependency graph, aiding in better code organization and understanding dependencies in the code. The Notebook class defines a `parse` method to process source code and return a Notebook object, and a `to_migrated_code` method to apply necessary modifications for running code in different environments. The NotebookDependencyGraph class offers a `build_dependency_graph` method to construct a directed acyclic graph (DAG) of dependencies between cells. The commit also includes renaming the test file for the notebook migrator and updating the Notebooks class to NotebookMigrator in the test functions.
* Bump actions/checkout from 3 to 4 ([#1191](https://github.com/databrickslabs/ucx/issues/1191)). In this release, the "actions/checkout" dependency version has been updated from 3 to 4 in the project's acceptance workflow, addressing issue [#1191](https://github.com/databrickslabs/ucx/issues/1191). The new "actions/checkout@v4" improves the reliability and performance of the code checkout process, with better handling of shallow clones and submodules. This is achieved by setting the fetch-depth to 0 for a full clone of the repository, ensuring all bug fixes and improvements of the latest version are utilized. The update provides enhanced submodule handling and improved performance, resulting in a more stable and efficient checkout process for the project's CI/CD pipeline.
* Bump actions/setup-python from 4 to 5 ([#1189](https://github.com/databrickslabs/ucx/issues/1189)). In this release, the version of the `actions/setup-python` library has been updated from 4 to 5 in various workflow files, including `.github/workflows/acceptance.yml`, `.github/workflows/push.yml`, and `.github/workflows/release.yml`. This update ensures the usage of the latest available version of the Python environment setup action, which may include bug fixes, performance improvements, and new features. The `setup-python@v5` action is configured with appropriate cache and installation parameters. As a software engineer, it is important to review this change to ensure compatibility with the project's specific requirements and configurations related to `actions/setup-python`. Communicating these improvements to colleagues and maintaining up-to-date dependencies can help ensure the reliability and performance of the project.
* Bump codecov/codecov-action from 1 to 4 ([#1190](https://github.com/databrickslabs/ucx/issues/1190)). In this update, we have improved the project's CI/CD workflow by upgrading the `codecov-action` version from 1 to 4. This change primarily affects the `Publish test coverage` job, where we have replaced `codecov/codecov-action@v1` with `codecov/codecov-action@v4`. By implementing this update, the most recent features and bug fixes from `codecov-action` will be utilized in the project's testing and coverage reporting. Moreover, this update may introduce modifications to the `codecov-action` configuration options and input parameters, requiring users to review the updated documentation to ensure their usage remains correct. The anticipated benefit of this change is enhanced accuracy and up-to-date test coverage reporting for the project.
* Bump databricks-sdk from 0.23.0 to 0.24.0 ([#1223](https://github.com/databrickslabs/ucx/issues/1223)). In this release, the dependency for the `databricks-sdk` package has been updated from version 0.23.0 to 0.24.0, which may include bug fixes, performance improvements, or new features. Additionally, specific versions have been fixed for `databricks-labs-lsql` and "databricks-labs-blueprint", and the PyYAML package version has been constrained to a range between 6.0.0 and 7.0.0. This update enhances the reliability and compatibility of the project with other libraries and packages. However, the "jobs.py" file's `crawl` function now uses the `RunType` type instead of `ListRunsRunType` when calling the `list_runs` method, which could affect the job's behavior. Therefore, further investigation is required to ensure that the updated functionality aligns with the expected behavior.
* Bump softprops/action-gh-release from 1 to 2 ([#1188](https://github.com/databrickslabs/ucx/issues/1188)). The softprops/action-gh-release package has been updated from version 1 to version 2 in this release, enhancing the reliability and efficiency of release automation. The update specifically affects the "release.yml" file in the ".github/workflows" directory, where the action-gh-release is called. While there are no specific details about the changes included in this version, it is expected to contain bug fixes, performance improvements, and possibly new features. By updating to the latest version, software engineers can ensure the smooth operation of their release processes, taking advantage of the enhanced functionality and improved performance.
* Bumped databricks-sdk from 0.24.0 to 0.26.0 ([#1388](https://github.com/databrickslabs/ucx/issues/1388)). In this release, the databricks-sdk version has been updated from 0.24.0 to 0.26.0 to resolve a breaking change where the `AzureManagedIdentity` struct has been split into `AzureManagedIdentityResponse` and `AzureManagedIdentityRequest`. This change enhances the library's compatibility with Azure services. The `tests/unit/aws/test_credentials.py` file has been updated to replace `AzureManagedIdentity` instances with `AzureManagedIdentityResponse`. The `AzureManagedIdentityResponse` struct represents the response from the Databricks SDK when requesting information about an Azure managed identity, while the `AzureManagedIdentityRequest` struct represents the request sent to the Databricks SDK when requesting the creation or modification of an Azure managed identity. These changes improve the codebase's modularity and maintainability, allowing for clearer separation of concerns and more flexibility in handling managed identities in Databricks. The updated `AzureManagedIdentityResponse` and `AzureManagedIdentityRequest` structs have been manually tested, but they have not been verified on a staging environment. The functionality of the code remains mostly the same, except for the split `AzureManagedIdentity` struct. The revised dependencies list includes "databricks-sdk~=0.26.0", "databricks-labs-lsql~=0.4.0", "databricks-labs-blueprint~=0.4.3", "PyYAML>=6.0.0,<7.0.0", and "sqlglot>=23.9,<23.12".
* Cleaned up integration test suite ([#1422](https://github.com/databrickslabs/ucx/issues/1422)). In this release, the integration test suite has been improved through the removal of the outdated `test_mount_listing` test and the fixing of `test_runtime_crawl_permissions`. The fixed test now accurately checks permissions for runtime crawlers and verifies that tables are dropped before being created. The API client and permissions of the WorkspaceClient are now properly handled, with the `do` and `permissions.get` methods returning an empty value. These changes ensure that the integration tests are up-to-date, accurate, functional, and have been manually tested. Issue [#1129](https://github.com/databrickslabs/ucx/issues/1129) has been resolved as a result of these improvements.
* Create UC External Location, Schema, and Table Grants based on workspace-wide Azure SPN mount points ([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This PR introduces new functionality to create UC external location, schema, and table grants based on workspace-wide Azure SPN mount points. The `get_interactive_cluster_grants` function in the `grants.py` file has been updated to include new grants for principals and catalog grants for the `hive_metastore` catalog. The `_get_privilege` function has also been updated to accept `locations` and `mounts` inputs. Additionally, new test methods `test_migrate_external_tables_with_principal_acl_azure` and `test_migrate_external_tables_with_spn_azure` have been added to test migrating managed and external tables with principal and SPN ACLs in Azure. Existing test methods have been modified to support a new user and UCX group access. These changes improve the management of UC resources in a Databricks workspace and are tested through manual testing and the addition of unit tests. However, there is no mention of integration tests or verification on staging environments in this PR.
* Decouple `InstallState` from `WorkflowsDeployment` constructor ([#1246](https://github.com/databrickslabs/ucx/issues/1246)). In pull request [#1209](https://github.com/databrickslabs/ucx/issues/1209), the `InstallState` class was decoupled from the `WorkflowsDeployment` constructor in the `install.py` file. This refactoring allows for increased modularity and maintainability by representing the state of an installation with the `InstallState` class, which includes information such as status and configuration. The `InstallState` class is created from the `Installation` object using the `from_installation` class method in the `run` and `current` methods, and is then passed as an argument to the `WorkflowsDeployment` constructor. This change also affects several methods, such as `create_jobs`, `repair_run`, `latest_job_status`, and others, which have been updated to use `InstallState` instead of `Installation`. In the `test_installation.py` file, the `WorkflowsDeployment` constructor has been updated to accept an `InstallState` object as a separate argument. This refactoring improves the code's decoupling, readability, and flexibility, allowing for more customization and configuration of `InstallState` independently of the `installation` object.
* Decouple `InstallState` from `WorkspaceDeployment` constructor. In this refactoring change, the `InstallState` object has been decoupled from the `WorkspaceDeployment` constructor, improving modularity and maintainability. The `InstallState` object is now initialized separately and passed as an argument to the `WorkspaceInstallation` and `WorkflowsDeployment` classes. The `state` property has been removed from the `WorkspaceDeployment` class, and the `run_workflow` and `validate_step` methods now access the `_install_state` object directly. The `_create_dashboards`, `_trigger_workflow`, and `_remove_jobs` methods have been updated to use `self._install_state` instead of `self._workflows_installer.state`. This change does not impact functionality but enhances the flexibility of managing the installation state and enables easier testing and modification of the `InstallState` object separately from the `WorkspaceDeployment` class.
* Delete src/databricks/labs/ucx/source_code/dbfsqueries.py ([#1396](https://github.com/databrickslabs/ucx/issues/1396)). In this release, we have removed the DBFS querying functionality previously implemented in the `dbfsqueries.py` module located at `src/databricks/labs/ucx/source_code/`. This change is intended to streamline the project and improve maintainability. As a result, users should note that any existing code that relies on this functionality will no longer work and should be updated accordingly. We have decided to remove this file because the DBFS querying code is either no longer needed or will be implemented differently in the future. We recommend that users familiarize themselves with this change and adjust their code accordingly. Specific details about the implementation cannot be provided, as the file has been completely deleted.
* Detect DBFS use in SQL statements in notebooks ([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new linter, 'notebook-linter', has been implemented to detect and raise advisories for DBFS (Databricks File System) usage in SQL statements within notebooks. This feature helps in identifying and migrating away from DBFS references. The linter is a class that parses a Databricks notebook and applies available linters to the code cells based on the language of the cell. It specifically checks for DBFS usage in SQL statements and raises advisories accordingly. New unit tests have been added to ensure the functionality of the linter and manual testing has been conducted. This feature resolves issue [#1108](https://github.com/databrickslabs/ucx/issues/1108) and promotes best practices for file system usage within notebooks.
* Detect `sys.path` manipulation ([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A new feature has been added to our open-source library that enables the detection of `sys.path` manipulation in Python code. This functionality is implemented through updates to the PythonLinter class, which now includes methods to parse the abstract syntax tree (AST) and identify modifications to `sys.path`. Additionally, the linter can now list imported sources and appended sys.paths with the new list_import_sources and list_appended_sys_paths methods, respectively. The new functionality is covered by several test cases and is accompanied by updates to the documentation, CLI command, and existing workflows. Unit tests have been added to ensure the proper functioning of the new feature, which resolves issue [#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202).
* Detect direct access to cloud storage and raise a deprecation warning ([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this release, the PySpark linter has been updated to detect and issue deprecation warnings for direct access to cloud storage. This new check complements the existing functionality of detecting table names that need migration. The changes include the addition of a new `AstHelper` class to extract fully-qualified function names from PySpark AST nodes and a `DirectFilesystemAccessMatcher` class to match calls to functions that perform direct filesystem access. The `TableNameMatcher` class has been updated to check if the table name is a constant string and raise a deprecation warning if the table has been migrated in the Unity Catalog. These updates aim to encourage the use of more secure and recommended methods for accessing cloud storage in PySpark code. This feature resolves issue [#1133](https://github.com/databrickslabs/ucx/issues/1133) and is signed off by Jim Idle.
* Detect imported files and packages ([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This commit introduces new functionality to parse Python code for `import` and `import from` processing instructions, allowing for the detection of imported files and packages. This resolves issue [#1346](https://github.com/databrickslabs/ucx/issues/1346) and is related to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). The changes include a new method for loading files, modifications to an existing method for loading objects, and new conditions for error checking. A new `WorkspaceFile` object is created for each imported file, and the `_load_source` method has been updated to validate object info. The functionality is tested through unit tests and is not dependent on any other files. User documentation and a new CLI command have been added. Additionally, a new workflow and table have been added, and an existing workflow and table have been modified. Co-authored by Cor <[email protected]>.
* Document troubleshooting guide ([#1226](https://github.com/databrickslabs/ucx/issues/1226)). A new troubleshooting guide has been added to the UCX toolkit documentation, providing comprehensive guidance on identifying and resolving common errors. The guide includes instructions for gathering and interpreting logs from both Databricks workspace and UCX command line, as well as resources for further assistance, such as the UCX GitHub repository, Databricks community, Databricks support, and Databricks partners. It also covers specific error scenarios, including cryptic authentication errors and issues with UCX installation, with detailed steps for troubleshooting and resolution. The guide can be found in the docs/troubleshooting.md file and is linked from the main README.md, which has undergone minor revisions to installation and migration processes, as well as the removal of the previous link for questions and bug fixes in favor of the new troubleshooting guide.
* Don't fail `main` branch build with `no-cheat` ([#1461](https://github.com/databrickslabs/ucx/issues/1461)). A new GitHub Actions workflow called `no-cheat` has been developed to maintain code quality and consistency in pull requests. This workflow checks out the code with full history and verifies that no linter directives have been disabled in the new code added in the pull request. If any linter directives are found, the workflow will cause the build to fail and print a message indicating the number of instances of linter directives found. This feature is especially useful for projects that value code quality and consistency, as it helps to maintain a uniform code style throughout the codebase. Additionally, changes have been made to the push.yml file in the .github/workflows directory to ensure that the linter is not disabled in new code added to the main branch, causing the build to fail if it is. This reinforces the project's commitment to maintaining high code standards.
* Enforced removal of commented-out code on `make fmt` ([#1493](https://github.com/databrickslabs/ucx/issues/1493)). In this release, the `make fmt` process has been updated to enforce the removal of dead commented-out code through the use of `pylint`. The `pyproject.toml` file has been modified to utilize the new version of `databricks-labs-pylint` (0.3.0) and incorporate the `databricks.labs.pylint.eradicate` plugin for identifying and removing dead code. The `databricks.labs.ucx.hive_metastore.locations` module, specifically the `locations.py` and `lsp.py` files, has undergone changes to eliminate dead code and update commented-out lines. The `do_POST` and `do_GET` methods in the `lsp.py` file have been affected. No functional changes have been introduced; the modifications focus on removing dead code and improving code quality. This update enhances code readability and maintainability, promotes consistency, and eliminates unnecessary code accumulation throughout the project. A test case in the "test_generic.py" file verifies the removal of dead code, further ensuring the codebase's integrity and reliability. This release offers a cleaner, more efficient, and consistent codebase for software engineers to adopt and work with.
* Enhanced migrate views task to support views created with explicit column list ([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The `migrate views` task has been enhanced to support views that are created with an explicit column list, addressing issue [#1375](https://github.com/databrickslabs/ucx/issues/1375). A lookup based on `SHOW CREATE TABLE` has been added to extract the column list from the create script, allowing for better handling of views with defined column lists. The commit also introduces a new dependency, "sqlglot~=23.9.0", and updates the PyYAML dependency. Test functions and methods have been updated in the test file 'test_table_migrate.py' to ensure that views with explicit column lists are migrated correctly, including a new test function 'test_migrate_view_with_columns'. This improvement helps make the migrate views task more robust and capable of handling a wider variety of views.
* Ensure that USE statements are recognized and apply to table references without a qualifying schema in SQL and pyspark ([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This change introduces a new class, CurrentSessionState, in the databricks.labs.ucx.source_code.base module to manage the current session's database name during table initialization in SQL and PySpark. The purpose of this enhancement is to ensure proper recognition and application of USE statements to table references without a qualifying schema, addressing schema ambiguity and aligning with Spark documentation. The linter class has been updated to track session schema, and SQL parsing methods have been modified to accurately interpret the read argument as the dialect for parsing. Improved handling of table references and schema alignment in SQL and PySpark contexts enhances code robustness and user-friendliness.
* Expand documentation for end to end workflows with external HMS ([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The updated UCX library now supports integration with an external Hive Metastore, providing users with the flexibility to choose between the default workspace metastore or an external one. Upon detecting an external metastore in cluster policies and Spark configurations, UCX will prompt the user to connect to it, creating a new policy with the chosen external metastore configuration. This change does not affect SQL Warehouse data access configurations, and users must ensure both job clusters and SQL Warehouses are configured for the same external Hive metastore. When setting up UCX with an external metastore, the assessment workflow scans tables and views, and the table migration workflow upgrades them accordingly. The inventory database is stored in the external Hive metastore and can only be queried with the correct configuration. When using multiple external Hive metastores, users can choose between having multiple UCX installations or manually modifying the cluster policy and SQL data access configuration to point to the correct external Hive metastore.
* Extend service principal migration with option to create access connectors with managed identity for each storage account ([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This commit extends the service principal migration feature by adding the capability to create access connectors with managed identities for each storage account. A new CLI command and updated existing command are included, as well as new methods for creating, listing, getting, and deleting access connectors. The `AccessConnector` class is added to represent an access connector with properties such as id, name, location, and tags. The necessary permissions for these new access connectors will be set in a later PR. The changes also include updates to user documentation and new unit and integration tests. This feature will allow users to migrate their service principals to UC storage credentials and create Databricks Access Connectors for their storage accounts, all with the convenience of managed identities, improving security and management.
* Extended wait time for group checking during tests ([#1464](https://github.com/databrickslabs/ucx/issues/1464)). In this release, a modification has been implemented to address eventual consistency issues in group APIs that can cause failed tests. The update extends the wait time for group checking during tests to up to 2 minutes, specifically affecting the retried decorator in the `tests/integration/workspace_access/test_groups.py` file. The `timeout` parameter in the `wait` function's retry decorator has been adjusted from 60 seconds to 120 seconds, enhancing the reliability of tests interacting with group APIs. This adjustment ensures reliable group verification, even in the presence of delays or inconsistencies in group APIs, thereby improving the stability and robustness of the system.
* Fix: `test_delete_ws_groups_should_delete_renamed_and_reflected_groups_only` and `test_running_real_remove_backup_groups_job` ([#1476](https://github.com/databrickslabs/ucx/issues/1476)). This release includes a fix for issues [#1473](https://github.com/databrickslabs/ucx/issues/1473) and [#1472](https://github.com/databrickslabs/ucx/issues/1472) in the tests for deleting workspace groups. In `test_groups.py`, the `test_delete_ws_groups_should_delete_renamed_and_reflected_groups_only` and `test_running_real_remove_backup_groups_job` tests have been updated to improve the reliability of checking if a group is deleted. Previously, the tests would fail if a group was not found immediately after deletion, but now they will pass if a `NotFound` exception is raised after retrying a few times and fail if the group is not found after a couple of minutes. The logic for handling `NotFound` errors has been updated to use a new `get_group` function that raises a `KeyError` when the group is not found, which is then caught and expected to fail with a `NotFound` error. This ensures that groups are properly deleted and not found errors are handled correctly.
* Fixed UCX policy creation when instance pool is specified ([#1457](https://github.com/databrick…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/cli CLI commands migrate/code Abstract Syntax Trees and other dark magic tech debt chores and design flaws
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants