Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added baseline for workflow linter #1613

Merged
merged 26 commits into from
May 7, 2024
Merged

Added baseline for workflow linter #1613

merged 26 commits into from
May 7, 2024

Conversation

nfx
Copy link
Collaborator

@nfx nfx commented May 2, 2024

flowchart TD
    job -->|has many| job_task
    job_task -.-> notebook_task
    job_task -.-> wheel_task 

    job -.-> git_source

    job_task -.->|execute on| interactive_cluster
    interactive_cluster -.-> library

    job_task -.-> library
    library -.-> wheel_on_dbfs
    library -.-> wheel_on_wsfs
    library -.-> wheel_on_volumes
    library -.-> egg_on_dbfs
    library -.-> egg_on_wsfs
    library -.-> pypi
    wheel_task -.-> wheel_on_dbfs
    wheel_task -.-> wheel_on_wsfs

    wheel_on_dbfs -.-> python_file
    wheel_on_wsfs -.-> python_file
    egg_on_dbfs -.-> python_file
    egg_on_wsfs -.-> python_file
    pypi -.-> python_file
    wsfs_file -.-> python_file
    python_file -.->|import| python_file
    notebook_task -.-> notebook
    notebook -.->|import| python_file
    notebook -.->|can run| notebook

    job_task -.-> dependency_graph
    python_file --> dependency_graph
    notebook --> dependency_graph

    git_source -.-> python_file
    git_source -.-> notebook
    lint_local_code_cli --> dependency_graph

    workflow_linter --> dependency_graph
    workflow_linter -.-> job_problems
    dependency_graph -.-> job_problems
    job_problems -.->|viz| redash_dashboard
Loading

This PR adds baseline for linting workflows

Related to:

closes #1559
closes #1468
closes #1286

@ericvergnaud
Copy link
Contributor

test_dependency_graph_builder_visits_site_packages only succeeds in main branch because it' not checking problems. Importing core from __init__.py doesn't yet work (requires cwd in my PR). Core ends up being loaded because it's described in dist-info, hence the illusion. It's failing in this PR because of the changes in error handling.

Copy link

codecov bot commented May 3, 2024

Codecov Report

Attention: Patch coverage is 62.15753% with 221 lines in your changes are missing coverage. Please review.

Project coverage is 88.34%. Comparing base (5222aaa) to head (9628618).
Report is 20 commits behind head on main.

Files Patch % Lines
src/databricks/labs/ucx/source_code/jobs.py 32.22% 122 Missing ⚠️
...tabricks/labs/ucx/source_code/notebooks/sources.py 46.66% 24 Missing ⚠️
...abricks/labs/ucx/source_code/notebooks/migrator.py 0.00% 14 Missing ⚠️
...tabricks/labs/ucx/source_code/notebooks/loaders.py 70.00% 8 Missing and 4 partials ⚠️
...databricks/labs/ucx/source_code/notebooks/cells.py 73.80% 8 Missing and 3 partials ⚠️
src/databricks/labs/ucx/source_code/files.py 77.77% 7 Missing and 3 partials ⚠️
src/databricks/labs/ucx/source_code/graph.py 93.07% 9 Missing ⚠️
...c/databricks/labs/ucx/source_code/site_packages.py 68.42% 5 Missing and 1 partial ⚠️
...c/databricks/labs/ucx/source_code/python_linter.py 50.00% 5 Missing ⚠️
src/databricks/labs/ucx/contexts/application.py 60.00% 4 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1613      +/-   ##
==========================================
- Coverage   89.70%   88.34%   -1.36%     
==========================================
  Files          82       87       +5     
  Lines       10301    10826     +525     
  Branches     1813     1907      +94     
==========================================
+ Hits         9240     9564     +324     
- Misses        691      895     +204     
+ Partials      370      367       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nfx nfx force-pushed the feat/lint/job branch from 3b63d04 to 16502d7 Compare May 3, 2024 14:56
Copy link
Member

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I see the work this PR is preparing for, let's discuss that in our meeting.

I read through it today and two or three more times during the last couple days. It's a big PR with new functionality, let's merge it so that I can get better understanding by extending the functionality

Have some minor nits/questions in the comments 👇

src/databricks/labs/ucx/mixins/fixtures.py Show resolved Hide resolved
src/databricks/labs/ucx/mixins/fixtures.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/source_code/files.py Show resolved Hide resolved
src/databricks/labs/ucx/source_code/files.py Show resolved Hide resolved
src/databricks/labs/ucx/source_code/files.py Show resolved Hide resolved
self._migration_index = migration_index
self._whitelist = whitelist

def lint_job(self, job_id: int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def lint_job(self, job_id: int):
def lint_job(self, job_id: int) -> list[JobProblem]:

@nfx nfx requested a review from JCZuurmond May 7, 2024 12:51
@nfx nfx marked this pull request as ready for review May 7, 2024 12:51
@nfx nfx requested a review from a team May 7, 2024 12:51
@nfx nfx enabled auto-merge May 7, 2024 13:01
return None
container.build_dependency_graph(child_graph)
return child_graph
problem = DependencyProblem('dependency-register-failed', 'Failed to register dependency', dependency.path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change! We might want to return a problem, but we almost certainly don't want to return an ill graph

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return MaybeGraph(child_graph, [problem]) is not nil

def resolve_notebook(self, path: Path, problem_collector: Callable[[DependencyProblem], None]) -> Dependency | None:
return None
def resolve_notebook(self, path: Path) -> MaybeDependency:
return self._fail('notebook-not-found', f"Notebook not found: {path.as_posix()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's incorrect. Sub-resolvers shouldn't be reporting not-found problems, that's the job of the DependencyResolver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale being that not resolving a dependency might not be a problem, it's for the caller to make that call. (like file.exists() doesn't raise errors)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need sub-resolvers then? perhaps we can remove them?..

src/databricks/labs/ucx/source_code/graph.py Show resolved Hide resolved
src/databricks/labs/ucx/source_code/graph.py Show resolved Hide resolved
else:
self.add_problems(problems)
return dependency
def resolve_notebook(self, path: Path) -> MaybeDependency:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change incorrectly moves the responsibility of reporting a not-found problem from the DependencyResolver to sub-resolvers. Sub-resolvers should only answer the question: can you locate this thing ?

src/databricks/labs/ucx/source_code/notebooks/cells.py Outdated Show resolved Hide resolved
end_col=call.end_col_offset or 0,
)

def build_graph_from_python_source(self, python_code: str) -> MaybeGraph:
linter = ASTLinter.parse(python_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing augmenting sys.paths, see #1633

Copy link
Collaborator Author

@nfx nfx May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

got it

src/databricks/labs/ucx/source_code/notebooks/migrator.py Outdated Show resolved Hide resolved
@nfx nfx changed the title Added workflow linter Added baseline for workflow linter May 7, 2024
@nfx nfx disabled auto-merge May 7, 2024 14:30
@nfx nfx enabled auto-merge May 7, 2024 14:30
@nfx nfx requested a review from ericvergnaud May 7, 2024 15:27
@nfx nfx disabled auto-merge May 7, 2024 15:44
@nfx nfx merged commit 1ae345a into main May 7, 2024
4 of 7 checks passed
@nfx nfx deleted the feat/lint/job branch May 7, 2024 15:44
nfx added a commit that referenced this pull request May 8, 2024
* Added DBSQL queries & dashboard migration ([#1532](#1532)). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: `migrate-dbsql-dashboards` and `revert-dbsql-dashboards`. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The `migrate-dbsql-dashboards` command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with `migrated by UCX` and backing up original queries. The `revert-dbsql-dashboards` command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a `--dashboard-id` flag for migrating or reverting a specific dashboard. Additionally, two new functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`, have been added to the `cli.py` file, and new classes have been added to interact with Redash for data visualization and querying. The `make_dashboard` fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards.
* Added UDFs assessment ([#1610](#1610)). A User Defined Function (UDF) assessment feature has been introduced, addressing issue [#1610](#1610). A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, `success` and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to create the missing UC roles in AWS ([#1495](#1495)). The `databricks labs ucx` tool now includes a new command, `create-missing-principals`, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using `IamRoleCreation` from `databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new command only supports AWS and does not affect Azure. The existing `migrate_credentials` function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including `AWSUCRoleCandidate` in `aws.py`, and `create_missing_principals` and `list_uc_roles` methods in `access.py`. The `create_uc_roles_cli` method in `access.py` has been refactored and renamed to `list_uc_roles`. New unit tests have been implemented to test the functionality of `create_missing_principals` for AWS and Azure, as well as testing the behavior when the command is not approved.
* Added baseline for workflow linter ([#1613](#1613)). This change introduces the `WorkflowLinter` class in the `application.py` file of the `databricks.labs.ucx.source_code.jobs` package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as `workspace_client`, `dependency_resolver`, `path_lookup`, and `migration_index`. Several properties have been moved from `dependency_resolver` to the `CliContext` class, and the `NotebookLoader` class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The `generic` and `redash` modules from `databricks.labs.ucx.workspace_access` and the `GroupManager` class from `databricks.labs.ucx.workspace_access.groups` are used. The `VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from `databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class from `databricks.labs.ucx.installer.workflows` are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access ([#1606](#1606)). A new `AstHelper` class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the `AstHelper` class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow ([#1621](#1621)). The `migrate_tables` workflow in `workflows.py` has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The `migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and `migrate_views` tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities.
* Added "seen tables" feature ([#1465](#1465)). The `seen tables` feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a `snapshot` method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The `_crawl` function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and `_get_table_location` have been included to facilitate these improvements. In the testing realm, a new test `test_mount_listing_seen_tables` has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` CLI command ([#1660](#1660)). This commit adds support for the `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` command, which checks for external tables that cannot be synced and prompts the user to run the `migrate-tables-ctas` workflow. Two new methods, `test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts, ctx=ctx)`, have been added. The first method checks if the `migrate-external-tables-ctas` workflow is called correctly, while the second method runs the workflow after prompting the user. The method `test_migrate_external_hiveserde_tables_in_place(ws)` has been modified to test if the `migrate-external-hiveserde-tables-in-place-experimental` workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project.
* Added support for migrating external location permissions from interactive cluster mounts ([#1487](#1487)). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing `databricks labs ucx` command is modified, with the addition of the new method `create_external_locations` and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues [#1192](#1192) and [#1193](#1193), ensuring a more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN ([#1631](#1631)). In this release, we've implemented new features to enhance the security and control over data access during the migration process for the SQL warehouse data access configuration. The `databricks labs ucx create-uber-principal` command now creates a service principal with read-only access to all the storage used by tables in the workspace. The UCX Cluster Policy and SQL Warehouse data access configuration will be updated to use this service principal for migration workflows. A new method, `_update_sql_dac_with_instance_profile`, has been introduced in the `access.py` file to update the SQL data access configuration with the provided AWS instance profile, ensuring a more streamlined management of instance profiles within the SQL data access configuration during the creation of an uber service principal (SPN). Additionally, new methods and tests have been added to the sql module of the databricks.sdk.service package to improve Azure resource permissions, handling different scenarios related to creating a global SPN in the presence or absence of various conditions, such as storage, cluster policies, or secrets.
* Addressed issue with disabled features in certain regions ([#1618](#1618)). In this release, we have implemented improvements to address an issue where certain features were disabled in specific regions. We have added error handling when listing serving endpoints to raise a NotFound error if a feature is disabled, preventing the code from failing silently and providing better error messages. A new method, test_serving_endpoints_not_enabled, has been added, which creates a mock WorkspaceClient and raises a NotFound error if serving endpoints are not enabled for a shard. The GenericPermissionsSupport class uses this method to get crawler tasks, and if serving endpoints are not enabled, an error message is logged. These changes increase the reliability and robustness of the codebase by providing better error handling and messaging for this particular issue. Additionally, the change includes unit tests and manual testing to ensure the proper functioning of the new features.
* Aggregate UCX output across workspaces with CLI command ([#1596](#1596)). A new `report-account-compatibility` command has been added to the `databricks labs ucx` tool, enabling users to evaluate the compatibility of an entire Azure Databricks account with UCX (Unified Client Context). This command generates a readiness report for an Azure Databricks account, specifically for evaluating compatibility with UCX, by querying various aspects of the account such as clusters, configurations, and data formats. It uses Azure CLI authentication with AAD tokens for authentication and accepts a profile as an argument. The output includes warnings for workspaces that do not have UCX installed, and provides information about unsupported cluster types, unsupported configurations, data format compatibility, and more. Additionally, a new feature has been added to aggregate UCX output across workspaces in an account through a new CLI command, "report-account-compatibility", which can be run at the account level. The existing `manual-workspace-info` command remains unchanged. These changes will help assess the readiness and compatibility of an Azure Databricks account for UCX integration and simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy ([#1665](#1665)). In this release, we have implemented a change to ensure the presence of the display name of a specific workspace group (ws_group_a) in the cluster policy. This is to prevent a key error previously encountered. The cluster policy is now loaded as a dictionary, and the group name is checked to confirm its presence. If the group is not found, a message is raised alerting users. Additionally, the permission level for the group is verified to ensure it is set to CAN_USE. No new methods have been added, and existing functionality remains unchanged. The test file test_ext_hms.py has been updated to include the new assertion and has undergone both unit tests and manual testing to ensure proper implementation. This change is intended for software engineers who adopt the project.
* Automatically retrying with `auth_type=azure-cli` when constructing `workspace_clients` on Azure ([#1650](#1650)). This commit introduces automatic retrying with 'auth_type=azure-cli' when constructing `workspace_clients` on Azure, resolving TODO items for `AccountWorkspaces` and adding relevant suggestions in 'troubleshooting.md'. It closes issues [#1574](#1574) and [#1430](#1430), and includes new methods for generating readiness reports in `AccountAggregate` and testing the `get_accessible_workspaces` method in 'test_workspaces.py'. User documentation has been updated and the changes have been manually verified in a staging environment. For macOS and Windows users, explicit auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure storage account for principal-prefix-access ([#1576](#1576)). This release introduces several enhancements to the identification of service principals with custom roles on Azure storage accounts for principal-prefix-access. New methods such as `_get_permission_level`, `_get_custom_role_privilege`, and `_get_role_privilege` have been added to improve the functionality of the module. Additionally, two new classes, AzureRoleAssignment and AzureRoleDetails, have been added to enable more detailed management and access control for custom roles on Azure storage accounts. The 'test_access.py' file has been updated to include tests for saving custom roles in Azure storage accounts and ensuring the correct identification of service principals with custom roles. A new unit test function, test_role_assignments_custom_storage(), has also been added to verify the behavior of custom roles in Azure storage accounts. Overall, these changes provide a more efficient and fine-grained way to manage and control custom roles on Azure storage accounts.
* Clarified unsupported config in compute crawler ([#1656](#1656)). In this release, we have made significant changes to clarify and improve the handling of unsupported configurations in our compute crawler related to the Hive metastore. We have expanded error messages for unsupported configurations and provided detailed recommendations for remediation. Additionally, we have added relevant user documentation and manually tested the changes. The changes include updates to the configuration for external Hive metastore and passthrough security model for Unity Catalog, which are incompatible with the current configurations. We recommend removing or altering the configs while migrating existing tables and views using UCX or other compatible clusters, and mapping the passthrough security model to a security model compatible with Unity Catalog. The code modifications include the addition of new methods for checking cluster init script and Spark configurations, as well as refining the error messages for unsupported configurations. We also added a new assertion in the `test_cluster_with_multiple_failures` unit test to check for the presence of a specific message regarding the use of the `spark.databricks.passthrough.enabled` configuration. This release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is detected ([#1579](#1579)). A new default database `ucx` is introduced for storing inventory in the hive metastore, with a suffix consisting of the workspace's client ID to ensure uniqueness when an external hive metastore is detected. The `has_ext_hms()` method is added to the `InstallationPolicy` class to detect external HMS and thereby create a unique default schema. The `_prompt_for_new_installation` method's default value for the `Inventory Database stored in hive_metastore` prompt is updated to use the new default database name, modified to include the workspace's client ID if external HMS is detected. Additionally, a test function `test_save_config_ext_hms` is implemented to demonstrate the `WorkspaceInstaller` class's behavior with external HMS, creating a unique default schema for improved system functionality and customization. This change is part of issue [#1579](#1579).
* Extend service principal migration to create storage credentials for access connectors created for each storage account ([#1426](#1426)). This commit extends the service principal migration to create storage credentials for access connectors associated with each storage account, resolving issues [#1384](#1384) and [#875](#875). The update includes modifications to the existing `databricks labs ucx` command for creating access connectors, adds a new CLI command for creating storage credentials, and updates the documentation. A new workflow has been added for creating credentials for access connectors and service principals, and updates have been made to existing workflows. The commit includes manual, unit, and integration tests, and no new or modified methods are specified in the diff. The focus is on the feature description and its impact on the project's functionality. The commit has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to access Azure Storage Accounts behind firewall ([#1589](#1589)). In this release, we have introduced a new feature to improve access to Azure Storage Accounts that are protected by firewalls. Due to limitations with service principals in such scenarios, we have developed Access Connectors with Managed Identities for more reliable connectivity. This change includes updates to the 'credentials.py' file, which introduces new methods for managing the migration of service principals to Access Connectors using Managed Identities. Users are warned that migrating to this new feature may cause issues when transitioning to UC, and are advised to validate external locations after running the migration command. This update enhances the security and functionality of the system, providing a more dependable method for accessing Azure Storage Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have different target schemas ([#1581](#1581)). In this release, we have implemented a fix to address an issue where catalog/schema grants were not being handled correctly when tables with the same source schema had different target schemas. This was causing problems with granting appropriate permissions to users. We have modified the prepare_test function to include an additional test case with a different target schema for the same source table. Furthermore, we have updated the test_catalog_schema_acl function to ensure that grants are being created correctly for all catalogs, schemas, and tables. We have also added an extra query to grant use schema permissions for catalog2.schema3 to user1. Additionally, we have introduced a new `SchemaInfo` class to store information about catalogs and schemas, and refactored the `_get_database_source_target_mapping` method to return a dictionary that maps source databases to a list of `SchemaInfo` objects instead of a single dictionary. These changes ensure that grants are being handled correctly for catalogs, schemas, and tables, even when tables with the same source schema have different target schemas. This will improve the overall functionality and reliability of the system, making it easier for users to manage their catalogs and schemas.
* Fixed Spark configuration parameter referencing secret ([#1635](#1635)). In this release, the code related to the Spark configuration parameter reference for a secret has been updated in the `access.py` file, specifically within the `_update_cluster_policy_definition` method. The change modifies the method to retrieve the OAuth client secret for a given storage account using an f-string to reference the secret, replacing the previous concatenation operator. This enhancement is aimed at improving the readability and maintainability of the code while preserving its functionality. Furthermore, the commit includes additional changes, such as new methods `test_create_global_spn` and "cluster_policies.edit", which may be related to this fix. These changes address the secret reference issue, ensuring secure access control and improved integration, particularly with the Spark configuration, benefiting engineers utilizing this project for handling sensitive information and managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in `labs.yml` ([#1627](#1627)). In this release, the `migration-locations` command in the `labs.yml` file has been updated to include new flags `subscription-id` and `aws-profile`. The `subscription-id` flag allows users to specify the subscription to scan the storage account in, and the `aws-profile` flag allows for authentication using a specified AWS Profile. The `assign-metastore` command has also been updated with a new description: "Enable Unity Catalog features on a workspace by assigning a metastore to it." The `is_account_level` parameter remains unchanged, and the new optional flag `workspace-id` has been added, allowing users to specify the Workspace ID to assign a metastore to. This change enhances the functionality of the `migration-locations` and `assign-metastore` commands, providing more options for users to customize their storage scanning and metastore assignment processes. The `migration-locations` and `assign-metastore` definitions in the `labs.yml` file have been fixed in this release.
* Fixed prompt for using external metastore ([#1668](#1668)). A fix has been implemented in the `create` function of the `policy.py` file to correctly prompt users for using an external metastore. Previously, a missing period and space in the prompt caused potential confusion. The updated prompt now includes a clarifying sentence and the `_prompts.confirm` method has been modified to check if the user wants to set UCX to connect to an external metastore in two scenarios: when one or more cluster policies are set up for an external metastore, and when the workspace warehouse is configured for an external metastore. If the user chooses to set up an external metastore, an informational message will be recorded in the logger. This change ensures clear and precise communication with users during the external metastore setup process.
* Fixed storage account network ACLs retrieved from properties ([#1620](#1620)). This release includes a fix to the storage account network ACLs retrieval in the open-source library, addressing issue [#1](#1). Previously, the network ACLs were being retrieved from an incorrect location, but this commit corrects that by obtaining the network ACLs from the storage account's properties.networkAcls field. The `StorageAccount` class has been updated to modify the way default network action is retrieved, with a new value `Unknown` added to the previous values `Deny` and "Allow". The `from_raw_resource` class method has also been updated to retrieve the default network action from the `properties.networkAcls` field instead of the `networkAcls` field. This change may affect any functionality that relies on network ACL information and impacts the existing command `databricks labs ucx ...`. Relevant tests, including a new test `test_azure_resource_storage_accounts_list_non_zero`, have been added and manually and unit tested to ensure the fix is functioning correctly.
* Fully refresh table migration status in table migration workflow ([#1630](#1630)). This release introduces a new method, `index_full_refresh()`, to the table migration workflow for fully refreshing the migration status, addressing an oversight from a previous commit ([#1623](#1623)) and resolving issue [#1628](#1628). The new method resets the `_migration_status_refresher` before computing the index, ensuring the latest migration status is used for determining whether view dependencies have been migrated. The `index()` method was previously used to refresh the migration status, but it only provided a partial refresh. With this update, `index_full_refresh()` is utilized for a comprehensive refresh, affecting the `refresh_migration_status` task in multiple workflows such as `migrate_views`, `scan_tables_in_mounts_experimental`, and others. This change ensures a more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing ([#1605](#1605)). A recent update has enhanced the error handling during the loading of installations in the `install.py` file. Specifically, the `installation.load` function now handles certain errors, including `PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by logging a warning message and skipping the corrupted installation instead of raising an error. This behavior has been incorporated into both the `configure` and `_check_inventory_database_exists` functions, allowing the installation process to continue even in the presence of issues with existing installations, while providing improved error messages. This change resolves issue [#1601](#1601) and introduces a new test case for a corrupted installation configuration, as well as an updated existing test case for `test_save_config` that includes a mock installation.
* Improved exception handling ([#1584](#1584)). In this release, the exception handling during the upload of a wheel file to DBFS has been significantly improved. Previously, only PermissionDenied errors were caught and handled. Now, both BadRequest and PermissionDenied exceptions will be caught and logged as a warning. This change enhances the robustness of the code by handling a wider range of exceptions during the upload process. In addition, cluster overrides have been configured and DBFS write permissions have been set up. The specific changes made to the code include updating the import statement for NotFound to include BadRequest and modifying the except block in the _get_init_script_data method to catch both NotFound and BadRequest exceptions. These improvements ensure that the code can handle more types of errors, providing more helpful error messages and preventing crash scenarios, thereby enhancing the reliability and robustness of the code.
* Improved exception handling for `migrate_acl` ([#1590](#1590)). In this release, the `migrate_acl` functionality has been enhanced to improve exception handling, addressing a flakiness issue in the `test_migrate_managed_tables_with_acl` test. Previously, unhandled `not found` exceptions during parallel test execution caused the flakiness. This release resolves this issue ([#1549](#1549)) by introducing error handling in the `test_migrate_acls_should_produce_proper_queries` test. A controlled error is now introduced to simulate a failed grant migration due to a `TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise testing of error handling and logging mechanisms when migration fails for specific objects, ensuring a more reliable testing environment for the `migrate_acl` functionality.
* Improved reliability of table migration status refresher ([#1623](#1623)). This release introduces improvements to the table migration status refresher in the open-source library, enhancing its reliability and robustness. The `table_migrate` function has been updated to ensure that the table migration status is always reset when requesting the latest snapshot, addressing issues [#1623](#1623), [#1622](#1622), and [#1615](#1615). Additionally, the function now handles `NotFound` errors when refreshing migration status. The `get_seen_tables` function has been modified to convert the returned iterator to a list and raise a `NotFound` exception if the schema does not exist, which is then caught and logged as a warning. Furthermore, the migration status reset behavior has been improved, and the `migration_status_refresher` parameter type in the `TableMigrate` class constructor has been modified. New private methods `_index_with_reset()` and updated `_migrate_views()` and `_view_can_be_migrated()` methods have been added to ensure a more accurate and consistent table migration process. The changes have been thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows ([#1599](#1599)). In this release, updates have been made to the migration status at the end of the `migrate_tables` workflows, with no new or modified tables or methods introduced. The `_migration_status_refresher.reset()` method has been added in two locations to ensure accurate migration status updates. A new `refresh_migration_status` method has been included in the `RuntimeContext` class in the `databricks.labs.ucx.hive_metastore.workflows` module, which refreshes the migration status for presentation in the dashboard. The changes also include the addition of the `refresh_migration_status` task in `migrate_views`, `migrate_views_with_acl`, and `scan_tables_in_mounts_experimental` workflows, and the `migration_report` method is now dependent on the `refresh_migration_status` task. Thorough testing has been conducted, including the creation of a new integration test in the file `tests/integration/hive_metastore/test_workflows.py` to verify that the migration status is refreshed after the migration job is run. These changes aim to ensure that the migration status is up-to-date and accurately presented in the dashboard.
* Removed DBFS library installations ([#1554](#1554)). In this release, the "configure.py" file has been removed, which previously contained the `ConfigureClusterOverrides` class with methods for validating cluster IDs, distinguishing between classic and Table Access Control (TACL) clusters, and building a prompt for users to select a valid active cluster ID. The removal of this file signifies that these functionalities are no longer available. This change is part of a larger commit that also removes DBFS library installations and updates the Estimates Dashboard to remove metastore assignment, addressing issue [#1098](#1098). The commit has been tested via integration tests and manual installation and running of UCX on a no-uc environment. Please note that the `create_jobs` method in the `install.py` file has been updated to reflect these changes, ensuring a more straightforward installation experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt ([#1664](#1664)). In this release, we have removed the `is_terraform_used` prompt from the configuration file and the installation process in the ucx package. This prompt was not being utilized and had been a source of confusion for some users. Although the variable that stored its outcome will be retained for backwards compatibility, no new methods or modifications to existing functionality have been introduced. No tests have been added or modified as part of this change. The removal of this prompt simplifies the configuration process and aligns with the project's future plans to eliminate the use of Terraform state for ucx migration. Manual testing has been conducted to ensure that the removal of the prompt does not affect the functionality of other properties in the configuration file or the installation process.
* Resolve relative paths when building dependency graph ([#1608](#1608)). This commit introduces support for resolving relative paths when building a dependency graph in the UCX project, addressing issues 1202, 1499, and 1287. The SysPathProvider now includes a `cwd` attribute, and a new class, LocalNotebookLoader, has been implemented to handle local files and folders. The PathLookup class is used to resolve paths, and new methods have been added to support these changes. Unit tests have been provided to ensure the correct functioning of the new functionality. This commit replaces issue 1593 and enhances the project's ability to handle local files and folders, resulting in a more robust and reliable dependency graph.
* Show tables migration status in migration dashboard ([#1507](#1507)). A migration dashboard has been added to display the status of data object migrations, addressing issue [#323](#323). This new feature includes a query to show the migration status of tables, a new CLI command, and a modification to an existing command. The `migrataion-*` workflow has been updated to include a refresh migration dashboard option. The `mock_installation` function has been modified with an updated state.json file. The changes consist of manual testing and can be found in the `migrations/main` directory as a new SQL query file. This migration dashboard provides users with an easier way to monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of `sys.path` ([#1633](#1633)). This commit updates the PathLookup process during the construction of the dependency graph, addressing issues [#1202](#1202) and [#1468](#1468). It simplifies the DependencyGraphBuilder by directly using the DependencyResolver with resolvers and lookup passed as arguments, and removes the DependencyGraphBuilder. The changes include new methods for handling compatibility checks, but no new user-facing features or changes to command-line interfaces or existing workflows are introduced. Unit tests are included to ensure correct behavior. The modifications aim to improve the internal handling of dependency resolution and compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount paths ([#1578](#1578)). This release includes a new unit test for the `create-catalogs-schemas` logic that verifies the correct creation and management of catalogs and schemas defined as mount paths. The test checks the storage location of catalogs, ensures non-existing schemas are properly created, and prevents the creation of catalogs without a storage location. It also verifies the catalog schema ACL is set correctly. Using the `CatalogSchema` class and various test functions, the test creates and grants permissions to catalogs and schemas. This change resolves issue [#1039](#1039) without modifying any existing commands or workflows. The release contains no new CLI commands or user documentation, but includes unit tests and assertion calls to validate the behavior of the `create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27 ([#1626](#1626)). In this release, the `databricks-sdk` package has been upgraded to version 0.27, bringing updated methods for Redash objects. The `_install_query` method in the `dashboards.py` file has been updated to include a `tags` parameter, set to `None`, when calling `self._ws.queries.update` and `self._ws.queries.create`. This ensures that the updated SDK version is used and that tags are not applied during query updates and creation. Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint` packages have been updated to versions 0.4.0 and 0.4.3 respectively, and the dependency for PyYAML has been updated to a version between 6.0.0 and 7.0.0. These updates may impact the functionality of the project. The changes have been manually tested, but there is no verification on a staging environment.
* Use stack of dependency resolvers ([#1560](#1560)). This pull request introduces a stack-based implementation of resolvers, resolving issues [#1202](#1202), [#1499](#1499), and [#1421](#1421), and implements an initial version of SysPathProvider, while eliminating previous hacks. The new functionality includes modified existing commands, a new workflow, and the addition of unit tests. No new documentation or CLI commands have been added. The `problem_collector` parameter is not addressed in this PR and has been moved to a separate issue. The changes include renaming and moving a Python file, as well as modifications to the `Notebook` class and its related methods for handling notebook dependencies and dependency checking. The code has been tested, but manual testing and integration tests are still pending.
@nfx nfx mentioned this pull request May 8, 2024
nfx added a commit that referenced this pull request May 8, 2024
* Added DBSQL queries & dashboard migration
([#1532](#1532)). The
Databricks Labs Unified Command Extensions (UCX) project has been
updated with two new experimental commands: `migrate-dbsql-dashboards`
and `revert-dbsql-dashboards`. These commands are designed for migrating
and reverting the migration of Databricks SQL dashboards in the
workspace. The `migrate-dbsql-dashboards` command transforms all
Databricks SQL dashboards in the workspace after table migration,
tagging migrated dashboards and queries with `migrated by UCX` and
backing up original queries. The `revert-dbsql-dashboards` command
returns migrated Databricks SQL dashboards to their original state
before migration. Both commands accept a `--dashboard-id` flag for
migrating or reverting a specific dashboard. Additionally, two new
functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`,
have been added to the `cli.py` file, and new classes have been added to
interact with Redash for data visualization and querying. The
`make_dashboard` fixture has been updated to enhance testing
capabilities, and new unit tests have been added for migrating and
reverting DBSQL dashboards.
* Added UDFs assessment
([#1610](#1610)). A User
Defined Function (UDF) assessment feature has been introduced,
addressing issue
[#1610](#1610). A new
method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed
information about UDFs, including function description, input
parameters, and return types. This method has been integrated into
existing test cases, enhancing the validation of UDF metadata and
associated privileges, and ensuring system reliability. The UDF
constructor has been updated with a new parameter 'comment', initially
left blank in the test function. Additionally, two new columns,
`success` and 'failures', have been added to the udf table in the
inventory database to store assessment data for UDFs. The UdfsCrawler
class has been updated to return a list of UDF objects, and the
assertions in the test have been updated accordingly. Furthermore, a new
SQL file has been added to calculate the total count of UDFs in the
$inventory.udfs table, with a widget displaying this information as a
counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to
create the missing UC roles in AWS
([#1495](#1495)). The
`databricks labs ucx` tool now includes a new command,
`create-missing-principals`, which creates missing Universal Catalog
(UC) roles in AWS for S3 locations that lack a UC compatible role. This
command is implemented using `IamRoleCreation` from
`databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with
the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new
command only supports AWS and does not affect Azure. The existing
`migrate_credentials` function has been updated to handle Azure Service
Principals migration. Additionally, new classes and methods have been
added, including `AWSUCRoleCandidate` in `aws.py`, and
`create_missing_principals` and `list_uc_roles` methods in `access.py`.
The `create_uc_roles_cli` method in `access.py` has been refactored and
renamed to `list_uc_roles`. New unit tests have been implemented to test
the functionality of `create_missing_principals` for AWS and Azure, as
well as testing the behavior when the command is not approved.
* Added baseline for workflow linter
([#1613](#1613)). This
change introduces the `WorkflowLinter` class in the `application.py`
file of the `databricks.labs.ucx.source_code.jobs` package. The class is
used to lint workflows by checking their dependencies and ensuring they
meet certain criteria, taking in arguments such as `workspace_client`,
`dependency_resolver`, `path_lookup`, and `migration_index`. Several
properties have been moved from `dependency_resolver` to the
`CliContext` class, and the `NotebookLoader` class has been moved to a
new location. Additionally, several classes and methods have been
introduced to build a dependency graph, resolve dependencies, and manage
allowed dependencies, site packages, and supported programming
languages. The `generic` and `redash` modules from
`databricks.labs.ucx.workspace_access` and the `GroupManager` class from
`databricks.labs.ucx.workspace_access.groups` are used. The
`VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from
`databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class
from `databricks.labs.ucx.installer.workflows` are also used. This
commit is part of a larger effort to improve workflow linting and
addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access
([#1606](#1606)). A new
`AstHelper` class has been added to provide utility functions for
working with abstract syntax trees (ASTs) in Python code, including
methods for extracting attribute and function call node names.
Additionally, a linter has been integrated to check for RDD use and JVM
access, utilizing the `AstHelper` class, which has been moved to a
separate module. A new file, 'spark_connect.py', introduces a linter
with three matchers to ensure conformance to best practices and catch
potential issues early in the development process related to RDD usage
and JVM access. The linter is environment-aware, accommodating shared
cluster and serverless configurations, and includes new test methods to
validate its functionality. These improvements enhance codebase quality,
promote reusability, and ensure performance and stability in Spark
cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in
migrate_table workflow
([#1621](#1621)). The
`migrate_tables` workflow in `workflows.py` has been enhanced to support
a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables
stored in DBFS root from the Hive Metastore to the Unity Catalog using
CTAS. Additionally, the ACL migration strategy has been updated to
include the AclMigrationWhat.PRINCIPAL strategy. The
`migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and
`migrate_views` tasks now incorporate the new ACL migration strategy.
These changes have been thoroughly tested through unit tests and
integration tests, ensuring the continued functionality of the existing
workflow while expanding its capabilities.
* Added "seen tables" feature
([#1465](#1465)). The `seen
tables` feature has been introduced, allowing for better handling of
existing tables in the hive metastore and supporting their migration to
UC. This enhancement includes the addition of a `snapshot` method that
fetches and crawls table inventory, appending or overwriting records
based on assessment results. The `_crawl` function has been updated to
check for and skip existing tables in the current workspace. New methods
such as '_get_tables_paths_from_assessment', '_overwrite_records', and
`_get_table_location` have been included to facilitate these
improvements. In the testing realm, a new test
`test_mount_listing_seen_tables` has been implemented, replacing
'test_partitioned_csv_jsons'. This test checks the behavior of the
TablesInMounts class when enumerating tables in mounts for a specific
context, accounting for different table formats and managing external
and managed tables. The diff modifies the 'locations.py' file in the
databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks
labs ucx migrate-tables` CLI command
([#1660](#1660)). This
commit adds support for the `migrate-tables-ctas` workflow in the
`databricks labs ucx migrate-tables` command, which checks for external
tables that cannot be synced and prompts the user to run the
`migrate-tables-ctas` workflow. Two new methods,
`test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts,
ctx=ctx)`, have been added. The first method checks if the
`migrate-external-tables-ctas` workflow is called correctly, while the
second method runs the workflow after prompting the user. The method
`test_migrate_external_hiveserde_tables_in_place(ws)` has been modified
to test if the `migrate-external-hiveserde-tables-in-place-experimental`
workflow is called correctly. No new methods or significant
modifications to existing functionality have been made in this commit.
The changes include updated unit tests and user documentation. The
target audience for this feature are software engineers who adopt the
project.
* Added support for migrating external location permissions from
interactive cluster mounts
([#1487](#1487)). This
commit adds support for migrating external location permissions from
interactive cluster mounts in Databricks Labs' UCX project, enhancing
security and access control. It retrieves interactive cluster locations
and user mappings from the AzureACL class, granting necessary
permissions to each cluster principal for each location. The existing
`databricks labs ucx` command is modified, with the addition of the new
method `create_external_locations` and thorough testing through manual,
unit, and integration tests. This feature is developed by vuong-nguyen
and Vuong and addresses issues
[#1192](#1192) and
[#1193](#1193), ensuring a
more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access
configuration when creating uber-SPN
([#1631](#1631)). In this
release, we've implemented new features to enhance the security and
control over data access during the migration process for the SQL
warehouse data access configuration. The `databricks labs ucx
create-uber-principal` command now creates a service principal with
read-only access to all the storage used by tables in the workspace. The
UCX Cluster Policy and SQL Warehouse data access configuration will be
updated to use this service principal for migration workflows. A new
method, `_update_sql_dac_with_instance_profile`, has been introduced in
the `access.py` file to update the SQL data access configuration with
the provided AWS instance profile, ensuring a more streamlined
management of instance profiles within the SQL data access configuration
during the creation of an uber service principal (SPN). Additionally,
new methods and tests have been added to the sql module of the
databricks.sdk.service package to improve Azure resource permissions,
handling different scenarios related to creating a global SPN in the
presence or absence of various conditions, such as storage, cluster
policies, or secrets.
* Addressed issue with disabled features in certain regions
([#1618](#1618)). In this
release, we have implemented improvements to address an issue where
certain features were disabled in specific regions. We have added error
handling when listing serving endpoints to raise a NotFound error if a
feature is disabled, preventing the code from failing silently and
providing better error messages. A new method,
test_serving_endpoints_not_enabled, has been added, which creates a mock
WorkspaceClient and raises a NotFound error if serving endpoints are not
enabled for a shard. The GenericPermissionsSupport class uses this
method to get crawler tasks, and if serving endpoints are not enabled,
an error message is logged. These changes increase the reliability and
robustness of the codebase by providing better error handling and
messaging for this particular issue. Additionally, the change includes
unit tests and manual testing to ensure the proper functioning of the
new features.
* Aggregate UCX output across workspaces with CLI command
([#1596](#1596)). A new
`report-account-compatibility` command has been added to the `databricks
labs ucx` tool, enabling users to evaluate the compatibility of an
entire Azure Databricks account with UCX (Unified Client Context). This
command generates a readiness report for an Azure Databricks account,
specifically for evaluating compatibility with UCX, by querying various
aspects of the account such as clusters, configurations, and data
formats. It uses Azure CLI authentication with AAD tokens for
authentication and accepts a profile as an argument. The output includes
warnings for workspaces that do not have UCX installed, and provides
information about unsupported cluster types, unsupported configurations,
data format compatibility, and more. Additionally, a new feature has
been added to aggregate UCX output across workspaces in an account
through a new CLI command, "report-account-compatibility", which can be
run at the account level. The existing `manual-workspace-info` command
remains unchanged. These changes will help assess the readiness and
compatibility of an Azure Databricks account for UCX integration and
simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy
([#1665](#1665)). In this
release, we have implemented a change to ensure the presence of the
display name of a specific workspace group (ws_group_a) in the cluster
policy. This is to prevent a key error previously encountered. The
cluster policy is now loaded as a dictionary, and the group name is
checked to confirm its presence. If the group is not found, a message is
raised alerting users. Additionally, the permission level for the group
is verified to ensure it is set to CAN_USE. No new methods have been
added, and existing functionality remains unchanged. The test file
test_ext_hms.py has been updated to include the new assertion and has
undergone both unit tests and manual testing to ensure proper
implementation. This change is intended for software engineers who adopt
the project.
* Automatically retrying with `auth_type=azure-cli` when constructing
`workspace_clients` on Azure
([#1650](#1650)). This
commit introduces automatic retrying with 'auth_type=azure-cli' when
constructing `workspace_clients` on Azure, resolving TODO items for
`AccountWorkspaces` and adding relevant suggestions in
'troubleshooting.md'. It closes issues
[#1574](#1574) and
[#1430](#1430), and includes
new methods for generating readiness reports in `AccountAggregate` and
testing the `get_accessible_workspaces` method in 'test_workspaces.py'.
User documentation has been updated and the changes have been manually
verified in a staging environment. For macOS and Windows users, explicit
auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure
storage account for principal-prefix-access
([#1576](#1576)). This
release introduces several enhancements to the identification of service
principals with custom roles on Azure storage accounts for
principal-prefix-access. New methods such as `_get_permission_level`,
`_get_custom_role_privilege`, and `_get_role_privilege` have been added
to improve the functionality of the module. Additionally, two new
classes, AzureRoleAssignment and AzureRoleDetails, have been added to
enable more detailed management and access control for custom roles on
Azure storage accounts. The 'test_access.py' file has been updated to
include tests for saving custom roles in Azure storage accounts and
ensuring the correct identification of service principals with custom
roles. A new unit test function, test_role_assignments_custom_storage(),
has also been added to verify the behavior of custom roles in Azure
storage accounts. Overall, these changes provide a more efficient and
fine-grained way to manage and control custom roles on Azure storage
accounts.
* Clarified unsupported config in compute crawler
([#1656](#1656)). In this
release, we have made significant changes to clarify and improve the
handling of unsupported configurations in our compute crawler related to
the Hive metastore. We have expanded error messages for unsupported
configurations and provided detailed recommendations for remediation.
Additionally, we have added relevant user documentation and manually
tested the changes. The changes include updates to the configuration for
external Hive metastore and passthrough security model for Unity
Catalog, which are incompatible with the current configurations. We
recommend removing or altering the configs while migrating existing
tables and views using UCX or other compatible clusters, and mapping the
passthrough security model to a security model compatible with Unity
Catalog. The code modifications include the addition of new methods for
checking cluster init script and Spark configurations, as well as
refining the error messages for unsupported configurations. We also
added a new assertion in the `test_cluster_with_multiple_failures` unit
test to check for the presence of a specific message regarding the use
of the `spark.databricks.passthrough.enabled` configuration. This
release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is
detected ([#1579](#1579)). A
new default database `ucx` is introduced for storing inventory in the
hive metastore, with a suffix consisting of the workspace's client ID to
ensure uniqueness when an external hive metastore is detected. The
`has_ext_hms()` method is added to the `InstallationPolicy` class to
detect external HMS and thereby create a unique default schema. The
`_prompt_for_new_installation` method's default value for the `Inventory
Database stored in hive_metastore` prompt is updated to use the new
default database name, modified to include the workspace's client ID if
external HMS is detected. Additionally, a test function
`test_save_config_ext_hms` is implemented to demonstrate the
`WorkspaceInstaller` class's behavior with external HMS, creating a
unique default schema for improved system functionality and
customization. This change is part of issue
[#1579](#1579).
* Extend service principal migration to create storage credentials for
access connectors created for each storage account
([#1426](#1426)). This
commit extends the service principal migration to create storage
credentials for access connectors associated with each storage account,
resolving issues
[#1384](#1384) and
[#875](#875). The update
includes modifications to the existing `databricks labs ucx` command for
creating access connectors, adds a new CLI command for creating storage
credentials, and updates the documentation. A new workflow has been
added for creating credentials for access connectors and service
principals, and updates have been made to existing workflows. The commit
includes manual, unit, and integration tests, and no new or modified
methods are specified in the diff. The focus is on the feature
description and its impact on the project's functionality. The commit
has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to
access Azure Storage Accounts behind firewall
([#1589](#1589)). In this
release, we have introduced a new feature to improve access to Azure
Storage Accounts that are protected by firewalls. Due to limitations
with service principals in such scenarios, we have developed Access
Connectors with Managed Identities for more reliable connectivity. This
change includes updates to the 'credentials.py' file, which introduces
new methods for managing the migration of service principals to Access
Connectors using Managed Identities. Users are warned that migrating to
this new feature may cause issues when transitioning to UC, and are
advised to validate external locations after running the migration
command. This update enhances the security and functionality of the
system, providing a more dependable method for accessing Azure Storage
Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have
different target schemas
([#1581](#1581)). In this
release, we have implemented a fix to address an issue where
catalog/schema grants were not being handled correctly when tables with
the same source schema had different target schemas. This was causing
problems with granting appropriate permissions to users. We have
modified the prepare_test function to include an additional test case
with a different target schema for the same source table. Furthermore,
we have updated the test_catalog_schema_acl function to ensure that
grants are being created correctly for all catalogs, schemas, and
tables. We have also added an extra query to grant use schema
permissions for catalog2.schema3 to user1. Additionally, we have
introduced a new `SchemaInfo` class to store information about catalogs
and schemas, and refactored the `_get_database_source_target_mapping`
method to return a dictionary that maps source databases to a list of
`SchemaInfo` objects instead of a single dictionary. These changes
ensure that grants are being handled correctly for catalogs, schemas,
and tables, even when tables with the same source schema have different
target schemas. This will improve the overall functionality and
reliability of the system, making it easier for users to manage their
catalogs and schemas.
* Fixed Spark configuration parameter referencing secret
([#1635](#1635)). In this
release, the code related to the Spark configuration parameter reference
for a secret has been updated in the `access.py` file, specifically
within the `_update_cluster_policy_definition` method. The change
modifies the method to retrieve the OAuth client secret for a given
storage account using an f-string to reference the secret, replacing the
previous concatenation operator. This enhancement is aimed at improving
the readability and maintainability of the code while preserving its
functionality. Furthermore, the commit includes additional changes, such
as new methods `test_create_global_spn` and "cluster_policies.edit",
which may be related to this fix. These changes address the secret
reference issue, ensuring secure access control and improved
integration, particularly with the Spark configuration, benefiting
engineers utilizing this project for handling sensitive information and
managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in
`labs.yml` ([#1627](#1627)).
In this release, the `migration-locations` command in the `labs.yml`
file has been updated to include new flags `subscription-id` and
`aws-profile`. The `subscription-id` flag allows users to specify the
subscription to scan the storage account in, and the `aws-profile` flag
allows for authentication using a specified AWS Profile. The
`assign-metastore` command has also been updated with a new description:
"Enable Unity Catalog features on a workspace by assigning a metastore
to it." The `is_account_level` parameter remains unchanged, and the new
optional flag `workspace-id` has been added, allowing users to specify
the Workspace ID to assign a metastore to. This change enhances the
functionality of the `migration-locations` and `assign-metastore`
commands, providing more options for users to customize their storage
scanning and metastore assignment processes. The `migration-locations`
and `assign-metastore` definitions in the `labs.yml` file have been
fixed in this release.
* Fixed prompt for using external metastore
([#1668](#1668)). A fix has
been implemented in the `create` function of the `policy.py` file to
correctly prompt users for using an external metastore. Previously, a
missing period and space in the prompt caused potential confusion. The
updated prompt now includes a clarifying sentence and the
`_prompts.confirm` method has been modified to check if the user wants
to set UCX to connect to an external metastore in two scenarios: when
one or more cluster policies are set up for an external metastore, and
when the workspace warehouse is configured for an external metastore. If
the user chooses to set up an external metastore, an informational
message will be recorded in the logger. This change ensures clear and
precise communication with users during the external metastore setup
process.
* Fixed storage account network ACLs retrieved from properties
([#1620](#1620)). This
release includes a fix to the storage account network ACLs retrieval in
the open-source library, addressing issue
[#1](#1). Previously, the
network ACLs were being retrieved from an incorrect location, but this
commit corrects that by obtaining the network ACLs from the storage
account's properties.networkAcls field. The `StorageAccount` class has
been updated to modify the way default network action is retrieved, with
a new value `Unknown` added to the previous values `Deny` and "Allow".
The `from_raw_resource` class method has also been updated to retrieve
the default network action from the `properties.networkAcls` field
instead of the `networkAcls` field. This change may affect any
functionality that relies on network ACL information and impacts the
existing command `databricks labs ucx ...`. Relevant tests, including a
new test `test_azure_resource_storage_accounts_list_non_zero`, have been
added and manually and unit tested to ensure the fix is functioning
correctly.
* Fully refresh table migration status in table migration workflow
([#1630](#1630)). This
release introduces a new method, `index_full_refresh()`, to the table
migration workflow for fully refreshing the migration status, addressing
an oversight from a previous commit
([#1623](#1623)) and
resolving issue
[#1628](#1628). The new
method resets the `_migration_status_refresher` before computing the
index, ensuring the latest migration status is used for determining
whether view dependencies have been migrated. The `index()` method was
previously used to refresh the migration status, but it only provided a
partial refresh. With this update, `index_full_refresh()` is utilized
for a comprehensive refresh, affecting the `refresh_migration_status`
task in multiple workflows such as `migrate_views`,
`scan_tables_in_mounts_experimental`, and others. This change ensures a
more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing
([#1605](#1605)). A recent
update has enhanced the error handling during the loading of
installations in the `install.py` file. Specifically, the
`installation.load` function now handles certain errors, including
`PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by
logging a warning message and skipping the corrupted installation
instead of raising an error. This behavior has been incorporated into
both the `configure` and `_check_inventory_database_exists` functions,
allowing the installation process to continue even in the presence of
issues with existing installations, while providing improved error
messages. This change resolves issue
[#1601](#1601) and
introduces a new test case for a corrupted installation configuration,
as well as an updated existing test case for `test_save_config` that
includes a mock installation.
* Improved exception handling
([#1584](#1584)). In this
release, the exception handling during the upload of a wheel file to
DBFS has been significantly improved. Previously, only PermissionDenied
errors were caught and handled. Now, both BadRequest and
PermissionDenied exceptions will be caught and logged as a warning. This
change enhances the robustness of the code by handling a wider range of
exceptions during the upload process. In addition, cluster overrides
have been configured and DBFS write permissions have been set up. The
specific changes made to the code include updating the import statement
for NotFound to include BadRequest and modifying the except block in the
_get_init_script_data method to catch both NotFound and BadRequest
exceptions. These improvements ensure that the code can handle more
types of errors, providing more helpful error messages and preventing
crash scenarios, thereby enhancing the reliability and robustness of the
code.
* Improved exception handling for `migrate_acl`
([#1590](#1590)). In this
release, the `migrate_acl` functionality has been enhanced to improve
exception handling, addressing a flakiness issue in the
`test_migrate_managed_tables_with_acl` test. Previously, unhandled `not
found` exceptions during parallel test execution caused the flakiness.
This release resolves this issue
([#1549](#1549)) by
introducing error handling in the
`test_migrate_acls_should_produce_proper_queries` test. A controlled
error is now introduced to simulate a failed grant migration due to a
`TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise
testing of error handling and logging mechanisms when migration fails
for specific objects, ensuring a more reliable testing environment for
the `migrate_acl` functionality.
* Improved reliability of table migration status refresher
([#1623](#1623)). This
release introduces improvements to the table migration status refresher
in the open-source library, enhancing its reliability and robustness.
The `table_migrate` function has been updated to ensure that the table
migration status is always reset when requesting the latest snapshot,
addressing issues
[#1623](#1623),
[#1622](#1622), and
[#1615](#1615).
Additionally, the function now handles `NotFound` errors when refreshing
migration status. The `get_seen_tables` function has been modified to
convert the returned iterator to a list and raise a `NotFound` exception
if the schema does not exist, which is then caught and logged as a
warning. Furthermore, the migration status reset behavior has been
improved, and the `migration_status_refresher` parameter type in the
`TableMigrate` class constructor has been modified. New private methods
`_index_with_reset()` and updated `_migrate_views()` and
`_view_can_be_migrated()` methods have been added to ensure a more
accurate and consistent table migration process. The changes have been
thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows
([#1599](#1599)). In this
release, updates have been made to the migration status at the end of
the `migrate_tables` workflows, with no new or modified tables or
methods introduced. The `_migration_status_refresher.reset()` method has
been added in two locations to ensure accurate migration status updates.
A new `refresh_migration_status` method has been included in the
`RuntimeContext` class in the
`databricks.labs.ucx.hive_metastore.workflows` module, which refreshes
the migration status for presentation in the dashboard. The changes also
include the addition of the `refresh_migration_status` task in
`migrate_views`, `migrate_views_with_acl`, and
`scan_tables_in_mounts_experimental` workflows, and the
`migration_report` method is now dependent on the
`refresh_migration_status` task. Thorough testing has been conducted,
including the creation of a new integration test in the file
`tests/integration/hive_metastore/test_workflows.py` to verify that the
migration status is refreshed after the migration job is run. These
changes aim to ensure that the migration status is up-to-date and
accurately presented in the dashboard.
* Removed DBFS library installations
([#1554](#1554)). In this
release, the "configure.py" file has been removed, which previously
contained the `ConfigureClusterOverrides` class with methods for
validating cluster IDs, distinguishing between classic and Table Access
Control (TACL) clusters, and building a prompt for users to select a
valid active cluster ID. The removal of this file signifies that these
functionalities are no longer available. This change is part of a larger
commit that also removes DBFS library installations and updates the
Estimates Dashboard to remove metastore assignment, addressing issue
[#1098](#1098). The commit
has been tested via integration tests and manual installation and
running of UCX on a no-uc environment. Please note that the
`create_jobs` method in the `install.py` file has been updated to
reflect these changes, ensuring a more straightforward installation
experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt
([#1664](#1664)). In this
release, we have removed the `is_terraform_used` prompt from the
configuration file and the installation process in the ucx package. This
prompt was not being utilized and had been a source of confusion for
some users. Although the variable that stored its outcome will be
retained for backwards compatibility, no new methods or modifications to
existing functionality have been introduced. No tests have been added or
modified as part of this change. The removal of this prompt simplifies
the configuration process and aligns with the project's future plans to
eliminate the use of Terraform state for ucx migration. Manual testing
has been conducted to ensure that the removal of the prompt does not
affect the functionality of other properties in the configuration file
or the installation process.
* Resolve relative paths when building dependency graph
([#1608](#1608)). This
commit introduces support for resolving relative paths when building a
dependency graph in the UCX project, addressing issues 1202, 1499, and
1287. The SysPathProvider now includes a `cwd` attribute, and a new
class, LocalNotebookLoader, has been implemented to handle local files
and folders. The PathLookup class is used to resolve paths, and new
methods have been added to support these changes. Unit tests have been
provided to ensure the correct functioning of the new functionality.
This commit replaces issue 1593 and enhances the project's ability to
handle local files and folders, resulting in a more robust and reliable
dependency graph.
* Show tables migration status in migration dashboard
([#1507](#1507)). A
migration dashboard has been added to display the status of data object
migrations, addressing issue
[#323](#323). This new
feature includes a query to show the migration status of tables, a new
CLI command, and a modification to an existing command. The
`migrataion-*` workflow has been updated to include a refresh migration
dashboard option. The `mock_installation` function has been modified
with an updated state.json file. The changes consist of manual testing
and can be found in the `migrations/main` directory as a new SQL query
file. This migration dashboard provides users with an easier way to
monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of
`sys.path` ([#1633](#1633)).
This commit updates the PathLookup process during the construction of
the dependency graph, addressing issues
[#1202](#1202) and
[#1468](#1468). It
simplifies the DependencyGraphBuilder by directly using the
DependencyResolver with resolvers and lookup passed as arguments, and
removes the DependencyGraphBuilder. The changes include new methods for
handling compatibility checks, but no new user-facing features or
changes to command-line interfaces or existing workflows are introduced.
Unit tests are included to ensure correct behavior. The modifications
aim to improve the internal handling of dependency resolution and
compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount
paths ([#1578](#1578)). This
release includes a new unit test for the `create-catalogs-schemas` logic
that verifies the correct creation and management of catalogs and
schemas defined as mount paths. The test checks the storage location of
catalogs, ensures non-existing schemas are properly created, and
prevents the creation of catalogs without a storage location. It also
verifies the catalog schema ACL is set correctly. Using the
`CatalogSchema` class and various test functions, the test creates and
grants permissions to catalogs and schemas. This change resolves issue
[#1039](#1039) without
modifying any existing commands or workflows. The release contains no
new CLI commands or user documentation, but includes unit tests and
assertion calls to validate the behavior of the
`create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27
([#1626](#1626)). In this
release, the `databricks-sdk` package has been upgraded to version 0.27,
bringing updated methods for Redash objects. The `_install_query` method
in the `dashboards.py` file has been updated to include a `tags`
parameter, set to `None`, when calling `self._ws.queries.update` and
`self._ws.queries.create`. This ensures that the updated SDK version is
used and that tags are not applied during query updates and creation.
Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint`
packages have been updated to versions 0.4.0 and 0.4.3 respectively, and
the dependency for PyYAML has been updated to a version between 6.0.0
and 7.0.0. These updates may impact the functionality of the project.
The changes have been manually tested, but there is no verification on a
staging environment.
* Use stack of dependency resolvers
([#1560](#1560)). This pull
request introduces a stack-based implementation of resolvers, resolving
issues [#1202](#1202),
[#1499](#1499), and
[#1421](#1421), and
implements an initial version of SysPathProvider, while eliminating
previous hacks. The new functionality includes modified existing
commands, a new workflow, and the addition of unit tests. No new
documentation or CLI commands have been added. The `problem_collector`
parameter is not addressed in this PR and has been moved to a separate
issue. The changes include renaming and moving a Python file, as well as
modifications to the `Notebook` class and its related methods for
handling notebook dependencies and dependency checking. The code has
been tested, but manual testing and integration tests are still pending.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants