diff --git a/CHANGELOG.md b/CHANGELOG.md index 0292ea7a65..caf14e4ac8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,21 @@ # Version changelog +## 0.15.0 + +* Added AWS S3 support for `migrate-locations` command ([#1009](https://github.com/databrickslabs/ucx/issues/1009)). In this release, the open-source library has been enhanced with AWS S3 support for the `migrate-locations` command, enabling efficient and secure management of S3 data. The new functionality includes the identification of missing S3 prefixes and the creation of corresponding roles and policies through the addition of methods `_identify_missing_paths`, `_get_existing_credentials_dict`, and `create_external_locations`. The library now also includes new classes `AwsIamRole`, `ExternalLocationInfo`, and `StorageCredentialInfo` for better handling of AWS-related functionality. Additionally, two new tests, `test_create_external_locations` and `test_create_external_locations_skip_existing`, have been added to ensure the correct behavior of the new AWS-related functionality. The new test function `test_migrate_locations_aws` checks the AWS-specific implementation of the `migrate-locations` command, while `test_missing_aws_cli` verifies the correct error message is displayed when the AWS CLI is not found in the system path. These changes enhance the library's capabilities, improving data security, privacy, and overall performance for users working with AWS S3. +* Added `databricks labs ucx create-uber-principal` command to create Azure Service Principal for migration ([#976](https://github.com/databrickslabs/ucx/issues/976)). The new CLI command, `databricks labs ucx create-uber-principal`, has been introduced to create an Azure Service Principal (SPN) and grant it STORAGE BLOB READER access on all the storage accounts used by the tables in the workspace. The SPN information is then stored in the UCX cluster policy. A new class, AzureApiClient, has been added to isolate Azure API calls, and unit and integration tests have been included to verify the functionality. This development enhances migration capabilities for Azure workspaces, providing a more streamlined and automated way to create and manage Service Principals, and improves the functionality and usability of the UCX tool. The changes are well-documented and follow the project's coding standards. +* Added `migrate-locations` command ([#1016](https://github.com/databrickslabs/ucx/issues/1016)). In this release, we've added a new CLI command, `migrate_locations`, to create Unity Catalog (UC) external locations. This command extracts candidates for location creation from the `guess_external_locations` assessment task and checks if corresponding UC Storage Credentials exist before creating the locations. Currently, the command only supports Azure, with plans to add support for AWS and GCP in the future. The `migrate_locations` function is marked with the `ucx.command` decorator and is available as a command-line interface (CLI) command. The pull request also includes unit tests for this new command, which check the environment (Azure, AWS, or GCP) before executing the migration and log a message if the environment is AWS or GCP, indicating that the migration is not yet supported on those platforms. No changes have been made to existing workflows, commands, or tables. +* Added handling for widget delete on upgrade platform bug ([#1011](https://github.com/databrickslabs/ucx/issues/1011)). In this release, the `_install_dashboard` method in `dashboards.py` has been updated to handle a platform bug that occurred during the deletion of dashboard widgets during an upgrade process (issue [#1011](https://github.com/databrickslabs/ucx/issues/1011)). Previously, the method attempted to delete each widget using the `self._ws.dashboard_widgets.delete(widget.id)` command, which resulted in a `TypeError` when attempting to delete a widget. The updated method now includes a try/except block that catches this `TypeError` and logs a warning message, while also tracking the issue under bug ES-1061370. The rest of the method remains unchanged, creating a dashboard with the given name, role, and parent folder ID if no widgets are present. This enhancement improves the robustness of the `_install_dashboard` method by adding error handling for the SDK API response when deleting dashboard widgets, ensuring a smoother upgrade process. +* Create UC external locations in Azure based on migrated storage credentials ([#992](https://github.com/databrickslabs/ucx/issues/992)). The `locations.py` file in the `databricks.labs.ucx.azure` package has been updated to include a new class `ExternalLocationsMigration`, which creates UC external locations in Azure based on migrated storage credentials. This class takes various arguments, including `WorkspaceClient`, `HiveMetastoreLocations`, `AzureResourcePermissions`, and `AzureResources`. It has a `run()` method that lists any missing external locations in UC, extracts their location URLs, and attempts to create a UC external location with a mapped storage credential name if the missing external location is in the mapping. The class also includes helper methods for generating credential name mappings. Additionally, the `resources.py` file in the same package has been modified to include a new method `managed_identity_client_id`, which retrieves the client ID of a managed identity associated with a given access connector. Test functions for the `ExternalLocationsMigration` class and Azure external locations functionality have been added in the new file `test_locations.py`. The `test_resources.py` file has been updated to include tests for the `managed_identity_client_id` method. A new `mappings.json` file has also been added for tests related to Azure external location mappings based on migrated storage credentials. +* Deprecate legacy installer ([#1014](https://github.com/databrickslabs/ucx/issues/1014)). In this release, we have deprecated the legacy installer for the UCX project, which was previously implemented as a bash script. A warning message has been added to inform users about the deprecation and direct them to the UCX installation instructions. The functionality of the script remains unchanged, and it still performs tasks such as installing Python dependencies and building Python bindings. The script will eventually be replaced with the `databricks labs install ucx` command. This change is part of issue [#1014](https://github.com/databrickslabs/ucx/issues/1014) and is intended to streamline the installation process and improve the overall user experience. We recommend that users update their installation process to the new recommended method as soon as possible to avoid any issues with the legacy installer in the future. +* Prompt user if Terraform utilised for deploying infrastructure ([#1004](https://github.com/databrickslabs/ucx/issues/1004)). In this update, the `config.py` file has been modified to include a new attribute, `is_terraform_used`, in the `WorkspaceConfig` class. This boolean flag indicates whether Terraform has been used for deploying certain entities in the workspace. Issue [#393](https://github.com/databrickslabs/ucx/issues/393) has been addressed with this change. The `WorkspaceInstaller` configuration has also been updated to take advantage of this new attribute, allowing developers to determine if Terraform was used for infrastructure deployment, thereby increasing visibility into the deployment process. Additionally, a new prompt has been added to the `warehouse_type` function to ascertain if Terraform is being utilized for infrastructure deployment, setting the `is_terraform_used` variable to True if it is. This improvement is intended for software engineers adopting this open-source library. +* Updated CONTRIBUTING.md ([#1005](https://github.com/databrickslabs/ucx/issues/1005)). In this contribution to the open-source library, the CONTRIBUTING.md file has been significantly updated with clearer instructions on how to effectively contibute to the project. The previous command to print the Python path has been removed, as the IDE is now advised to be configured to use the Python interpreter from the virtual environment. A new step has been added, recommending the use of a consistent styleguide and formatting of the code before every commit. Moreover, it is now encouraged to run tests before committing to minimize potential issues during the review process. The steps on how to make a Fork from the ucx repo and create a PR have been updated with links to official documentation. Lastly, the commit now includes information on handling dependency errors that may occur after `git pull`. +* Updated databricks-labs-blueprint requirement from ~=0.2.4 to ~=0.3.0 ([#1001](https://github.com/databrickslabs/ucx/issues/1001)). In this pull request update, the requirements file, pyproject.toml, has been modified to upgrade the databricks-labs-blueprint package from version ~0.2.4 to ~0.3.0. This update integrates the latest features and bug fixes of the package, including an automated upgrade framework, a brute-forcing approach for handling SerdeError, and enhancements for running nightly integration tests with service principals. These improvements increase the testability and functionality of the software, ensuring its stable operation with service principals during nightly integration tests. Furthermore, the reliability of the test for detecting existing installations has been reinforced by adding a new test function that checks for the correct detection of existing installations and retries the test for up to 15 seconds if they are not. + +Dependency updates: + + * Updated databricks-labs-blueprint requirement from ~=0.2.4 to ~=0.3.0 ([#1001](https://github.com/databrickslabs/ucx/pull/1001)). + ## 0.14.0 * Added `upgraded_from_workspace_id` property to migrated tables to indicated the source workspace ([#987](https://github.com/databrickslabs/ucx/issues/987)). In this release, updates have been made to the `_migrate_external_table`, `_migrate_dbfs_root_table`, and `_migrate_view` methods in the `table_migrate.py` file to include a new parameter `upgraded_from_ws` in the SQL commands used to alter tables, views, or managed tables. This parameter is used to store the source workspace ID in the migrated tables, indicating the migration origin. A new utility method `sql_alter_from` has been added to the `Table` class in `tables.py` to generate the SQL command with the new parameter. Additionally, a new class-level attribute `UPGRADED_FROM_WS_PARAM` has been added to the `Table` class in `tables.py` to indicate the source workspace. A new property `upgraded_from_workspace_id` has been added to migrated tables to store the source workspace ID. These changes resolve issue [#899](https://github.com/databrickslabs/ucx/issues/899) and are tested through manual testing, unit tests, and integration tests. No new CLI commands, workflows, or tables have been added or modified, and there are no changes to user documentation. diff --git a/README.md b/README.md index 10eba79f50..6c8a886cfd 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,7 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project. * [Install UCX](#install-ucx) * [Upgrading UCX for newer versions](#upgrading-ucx-for-newer-versions) * [Uninstall UCX](#uninstall-ucx) +* [Migration process](#migration-process) * [Workflows](#workflows) * [Readme notebook](#readme-notebook) * [Assessment workflow](#assessment-workflow) @@ -36,8 +37,10 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project. * [`principal-prefix-access` command](#principal-prefix-access-command) * [Access for AWS S3 Buckets](#access-for-aws-s3-buckets) * [Access for Azure Storage Accounts](#access-for-azure-storage-accounts) + * [`create-uber-principal` command](#create-uber-principal-command) * [`migrate-credentials` command](#migrate-credentials-command) * [`validate-external-locations` command](#validate-external-locations-command) + * [`migrate-locations` command](#migrate-locations-command) * [`create-table-mapping` command](#create-table-mapping-command) * [`skip` command](#skip-command) * [`revert-migrated-tables` command](#revert-migrated-tables-command) @@ -160,6 +163,41 @@ Databricks CLI will confirm a few options: [[back to top](#databricks-labs-ucx)] +# Migration process + +On the high level, the steps in migration process can be described as: + +```mermaid +flowchart TD + subgraph workspace-admin + assessment --> group-migration + group-migration --> table-migration + table-migration --> code-migration + assessment --> create-table-mapping + create-table-mapping --> table-migration + create-table-mapping --> code-migration + validate-external-locations --> table-migration + table-migration --> revert-migrated-tables + revert-migrated-tables --> table-migration + end + subgraph account-admin + create-account-groups --> group-migration + sync-workspace-info --> create-table-mapping + group-migration --> validate-groups-membership + end + subgraph iam-admin + setup-account-scim --> create-account-groups + assessment --> create-uber-principal + create-uber-principal --> table-migration + assessment --> principal-prefix-access + principal-prefix-access --> migrate-credentials + migrate-credentials --> validate-external-locations + setup-account-scim + end +``` + +[[back to top](#databricks-labs-ucx)] + # Workflows Part of this application is deployed as [Databricks Workflows](https://docs.databricks.com/en/workflows/index.html). @@ -402,6 +440,18 @@ on each storage account. This requires Azure CLI to be installed and configured [[back to top](#databricks-labs-ucx)] +## `create-uber-principal` command + +```text +databricks labs ucx create-uber-principal [--subscription-id X] +``` + +**Requires Cloud IAM admin privileges.** Once the [`assessment` workflow](#assessment-workflow) complete, you should run +this command to creates a service principal with the _**read-only access to all storage**_ used by tables in this +workspace and configure the [UCX Cluster Policy](#installation) with the details of it. Once migration is complete, this +service principal should be unprovisioned. On Azure, it creates a principal with `Storage Blob Data Reader` role +assignment on every storage account using Azure Resource Manager APIs. + ## `migrate-credentials` command ```commandline @@ -427,12 +477,25 @@ databricks labs ucx validate-external-locations ``` Once the [`assessment` workflow](#assessment-workflow) finished successfully, [storage credentials](#migrate-credentials-command) are configured, -run this command to ensure the relevant Unity Catalog external locations are created if they are missing. +run this command to validate and report the missing Unity Catalog external locations to be created. This command validates and provides mapping to external tables to external locations, also as Terraform configurations. [[back to top](#databricks-labs-ucx)] + +## `migrate-locations` command + +```text +databricks labs ucx migrate-locations +``` + +Once the [`assessment` workflow](#assessment-workflow) finished successfully, and [storage credentials](#migrate-credentials-command) are configured, +run this command to have Unity Catalog external locations created. The candidate locations to be created are extracted from guess_external_locations +task in the assessment job. You can run [validate_external_locations](#validate-external-locations-command) command to check the candidate locations. + +[[back to top](#databricks-labs-ucx)] + ## `create-table-mapping` command ```text diff --git a/install.sh b/install.sh index 88c98b377f..de61fb3b6e 100755 --- a/install.sh +++ b/install.sh @@ -1,6 +1,6 @@ #!/bin/bash -# This script will eventually be replaced with `databricks labs install ucx` command. +echo -e "\033[0;31m\033[1m!!!!!!!!!!!!!!!\n!!! DEPRECATED: This script is going to be removed soon. See https://github.com/databrickslabs/ucx#installation\n!!!!!!!!!!!!!!!\033[0m" # Initialize an empty array to store Python binary paths python_binaries=() @@ -84,4 +84,4 @@ $py -m pip install --quiet -e . # without console_scripts entrypoint $py -m databricks.labs.ucx.install -rm -r "$tmp_dir" \ No newline at end of file +rm -r "$tmp_dir" diff --git a/labs.yml b/labs.yml index bb92770f41..45bc6fe290 100644 --- a/labs.yml +++ b/labs.yml @@ -122,7 +122,7 @@ commands: {{range .}}{{.wf_group_name}}\t{{.wf_group_members_count}}\t{{.acc_group_name}}\t{{.acc_group_members_count}}\t{{.group_members_difference}} {{end}} - - name: migrate_credentials + - name: migrate-credentials description: Migrate credentials for storage access to UC storage credential - name: create-account-groups @@ -133,3 +133,6 @@ commands: flags: - name: workspace_ids description: List of workspace IDs to create account groups from. + + - name: migrate-locations + description: Create UC external locations based on the output of guess_external_locations assessment task. \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 0ee3b31742..b176d4830e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -598,7 +598,7 @@ default-docstring-type = "default" [tool.pylint.refactoring] # Maximum number of nested blocks for function / method body -max-nested-blocks = 5 +max-nested-blocks = 3 # Complete name of functions that never returns. When checking for inconsistent- # return-statements if a never returning function is called then it will be diff --git a/src/databricks/labs/ucx/__about__.py b/src/databricks/labs/ucx/__about__.py index e041ef2c1b..637a26e632 100644 --- a/src/databricks/labs/ucx/__about__.py +++ b/src/databricks/labs/ucx/__about__.py @@ -1,2 +1,2 @@ # DO NOT MODIFY THIS FILE -__version__ = "0.14.0" +__version__ = "0.15.0" diff --git a/src/databricks/labs/ucx/account.py b/src/databricks/labs/ucx/account.py index 2ef728a54b..6c638cddcd 100644 --- a/src/databricks/labs/ucx/account.py +++ b/src/databricks/labs/ucx/account.py @@ -131,41 +131,37 @@ def _get_valid_workspaces_groups(self, prompts: Prompts, workspace_ids: list[int for workspace in self._workspaces(): if workspace.workspace_id not in workspace_ids: continue - client = self.client_for(workspace) - logger.info(f"Crawling groups in workspace {client.config.host}") + self._load_workspace_groups(prompts, workspace, all_workspaces_groups) - ws_group_ids = client.groups.list(attributes="id") - for group_id in ws_group_ids: - full_workspace_group = self._safe_groups_get(client, group_id.id) - if not full_workspace_group: - continue - group_name = full_workspace_group.display_name + return all_workspaces_groups - if self._is_group_out_of_scope(full_workspace_group): + def _load_workspace_groups(self, prompts, workspace, all_workspaces_groups): + client = self.client_for(workspace) + logger.info(f"Crawling groups in workspace {client.config.host}") + ws_group_ids = client.groups.list(attributes="id") + for group_id in ws_group_ids: + full_workspace_group = self._safe_groups_get(client, group_id.id) + if not full_workspace_group: + continue + group_name = full_workspace_group.display_name + if self._is_group_out_of_scope(full_workspace_group): + continue + if not group_name: + continue + if group_name in all_workspaces_groups: + if self._has_same_members(all_workspaces_groups[group_name], full_workspace_group): + logger.info(f"Workspace group {group_name} already found, ignoring") continue - - if group_name in all_workspaces_groups: - if self._has_same_members(all_workspaces_groups[group_name], full_workspace_group): - logger.info(f"Workspace group {group_name} already found, ignoring") - continue - - if prompts.confirm( - f"Group {group_name} does not have the same amount of members " - f"in workspace {client.config.host} than previous workspaces which contains the same group name," - f"it will be created at the account with name : {workspace.workspace_name}_{group_name}" - ): - all_workspaces_groups[f"{workspace.workspace_name}_{group_name}"] = full_workspace_group - continue - - if not group_name: + if prompts.confirm( + f"Group {group_name} does not have the same amount of members " + f"in workspace {client.config.host} than previous workspaces which contains the same group name," + f"it will be created at the account with name : {workspace.workspace_name}_{group_name}" + ): + all_workspaces_groups[f"{workspace.workspace_name}_{group_name}"] = full_workspace_group continue - - logger.info(f"Found new group {group_name}") - all_workspaces_groups[group_name] = full_workspace_group - - logger.info(f"Found a total of {len(all_workspaces_groups)} groups to migrate to the account") - - return all_workspaces_groups + logger.info(f"Found new group {group_name}") + all_workspaces_groups[group_name] = full_workspace_group + logger.info(f"Found a total of {len(all_workspaces_groups)} groups to migrate to the account") def _is_group_out_of_scope(self, group: Group) -> bool: if group.display_name in {"users", "admins", "account users"}: diff --git a/src/databricks/labs/ucx/assessment/jobs.py b/src/databricks/labs/ucx/assessment/jobs.py index 4b6fcfdb93..8dc05570ba 100644 --- a/src/databricks/labs/ucx/assessment/jobs.py +++ b/src/databricks/labs/ucx/assessment/jobs.py @@ -38,27 +38,34 @@ class JobInfo: class JobsMixin: - @staticmethod - def _get_cluster_configs_from_all_jobs(all_jobs, all_clusters_by_id): # pylint: disable=too-complex - for j in all_jobs: - if j.settings is None: + @classmethod + def _get_cluster_configs_from_all_jobs(cls, all_jobs, all_clusters_by_id): + for job in all_jobs: + if job.settings is None: continue - if j.settings.job_clusters is not None: - for job_cluster in j.settings.job_clusters: - if job_cluster.new_cluster is None: - continue - yield j, job_cluster.new_cluster - if j.settings.tasks is None: + if job.settings.job_clusters is not None: + yield from cls._job_clusters(job) + if job.settings.tasks is None: continue - for task in j.settings.tasks: - if task.existing_cluster_id is not None: - interactive_cluster = all_clusters_by_id.get(task.existing_cluster_id, None) - if interactive_cluster is None: - continue - yield j, interactive_cluster + yield from cls._task_clusters(job, all_clusters_by_id) - elif task.new_cluster is not None: - yield j, task.new_cluster + @classmethod + def _task_clusters(cls, job, all_clusters_by_id): + for task in job.settings.tasks: + if task.existing_cluster_id is not None: + interactive_cluster = all_clusters_by_id.get(task.existing_cluster_id, None) + if interactive_cluster is None: + continue + yield job, interactive_cluster + elif task.new_cluster is not None: + yield job, task.new_cluster + + @staticmethod + def _job_clusters(job): + for job_cluster in job.settings.job_clusters: + if job_cluster.new_cluster is None: + continue + yield job, job_cluster.new_cluster class JobsCrawler(CrawlerBase[JobInfo], JobsMixin, CheckClusterMixin): @@ -299,22 +306,10 @@ def _assess_job_runs(self, submit_runs: Iterable[BaseRun], all_clusters_by_id) - runs_per_hash: dict[str, list[int | None]] = {} for submit_run in submit_runs: - task_failures = [] + task_failures: list[str] = [] # v2.1+ API, with tasks if submit_run.tasks: - all_tasks: list[RunTask] = submit_run.tasks - for task in sorted(all_tasks, key=lambda x: x.task_key if x.task_key is not None else ""): - _task_key = task.task_key if task.task_key is not None else "" - _cluster_details = None - if task.new_cluster: - _cluster_details = ClusterDetails.from_dict(task.new_cluster.as_dict()) - if self._needs_compatibility_check(task.new_cluster): - task_failures.append("no data security mode specified") - if task.existing_cluster_id: - _cluster_details = all_clusters_by_id.get(task.existing_cluster_id, None) - if _cluster_details: - task_failures.extend(self._check_cluster_failures(_cluster_details, _task_key)) - + self._check_run_task(submit_run.tasks, all_clusters_by_id, task_failures) # v2.0 API, without tasks elif submit_run.cluster_spec: _cluster_details = ClusterDetails.from_dict(submit_run.cluster_spec.as_dict()) @@ -324,7 +319,6 @@ def _assess_job_runs(self, submit_runs: Iterable[BaseRun], all_clusters_by_id) - runs_per_hash[hashed_id].append(submit_run.run_id) else: runs_per_hash[hashed_id] = [submit_run.run_id] - result[hashed_id] = SubmitRunInfo( run_ids=json.dumps(runs_per_hash[hashed_id]), hashed_id=hashed_id, @@ -332,3 +326,16 @@ def _assess_job_runs(self, submit_runs: Iterable[BaseRun], all_clusters_by_id) - ) return list(result.values()) + + def _check_run_task(self, all_tasks: list[RunTask], clusters: dict[str, ClusterDetails], task_failures: list[str]): + for task in sorted(all_tasks, key=lambda x: x.task_key if x.task_key is not None else ""): + _task_key = task.task_key if task.task_key is not None else "" + cluster_details = None + if task.new_cluster: + cluster_details = ClusterDetails.from_dict(task.new_cluster.as_dict()) + if self._needs_compatibility_check(task.new_cluster): + task_failures.append("no data security mode specified") + if task.existing_cluster_id: + cluster_details = clusters.get(task.existing_cluster_id, None) + if cluster_details: + task_failures.extend(self._check_cluster_failures(cluster_details, _task_key)) diff --git a/src/databricks/labs/ucx/assessment/pipelines.py b/src/databricks/labs/ucx/assessment/pipelines.py index ab872a5e89..244081cad2 100644 --- a/src/databricks/labs/ucx/assessment/pipelines.py +++ b/src/databricks/labs/ucx/assessment/pipelines.py @@ -50,22 +50,24 @@ def _assess_pipelines(self, all_pipelines) -> Iterable[PipelineInfo]: pipeline_config = pipeline_response.spec.configuration if pipeline_config: failures.extend(self._check_spark_conf(pipeline_config, "pipeline")) - pipeline_cluster = pipeline_response.spec.clusters - if pipeline_cluster: - for cluster in pipeline_cluster: - if cluster.spark_conf: - failures.extend(self._check_spark_conf(cluster.spark_conf, "pipeline cluster")) - # Checking if cluster config is present in cluster policies - if cluster.policy_id: - failures.extend(self._check_cluster_policy(cluster.policy_id, "pipeline cluster")) - if cluster.init_scripts: - failures.extend(self._check_cluster_init_script(cluster.init_scripts, "pipeline cluster")) - + clusters = pipeline_response.spec.clusters + if clusters: + self._pipeline_clusters(clusters, failures) pipeline_info.failures = json.dumps(failures) if len(failures) > 0: pipeline_info.success = 0 yield pipeline_info + def _pipeline_clusters(self, clusters, failures): + for cluster in clusters: + if cluster.spark_conf: + failures.extend(self._check_spark_conf(cluster.spark_conf, "pipeline cluster")) + # Checking if cluster config is present in cluster policies + if cluster.policy_id: + failures.extend(self._check_cluster_policy(cluster.policy_id, "pipeline cluster")) + if cluster.init_scripts: + failures.extend(self._check_cluster_init_script(cluster.init_scripts, "pipeline cluster")) + def snapshot(self) -> Iterable[PipelineInfo]: return self._snapshot(self._try_fetch, self._crawl) diff --git a/src/databricks/labs/ucx/cli.py b/src/databricks/labs/ucx/cli.py index c3dd643111..ea687ab88e 100644 --- a/src/databricks/labs/ucx/cli.py +++ b/src/databricks/labs/ucx/cli.py @@ -15,6 +15,7 @@ from databricks.labs.ucx.aws.credentials import IamRoleMigration from databricks.labs.ucx.azure.access import AzureResourcePermissions from databricks.labs.ucx.azure.credentials import ServicePrincipalMigration +from databricks.labs.ucx.azure.locations import ExternalLocationsMigration from databricks.labs.ucx.config import WorkspaceConfig from databricks.labs.ucx.framework.crawlers import StatementExecutionBackend from databricks.labs.ucx.hive_metastore import ExternalLocations, TablesCrawler @@ -362,5 +363,30 @@ def create_uber_principal(w: WorkspaceClient, subscription_id: str): return +@ucx.command +def migrate_locations(w: WorkspaceClient, aws_profile: str | None = None): + """This command creates UC external locations. The candidate locations to be created are extracted from guess_external_locations + task in the assessment job. You can run validate_external_locations command to check the candidate locations. Please make sure + the credentials haven migrated before running this command. The command will only create the locations that have corresponded UC Storage Credentials. + """ + if w.config.is_azure: + logger.info("Running migrate_locations for Azure") + installation = Installation.current(w, 'ucx') + service_principal_migration = ExternalLocationsMigration.for_cli(w, installation) + service_principal_migration.run() + if w.config.is_aws: + logger.error("Migrate_locations for AWS") + if not shutil.which("aws"): + logger.error("Couldn't find AWS CLI in path. Please install the CLI from https://aws.amazon.com/cli/") + return + installation = Installation.current(w, 'ucx') + config = installation.load(WorkspaceConfig) + sql_backend = StatementExecutionBackend(w, config.warehouse_id) + aws_permissions = AWSResourcePermissions.for_cli(w, sql_backend, aws_profile, config.inventory_database) + aws_permissions.create_external_locations() + if w.config.is_gcp: + logger.error("migrate_locations is not yet supported in GCP") + + if __name__ == "__main__": ucx() diff --git a/src/databricks/labs/ucx/config.py b/src/databricks/labs/ucx/config.py index 54796f5874..74b35c17d5 100644 --- a/src/databricks/labs/ucx/config.py +++ b/src/databricks/labs/ucx/config.py @@ -41,6 +41,9 @@ class WorkspaceConfig: # pylint: disable=too-many-instance-attributes # Flag to see if terraform has been used for deploying certain entities is_terraform_used: bool = False + # Flag to see if terraform has been used for deploying certain entities + is_terraform_used: bool = False + def replace_inventory_variable(self, text: str) -> str: return text.replace("$inventory", f"hive_metastore.{self.inventory_database}") diff --git a/src/databricks/labs/ucx/framework/dashboards.py b/src/databricks/labs/ucx/framework/dashboards.py index 64652717e5..565b4847be 100644 --- a/src/databricks/labs/ucx/framework/dashboards.py +++ b/src/databricks/labs/ucx/framework/dashboards.py @@ -126,14 +126,17 @@ def validate(self): dashboard_folders = [f for f in step_folder.glob("*") if f.is_dir()] # Create separate dashboards per step, represented as second-level folders for dashboard_folder in dashboard_folders: - dashboard_ref = f"{step_folder.stem}_{dashboard_folder.stem}".lower() - for query in self._desired_queries(dashboard_folder, dashboard_ref): - try: - self._get_viz_options(query) - self._get_widget_options(query) - except Exception as err: - msg = f"Error in {query.name}: {err}" - raise AssertionError(msg) from err + self._validate_folder(dashboard_folder, step_folder) + + def _validate_folder(self, dashboard_folder, step_folder): + dashboard_ref = f"{step_folder.stem}_{dashboard_folder.stem}".lower() + for query in self._desired_queries(dashboard_folder, dashboard_ref): + try: + self._get_viz_options(query) + self._get_widget_options(query) + except Exception as err: + msg = f"Error in {query.name}: {err}" + raise AssertionError(msg) from err def _install_widget(self, query: SimpleQuery, dashboard_ref: str): dashboard_id = self._state.dashboards[dashboard_ref] diff --git a/src/databricks/labs/ucx/hive_metastore/locations.py b/src/databricks/labs/ucx/hive_metastore/locations.py index 8de19a8501..9efb23f218 100644 --- a/src/databricks/labs/ucx/hive_metastore/locations.py +++ b/src/databricks/labs/ucx/hive_metastore/locations.py @@ -7,6 +7,7 @@ from databricks.labs.blueprint.installation import Installation from databricks.sdk import WorkspaceClient +from databricks.sdk.service.catalog import ExternalLocationInfo from databricks.labs.ucx.framework.crawlers import CrawlerBase, SqlBackend from databricks.labs.ucx.framework.utils import escape_sql_identifier @@ -39,23 +40,27 @@ def _external_locations(self, tables: list[Row], mounts) -> Iterable[ExternalLoc external_locations: list[ExternalLocation] = [] for table in tables: location = table.location - if location is not None and len(location) > 0: - if location.startswith("dbfs:/mnt"): - for mount in mounts: - if location[5:].startswith(mount.name.lower()): - location = location[5:].replace(mount.name, mount.source) - break - if ( - not location.startswith("dbfs") - and (self._prefix_size[0] < location.find(":/") < self._prefix_size[1]) - and not location.startswith("jdbc") - ): - self._dbfs_locations(external_locations, location, min_slash) - if location.startswith("jdbc"): - self._add_jdbc_location(external_locations, location, table) - + if not location: + continue + if location.startswith("dbfs:/mnt"): + location = self._resolve_mount(location, mounts) + if ( + not location.startswith("dbfs") + and (self._prefix_size[0] < location.find(":/") < self._prefix_size[1]) + and not location.startswith("jdbc") + ): + self._dbfs_locations(external_locations, location, min_slash) + if location.startswith("jdbc"): + self._add_jdbc_location(external_locations, location, table) return external_locations + def _resolve_mount(self, location, mounts): + for mount in mounts: + if location[5:].startswith(mount.name.lower()): + location = location[5:].replace(mount.name, mount.source) + break + return location + @staticmethod def _dbfs_locations(external_locations, location, min_slash): dupe = False @@ -161,31 +166,33 @@ def _get_ext_location_definitions(self, missing_locations: list[ExternalLocation return tf_script def match_table_external_locations(self) -> tuple[dict[str, int], list[ExternalLocation]]: - uc_external_locations = list(self._ws.external_locations.list()) + existing_locations = list(self._ws.external_locations.list()) table_locations = self.snapshot() - matching_locations = {} + matching_locations: dict[str, int] = {} missing_locations = [] for table_loc in table_locations: # external_location.list returns url without trailing "/" but ExternalLocation.snapshot # does so removing the trailing slash before comparing - matched = False - for uc_loc in uc_external_locations: - if not uc_loc.url: - continue - if not uc_loc.name: - continue - uc_loc_path = uc_loc.url.lower() - if uc_loc_path in table_loc.location.rstrip("/").lower(): - if uc_loc.name not in matching_locations: - matching_locations[uc_loc.name] = table_loc.table_count - else: - matching_locations[uc_loc.name] = matching_locations[uc_loc.name] + table_loc.table_count - matched = True - break - if not matched: + if not self._match_existing(table_loc, matching_locations, existing_locations): missing_locations.append(table_loc) return matching_locations, missing_locations + @staticmethod + def _match_existing(table_loc, matching_locations: dict[str, int], existing_locations: list[ExternalLocationInfo]): + for uc_loc in existing_locations: + if not uc_loc.url: + continue + if not uc_loc.name: + continue + uc_loc_path = uc_loc.url.lower() + if uc_loc_path in table_loc.location.rstrip("/").lower(): + if uc_loc.name not in matching_locations: + matching_locations[uc_loc.name] = table_loc.table_count + else: + matching_locations[uc_loc.name] = matching_locations[uc_loc.name] + table_loc.table_count + return True + return False + def save_as_terraform_definitions_on_workspace(self, installation: Installation): matching_locations, missing_locations = self.match_table_external_locations() if len(matching_locations) > 0: diff --git a/src/databricks/labs/ucx/hive_metastore/table_migrate.py b/src/databricks/labs/ucx/hive_metastore/table_migrate.py index f87e6276cd..03d71093e2 100644 --- a/src/databricks/labs/ucx/hive_metastore/table_migrate.py +++ b/src/databricks/labs/ucx/hive_metastore/table_migrate.py @@ -93,12 +93,18 @@ def _migrate_view(self, src_table: Table, rule: Rule): self._backend.execute(src_table.sql_alter_from(rule.as_uc_table_key, self._ws.get_workspace_id())) return True - def _init_seen_tables(self): + def _iter_schemas(self): for catalog in self._ws.catalogs.list(): - for schema in self._ws.schemas.list(catalog_name=catalog.name): - for table in self._ws.tables.list(catalog_name=catalog.name, schema_name=schema.name): - if table.properties is not None and "upgraded_from" in table.properties: - self._seen_tables[table.full_name.lower()] = table.properties["upgraded_from"].lower() + yield from self._ws.schemas.list(catalog_name=catalog.name) + + def _init_seen_tables(self): + for schema in self._iter_schemas(): + for table in self._ws.tables.list(catalog_name=schema.catalog_name, schema_name=schema.name): + if table.properties is None: + continue + if "upgraded_from" not in table.properties: + continue + self._seen_tables[table.full_name.lower()] = table.properties["upgraded_from"].lower() def _table_already_upgraded(self, target) -> bool: return target in self._seen_tables @@ -391,23 +397,24 @@ def _reapply_grants(self, from_table_name, to_table_name, *, target_view: bool = except NotFound: logger.warning(f"removed on the backend {from_table_name}") return - if grants.privilege_assignments is not None: - logger.info(f"Applying grants on table {to_table_name}") - grants_changes = [] - for permission in grants.privilege_assignments: - if not permission.privileges: - continue - if not target_view: - grants_changes.append(PermissionsChange(list(permission.privileges), permission.principal)) - continue - privileges = set() - for privilege in permission.privileges: - if privilege != Privilege.MODIFY: - privileges.add(privilege) - if privileges: - grants_changes.append(PermissionsChange(list(privileges), permission.principal)) - - self._ws.grants.update(SecurableType.TABLE, to_table_name, changes=grants_changes) + if not grants.privilege_assignments: + return + logger.info(f"Applying grants on table {to_table_name}") + grants_changes = [] + for permission in grants.privilege_assignments: + if not permission.privileges: + continue + if not target_view: + grants_changes.append(PermissionsChange(list(permission.privileges), permission.principal)) + continue + privileges = set() + for privilege in permission.privileges: + if privilege != Privilege.MODIFY: + privileges.add(privilege) + if privileges: + grants_changes.append(PermissionsChange(list(privileges), permission.principal)) + + self._ws.grants.update(SecurableType.TABLE, to_table_name, changes=grants_changes) def _recreate_table(self, from_table_name, to_table_name): create_sql = str(next(self._backend.fetch(f"SHOW CREATE TABLE {from_table_name}"))[0]) diff --git a/src/databricks/labs/ucx/hive_metastore/tables.py b/src/databricks/labs/ucx/hive_metastore/tables.py index 40798b9634..27a21e50fe 100644 --- a/src/databricks/labs/ucx/hive_metastore/tables.py +++ b/src/databricks/labs/ucx/hive_metastore/tables.py @@ -88,14 +88,15 @@ def is_dbfs_root(self) -> bool: if not self.location: return False for prefix in self.DBFS_ROOT_PREFIXES: - if self.location.startswith(prefix): - for exception in self.DBFS_ROOT_PREFIX_EXCEPTIONS: - if self.location.startswith(exception): - return False - for db_datasets in self.DBFS_DATABRICKS_DATASETS: - if self.location.startswith(db_datasets): - return False - return True + if not self.location.startswith(prefix): + continue + for exception in self.DBFS_ROOT_PREFIX_EXCEPTIONS: + if self.location.startswith(exception): + return False + for db_datasets in self.DBFS_DATABRICKS_DATASETS: + if self.location.startswith(db_datasets): + return False + return True return False @property diff --git a/src/databricks/labs/ucx/hive_metastore/udfs.py b/src/databricks/labs/ucx/hive_metastore/udfs.py index 224a3d4a16..504b3dcbb2 100644 --- a/src/databricks/labs/ucx/hive_metastore/udfs.py +++ b/src/databricks/labs/ucx/hive_metastore/udfs.py @@ -67,21 +67,26 @@ def _crawl(self) -> Iterable[Udf]: # "target schema is not in the current catalog" self._exec(f"USE CATALOG {escape_sql_identifier(catalog)};") for (database,) in self._all_databases(): - try: - logger.debug(f"[{catalog}.{database}] listing udfs") - for (udf,) in self._fetch( - f"SHOW USER FUNCTIONS FROM {escape_sql_identifier(catalog)}.{escape_sql_identifier(database)};" - ): - if udf.startswith(f"{catalog}.{database}"): - udf_name = udf[udf.rfind(".") + 1 :] # remove catalog and database info from the name - tasks.append(partial(self._describe, catalog, database, udf_name)) - except Unknown as err: - logger.error(f"Problem with {database}: {err}") + for task in self._collect_tasks(catalog, database): + tasks.append(task) catalog_tables, errors = Threads.gather(f"listing udfs in {catalog}", tasks) if len(errors) > 0: logger.error(f"Detected {len(errors)} while scanning udfs in {catalog}") return catalog_tables + def _collect_tasks(self, catalog, database) -> Iterable[partial[Udf | None]]: + try: + logger.debug(f"[{catalog}.{database}] listing udfs") + for (udf,) in self._fetch( + f"SHOW USER FUNCTIONS FROM {escape_sql_identifier(catalog)}.{escape_sql_identifier(database)};" + ): + if not udf.startswith(f"{catalog}.{database}"): + continue + udf_name = udf[udf.rfind(".") + 1 :] # remove catalog and database info from the name + yield partial(self._describe, catalog, database, udf_name) + except Unknown as err: + logger.error(f"Problem with {database}: {err}") + def _describe(self, catalog: str, database: str, udf: str) -> Udf | None: """Fetches metadata like udf type, input, returns, data access and body if specified for a specific udf within the given catalog and database. diff --git a/src/databricks/labs/ucx/install.py b/src/databricks/labs/ucx/install.py index 2b3b482876..d6a84e1dcb 100644 --- a/src/databricks/labs/ucx/install.py +++ b/src/databricks/labs/ucx/install.py @@ -6,6 +6,7 @@ import sys import time import webbrowser +from collections.abc import Callable from dataclasses import replace from datetime import datetime, timedelta from pathlib import Path @@ -16,6 +17,7 @@ from databricks.labs.blueprint.installer import InstallState from databricks.labs.blueprint.parallel import ManyError, Threads from databricks.labs.blueprint.tui import Prompts +from databricks.labs.blueprint.upgrades import Upgrades from databricks.labs.blueprint.wheels import ProductInfo, WheelsV2, find_project_root from databricks.sdk import WorkspaceClient from databricks.sdk.errors import ( # pylint: disable=redefined-builtin @@ -176,54 +178,65 @@ def __init__(self, prompts: Prompts, installation: Installation, ws: WorkspaceCl self._installation = installation self._prompts = prompts - def run(self): + def run( + self, + verify_timeout=timedelta(minutes=2), + sql_backend_factory: Callable[[WorkspaceConfig], SqlBackend] | None = None, + wheel_builder_factory: Callable[[], WheelsV2] | None = None, + ): logger.info(f"Installing UCX v{PRODUCT_INFO.version()}") config = self.configure() - sql_backend = StatementExecutionBackend(self._ws, config.warehouse_id) - wheels = WheelsV2(self._installation, PRODUCT_INFO) + if not sql_backend_factory: + sql_backend_factory = self._new_sql_backend + if not wheel_builder_factory: + wheel_builder_factory = self._new_wheel_builder workspace_installation = WorkspaceInstallation( config, self._installation, - sql_backend, - wheels, + sql_backend_factory(config), + wheel_builder_factory(), self._ws, self._prompts, - verify_timeout=timedelta(minutes=2), + verify_timeout=verify_timeout, ) - workspace_installation.run() + try: + workspace_installation.run() + except ManyError as err: + if len(err.errs) == 1: + raise err.errs[0] from None + raise err + + def _new_wheel_builder(self): + return WheelsV2(self._installation, PRODUCT_INFO) + + def _new_sql_backend(self, config: WorkspaceConfig) -> SqlBackend: + return StatementExecutionBackend(self._ws, config.warehouse_id) def configure(self) -> WorkspaceConfig: try: - return self._installation.load(WorkspaceConfig) + config = self._installation.load(WorkspaceConfig) + self._apply_upgrades() + return config except NotFound as err: logger.debug(f"Cannot find previous installation: {err}") + return self._configure_new_installation() + + def _apply_upgrades(self): + try: + upgrades = Upgrades(PRODUCT_INFO, self._installation) + upgrades.apply(self._ws) + except NotFound as err: + logger.warning(f"Installed version is too old: {err}") + return + + def _configure_new_installation(self) -> WorkspaceConfig: logger.info("Please answer a couple of questions to configure Unity Catalog migration") HiveMetastoreLineageEnabler(self._ws).apply(self._prompts) inventory_database = self._prompts.question( "Inventory Database stored in hive_metastore", default="ucx", valid_regex=r"^\w+$" ) - def warehouse_type(_): - return _.warehouse_type.value if not _.enable_serverless_compute else "SERVERLESS" - - pro_warehouses = {"[Create new PRO SQL warehouse]": "create_new"} | { - f"{_.name} ({_.id}, {warehouse_type(_)}, {_.state.value})": _.id - for _ in self._ws.warehouses.list() - if _.warehouse_type == EndpointInfoWarehouseType.PRO - } - warehouse_id = self._prompts.choice_from_dict( - "Select PRO or SERVERLESS SQL warehouse to run assessment dashboards on", pro_warehouses - ) - if warehouse_id == "create_new": - new_warehouse = self._ws.warehouses.create( - name=f"{WAREHOUSE_PREFIX} {time.time_ns()}", - spot_instance_policy=SpotInstancePolicy.COST_OPTIMIZED, - warehouse_type=CreateWarehouseRequestWarehouseType.PRO, - cluster_size="Small", - max_num_clusters=1, - ) - warehouse_id = new_warehouse.id - + warehouse_id = self._configure_warehouse() configure_groups = ConfigureGroups(self._prompts) configure_groups.run() log_level = self._prompts.question("Log level", default="INFO").upper() @@ -269,6 +282,29 @@ def warehouse_type(_): webbrowser.open(ws_file_url) return config + def _configure_warehouse(self): + def warehouse_type(_): + return _.warehouse_type.value if not _.enable_serverless_compute else "SERVERLESS" + + pro_warehouses = {"[Create new PRO SQL warehouse]": "create_new"} | { + f"{_.name} ({_.id}, {warehouse_type(_)}, {_.state.value})": _.id + for _ in self._ws.warehouses.list() + if _.warehouse_type == EndpointInfoWarehouseType.PRO + } + warehouse_id = self._prompts.choice_from_dict( + "Select PRO or SERVERLESS SQL warehouse to run assessment dashboards on", pro_warehouses + ) + if warehouse_id == "create_new": + new_warehouse = self._ws.warehouses.create( + name=f"{WAREHOUSE_PREFIX} {time.time_ns()}", + spot_instance_policy=SpotInstancePolicy.COST_OPTIMIZED, + warehouse_type=CreateWarehouseRequestWarehouseType.PRO, + cluster_size="Small", + max_num_clusters=1, + ) + warehouse_id = new_warehouse.id + return warehouse_id + @staticmethod def _policy_config(value: str): return {"type": "fixed", "value": value} @@ -370,7 +406,7 @@ def __init__( @classmethod def current(cls, ws: WorkspaceClient): - installation = Installation.current(ws, PRODUCT_INFO.product_name()) + installation = PRODUCT_INFO.current_installation(ws) config = installation.load(WorkspaceConfig) sql_backend = StatementExecutionBackend(ws, config.warehouse_id) wheels = WheelsV2(installation, PRODUCT_INFO) diff --git a/src/databricks/labs/ucx/mixins/sql.py b/src/databricks/labs/ucx/mixins/sql.py index 27503627fa..b8079f07a9 100644 --- a/src/databricks/labs/ucx/mixins/sql.py +++ b/src/databricks/labs/ucx/mixins/sql.py @@ -222,7 +222,7 @@ def execute_fetch_all( result_data = execute_response.result if result_data is None: return - while True: + while True: # pylint: disable=too-many-nested-blocks data_array = result_data.data_array if not data_array: data_array = [] diff --git a/src/databricks/labs/ucx/upgrades/v0.4.0_added_log_dir.py b/src/databricks/labs/ucx/upgrades/v0.4.0_added_log_dir.py new file mode 100644 index 0000000000..5a79f99263 --- /dev/null +++ b/src/databricks/labs/ucx/upgrades/v0.4.0_added_log_dir.py @@ -0,0 +1,11 @@ +# pylint: disable=invalid-name,unused-argument +import logging + +from databricks.labs.blueprint.installation import Installation +from databricks.sdk import WorkspaceClient + +logger = logging.getLogger(__name__) + + +def upgrade(installation: Installation, ws: WorkspaceClient): + installation.upload('logs/README.md', b'# This folder contains logs from UCX workflows') diff --git a/src/databricks/labs/ucx/workspace_access/listing.py b/src/databricks/labs/ucx/workspace_access/listing.py index 26af491090..eb640d7995 100644 --- a/src/databricks/labs/ucx/workspace_access/listing.py +++ b/src/databricks/labs/ucx/workspace_access/listing.py @@ -82,28 +82,25 @@ def walk(self, start_path="/"): logger.warning(f"removed on the backend {start_path}") return self.results self.results.append(root_object) - with ThreadPoolExecutor(self._num_threads) as executor: initial_future = executor.submit(self._list_and_analyze, root_object) initial_future.add_done_callback(self._progress_report) futures_to_objects = {initial_future: root_object} while futures_to_objects: futures_done, _ = wait(futures_to_objects, return_when=FIRST_COMPLETED) - for future in futures_done: futures_to_objects.pop(future) directories, others = future.result() self.results.extend(directories) self.results.extend(others) - - if directories: - new_futures = {} - for directory in directories: - new_future = executor.submit(self._list_and_analyze, directory) - new_future.add_done_callback(self._progress_report) - new_futures[new_future] = directory - futures_to_objects.update(new_futures) - + if not directories: + continue + new_futures = {} + for directory in directories: + new_future = executor.submit(self._list_and_analyze, directory) + new_future.add_done_callback(self._progress_report) + new_futures[new_future] = directory + futures_to_objects.update(new_futures) logger.info( f"Recursive WorkspaceFS listing finished at {dt.datetime.now()}. " f"Total time taken for workspace listing: {dt.datetime.now() - self.start_time}" diff --git a/tests/integration/azure/test_credentials.py b/tests/integration/azure/test_credentials.py index b12b7e2c4d..4308bdeadb 100644 --- a/tests/integration/azure/test_credentials.py +++ b/tests/integration/azure/test_credentials.py @@ -109,7 +109,7 @@ def inner( @retried(on=[InternalError], timeout=timedelta(minutes=2)) def test_spn_migration_existed_storage_credential(extract_test_info, make_storage_credential_spn, run_migration): # create a storage credential for this test - make_storage_credential( + make_storage_credential_spn( credential_name=extract_test_info.credential_name, application_id=extract_test_info.application_id, client_secret=extract_test_info.client_secret, diff --git a/tests/unit/assessment/test_aws.py b/tests/unit/assessment/test_aws.py index af27c723ca..4f599eb42b 100644 --- a/tests/unit/assessment/test_aws.py +++ b/tests/unit/assessment/test_aws.py @@ -1,4 +1,5 @@ import logging +from unittest import mock from unittest.mock import MagicMock, call, create_autospec import pytest @@ -6,6 +7,11 @@ from databricks.sdk import WorkspaceClient from databricks.sdk.errors import ResourceDoesNotExist from databricks.sdk.service import iam +from databricks.sdk.service.catalog import ( + AwsIamRole, + ExternalLocationInfo, + StorageCredentialInfo, +) from databricks.sdk.service.compute import InstanceProfile from databricks.labs.ucx.assessment.aws import ( @@ -986,3 +992,76 @@ def test_get_uc_compatible_roles(): }, ], ) + + +def test_create_external_locations(): + ws = create_autospec(WorkspaceClient) + aws = create_autospec(AWSResources) + installation = MockInstallation() + installation.load = MagicMock() + installation.load.return_value = [ + AWSRoleAction("arn:aws:iam::12345:role/uc-role1", "s3", "WRITE_FILES", "s3://BUCKET1/*"), + AWSRoleAction("arn:aws:iam::12345:role/uc-role1", "s3", "WRITE_FILES", "s3://BUCKET2/*"), + AWSRoleAction("arn:aws:iam::12345:role/uc-rolex", "s3", "WRITE_FILES", "s3://BUCKETX/FOLDERX"), + ] + rows = { + "external_locations": [["s3://BUCKET1/FOLDER1", 1], ["s3://BUCKET2/FOLDER2", 1], ["s3://BUCKETX/FOLDERX", 1]] + } + ws.storage_credentials.list.return_value = [ + StorageCredentialInfo( + id="1", + name="cred1", + aws_iam_role=AwsIamRole("arn:aws:iam::12345:role/uc-role1"), + ), + StorageCredentialInfo( + id="2", + name="credx", + aws_iam_role=AwsIamRole("arn:aws:iam::12345:role/uc-rolex"), + ), + ] + errors = {} + backend = MockBackend(rows=rows, fails_on_first=errors) + aws_resource_permissions = AWSResourcePermissions(installation, ws, backend, aws, "ucx") + aws_resource_permissions.create_external_locations() + calls = [ + call(mock.ANY, 's3://BUCKET1/FOLDER1', 'cred1'), + call(mock.ANY, 's3://BUCKET2/FOLDER2', 'cred1'), + call(mock.ANY, 's3://BUCKETX/FOLDERX', 'credx'), + ] + ws.external_locations.create.assert_has_calls(calls, any_order=True) + + +def test_create_external_locations_skip_existing(): + ws = create_autospec(WorkspaceClient) + aws = create_autospec(AWSResources) + installation = MockInstallation() + installation.load = MagicMock() + installation.load.return_value = [ + AWSRoleAction("arn:aws:iam::12345:role/uc-role1", "s3", "WRITE_FILES", "s3://BUCKET1/*"), + AWSRoleAction("arn:aws:iam::12345:role/uc-rolex", "s3", "WRITE_FILES", "s3://BUCKETX/FOLDERX"), + ] + rows = {"external_locations": [["s3://BUCKET1/FOLDER1", 1], ["s3://BUCKETX/FOLDERX", 1]]} + ws.storage_credentials.list.return_value = [ + StorageCredentialInfo( + id="1", + name="cred1", + aws_iam_role=AwsIamRole("arn:aws:iam::12345:role/uc-role1"), + ), + StorageCredentialInfo( + id="2", + name="credx", + aws_iam_role=AwsIamRole("arn:aws:iam::12345:role/uc-rolex"), + ), + ] + ws.external_locations.list.return_value = [ + ExternalLocationInfo(name="UCX_FOO_1", url="s3://BUCKETX/FOLDERX", credential_name="credx"), + ] + + errors = {} + backend = MockBackend(rows=rows, fails_on_first=errors) + aws_resource_permissions = AWSResourcePermissions(installation, ws, backend, aws, "ucx") + aws_resource_permissions.create_external_locations(location_init="UCX_FOO") + calls = [ + call("UCX_FOO_2", 's3://BUCKET1/FOLDER1', 'cred1'), + ] + ws.external_locations.create.assert_has_calls(calls, any_order=True) diff --git a/tests/unit/azure/test_credentials.py b/tests/unit/azure/test_credentials.py index eb2863d598..db72b716e3 100644 --- a/tests/unit/azure/test_credentials.py +++ b/tests/unit/azure/test_credentials.py @@ -38,8 +38,6 @@ from databricks.labs.ucx.hive_metastore import ExternalLocations from tests.unit import DEFAULT_CONFIG -from tests.unit import DEFAULT_CONFIG - @pytest.fixture def ws(): @@ -49,7 +47,8 @@ def ws(): @pytest.fixture def installation(): return MockInstallation( - DEFAULT_CONFIG | { + DEFAULT_CONFIG + | { "azure_storage_account_info.csv": [ { 'prefix': 'prefix1', diff --git a/tests/unit/test_install.py b/tests/unit/test_install.py index 14b9e5a7d6..76e55ae096 100644 --- a/tests/unit/test_install.py +++ b/tests/unit/test_install.py @@ -1339,3 +1339,56 @@ def test_open_config(ws, mocker, mock_installation): install.configure() webbrowser_open.assert_called_with('https://localhost/#workspace~/mock/config.yml') + + +def test_runs_upgrades_on_too_old_version(ws, any_prompt): + existing_installation = MockInstallation( + { + 'state.json': {'resources': {'dashboards': {'assessment_main': 'abc'}}}, + 'config.yml': { + 'inventory_database': 'x', + 'warehouse_id': 'abc', + 'connect': {'host': '...', 'token': '...'}, + }, + } + ) + install = WorkspaceInstaller(any_prompt, existing_installation, ws) + + sql_backend = MockBackend() + wheels = create_autospec(WheelsV2) + + # TODO: (HariGS-DB) remove this, once added the policy upgrade + # TODO: fix along https://github.com/databrickslabs/ucx/issues/1012 + with pytest.raises(InvalidParameterValue): + install.run( + verify_timeout=timedelta(seconds=1), + sql_backend_factory=lambda _: sql_backend, + wheel_builder_factory=lambda: wheels, + ) + + +def test_runs_upgrades_on_more_recent_version(ws, any_prompt): + existing_installation = MockInstallation( + { + 'version.json': {'version': '0.3.0', 'wheel': '...', 'date': '...'}, + 'state.json': {'resources': {'dashboards': {'assessment_main': 'abc'}}}, + 'config.yml': { + 'inventory_database': 'x', + 'warehouse_id': 'abc', + 'policy_id': 'abc', # TODO: (HariGS-DB) remove this, once added the policy upgrade + 'connect': {'host': '...', 'token': '...'}, + }, + } + ) + install = WorkspaceInstaller(any_prompt, existing_installation, ws) + + sql_backend = MockBackend() + wheels = create_autospec(WheelsV2) + + install.run( + verify_timeout=timedelta(seconds=1), + sql_backend_factory=lambda _: sql_backend, + wheel_builder_factory=lambda: wheels, + ) + + existing_installation.assert_file_uploaded('logs/README.md')