Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip listing built-in catalogs to update table migration process #3464

Merged
merged 6 commits into from
Dec 20, 2024

Conversation

JCZuurmond
Copy link
Member

@JCZuurmond JCZuurmond commented Dec 20, 2024

Changes

Skip listing built-in catalogs to update table migration process

Linked issues

Resolves #3462

Functionality

  • modified existing workflow: migrate-tables

Tests

  • added unit tests

@JCZuurmond JCZuurmond added the feat/migration-index mapping of databases to catalog or potentially other databases label Dec 20, 2024
@JCZuurmond JCZuurmond self-assigned this Dec 20, 2024
@JCZuurmond JCZuurmond requested a review from a team as a code owner December 20, 2024 14:40
@@ -79,6 +79,11 @@ class TableMigrationStatusRefresher(CrawlerBase[TableMigrationStatus]):
properties for the presence of the marker.
"""

_skip_catalogs_with_securable_kinds = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@FastLee FastLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

✅ 55/55 passed, 4 skipped, 2h14m4s total

Running from acceptance #7811

@JCZuurmond JCZuurmond added this pull request to the merge queue Dec 20, 2024
Merged via the queue into main with commit 3f6da0d Dec 20, 2024
7 checks passed
@JCZuurmond JCZuurmond deleted the fix/skip-listing-internal-catalogs branch December 20, 2024 16:00
gueniai added a commit that referenced this pull request Dec 23, 2024
* Added dashboard crawlers ([#3397](#3397)). The open-source library has been updated with new dashboard crawlers for the assessment workflow, Redash migration, and QueryLinter. These crawlers are responsible for crawling and persisting dashboards, as well as migrating or reverting them during Redash migration. They also lint the queries of the crawled dashboards using QueryLinter. This change resolves issues [#3366](#3366) and [#3367](#3367), and progresses [#2854](#2854). The 'databricks labs ucx {migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the `assessment` workflow have been modified to incorporate these new features. Unit tests and integration tests have been added to ensure proper functionality of the new dashboard crawlers. Additionally, two new tables, _$inventory_.redash_dashboards and _$inventory_.lakeview_dashboards, have been introduced to hold a list of all Redash or Lakeview dashboards and are used by the `QueryLinter` and `Redash` migration. These changes improve the assessment, migration, and linting processes for dashboards in the library.
* DBFS Root Support for HMS Federation ([#3425](#3425)). The commit `DBFS Root Support for HMS Federation` introduces changes to support the DBFS root location for HMS federation. A new method, `external_locations_with_root`, is added to the `ExternalLocations` class to return a list of external locations including the DBFS root location. This method is used in various functions and test cases, such as `test_create_uber_principal_no_storage`, `test_create_uc_role_multiple_raises_error`, `test_create_uc_no_roles`, `test_save_spn_permissions`, and `test_create_access_connectors_for_storage_accounts`, to ensure that the DBFS root location is correctly identified and tested in different scenarios. Additionally, the `external_locations.snapshot.return_value` is changed to `external_locations.external_locations_with_root.return_value` in test functions `test_create_federated_catalog` and `test_already_existing_connection` to retrieve a list of external locations including the DBFS root location. This commit closes issue [#3406](#3406), which was related to this functionality. Overall, these changes improve the handling and testing of DBFS root location in HMS federation.
* Log message as error when legacy permissions API is enabled/disabled depending on the workflow ran ([#3443](#3443)). In this release, logging behavior has been updated in several methods in the 'workflows.py' file. When the `use_legacy_permission_migration` configuration is set to False and specific conditions are met, error messages are now logged instead of info messages for the methods 'verify_metastore_attached', 'rename_workspace_local_groups', 'reflect_account_groups_on_workspace', 'apply_permissions_to_account_groups', 'apply_permissions', and 'validate_groups_permissions'. This change is intended to address issue [#3388](#3388) and provides clearer guidance to users when the legacy permissions API is not functioning as expected. Users will now see an error message advising them to run the `migrate-groups` job or set `use_legacy_permission_migration` to True in the config.yml file. These updates will help ensure smoother workflow runs and more accurate logging for better troubleshooting.
* MySQL External HMS Support for HMS Federation ([#3385](#3385)). This commit adds support for MySQL-based Hive Metastore (HMS) in HMS Federation, enhances the CLI for creating a federated catalog, and improves external HMS functionality. It introduces a new parameter `enable_hms_federation` in the `Locations` class constructor, allowing users to enable or disable MySQL-based HMS federation. The `external_locations` method in `application.py` now accepts `enable_hms_federation` as a parameter, enabling more granular control of the federation feature. Additionally, the CLI for creating a federated catalog has been updated to accept a `prompts` parameter, providing more flexibility. The commit also introduces a new dataclass `ExternalHmsInfo` for external HMS connection information and updates the `HiveMetastoreFederationEnabler` and `HiveMetastoreFederation` classes to support non-Glue external metastores. Furthermore, it adds methods to handle the creation of a Federated Catalog from the command-line interface, split JDBC URLs, and manage external connections and permissions.
* Skip listing built-in catalogs to update table migration process ([#3464](#3464)). In this release, the migration process for updating tables in the Hive Metastore has been optimized with the introduction of the `TableMigrationStatusRefresher` class, which inherits from `CrawlerBase`. This new class includes modifications to the `_iter_schemas` method, which now filters out built-in catalogs and schemas when listing catalogs and schemas, thereby skipping unnecessary processing during the table migration process. Additionally, the `get_seen_tables` method has been updated to include checks for `schema.name` and `schema.catalog_name`, and the `_crawl` and `_try_fetch` methods have been modified to reflect changes in the `TableMigrationStatus` constructor. These changes aim to improve the efficiency and performance of the migration process by skipping built-in catalogs and schemas. The release also includes modifications to the existing `migrate-tables` workflow and adds unit tests that demonstrate the exclusion of built-in catalogs during the table migration status update process. The test case utilizes the `CatalogInfoSecurableKind` enumeration to specify the kind of catalog and verifies that the seen tables only include the non-builtin catalogs. These changes should prevent unnecessary processing of built-in catalogs and schemas during the table migration process, leading to improved efficiency and performance.
* Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40 ([#3434](#3434)). In this release, the requirement for the `databricks-sdk` package has been updated in the pyproject.toml file to be strictly greater than or equal to 0.39 and less than 0.40, allowing for the use of the latest version of the package while preventing the use of versions above 0.40. This change is based on the release notes and changelog for version 0.39 of the package, which includes bug fixes, internal changes, and API changes such as the addition of the `cleanrooms` package, delete() method for workspace-level services, and fields for various request and response objects. The commit history for the package is also provided. Dependabot has been configured to resolve any conflicts with this PR and can be manually triggered to perform various actions as needed. Additionally, Dependabot can be used to ignore specific dependency versions or close the PR.
* Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 ([#3456](#3456)). In this pull request, the version range of the `databricks-sdk` dependency has been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of the latest version of the `databricks-sdk` while ensuring that it is less than 0.41. The pull request also includes release notes detailing the API changes in version 0.40.0, such as the addition of new fields to various compute, dashboard, job, and pipeline services. A changelog is provided, outlining the bug fixes, internal changes, new features, and improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits is also included, showing the development progress of these versions.
* Use LTS Databricks runtime version ([#3459](#3459)). This release introduces a change in the Databricks runtime version to a Long-Term Support (LTS) release to address issues encountered during the migration to external tables. The previous runtime version caused the `convert to external table` migration strategy to fail, and this change serves as a temporary solution. The `migrate-tables` workflow has been modified, and existing integration tests have been reused to ensure functionality. The `test_job_cluster_policy` function now uses the LTS version instead of the latest version, ensuring a specified Spark version for the cluster policy. The function also checks for matching node type ID, Spark version, and necessary resources. However, users may still encounter problems with the latest Universal Connectivity (UCX) release. The `_convert_hms_table_to_external` method in the `table_migrate.py` file has been updated to return a boolean value, with a new TODO comment about a possible failure with Databricks runtime 16.0 due to a JDK update.
* Use `CREATE_FOREIGN_CATALOG` instead of `CREATE_FOREIGN_SECURABLE` with HMS federation enablement commands ([#3309](#3309)). A change has been made to update the `databricks-sdk` dependency version from `>=0.38,<0.39` to `>=0.39` in the `pyproject.toml` file, which may affect the project's functionality related to the `databricks-sdk` library. In the Hive Metastore Federation codebase, `CREATE_FOREIGN_CATALOG` is now used instead of `CREATE_FOREIGN_SECURABLE` for HMS federation enablement commands, aligned with issue [#3308](#3308). The `_add_missing_permissions_if_needed` method has been updated to check for `CREATE_FOREIGN_SECURABLE` instead of `CREATE_FOREIGN_CATALOG` when granting permissions. Additionally, a unit test file for HiveMetastore Federation has been updated to reflect the use of `CREATE_FOREIGN_SECURABLE` in the import statements and test functions, although this change is limited to the test file and does not affect production code. Thorough testing is recommended after applying this update to ensure that the project functions as expected and to benefit from the potential security improvements associated with the updated privilege handling.

Dependency updates:

 * Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 ([#3456](#3456)).
@gueniai gueniai mentioned this pull request Dec 23, 2024
github-merge-queue bot pushed a commit that referenced this pull request Dec 23, 2024
* Added dashboard crawlers
([#3397](#3397)). The
open-source library has been updated with new dashboard crawlers for the
assessment workflow, Redash migration, and QueryLinter. These crawlers
are responsible for crawling and persisting dashboards, as well as
migrating or reverting them during Redash migration. They also lint the
queries of the crawled dashboards using QueryLinter. This change
resolves issues
[#3366](#3366) and
[#3367](#3367), and
progresses [#2854](#2854).
The 'databricks labs ucx
{migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the
`assessment` workflow have been modified to incorporate these new
features. Unit tests and integration tests have been added to ensure
proper functionality of the new dashboard crawlers. Additionally, two
new tables, _$inventory_.redash_dashboards and
_$inventory_.lakeview_dashboards, have been introduced to hold a list of
all Redash or Lakeview dashboards and are used by the `QueryLinter` and
`Redash` migration. These changes improve the assessment, migration, and
linting processes for dashboards in the library.
* DBFS Root Support for HMS Federation
([#3425](#3425)). The commit
`DBFS Root Support for HMS Federation` introduces changes to support the
DBFS root location for HMS federation. A new method,
`external_locations_with_root`, is added to the `ExternalLocations`
class to return a list of external locations including the DBFS root
location. This method is used in various functions and test cases, such
as `test_create_uber_principal_no_storage`,
`test_create_uc_role_multiple_raises_error`, `test_create_uc_no_roles`,
`test_save_spn_permissions`, and
`test_create_access_connectors_for_storage_accounts`, to ensure that the
DBFS root location is correctly identified and tested in different
scenarios. Additionally, the `external_locations.snapshot.return_value`
is changed to
`external_locations.external_locations_with_root.return_value` in test
functions `test_create_federated_catalog` and
`test_already_existing_connection` to retrieve a list of external
locations including the DBFS root location. This commit closes issue
[#3406](#3406), which was
related to this functionality. Overall, these changes improve the
handling and testing of DBFS root location in HMS federation.
* Log message as error when legacy permissions API is enabled/disabled
depending on the workflow ran
([#3443](#3443)). In this
release, logging behavior has been updated in several methods in the
'workflows.py' file. When the `use_legacy_permission_migration`
configuration is set to False and specific conditions are met, error
messages are now logged instead of info messages for the methods
'verify_metastore_attached', 'rename_workspace_local_groups',
'reflect_account_groups_on_workspace',
'apply_permissions_to_account_groups', 'apply_permissions', and
'validate_groups_permissions'. This change is intended to address issue
[#3388](#3388) and provides
clearer guidance to users when the legacy permissions API is not
functioning as expected. Users will now see an error message advising
them to run the `migrate-groups` job or set
`use_legacy_permission_migration` to True in the config.yml file. These
updates will help ensure smoother workflow runs and more accurate
logging for better troubleshooting.
* MySQL External HMS Support for HMS Federation
([#3385](#3385)). This
commit adds support for MySQL-based Hive Metastore (HMS) in HMS
Federation, enhances the CLI for creating a federated catalog, and
improves external HMS functionality. It introduces a new parameter
`enable_hms_federation` in the `Locations` class constructor, allowing
users to enable or disable MySQL-based HMS federation. The
`external_locations` method in `application.py` now accepts
`enable_hms_federation` as a parameter, enabling more granular control
of the federation feature. Additionally, the CLI for creating a
federated catalog has been updated to accept a `prompts` parameter,
providing more flexibility. The commit also introduces a new dataclass
`ExternalHmsInfo` for external HMS connection information and updates
the `HiveMetastoreFederationEnabler` and `HiveMetastoreFederation`
classes to support non-Glue external metastores. Furthermore, it adds
methods to handle the creation of a Federated Catalog from the
command-line interface, split JDBC URLs, and manage external connections
and permissions.
* Skip listing built-in catalogs to update table migration process
([#3464](#3464)). In this
release, the migration process for updating tables in the Hive Metastore
has been optimized with the introduction of the
`TableMigrationStatusRefresher` class, which inherits from
`CrawlerBase`. This new class includes modifications to the
`_iter_schemas` method, which now filters out built-in catalogs and
schemas when listing catalogs and schemas, thereby skipping unnecessary
processing during the table migration process. Additionally, the
`get_seen_tables` method has been updated to include checks for
`schema.name` and `schema.catalog_name`, and the `_crawl` and
`_try_fetch` methods have been modified to reflect changes in the
`TableMigrationStatus` constructor. These changes aim to improve the
efficiency and performance of the migration process by skipping built-in
catalogs and schemas. The release also includes modifications to the
existing `migrate-tables` workflow and adds unit tests that demonstrate
the exclusion of built-in catalogs during the table migration status
update process. The test case utilizes the `CatalogInfoSecurableKind`
enumeration to specify the kind of catalog and verifies that the seen
tables only include the non-builtin catalogs. These changes should
prevent unnecessary processing of built-in catalogs and schemas during
the table migration process, leading to improved efficiency and
performance.
* Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40
([#3434](#3434)). In this
release, the requirement for the `databricks-sdk` package has been
updated in the pyproject.toml file to be strictly greater than or equal
to 0.39 and less than 0.40, allowing for the use of the latest version
of the package while preventing the use of versions above 0.40. This
change is based on the release notes and changelog for version 0.39 of
the package, which includes bug fixes, internal changes, and API changes
such as the addition of the `cleanrooms` package, delete() method for
workspace-level services, and fields for various request and response
objects. The commit history for the package is also provided. Dependabot
has been configured to resolve any conflicts with this PR and can be
manually triggered to perform various actions as needed. Additionally,
Dependabot can be used to ignore specific dependency versions or close
the PR.
* Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41
([#3456](#3456)). In this
pull request, the version range of the `databricks-sdk` dependency has
been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of
the latest version of the `databricks-sdk` while ensuring that it is
less than 0.41. The pull request also includes release notes detailing
the API changes in version 0.40.0, such as the addition of new fields to
various compute, dashboard, job, and pipeline services. A changelog is
provided, outlining the bug fixes, internal changes, new features, and
improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits
is also included, showing the development progress of these versions.
* Use LTS Databricks runtime version
([#3459](#3459)). This
release introduces a change in the Databricks runtime version to a
Long-Term Support (LTS) release to address issues encountered during the
migration to external tables. The previous runtime version caused the
`convert to external table` migration strategy to fail, and this change
serves as a temporary solution. The `migrate-tables` workflow has been
modified, and existing integration tests have been reused to ensure
functionality. The `test_job_cluster_policy` function now uses the LTS
version instead of the latest version, ensuring a specified Spark
version for the cluster policy. The function also checks for matching
node type ID, Spark version, and necessary resources. However, users may
still encounter problems with the latest Universal Connectivity (UCX)
release. The `_convert_hms_table_to_external` method in the
`table_migrate.py` file has been updated to return a boolean value, with
a new TODO comment about a possible failure with Databricks runtime 16.0
due to a JDK update.
* Use `CREATE_FOREIGN_CATALOG` instead of `CREATE_FOREIGN_SECURABLE`
with HMS federation enablement commands
([#3309](#3309)). A change
has been made to update the `databricks-sdk` dependency version from
`>=0.38,<0.39` to `>=0.39` in the `pyproject.toml` file, which may
affect the project's functionality related to the `databricks-sdk`
library. In the Hive Metastore Federation codebase,
`CREATE_FOREIGN_CATALOG` is now used instead of
`CREATE_FOREIGN_SECURABLE` for HMS federation enablement commands,
aligned with issue
[#3308](#3308). The
`_add_missing_permissions_if_needed` method has been updated to check
for `CREATE_FOREIGN_SECURABLE` instead of `CREATE_FOREIGN_CATALOG` when
granting permissions. Additionally, a unit test file for HiveMetastore
Federation has been updated to reflect the use of
`CREATE_FOREIGN_SECURABLE` in the import statements and test functions,
although this change is limited to the test file and does not affect
production code. Thorough testing is recommended after applying this
update to ensure that the project functions as expected and to benefit
from the potential security improvements associated with the updated
privilege handling.

Dependency updates:

* Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41
([#3456](#3456)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/migration-index mapping of databases to catalog or potentially other databases
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG]: TableMigrationStatusRefresher lists system catalog
3 participants