Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let create-catalogs-schemas reuse MigrateGrants so that it applies group renaming #2955

Merged
merged 54 commits into from
Oct 16, 2024

Conversation

JCZuurmond
Copy link
Member

@JCZuurmond JCZuurmond commented Oct 14, 2024

Changes

Let create-catalogs-schemas reuse MigrateGrants so that it applies group renaming and eliminate duplicate code.

Linked issues

Resolves #2934
Resolves #2932

Functionality

  • modified existing command: databricks labs ucx create-catalogs-schemas
  • modified existing workflow: migrate-tables (*functionally no change)

Tests

  • manually tested
  • added unit tests
  • added integration tests

@JCZuurmond JCZuurmond added bug Something isn't working migrate/access-control Access Control to things labels Oct 14, 2024
@JCZuurmond JCZuurmond self-assigned this Oct 14, 2024
catalogs_existing, schemas_existing = self._catalogs_schemas_from_unity_catalog()
catalogs, schemas = self._catalogs_schemas_from_table_mapping()
for src_schema, dst_catalog in catalogs:
if dst_catalog in catalogs_existing:
Copy link
Member Author

@JCZuurmond JCZuurmond Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx and @FastLee : This introduces the question on "do we want to skip applying ACLs when the catalog exists?"

Before this PR:

  • ACLs is still applied to an existing catalog if a new schema will be created within that catalog
  • ACLs will be skipped for an existing schema.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FastLee to give final opinion. i think we should apply to existing catalog, but i don't feel strongly about it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning to applying all ACLs present also if the catalog exist. Then, provide an option to the user to skip the ACLs alltogether. So either all or no ACLs are applied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip.
Let's just alert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip what exactly? The ones for the catalog if the catalog exists, the ones for the schema if the schema exits, the ones for the catalog if the schema exits? All ACLs if any schema or catalog exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently implemented that all legacy grants are migrated to UC and created an issue to make this part optional.

@nfx nfx requested a review from FastLee October 14, 2024 15:54
@JCZuurmond JCZuurmond force-pushed the fix/reuse-migrate-grants-in-create-catalogs-schemas branch 2 times, most recently from c634d3f to bb382f7 Compare October 15, 2024 13:23


@dataclass(frozen=True)
class Catalog:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this data structure seems to be used only in one place. why do we need to create whole new modules for those? also, it really is tuple[str,str] = ("CATALOG", name)

might be a good idea to co-locate next to Table in src/databricks/labs/ucx/hive_metastore/tables.py. make sure to implement only the required interface methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to a separate module to avoid circular imports. After introducing a protocol in the grants module (see comment below) this was not an issue anymore. Decided to move this to the catalog_schema module as this is closer to the pattern that we more often use for hive_metastore/ modules, i.e. the object being crawled in a module is represented by a dataclass defined in the same module.

Comment on lines 116 to 120
match self.action_type:
case "OWN":
return 1
case _:
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match self.action_type:
case "OWN":
return 1
case _:
return 0
if self.action_type == "OWN":
return 1
return 0

why do we need all this syntactic sugar to replace a simple if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I like to be fancy and show off 💁‍♂️

It is copied from:

@staticmethod
def _library_dep_order(library: str):
match library:
case library if 'sdk' in library:
return 0
case library if 'blueprint' in library:
return 1
case library if 'sqlglot' in library:
return 2
case _:
return 3
.

Though, with just two values 0 and 1 it maybe does not matter

src/databricks/labs/ucx/hive_metastore/grants.py Outdated Show resolved Hide resolved
@@ -765,16 +782,14 @@ def _grants(self) -> list[Grant]:
grants.append(grant)
return grants

def _match_grants(self, table: Table) -> list[Grant]:
def _match_grants(self, src: Catalog | Schema | Table) -> list[Grant]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 22 to 34
@retried(on=[NotFound], timeout=timedelta(seconds=20))
def get_catalog(ws: WorkspaceClient, name: str) -> CatalogInfo:
return ws.catalogs.get(name)


@retried(on=[NotFound], timeout=timedelta(seconds=20))
def get_schema(ws: WorkspaceClient, full_name: str) -> SchemaInfo:
return ws.schemas.get(full_name)


@retried(on=[NotFound], timeout=timedelta(seconds=20))
def get_schema_permissions_list(ws: WorkspaceClient, full_name: str) -> PermissionsList:
return ws.grants.get(SecurableType.SCHEMA, full_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this logic down to CatalogSchema class, so that you handle eventual consistency in production code, not in the test code. remember - CREATE-READ-READ is the pattern used in Terraform to handle these situations.

e.g. this makes the wrong thing more robust :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the changes to the CatalogSchema class.

Note that I went for a (READ-)CREATE-READ pattern:

  • READ before CREATE to check if resource already exists.
  • One READ after create because all settings can be set during the CREATE which is not always the case for Terraform resources.

Also note that I kept get_schema_permissions_list as this uses a different API for CREATE (with SQL) and READ (with SDK/RestAPI). For now, I do not think it is worth it to implement a mapping between these two APIs to introduce the CREATE-READ pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@@ -0,0 +1,36 @@
from dataclasses import dataclass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx: Have a look at this module.

Notes:

  • Needed to be a separate module to avoid circular imports
  • Maybe we want to introduce a parent object securable object (later), to have a common interface for securable objects
  • Note some inconsistencies across this code base and including the sdk:
    • SchemaInfo(catalog_name="catalog") vs Schema(catalog="catalog"). I choose the later to be consistent with the already existing Table class
    • Should we have a Schema or Database?
    • When do we use full_name and when key?
    • More?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to be a separate module to avoid circular imports

Not a problem anymore after using a Protocol, #2955 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the above as code comments

@property
def as_uc_table(self) -> Table:
return Table(self.catalog_name, self.dst_schema, self.dst_table)

@property
def as_uc_table_key(self) -> str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but would try to avoid using this key and use the Table data class instead

@@ -53,8 +53,8 @@ class Table: # pylint: disable=too-many-public-methods
catalog: str
database: str
name: str
object_type: str
table_format: str
object_type: str | None = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note change on this data type to support Rule.as_uc_table

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will cause table migration issues. can we get without this change?

schema_name=src_schema.name,
external_csv=make_mounted_location,
)
dst_catalog = make_catalog()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FastLee : I removed this to test create a new catalog. Please review if the test is correct after my changes



def prepare_test(ws, backend: MockBackend | None = None) -> CatalogSchema: # pylint: disable=too-complex
"""Prepare tests with the following setup:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a clean up of the prepare_tests to verify it is testing what I expect it does

@@ -737,20 +752,22 @@ def __init__(
self._group_manager = group_manager
self._grant_loaders = grant_loaders

def apply(self, src: Table, uc_table_key: str) -> bool:
def apply(self, src: Catalog | Schema | Table, dst: Catalog | Schema | Table) -> bool:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type change allows to match on a Catalog, Schema and Table not just table key.

@JCZuurmond JCZuurmond marked this pull request as ready for review October 15, 2024 13:45
@JCZuurmond JCZuurmond requested a review from a team as a code owner October 15, 2024 13:45
Copy link

github-actions bot commented Oct 15, 2024

✅ 52/52 passed, 4 skipped, 1h55m4s total

Running from acceptance #6813

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last change and we're good to merge

src/databricks/labs/ucx/hive_metastore/catalog_schema.py Outdated Show resolved Hide resolved
@JCZuurmond JCZuurmond requested a review from nfx October 16, 2024 11:15
@JCZuurmond JCZuurmond force-pushed the fix/reuse-migrate-grants-in-create-catalogs-schemas branch from 7fda78f to c5069a9 Compare October 16, 2024 11:28
@@ -53,8 +53,8 @@ class Table: # pylint: disable=too-many-public-methods
catalog: str
database: str
name: str
object_type: str
table_format: str
object_type: str | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will cause table migration issues. can we get without this change?

@JCZuurmond JCZuurmond force-pushed the fix/reuse-migrate-grants-in-create-catalogs-schemas branch from 2cb3ca2 to 4a1aa19 Compare October 16, 2024 13:52
@nfx nfx force-pushed the fix/reuse-migrate-grants-in-create-catalogs-schemas branch from d9b2954 to a7559ce Compare October 16, 2024 15:44
@nfx nfx temporarily deployed to account-admin October 16, 2024 15:44 — with GitHub Actions Inactive
@nfx nfx merged commit fe06262 into main Oct 16, 2024
6 of 7 checks passed
@nfx nfx deleted the fix/reuse-migrate-grants-in-create-catalogs-schemas branch October 16, 2024 15:47
nfx added a commit that referenced this pull request Oct 16, 2024
* Added DBFS Root resolution when HMS Federation is enabled ([#2947](#2947)). This commit introduces a DBFS resolver for use with HMS (Hive Metastore) federation, enabling accurate resolution of DBFS root locations when HMS federation is enabled. A new `_resolve_dbfs_root()` class method is added to the `MountsCrawler` class, and a boolean argument `enable_hms_federation` is included in the `MountsCrawler` constructor, providing better handling of federation functionality. The commit also adds a test function, `test_resolve_dbfs_root_in_hms_federation`, to validate the resolution of DBFS roots with HMS federation. The test covers special cases, such as the `/user/hive/metastore` path, and utilizes `LocationTrie` for more accurate location guessing. These changes aim to improve the overall DBFS root resolution when using HMS federation.
* Added `jax-jumpy` to known list ([#2959](#2959)). In this release, we have added the `jax-jumpy` package to the list of known packages in our system. `jax-jumpy` is a Python-based numerical computation library, which includes modules such as `jumpy`, `jumpy._base_fns`, `jumpy.core`, `jumpy.lax`, `jumpy.numpy`, `jumpy.numpy._factory_fns`, `jumpy.numpy._transform_data`, `jumpy.numpy._types`, `jumpy.numpy.linalg`, `jumpy.ops`, and `jumpy.random`. These modules are now recognized by our system, which partially resolves issue [#1931](#1931), which may have been caused by the integration of the `jax-jumpy` package. Engineers can now utilize the capabilities of this library in their numerical computations.
* Added `joblibspark` to known list ([#2960](#2960)). In this release, we have added support for the `joblibspark` library in our system by updating the `known.json` file, which keeps track of various libraries and their associated components. This change is a part of the resolution for issue [#1931](#1931) and includes new elements such as `doc`, `doc.conf`, `joblibspark`, `joblibspark.backend`, and `joblibspark.utils`. These additions enable the system to recognize and manage the new components related to `joblibspark`, allowing for improved compatibility and functionality.
* Added `jsonpatch` to known list ([#2969](#2969)). In this release, we have added `jsonpatch` to the list of known libraries in the `known.json` file. Jsonpatch is a library used for applying JSON patches, which allow for partial updates to a JSON document. By including jsonpatch in the known list, developers can now easily utilize its functionality for JSON patching, and any necessary dependencies will be handled automatically. This change partially addresses issue [#1931](#1931), which may have been caused by the use or integration of jsonpatch. We encourage developers to take advantage of this new addition to enhance their code and efficiently make partial updates to JSON documents.
* Added `langchain-community` to known list ([#2970](#2970)). A new entry for `langchain-community` has been added to the configuration file for known language chain components in this release. This entry includes several sub-components such as 'langchain_community.agents', 'langchain_community.callbacks', 'langchain_community.chat_loaders', 'langchain_community.chat_message_histories', 'langchain_community.chat_models', 'langchain_community.cross_encoders', 'langchain_community.docstore', 'langchain_community.document_compressors', 'langchain_community.document_loaders', 'langchain_community.document_transformers', 'langchain_community.embeddings', 'langchain_community.example_selectors', 'langchain_community.graph_vectorstores', 'langchain_community.graphs', 'langchain_community.indexes', 'langchain_community.llms', 'langchain_community.memory', 'langchain_community.output_parsers', 'langchain_community.query_constructors', 'langchain_community.retrievers', 'langchain_community.storage', 'langchain_community.tools', 'langchain_community.utilities', and 'langchain_community.utils'. Currently, these sub-components are empty and have no additional configuration or code. This change partially resolves issue [#1931](#1931), but the specifics of the issue and how these components will be used are still unclear.
* Added `langcodes` to known list ([#2971](#2971)). A new `langcodes` library has been added to the project, addressing part of issue [#1931](#1931). This library includes several modules that provide functionalities related to language codes and their manipulation, including `langcodes`, `langcodes.build_data`, `langcodes.data_dicts`, `langcodes.language_distance`, `langcodes.language_lists`, `langcodes.registry_parser`, `langcodes.tag_parser`, and `langcodes.util`. Additionally, the memory-efficient trie (prefix tree) data structure library, `marisa-trie`, has been included in the known list. It is important to note that no existing functionality has been altered in this commit.
* Addressing Ownership Conflict when creating catalog/schemas ([#2956](#2956)). This release introduces new functionality to handle ownership conflicts during catalog/schema creation in our open-source library. The `_apply_from_legacy_table_acls` method has been enhanced with two loops to address non-own grants and own grants separately. This ensures proper handling of ownership conflicts by generating and executing UC grant SQL for each grant type, with appropriate exceptions. Additionally, a new helper function, `this_type_and_key()`, has been added to improve code readability. The release also introduces new methods, GrantsCrawler and Rule, in the hive_metastore package of the labs.ucx module, responsible for populating views and mapping source and destination objects. The test_catalog_schema.py file has been updated to include tests for creating catalogs and schemas with legacy ACLs, utilizing the new Rule method and GrantsCrawler. Issue [#2932](#2932) has been addressed with these changes, which include adding new methods and updating existing tests for hive_metastore.
* Clarify `skip` and `unskip` commands work on views ([#2962](#2962)). In this release, the `skip` and `unskip` commands in the databricks labs UCX tool have been updated to clarify their functionality on views and to make it more explicit with the addition of the `--view` flag. These commands allow users to skip or unskip certain schemas, tables, or views during the table migration process. This is useful for temporarily disabling migration of a particular schema, table, or view. Unit tests have been added to ensure the correct behavior of these commands when working with views. Two new methods have been added to test the behavior of the `unskip` command when a schema or table is specified, and two additional methods test the behavior of the `unskip` command when a view or no schema is specified. Finally, two methods test that an error message is logged when both the `--table` and `--view` flags are specified.
* Fixed issue with migrating MANAGED hive_metastore table to UC ([#2928](#2928)). This commit addresses an issue with migrating Hive Metastore (HMS) MANAGED tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table can result in data loss. To prevent this, a new option `CONVERT_TO_EXTERNAL` has been added to the `migrate_tables` method for migrating managed tables to UC as external, ensuring that the HMS managed table is converted to an external table in HMS and UC, and protecting against data loss when deleting a managed table that has been migrated to UC as external. Additionally, new caching properties have been added for better performance, and existing methods have been modified to handle the migration of managed tables to UC as external. Tests, including unit and integration tests, have been added to ensure the proper functioning of these changes. It is important to note that changing MANAGED tables to EXTERNAL can have potential consequences on regulatory data cleanup, and the impact of this change should be carefully validated for existing workloads.
* Let `create-catalogs-schemas` reuse `MigrateGrants` so that it applies group renaming ([#2955](#2955)). The `create-catalogs-schemas` command in the `databricks labs ucx` package has been enhanced to reuse the `MigrateGrants` function, enabling group renaming and eliminating redundant code. The `migrate-tables` workflow remains functionally the same. Changes include modifying the `CatalogSchema` class to accept a `migrate_grants` argument, introducing new `Catalog` and `Schema` dataclasses, and updating various methods in the `hive_metastore` module. Unit and integration tests have been added and manually verified to ensure proper functionality. The `MigrateGrants` class has been updated to accept two `SecurableObject` arguments and sort matched grants. The `from_src_dst` function in `mapping.py` now includes a new `as_uc_table` method and updates to `as_uc_table_key`. Addressing issues [#2934](#2934), [#2932](#2932), and [#2955](#2955), the changes also include a new `key` property for the `tables.py` file, and updates to the `test_create_catalogs_schemas` and `test_migrate_tables` test functions.
* Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)). A update has been made to the sqlglot requirement in the pyproject.toml file, changing the version range from allowing versions 25.5.0 and later, but less than 25.25, to a range that allows for versions 25.5.0 and later, but less than 25.26. This change was implemented to allow for the latest version of sqlglot compatible with the project's requirements. The commit message includes a detailed changelog for sqlglot version 25.25.0, highlighting breaking changes, new features, bug fixes, and refactors. The commits included in the pull request are also outlined in the message, providing a comprehensive overview of the updates made.
* Use `LocationTrie` to infer a list of UC external locations ([#2965](#2965)). This pull request introduces the `LocationTrie` class to improve the efficiency of determining external locations in UC storage. The `LocationTrie` class, implemented as a dataclass, maintains a hierarchical tree structure to store location paths and offers methods for inserting, finding, and checking if a table exists in the trie. It also refactors the existing `_parse_location` method and adds the `_parse_url` method to handle specific cases of JDBC URLs. The `find` method and `_external_locations` method are updated to use the new `LocationTrie` class, ensuring a more efficient way of determining overlapping storage prefixes. Additionally, the test file for external locations has been modified to include new URL formats and correct previously incorrect ones, enhancing compatibility with various URL formats. Overall, these changes are instrumental in preparing the codebase for future federated SQL connections by optimizing the determination of external locations and improving compatibility with diverse URL formats.
* Warn when table has column without no name during table migration ([#2984](#2984)). In this release, we have introduced changes to the table migration feature in the databricks project to address issue [#2891](#2891). We have renamed the `_migrate_table` method in the `TableMigration` class to `_safe_migrate_table` and added a new `_migrate_table` method that includes a try-except block to catch a specific Spark AnalysisException caused by an invalid column name. If this exception is caught, a warning is logged, and the function returns False. This change ensures that the migration process continues even when it encounters a table with a column without a name, generating a warning instead of raising an error. Additionally, we have added unit tests to verify the new functionality, including the introduction of a new mock backend `MockBackendWithGeneralException` to raise a general exception for testing purposes and a new test case `test_migrate_tables_handles_table_with_empty_column` to validate the warning message generated when encountering a table with an empty column during migration.

Dependency updates:

 * Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)).
@nfx nfx mentioned this pull request Oct 16, 2024
nfx added a commit that referenced this pull request Oct 16, 2024
* Added DBFS Root resolution when HMS Federation is enabled
([#2947](#2947)). This
commit introduces a DBFS resolver for use with HMS (Hive Metastore)
federation, enabling accurate resolution of DBFS root locations when HMS
federation is enabled. A new `_resolve_dbfs_root()` class method is
added to the `MountsCrawler` class, and a boolean argument
`enable_hms_federation` is included in the `MountsCrawler` constructor,
providing better handling of federation functionality. The commit also
adds a test function, `test_resolve_dbfs_root_in_hms_federation`, to
validate the resolution of DBFS roots with HMS federation. The test
covers special cases, such as the `/user/hive/metastore` path, and
utilizes `LocationTrie` for more accurate location guessing. These
changes aim to improve the overall DBFS root resolution when using HMS
federation.
* Added `jax-jumpy` to known list
([#2959](#2959)). In this
release, we have added the `jax-jumpy` package to the list of known
packages in our system. `jax-jumpy` is a Python-based numerical
computation library, which includes modules such as `jumpy`,
`jumpy._base_fns`, `jumpy.core`, `jumpy.lax`, `jumpy.numpy`,
`jumpy.numpy._factory_fns`, `jumpy.numpy._transform_data`,
`jumpy.numpy._types`, `jumpy.numpy.linalg`, `jumpy.ops`, and
`jumpy.random`. These modules are now recognized by our system, which
partially resolves issue
[#1931](#1931), which may
have been caused by the integration of the `jax-jumpy` package.
Engineers can now utilize the capabilities of this library in their
numerical computations.
* Added `joblibspark` to known list
([#2960](#2960)). In this
release, we have added support for the `joblibspark` library in our
system by updating the `known.json` file, which keeps track of various
libraries and their associated components. This change is a part of the
resolution for issue
[#1931](#1931) and includes
new elements such as `doc`, `doc.conf`, `joblibspark`,
`joblibspark.backend`, and `joblibspark.utils`. These additions enable
the system to recognize and manage the new components related to
`joblibspark`, allowing for improved compatibility and functionality.
* Added `jsonpatch` to known list
([#2969](#2969)). In this
release, we have added `jsonpatch` to the list of known libraries in the
`known.json` file. Jsonpatch is a library used for applying JSON
patches, which allow for partial updates to a JSON document. By
including jsonpatch in the known list, developers can now easily utilize
its functionality for JSON patching, and any necessary dependencies will
be handled automatically. This change partially addresses issue
[#1931](#1931), which may
have been caused by the use or integration of jsonpatch. We encourage
developers to take advantage of this new addition to enhance their code
and efficiently make partial updates to JSON documents.
* Added `langchain-community` to known list
([#2970](#2970)). A new
entry for `langchain-community` has been added to the configuration file
for known language chain components in this release. This entry includes
several sub-components such as 'langchain_community.agents',
'langchain_community.callbacks', 'langchain_community.chat_loaders',
'langchain_community.chat_message_histories',
'langchain_community.chat_models', 'langchain_community.cross_encoders',
'langchain_community.docstore',
'langchain_community.document_compressors',
'langchain_community.document_loaders',
'langchain_community.document_transformers',
'langchain_community.embeddings',
'langchain_community.example_selectors',
'langchain_community.graph_vectorstores', 'langchain_community.graphs',
'langchain_community.indexes', 'langchain_community.llms',
'langchain_community.memory', 'langchain_community.output_parsers',
'langchain_community.query_constructors',
'langchain_community.retrievers', 'langchain_community.storage',
'langchain_community.tools', 'langchain_community.utilities', and
'langchain_community.utils'. Currently, these sub-components are empty
and have no additional configuration or code. This change partially
resolves issue
[#1931](#1931), but the
specifics of the issue and how these components will be used are still
unclear.
* Added `langcodes` to known list
([#2971](#2971)). A new
`langcodes` library has been added to the project, addressing part of
issue [#1931](#1931). This
library includes several modules that provide functionalities related to
language codes and their manipulation, including `langcodes`,
`langcodes.build_data`, `langcodes.data_dicts`,
`langcodes.language_distance`, `langcodes.language_lists`,
`langcodes.registry_parser`, `langcodes.tag_parser`, and
`langcodes.util`. Additionally, the memory-efficient trie (prefix tree)
data structure library, `marisa-trie`, has been included in the known
list. It is important to note that no existing functionality has been
altered in this commit.
* Addressing Ownership Conflict when creating catalog/schemas
([#2956](#2956)). This
release introduces new functionality to handle ownership conflicts
during catalog/schema creation in our open-source library. The
`_apply_from_legacy_table_acls` method has been enhanced with two loops
to address non-own grants and own grants separately. This ensures proper
handling of ownership conflicts by generating and executing UC grant SQL
for each grant type, with appropriate exceptions. Additionally, a new
helper function, `this_type_and_key()`, has been added to improve code
readability. The release also introduces new methods, GrantsCrawler and
Rule, in the hive_metastore package of the labs.ucx module, responsible
for populating views and mapping source and destination objects. The
test_catalog_schema.py file has been updated to include tests for
creating catalogs and schemas with legacy ACLs, utilizing the new Rule
method and GrantsCrawler. Issue
[#2932](#2932) has been
addressed with these changes, which include adding new methods and
updating existing tests for hive_metastore.
* Clarify `skip` and `unskip` commands work on views
([#2962](#2962)). In this
release, the `skip` and `unskip` commands in the databricks labs UCX
tool have been updated to clarify their functionality on views and to
make it more explicit with the addition of the `--view` flag. These
commands allow users to skip or unskip certain schemas, tables, or views
during the table migration process. This is useful for temporarily
disabling migration of a particular schema, table, or view. Unit tests
have been added to ensure the correct behavior of these commands when
working with views. Two new methods have been added to test the behavior
of the `unskip` command when a schema or table is specified, and two
additional methods test the behavior of the `unskip` command when a view
or no schema is specified. Finally, two methods test that an error
message is logged when both the `--table` and `--view` flags are
specified.
* Fixed issue with migrating MANAGED hive_metastore table to UC
([#2928](#2928)). This
commit addresses an issue with migrating Hive Metastore (HMS) MANAGED
tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table
can result in data loss. To prevent this, a new option
`CONVERT_TO_EXTERNAL` has been added to the `migrate_tables` method for
migrating managed tables to UC as external, ensuring that the HMS
managed table is converted to an external table in HMS and UC, and
protecting against data loss when deleting a managed table that has been
migrated to UC as external. Additionally, new caching properties have
been added for better performance, and existing methods have been
modified to handle the migration of managed tables to UC as external.
Tests, including unit and integration tests, have been added to ensure
the proper functioning of these changes. It is important to note that
changing MANAGED tables to EXTERNAL can have potential consequences on
regulatory data cleanup, and the impact of this change should be
carefully validated for existing workloads.
* Let `create-catalogs-schemas` reuse `MigrateGrants` so that it applies
group renaming
([#2955](#2955)). The
`create-catalogs-schemas` command in the `databricks labs ucx` package
has been enhanced to reuse the `MigrateGrants` function, enabling group
renaming and eliminating redundant code. The `migrate-tables` workflow
remains functionally the same. Changes include modifying the
`CatalogSchema` class to accept a `migrate_grants` argument, introducing
new `Catalog` and `Schema` dataclasses, and updating various methods in
the `hive_metastore` module. Unit and integration tests have been added
and manually verified to ensure proper functionality. The
`MigrateGrants` class has been updated to accept two `SecurableObject`
arguments and sort matched grants. The `from_src_dst` function in
`mapping.py` now includes a new `as_uc_table` method and updates to
`as_uc_table_key`. Addressing issues
[#2934](#2934),
[#2932](#2932), and
[#2955](#2955), the changes
also include a new `key` property for the `tables.py` file, and updates
to the `test_create_catalogs_schemas` and `test_migrate_tables` test
functions.
* Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26
([#2968](#2968)). A update
has been made to the sqlglot requirement in the pyproject.toml file,
changing the version range from allowing versions 25.5.0 and later, but
less than 25.25, to a range that allows for versions 25.5.0 and later,
but less than 25.26. This change was implemented to allow for the latest
version of sqlglot compatible with the project's requirements. The
commit message includes a detailed changelog for sqlglot version
25.25.0, highlighting breaking changes, new features, bug fixes, and
refactors. The commits included in the pull request are also outlined in
the message, providing a comprehensive overview of the updates made.
* Use `LocationTrie` to infer a list of UC external locations
([#2965](#2965)). This pull
request introduces the `LocationTrie` class to improve the efficiency of
determining external locations in UC storage. The `LocationTrie` class,
implemented as a dataclass, maintains a hierarchical tree structure to
store location paths and offers methods for inserting, finding, and
checking if a table exists in the trie. It also refactors the existing
`_parse_location` method and adds the `_parse_url` method to handle
specific cases of JDBC URLs. The `find` method and `_external_locations`
method are updated to use the new `LocationTrie` class, ensuring a more
efficient way of determining overlapping storage prefixes. Additionally,
the test file for external locations has been modified to include new
URL formats and correct previously incorrect ones, enhancing
compatibility with various URL formats. Overall, these changes are
instrumental in preparing the codebase for future federated SQL
connections by optimizing the determination of external locations and
improving compatibility with diverse URL formats.
* Warn when table has column without no name during table migration
([#2984](#2984)). In this
release, we have introduced changes to the table migration feature in
the databricks project to address issue
[#2891](#2891). We have
renamed the `_migrate_table` method in the `TableMigration` class to
`_safe_migrate_table` and added a new `_migrate_table` method that
includes a try-except block to catch a specific Spark AnalysisException
caused by an invalid column name. If this exception is caught, a warning
is logged, and the function returns False. This change ensures that the
migration process continues even when it encounters a table with a
column without a name, generating a warning instead of raising an error.
Additionally, we have added unit tests to verify the new functionality,
including the introduction of a new mock backend
`MockBackendWithGeneralException` to raise a general exception for
testing purposes and a new test case
`test_migrate_tables_handles_table_with_empty_column` to validate the
warning message generated when encountering a table with an empty column
during migration.

Dependency updates:

* Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26
([#2968](#2968)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working migrate/access-control Access Control to things ready to merge this pull request is ready to merge
Projects
None yet
3 participants