-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DBFS Root resolution when HMS Federation is enabled #2947
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 tasks
nfx
force-pushed
the
feat/hms-fed-dbfs
branch
from
October 16, 2024 13:09
e4ad964
to
55a9637
Compare
nfx
force-pushed
the
feat/hms-fed-dbfs
branch
from
October 16, 2024 13:37
55a9637
to
b2f1739
Compare
✅ 51/51 passed, 5 flaky, 4 skipped, 2h8m46s total Flaky tests:
Running from acceptance #6810 |
nfx
changed the title
Resolve DBFS Root with HMS Federation
Added DBFS Root resolution when HMS Federation is enabled
Oct 16, 2024
nfx
added a commit
that referenced
this pull request
Oct 16, 2024
* Added DBFS Root resolution when HMS Federation is enabled ([#2947](#2947)). This commit introduces a DBFS resolver for use with HMS (Hive Metastore) federation, enabling accurate resolution of DBFS root locations when HMS federation is enabled. A new `_resolve_dbfs_root()` class method is added to the `MountsCrawler` class, and a boolean argument `enable_hms_federation` is included in the `MountsCrawler` constructor, providing better handling of federation functionality. The commit also adds a test function, `test_resolve_dbfs_root_in_hms_federation`, to validate the resolution of DBFS roots with HMS federation. The test covers special cases, such as the `/user/hive/metastore` path, and utilizes `LocationTrie` for more accurate location guessing. These changes aim to improve the overall DBFS root resolution when using HMS federation. * Added `jax-jumpy` to known list ([#2959](#2959)). In this release, we have added the `jax-jumpy` package to the list of known packages in our system. `jax-jumpy` is a Python-based numerical computation library, which includes modules such as `jumpy`, `jumpy._base_fns`, `jumpy.core`, `jumpy.lax`, `jumpy.numpy`, `jumpy.numpy._factory_fns`, `jumpy.numpy._transform_data`, `jumpy.numpy._types`, `jumpy.numpy.linalg`, `jumpy.ops`, and `jumpy.random`. These modules are now recognized by our system, which partially resolves issue [#1931](#1931), which may have been caused by the integration of the `jax-jumpy` package. Engineers can now utilize the capabilities of this library in their numerical computations. * Added `joblibspark` to known list ([#2960](#2960)). In this release, we have added support for the `joblibspark` library in our system by updating the `known.json` file, which keeps track of various libraries and their associated components. This change is a part of the resolution for issue [#1931](#1931) and includes new elements such as `doc`, `doc.conf`, `joblibspark`, `joblibspark.backend`, and `joblibspark.utils`. These additions enable the system to recognize and manage the new components related to `joblibspark`, allowing for improved compatibility and functionality. * Added `jsonpatch` to known list ([#2969](#2969)). In this release, we have added `jsonpatch` to the list of known libraries in the `known.json` file. Jsonpatch is a library used for applying JSON patches, which allow for partial updates to a JSON document. By including jsonpatch in the known list, developers can now easily utilize its functionality for JSON patching, and any necessary dependencies will be handled automatically. This change partially addresses issue [#1931](#1931), which may have been caused by the use or integration of jsonpatch. We encourage developers to take advantage of this new addition to enhance their code and efficiently make partial updates to JSON documents. * Added `langchain-community` to known list ([#2970](#2970)). A new entry for `langchain-community` has been added to the configuration file for known language chain components in this release. This entry includes several sub-components such as 'langchain_community.agents', 'langchain_community.callbacks', 'langchain_community.chat_loaders', 'langchain_community.chat_message_histories', 'langchain_community.chat_models', 'langchain_community.cross_encoders', 'langchain_community.docstore', 'langchain_community.document_compressors', 'langchain_community.document_loaders', 'langchain_community.document_transformers', 'langchain_community.embeddings', 'langchain_community.example_selectors', 'langchain_community.graph_vectorstores', 'langchain_community.graphs', 'langchain_community.indexes', 'langchain_community.llms', 'langchain_community.memory', 'langchain_community.output_parsers', 'langchain_community.query_constructors', 'langchain_community.retrievers', 'langchain_community.storage', 'langchain_community.tools', 'langchain_community.utilities', and 'langchain_community.utils'. Currently, these sub-components are empty and have no additional configuration or code. This change partially resolves issue [#1931](#1931), but the specifics of the issue and how these components will be used are still unclear. * Added `langcodes` to known list ([#2971](#2971)). A new `langcodes` library has been added to the project, addressing part of issue [#1931](#1931). This library includes several modules that provide functionalities related to language codes and their manipulation, including `langcodes`, `langcodes.build_data`, `langcodes.data_dicts`, `langcodes.language_distance`, `langcodes.language_lists`, `langcodes.registry_parser`, `langcodes.tag_parser`, and `langcodes.util`. Additionally, the memory-efficient trie (prefix tree) data structure library, `marisa-trie`, has been included in the known list. It is important to note that no existing functionality has been altered in this commit. * Addressing Ownership Conflict when creating catalog/schemas ([#2956](#2956)). This release introduces new functionality to handle ownership conflicts during catalog/schema creation in our open-source library. The `_apply_from_legacy_table_acls` method has been enhanced with two loops to address non-own grants and own grants separately. This ensures proper handling of ownership conflicts by generating and executing UC grant SQL for each grant type, with appropriate exceptions. Additionally, a new helper function, `this_type_and_key()`, has been added to improve code readability. The release also introduces new methods, GrantsCrawler and Rule, in the hive_metastore package of the labs.ucx module, responsible for populating views and mapping source and destination objects. The test_catalog_schema.py file has been updated to include tests for creating catalogs and schemas with legacy ACLs, utilizing the new Rule method and GrantsCrawler. Issue [#2932](#2932) has been addressed with these changes, which include adding new methods and updating existing tests for hive_metastore. * Clarify `skip` and `unskip` commands work on views ([#2962](#2962)). In this release, the `skip` and `unskip` commands in the databricks labs UCX tool have been updated to clarify their functionality on views and to make it more explicit with the addition of the `--view` flag. These commands allow users to skip or unskip certain schemas, tables, or views during the table migration process. This is useful for temporarily disabling migration of a particular schema, table, or view. Unit tests have been added to ensure the correct behavior of these commands when working with views. Two new methods have been added to test the behavior of the `unskip` command when a schema or table is specified, and two additional methods test the behavior of the `unskip` command when a view or no schema is specified. Finally, two methods test that an error message is logged when both the `--table` and `--view` flags are specified. * Fixed issue with migrating MANAGED hive_metastore table to UC ([#2928](#2928)). This commit addresses an issue with migrating Hive Metastore (HMS) MANAGED tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table can result in data loss. To prevent this, a new option `CONVERT_TO_EXTERNAL` has been added to the `migrate_tables` method for migrating managed tables to UC as external, ensuring that the HMS managed table is converted to an external table in HMS and UC, and protecting against data loss when deleting a managed table that has been migrated to UC as external. Additionally, new caching properties have been added for better performance, and existing methods have been modified to handle the migration of managed tables to UC as external. Tests, including unit and integration tests, have been added to ensure the proper functioning of these changes. It is important to note that changing MANAGED tables to EXTERNAL can have potential consequences on regulatory data cleanup, and the impact of this change should be carefully validated for existing workloads. * Let `create-catalogs-schemas` reuse `MigrateGrants` so that it applies group renaming ([#2955](#2955)). The `create-catalogs-schemas` command in the `databricks labs ucx` package has been enhanced to reuse the `MigrateGrants` function, enabling group renaming and eliminating redundant code. The `migrate-tables` workflow remains functionally the same. Changes include modifying the `CatalogSchema` class to accept a `migrate_grants` argument, introducing new `Catalog` and `Schema` dataclasses, and updating various methods in the `hive_metastore` module. Unit and integration tests have been added and manually verified to ensure proper functionality. The `MigrateGrants` class has been updated to accept two `SecurableObject` arguments and sort matched grants. The `from_src_dst` function in `mapping.py` now includes a new `as_uc_table` method and updates to `as_uc_table_key`. Addressing issues [#2934](#2934), [#2932](#2932), and [#2955](#2955), the changes also include a new `key` property for the `tables.py` file, and updates to the `test_create_catalogs_schemas` and `test_migrate_tables` test functions. * Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)). A update has been made to the sqlglot requirement in the pyproject.toml file, changing the version range from allowing versions 25.5.0 and later, but less than 25.25, to a range that allows for versions 25.5.0 and later, but less than 25.26. This change was implemented to allow for the latest version of sqlglot compatible with the project's requirements. The commit message includes a detailed changelog for sqlglot version 25.25.0, highlighting breaking changes, new features, bug fixes, and refactors. The commits included in the pull request are also outlined in the message, providing a comprehensive overview of the updates made. * Use `LocationTrie` to infer a list of UC external locations ([#2965](#2965)). This pull request introduces the `LocationTrie` class to improve the efficiency of determining external locations in UC storage. The `LocationTrie` class, implemented as a dataclass, maintains a hierarchical tree structure to store location paths and offers methods for inserting, finding, and checking if a table exists in the trie. It also refactors the existing `_parse_location` method and adds the `_parse_url` method to handle specific cases of JDBC URLs. The `find` method and `_external_locations` method are updated to use the new `LocationTrie` class, ensuring a more efficient way of determining overlapping storage prefixes. Additionally, the test file for external locations has been modified to include new URL formats and correct previously incorrect ones, enhancing compatibility with various URL formats. Overall, these changes are instrumental in preparing the codebase for future federated SQL connections by optimizing the determination of external locations and improving compatibility with diverse URL formats. * Warn when table has column without no name during table migration ([#2984](#2984)). In this release, we have introduced changes to the table migration feature in the databricks project to address issue [#2891](#2891). We have renamed the `_migrate_table` method in the `TableMigration` class to `_safe_migrate_table` and added a new `_migrate_table` method that includes a try-except block to catch a specific Spark AnalysisException caused by an invalid column name. If this exception is caught, a warning is logged, and the function returns False. This change ensures that the migration process continues even when it encounters a table with a column without a name, generating a warning instead of raising an error. Additionally, we have added unit tests to verify the new functionality, including the introduction of a new mock backend `MockBackendWithGeneralException` to raise a general exception for testing purposes and a new test case `test_migrate_tables_handles_table_with_empty_column` to validate the warning message generated when encountering a table with an empty column during migration. Dependency updates: * Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)).
Merged
nfx
added a commit
that referenced
this pull request
Oct 16, 2024
* Added DBFS Root resolution when HMS Federation is enabled ([#2947](#2947)). This commit introduces a DBFS resolver for use with HMS (Hive Metastore) federation, enabling accurate resolution of DBFS root locations when HMS federation is enabled. A new `_resolve_dbfs_root()` class method is added to the `MountsCrawler` class, and a boolean argument `enable_hms_federation` is included in the `MountsCrawler` constructor, providing better handling of federation functionality. The commit also adds a test function, `test_resolve_dbfs_root_in_hms_federation`, to validate the resolution of DBFS roots with HMS federation. The test covers special cases, such as the `/user/hive/metastore` path, and utilizes `LocationTrie` for more accurate location guessing. These changes aim to improve the overall DBFS root resolution when using HMS federation. * Added `jax-jumpy` to known list ([#2959](#2959)). In this release, we have added the `jax-jumpy` package to the list of known packages in our system. `jax-jumpy` is a Python-based numerical computation library, which includes modules such as `jumpy`, `jumpy._base_fns`, `jumpy.core`, `jumpy.lax`, `jumpy.numpy`, `jumpy.numpy._factory_fns`, `jumpy.numpy._transform_data`, `jumpy.numpy._types`, `jumpy.numpy.linalg`, `jumpy.ops`, and `jumpy.random`. These modules are now recognized by our system, which partially resolves issue [#1931](#1931), which may have been caused by the integration of the `jax-jumpy` package. Engineers can now utilize the capabilities of this library in their numerical computations. * Added `joblibspark` to known list ([#2960](#2960)). In this release, we have added support for the `joblibspark` library in our system by updating the `known.json` file, which keeps track of various libraries and their associated components. This change is a part of the resolution for issue [#1931](#1931) and includes new elements such as `doc`, `doc.conf`, `joblibspark`, `joblibspark.backend`, and `joblibspark.utils`. These additions enable the system to recognize and manage the new components related to `joblibspark`, allowing for improved compatibility and functionality. * Added `jsonpatch` to known list ([#2969](#2969)). In this release, we have added `jsonpatch` to the list of known libraries in the `known.json` file. Jsonpatch is a library used for applying JSON patches, which allow for partial updates to a JSON document. By including jsonpatch in the known list, developers can now easily utilize its functionality for JSON patching, and any necessary dependencies will be handled automatically. This change partially addresses issue [#1931](#1931), which may have been caused by the use or integration of jsonpatch. We encourage developers to take advantage of this new addition to enhance their code and efficiently make partial updates to JSON documents. * Added `langchain-community` to known list ([#2970](#2970)). A new entry for `langchain-community` has been added to the configuration file for known language chain components in this release. This entry includes several sub-components such as 'langchain_community.agents', 'langchain_community.callbacks', 'langchain_community.chat_loaders', 'langchain_community.chat_message_histories', 'langchain_community.chat_models', 'langchain_community.cross_encoders', 'langchain_community.docstore', 'langchain_community.document_compressors', 'langchain_community.document_loaders', 'langchain_community.document_transformers', 'langchain_community.embeddings', 'langchain_community.example_selectors', 'langchain_community.graph_vectorstores', 'langchain_community.graphs', 'langchain_community.indexes', 'langchain_community.llms', 'langchain_community.memory', 'langchain_community.output_parsers', 'langchain_community.query_constructors', 'langchain_community.retrievers', 'langchain_community.storage', 'langchain_community.tools', 'langchain_community.utilities', and 'langchain_community.utils'. Currently, these sub-components are empty and have no additional configuration or code. This change partially resolves issue [#1931](#1931), but the specifics of the issue and how these components will be used are still unclear. * Added `langcodes` to known list ([#2971](#2971)). A new `langcodes` library has been added to the project, addressing part of issue [#1931](#1931). This library includes several modules that provide functionalities related to language codes and their manipulation, including `langcodes`, `langcodes.build_data`, `langcodes.data_dicts`, `langcodes.language_distance`, `langcodes.language_lists`, `langcodes.registry_parser`, `langcodes.tag_parser`, and `langcodes.util`. Additionally, the memory-efficient trie (prefix tree) data structure library, `marisa-trie`, has been included in the known list. It is important to note that no existing functionality has been altered in this commit. * Addressing Ownership Conflict when creating catalog/schemas ([#2956](#2956)). This release introduces new functionality to handle ownership conflicts during catalog/schema creation in our open-source library. The `_apply_from_legacy_table_acls` method has been enhanced with two loops to address non-own grants and own grants separately. This ensures proper handling of ownership conflicts by generating and executing UC grant SQL for each grant type, with appropriate exceptions. Additionally, a new helper function, `this_type_and_key()`, has been added to improve code readability. The release also introduces new methods, GrantsCrawler and Rule, in the hive_metastore package of the labs.ucx module, responsible for populating views and mapping source and destination objects. The test_catalog_schema.py file has been updated to include tests for creating catalogs and schemas with legacy ACLs, utilizing the new Rule method and GrantsCrawler. Issue [#2932](#2932) has been addressed with these changes, which include adding new methods and updating existing tests for hive_metastore. * Clarify `skip` and `unskip` commands work on views ([#2962](#2962)). In this release, the `skip` and `unskip` commands in the databricks labs UCX tool have been updated to clarify their functionality on views and to make it more explicit with the addition of the `--view` flag. These commands allow users to skip or unskip certain schemas, tables, or views during the table migration process. This is useful for temporarily disabling migration of a particular schema, table, or view. Unit tests have been added to ensure the correct behavior of these commands when working with views. Two new methods have been added to test the behavior of the `unskip` command when a schema or table is specified, and two additional methods test the behavior of the `unskip` command when a view or no schema is specified. Finally, two methods test that an error message is logged when both the `--table` and `--view` flags are specified. * Fixed issue with migrating MANAGED hive_metastore table to UC ([#2928](#2928)). This commit addresses an issue with migrating Hive Metastore (HMS) MANAGED tables to Unity Catalog (UC) as EXTERNAL, where deleting a MANAGED table can result in data loss. To prevent this, a new option `CONVERT_TO_EXTERNAL` has been added to the `migrate_tables` method for migrating managed tables to UC as external, ensuring that the HMS managed table is converted to an external table in HMS and UC, and protecting against data loss when deleting a managed table that has been migrated to UC as external. Additionally, new caching properties have been added for better performance, and existing methods have been modified to handle the migration of managed tables to UC as external. Tests, including unit and integration tests, have been added to ensure the proper functioning of these changes. It is important to note that changing MANAGED tables to EXTERNAL can have potential consequences on regulatory data cleanup, and the impact of this change should be carefully validated for existing workloads. * Let `create-catalogs-schemas` reuse `MigrateGrants` so that it applies group renaming ([#2955](#2955)). The `create-catalogs-schemas` command in the `databricks labs ucx` package has been enhanced to reuse the `MigrateGrants` function, enabling group renaming and eliminating redundant code. The `migrate-tables` workflow remains functionally the same. Changes include modifying the `CatalogSchema` class to accept a `migrate_grants` argument, introducing new `Catalog` and `Schema` dataclasses, and updating various methods in the `hive_metastore` module. Unit and integration tests have been added and manually verified to ensure proper functionality. The `MigrateGrants` class has been updated to accept two `SecurableObject` arguments and sort matched grants. The `from_src_dst` function in `mapping.py` now includes a new `as_uc_table` method and updates to `as_uc_table_key`. Addressing issues [#2934](#2934), [#2932](#2932), and [#2955](#2955), the changes also include a new `key` property for the `tables.py` file, and updates to the `test_create_catalogs_schemas` and `test_migrate_tables` test functions. * Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)). A update has been made to the sqlglot requirement in the pyproject.toml file, changing the version range from allowing versions 25.5.0 and later, but less than 25.25, to a range that allows for versions 25.5.0 and later, but less than 25.26. This change was implemented to allow for the latest version of sqlglot compatible with the project's requirements. The commit message includes a detailed changelog for sqlglot version 25.25.0, highlighting breaking changes, new features, bug fixes, and refactors. The commits included in the pull request are also outlined in the message, providing a comprehensive overview of the updates made. * Use `LocationTrie` to infer a list of UC external locations ([#2965](#2965)). This pull request introduces the `LocationTrie` class to improve the efficiency of determining external locations in UC storage. The `LocationTrie` class, implemented as a dataclass, maintains a hierarchical tree structure to store location paths and offers methods for inserting, finding, and checking if a table exists in the trie. It also refactors the existing `_parse_location` method and adds the `_parse_url` method to handle specific cases of JDBC URLs. The `find` method and `_external_locations` method are updated to use the new `LocationTrie` class, ensuring a more efficient way of determining overlapping storage prefixes. Additionally, the test file for external locations has been modified to include new URL formats and correct previously incorrect ones, enhancing compatibility with various URL formats. Overall, these changes are instrumental in preparing the codebase for future federated SQL connections by optimizing the determination of external locations and improving compatibility with diverse URL formats. * Warn when table has column without no name during table migration ([#2984](#2984)). In this release, we have introduced changes to the table migration feature in the databricks project to address issue [#2891](#2891). We have renamed the `_migrate_table` method in the `TableMigration` class to `_safe_migrate_table` and added a new `_migrate_table` method that includes a try-except block to catch a specific Spark AnalysisException caused by an invalid column name. If this exception is caught, a warning is logged, and the function returns False. This change ensures that the migration process continues even when it encounters a table with a column without a name, generating a warning instead of raising an error. Additionally, we have added unit tests to verify the new functionality, including the introduction of a new mock backend `MockBackendWithGeneralException` to raise a general exception for testing purposes and a new test case `test_migrate_tables_handles_table_with_empty_column` to validate the warning message generated when encountering a table with an empty column during migration. Dependency updates: * Updated sqlglot requirement from <25.25,>=25.5.0 to >=25.5.0,<25.26 ([#2968](#2968)).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a DBFS Resolver which is used by HMS federation to resolve the DBFS root locations.
TODO:
ExternalLocations
explicitly depend onTablesCrawler
andMountsCrawler
#2954/user/hive/metastore
as special case of DBFS mountLocationTrie
to come guess locations - UseLocationTrie
to infer a list of UC external locations #2965co-authored by @vihangk-db