-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed redundant pyspark
, databricks-connect
, delta-spark
, and pandas
dependencies
#193
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… `pandas` dependencies Use consistent crawlers across HMS Crawling and Workspace Permissions
name and description are not relevant, and if we remove deps - we shall remove their usages too |
Merged
nfx
added a commit
that referenced
this pull request
Sep 18, 2023
# Version changelog ## 0.1.0 Features * Added interactive installation wizard ([#184](#184), [#117](#117)). * Added schedule of jobs as part of `install.sh` flow and created some documentation ([#187](#187)). * Added debug notebook companion to troubleshoot the installation ([#191](#191)). * Added support for Hive Metastore Table ACLs inventory from all databases ([#78](#78), [#122](#122), [#151](#151)). * Created `$inventory.tables` from Scala notebook ([#207](#207)). * Added local group migration support for ML-related objects ([#56](#56)). * Added local group migration support for SQL warehouses ([#57](#57)). * Added local group migration support for all compute-related resources ([#53](#53)). * Added local group migration support for security-related objects ([#58](#58)). * Added local group migration support for workflows ([#54](#54)). * Added local group migration support for workspace-level objects ([#59](#59)). * Added local group migration support for dashboards, queries, and alerts ([#144](#144)). Stability * Added `codecov.io` publishing ([#204](#204)). * Added more tests to group.py ([#148](#148)). * Added tests for group state ([#133](#133)). * Added tests for inventorizer and typed ([#125](#125)). * Added tests WorkspaceListing ([#110](#110)). * Added `make_*_permissions` fixtures ([#159](#159)). * Added reusable fixtures module ([#119](#119)). * Added testing for permissions ([#126](#126)). * Added inventory table manager tests ([#153](#153)). * Added `product_info` to track as SDK integration ([#76](#76)). * Added failsafe permission get operations ([#65](#65)). * Always install the latest `pip` version in `./install.sh` ([#201](#201)). * Always store inventory in `hive_metastore` and make only `inventory_database` configurable ([#178](#178)). * Changed default logging level from `TRACE` to `DEBUG` log level ([#124](#124)). * Consistently use `WorkspaceClient` from `databricks.sdk` ([#120](#120)). * Convert pipeline code to use fixtures. ([#166](#166)). * Exclude mixins from coverage ([#130](#130)). * Fixed codecov.io reporting ([#212](#212)). * Fixed configuration path in job task install code ([#210](#210)). * Fixed a bug with dependency definitions ([#70](#70)). * Fixed failing `test_jobs` ([#140](#140)). * Fixed the issues with experiment listing ([#64](#64)). * Fixed integration testing configuration ([#77](#77)). * Make project runnable on nightly testing infrastructure ([#75](#75)). * Migrated cluster policies to new fixtures ([#174](#174)). * Migrated clusters to the new fixture framework ([#162](#162)). * Migrated instance pool to the new fixture framework ([#161](#161)). * Migrated to `databricks.labs.ucx` package ([#90](#90)). * Migrated token authorization to new fixtures ([#175](#175)). * Migrated experiment fixture to standard one ([#168](#168)). * Migrated jobs test to fixture based one. ([#167](#167)). * Migrated model fixture to the standard fixtures ([#169](#169)). * Migrated warehouse fixture to standard one ([#170](#170)). * Organise modules by domain ([#197](#197)). * Prefetch all account-level and workspace-level groups ([#192](#192)). * Programmatically create a dashboard ([#121](#121)). * Properly integrate Python `logging` facility ([#118](#118)). * Refactored code to use Databricks SDK for Python ([#27](#27)). * Refactored configuration and remove global provider state ([#71](#71)). * Removed `pydantic` dependency ([#138](#138)). * Removed redundant `pyspark`, `databricks-connect`, `delta-spark`, and `pandas` dependencies ([#193](#193)). * Removed redundant `typer[all]` dependency and its usages ([#194](#194)). * Renamed `MigrationGroupsProvider` to `GroupMigrationState` ([#81](#81)). * Replaced `ratelimit` and `tenacity` dependencies with simpler implementations ([#195](#195)). * Reorganised integration tests to align more with unit tests ([#206](#206)). * Run `build` workflow also on `main` branch ([#211](#211)). * Run integration test with a single group ([#152](#152)). * Simplify `SqlBackend` and table creation logic ([#203](#203)). * Updated `migration_config.yml` ([#179](#179)). * Updated legal information ([#196](#196)). * Use `make_secret_scope` fixture ([#163](#163)). * Use fixture factory for `make_table`, `make_schema`, and `make_catalog` ([#189](#189)). * Use new fixtures for notebooks and folders ([#176](#176)). * Validate toolkit notebook test ([#183](#183)). Contributing * Added a note on external dependencies ([#139](#139)). * Added ability to run SQL queries on Spark when in Databricks Runtime ([#108](#108)). * Added some ground rules for contributing ([#82](#82)). * Added contributing instructions link from main readme ([#109](#109)). * Added info about environment refreshes ([#155](#155)). * Clarified documentation ([#137](#137)). * Enabled merge queue ([#146](#146)). * Improved `CONTRIBUTING.md` guide ([#135](#135), [#145](#145)).
FastLee
pushed a commit
that referenced
this pull request
Sep 19, 2023
# Version changelog ## 0.1.0 Features * Added interactive installation wizard ([#184](#184), [#117](#117)). * Added schedule of jobs as part of `install.sh` flow and created some documentation ([#187](#187)). * Added debug notebook companion to troubleshoot the installation ([#191](#191)). * Added support for Hive Metastore Table ACLs inventory from all databases ([#78](#78), [#122](#122), [#151](#151)). * Created `$inventory.tables` from Scala notebook ([#207](#207)). * Added local group migration support for ML-related objects ([#56](#56)). * Added local group migration support for SQL warehouses ([#57](#57)). * Added local group migration support for all compute-related resources ([#53](#53)). * Added local group migration support for security-related objects ([#58](#58)). * Added local group migration support for workflows ([#54](#54)). * Added local group migration support for workspace-level objects ([#59](#59)). * Added local group migration support for dashboards, queries, and alerts ([#144](#144)). Stability * Added `codecov.io` publishing ([#204](#204)). * Added more tests to group.py ([#148](#148)). * Added tests for group state ([#133](#133)). * Added tests for inventorizer and typed ([#125](#125)). * Added tests WorkspaceListing ([#110](#110)). * Added `make_*_permissions` fixtures ([#159](#159)). * Added reusable fixtures module ([#119](#119)). * Added testing for permissions ([#126](#126)). * Added inventory table manager tests ([#153](#153)). * Added `product_info` to track as SDK integration ([#76](#76)). * Added failsafe permission get operations ([#65](#65)). * Always install the latest `pip` version in `./install.sh` ([#201](#201)). * Always store inventory in `hive_metastore` and make only `inventory_database` configurable ([#178](#178)). * Changed default logging level from `TRACE` to `DEBUG` log level ([#124](#124)). * Consistently use `WorkspaceClient` from `databricks.sdk` ([#120](#120)). * Convert pipeline code to use fixtures. ([#166](#166)). * Exclude mixins from coverage ([#130](#130)). * Fixed codecov.io reporting ([#212](#212)). * Fixed configuration path in job task install code ([#210](#210)). * Fixed a bug with dependency definitions ([#70](#70)). * Fixed failing `test_jobs` ([#140](#140)). * Fixed the issues with experiment listing ([#64](#64)). * Fixed integration testing configuration ([#77](#77)). * Make project runnable on nightly testing infrastructure ([#75](#75)). * Migrated cluster policies to new fixtures ([#174](#174)). * Migrated clusters to the new fixture framework ([#162](#162)). * Migrated instance pool to the new fixture framework ([#161](#161)). * Migrated to `databricks.labs.ucx` package ([#90](#90)). * Migrated token authorization to new fixtures ([#175](#175)). * Migrated experiment fixture to standard one ([#168](#168)). * Migrated jobs test to fixture based one. ([#167](#167)). * Migrated model fixture to the standard fixtures ([#169](#169)). * Migrated warehouse fixture to standard one ([#170](#170)). * Organise modules by domain ([#197](#197)). * Prefetch all account-level and workspace-level groups ([#192](#192)). * Programmatically create a dashboard ([#121](#121)). * Properly integrate Python `logging` facility ([#118](#118)). * Refactored code to use Databricks SDK for Python ([#27](#27)). * Refactored configuration and remove global provider state ([#71](#71)). * Removed `pydantic` dependency ([#138](#138)). * Removed redundant `pyspark`, `databricks-connect`, `delta-spark`, and `pandas` dependencies ([#193](#193)). * Removed redundant `typer[all]` dependency and its usages ([#194](#194)). * Renamed `MigrationGroupsProvider` to `GroupMigrationState` ([#81](#81)). * Replaced `ratelimit` and `tenacity` dependencies with simpler implementations ([#195](#195)). * Reorganised integration tests to align more with unit tests ([#206](#206)). * Run `build` workflow also on `main` branch ([#211](#211)). * Run integration test with a single group ([#152](#152)). * Simplify `SqlBackend` and table creation logic ([#203](#203)). * Updated `migration_config.yml` ([#179](#179)). * Updated legal information ([#196](#196)). * Use `make_secret_scope` fixture ([#163](#163)). * Use fixture factory for `make_table`, `make_schema`, and `make_catalog` ([#189](#189)). * Use new fixtures for notebooks and folders ([#176](#176)). * Validate toolkit notebook test ([#183](#183)). Contributing * Added a note on external dependencies ([#139](#139)). * Added ability to run SQL queries on Spark when in Databricks Runtime ([#108](#108)). * Added some ground rules for contributing ([#82](#82)). * Added contributing instructions link from main readme ([#109](#109)). * Added info about environment refreshes ([#155](#155)). * Clarified documentation ([#137](#137)). * Enabled merge queue ([#146](#146)). * Improved `CONTRIBUTING.md` guide ([#135](#135), [#145](#145)).
nfx
added a commit
that referenced
this pull request
Oct 4, 2024
* Added `Farama-Notifications` to known list ([#2822](#2822)). A new configuration has been implemented in this release to integrate Farama-Notifications into the existing system, partially addressing issue [#193](#193) * Added `aiohttp-cors` library to known list ([#2775](#2775)). In this release, we have added the `aiohttp-cors` library to our project, providing asynchronous Cross-Origin Resource Sharing (CORS) handling for the `aiohttp` library. This addition enhances the robustness and flexibility of CORS management in our relevant projects. The library includes several new modules such as "aiohttp_cors", "aiohttp_cors.abc", "aiohttp_cors.cors_config", "aiohttp_cors.mixin", "aiohttp_cors.preflight_handler", "aiohttp_cors.resource_options", and "aiohttp_cors.urldispatcher_router_adapter", which offer functionalities for configuring and handling CORS in `aiohttp` applications. This change partially resolves issue [#1931](#1931) and further strengthens our application's security and cross-origin resource sharing capabilities. * Added `category-encoders` library to known list ([#2781](#2781)). In this release, we've added the `category-encoders` library to our supported libraries, which provides a variety of methods for encoding categorical variables as numerical data, including one-hot encoding and target encoding. This addition resolves part of issue [#1931](#1931), which concerned the support of this library. The library has been integrated into our system by adding a new entry for `category-encoders` in the known.json file, which contains several modules and classes corresponding to various encoding methods provided by the library. This enhancement enables software engineers to leverage the capabilities of `category-encoders` library to encode categorical variables more efficiently and effectively. * Added `cmdstanpy` to known list ([#2786](#2786)). In this release, we have added `cmdstanpy` and `stanio` libraries to our codebase. `cmdstanpy` is a Python library for interfacing with the Stan probabilistic programming language and has been added to the whitelist. This addition enables the use of `cmdstanpy`'s functionalities, including loading, inspecting, and manipulating Stan model objects, as well as running MCMC simulations. Additionally, we have included the `stanio` library, which provides functionality for reading and writing Stan data and model files. These additions enhance the codebase's capabilities for working with probabilistic models, offering expanded options for loading, manipulating, and simulating models written in Stan. * Added `confection` library to known list ([#2787](#2787)). In this release, the `confection` library, a lightweight, pure Python library for parsing and formatting cookies with two modules for working with cookie headers and utility functions, has been added to the known list of libraries and is now usable within the project. Additionally, several modules from the `srsly` library, a collection of serialization utilities for Python including support for JSON, MessagePack, cloudpickle, and Ruamel YAML, have been added to the known list of libraries, increasing the project's flexibility and functionality in handling serialized data. This partially resolves issue [#1931](#1931). * Added `configparser` library to known list ([#2796](#2796)). In this release, we have added support for the `configparser` library, addressing issue [#1931](#1931). `Configparser` is a standard Python library used for parsing configuration files. This change not only whitelists the library but also includes the "backports.configparser" and "backports.configparser.compat" modules, providing backward compatibility for older versions of Python. By recognizing and supporting the `configparser` library, users can now utilize it in their code with confidence, knowing that it is a known and supported library. This update also ensures that the backports for older Python versions are recognized, enabling users to leverage the library seamlessly, regardless of the Python version they are using. * Added `diskcache` library to known list ([#2790](#2790)). A new update has been made to include the `diskcache` library in our open-source library's known list, as detailed in the release notes. This addition brings in multiple modules, including `diskcache`, `diskcache.cli`, `diskcache.core`, `diskcache.djangocache`, `diskcache.persistent`, and `diskcache.recipes`. The `diskcache` library is a high-performance caching system, useful for a variety of purposes such as caching database queries, API responses, or any large data that needs frequent access. By adding the `diskcache` library to the known list, developers can now leverage its capabilities in their projects, partially addressing issue [#1931](#1931). * Added `dm-tree` library to known list ([#2789](#2789)). In this release, we have added the `dm-tree` library to our project's known list, enabling its integration and use within our software. The `dm-tree` library is a C++ API that provides functionalities for creating and manipulating tree data structures, with support for sequences and tree benchmarking. This addition expands our range of available data structures, addressing the lack of support for tree data structures and partially resolving issue [#1931](#1931), which may have been related to the integration of the `dm-tree` library. By incorporating this library, we aim to enhance our project's performance and versatility, providing software engineers with more options for handling tree data structures. * Added `evaluate` to known list ([#2821](#2821)). In this release, we have added the `evaluate` package and its dependent libraries to our open-source library. The `evaluate` package is a tool for evaluating and analyzing machine learning models, providing a consistent interface to various evaluation tasks. Its dependent libraries include `colorful`, `cmdstanpy`, `comm`, `eradicate`, `multiprocess`, and `xxhash`. The `colorful` library is used for colorizing terminal output, while `cmdstanpy` provides Python infrastructure for Stan, a platform for statistical modeling and high-performance statistical computation. The `comm` library is used for creating and managing IPython comms, and `eradicate` is used for removing unwanted columns from pandas DataFrame. The `multiprocess` library is used for spawning processes, and `xxhash` is used for the XXHash algorithms, which are used for fast hash computation. This addition partly resolves issue [#1931](#1931), providing enhanced functionality for evaluating machine learning models. * Added `future` to known list ([#2823](#2823)). In this commit, we have added the `future` module, a compatibility layer for Python 2 and Python 3, to the project's known list in the configuration file. This module provides a wide range of backward-compatible tools and fixers to smooth over the differences between the two major versions of Python. It includes numerous sub-modules such as "future.backports", "future.builtins", "future.moves", and "future.standard_library", among others, which offer backward-compatible features for various parts of the Python standard library. The commit also includes related modules like "libfuturize", "libpasteurize", and `past` and their respective sub-modules, which provide tools for automatically converting Python 2 code to Python 3 syntax. These additions enhance the project's compatibility with both Python 2 and Python 3, providing developers with an easier way to write cross-compatible code. By adding the `future` module and related tools, the project can take full advantage of the features and capabilities provided, simplifying the process of writing code that works on both versions of the language. * Added `google-api-core` to known list ([#2824](#2824)). In this commit, we have added the `google-api-core` and `proto-plus` packages to our codebase. The `google-api-core` package brings in a collection of modules for low-level support of Google Cloud services, such as client options, gRPC helpers, and retry mechanisms. This addition enables access to a wide range of functionalities for interacting with Google Cloud services. The `proto-plus` package includes protobuf-related modules, simplifying the handling and manipulation of protobuf messages. This package includes datetime helpers, enums, fields, marshaling utilities, message definitions, and more. These changes enhance the project's versatility, providing users with a more feature-rich environment for interacting with external services, such as those provided by Google Cloud. Users will benefit from the added functionality and convenience provided by these packages. * Added `google-auth-oauthlib` and dependent libraries to known list ([#2825](#2825)). In this release, we have added the `google-auth-oauthlib` and `requests-oauthlib` libraries and their dependencies to our repository to enhance OAuth2 authentication flow support. The `google-auth-oauthlib` library is utilized for Google's OAuth2 client authentication and authorization flows, while `requests-oauthlib` provides OAuth1 and OAuth2 support for the `requests` library. This change partially resolves the missing dependencies issue and improves the project's ability to handle OAuth2 authentication flows with Google and other providers. * Added `greenlet` to known list ([#2830](#2830)). In this release, we have added the `greenlet` library to the known list in the configuration file, addressing part of issue [#193](#193) * Added `gymnasium` to known list ([#2832](#2832)). A new update has been made to include the popular open-source `gymnasium` library in the project's configuration file. The library provides various environments, spaces, and wrappers for developing and testing reinforcement learning algorithms, and includes modules such as "gymnasium.core", "gymnasium.envs", "gymnasium.envs.box2d", "gymnasium.envs.classic_control", "gymnasium.envs.mujoco", "gymnasium.envs.phys2d", "gymnasium.envs.registration", "gymnasium.envs.tabular", "gymnasium.envs.toy_text", "gymnasium.experimental", "gymnasium.logger", "gymnasium.spaces", and "gymnasium.utils", each with specific functionality. This addition enables developers to utilize the library without having to modify any existing code and take advantage of the latest features and bug fixes. This change partly addresses issue [#1931](#1931), likely related to using `gymnasium` in the project, allowing developers to now use it for developing and testing reinforcement learning algorithms. * Added and populate UCX `workflow_runs` table ([#2754](#2754)). In this release, we have added and populated a new `workflow_runs` table in the UCX project to track the status of workflow runs and handle concurrent writes. This update resolves issue [#2600](#2600) and is accompanied by modifications to the `migration-process-experimental` workflow, new `WorkflowRunRecorder` and `ProgressTrackingInstallation` classes, and updated user documentation. We have also added unit tests, integration tests, and a `record_workflow_run` method in the `MigrationWorkflow` class. The new table and methods have been tested to ensure they correctly record workflow run information. However, there are still some issues to address, such as deciding on getting workflow run status from `parse_log_task`. * Added collection of used tables from Python notebooks and files and SQL queries ([#2772](#2772)). This commit introduces the collection and storage of table usage information as part of linting jobs to enable tracking of legacy table usage and lineage. The changes include the modification of existing workflows, addition of new tables and views, and the introduction of new classes such as `UsedTablesCrawler`, `LineageAtom`, and `TableInfoNode`. The new classes and methods support tracking table usage and lineage in Python notebooks, files, and SQL queries. Unit tests and integration tests have been added and updated to ensure the correct functioning of this feature. This is the first pull request in a series of three, with the next two focusing on using the table information in queries and displaying results in the assessment dashboard. * Changed logic of direct filesystem access linting ([#2766](#2766)). This commit modifies the direct filesystem access (DFSA) linting logic to reduce false positives and improve precision. Previously, all string constants matching a DFSA pattern were detected, with false positives filtered on a case-by-case basis. The new approach narrows DFSA detection to instances originating from `spark` or `dbutils` modules, ensuring relevance and minimizing false alarms. The commit introduces new methods, such as 'is_builtin()' and 'get_call_name()', to determine if a given node is a built-in or not. Additionally, it includes unit tests and updates to the test cases in `test_directfs.py` to reflect the new detection criteria. This change enhances the linting process and enables developers to maintain better control over direct filesystem access within the `spark` and `dbutils` modules. * Fixed integration issue when collecting tables ([#2817](#2817)). In this release, we have addressed integration issues related to table collection in the Databricks Labs UCX project. We have introduced a new `UsedTablesCrawler` class to crawl tables in paths and queries, which resolves issues reported in tickets [#2800](#2800) and [#2808](#2808). Additionally, we have updated the `directfs_access_crawler_for_paths` and `directfs_access_crawler_for_queries` methods to work with the new `UsedTablesCrawler` class. We have also made changes to the `workflow_linter` method to include the new `used_tables_crawler_for_paths` property. Furthermore, we have refactored the `lint` method of certain classes to a `collect_tables` method, which returns an iterable of `UsedTable` objects to improve table collection. The `lint` method now processes the collected tables and raises advisories as needed, while the `apply` method remains unchanged. Integration tests were executed as part of this commit. * Increase test coverage ([#2818](#2818)). In this update, we have expanded the test suite for the `Tree` class in our Python AST codebase with several new unit tests. These tests are designed to verify various behaviors, including checking for `None` returns, validating string truncation, ensuring `NotImplementedError` exceptions are raised during node appending and method calls, and testing the correct handling of global variables. Additionally, we have included tests that ensure a constant is not from a specific module. This enhancement signifies our dedication to improving test coverage and consistency, which will aid in maintaining code quality, detecting unintended side effects, and preventing regressions in future development efforts. * Strip preliminary comments in pip cells ([#2763](#2763)). In this release, we have addressed an issue in the processing of pip commands preceded by non-MAGIC comments, ensuring that pip-based library management in Databricks notebooks functions correctly. The changes include stripping preliminary comments and handling the case where the pip command is preceded by a single '%' or '!'. Additionally, a new unit test has been added to validate the behavior of a notebook containing a malformed pip cell. This test checks that the notebook can still be parsed and built into a dependency graph without issues, even in the presence of non-MAGIC comments preceding the pip install command. The code for the test is written in Python and uses the Notebook, Dependency, and DependencyGraph classes to parse the notebook and build the dependency graph. The overall functionality of the code remains unchanged, and the code now correctly processes pip commands in the presence of non-MAGIC comments. * Temporarily ignore `MANAGED` HMS tables on external storage location ([#2837](#2837)). This release introduces changes to the behavior of the `_migrate_external_table` method in the `table_migrate.py` file, specifically for handling managed tables located on external storage. Previously, the method attempted to migrate any external table, but with this change, it now checks if the object type is 'MANAGED'. If it is, a warning message is logged, and the migration is skipped due to UCX's lack of support for migrating managed tables on external storage. This change affects the existing workflow, specifically the behavior of the `migrate_dbfs_root_tables` function in the HMS table migration test suite. The function now checks for the absence of certain SQL queries, specifically those involving `SYNC TABLE` and `ALTER TABLE`, in the `backend.queries` list to ensure that queries related to managed tables on external storage locations are excluded. This release includes unit tests and integration tests to verify the changes and ensure proper behavior for the modified workflow. Issue [#2838](#2838) has been resolved with this commit. * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)). In this release, we have updated the sqlglot requirement in the pyproject.toml file to allow for any version greater than or equal to 25.5.0 but less than 25.25. This resolves a conflict in the previous requirement, which ranged from >=25.5.0 to <25.23. The update includes several bug fixes, refactors, and new features, such as support for the OVERLAY function in PostgreSQL and a flag to automatically exclude Keep diff nodes. Additionally, the check_deploy job has been simplified, and the supported dialect count has increased from 21 to 23. This update ensures that the project remains up-to-date and compatible with the latest version of sqlglot, while also improving functionality and stability. * Whitelists catalogue library ([#2780](#2780)). In this release, we've implemented a change to whitelist the catalogue library, which partially addresses issue [#193](#193). This improvement allows for the reliable and secure use of the catalogue library in our open-source project. The whitelisting ensures that any potential security threats originating from this library are mitigated, enhancing the overall security of our software. This enhancement also promotes better code maintainability and readability, making it easier for software engineers to understand the library's role in the project. By addressing this issue, our library becomes more robust, dependable, and maintainable for both current and future developments. * Whitelists circuitbreaker ([#2783](#2783)). A circuit breaker pattern has been implemented in the library to enhance fault tolerance and prevent cascading failures by introducing a delay before retrying requests to a failed service. This feature is configurable and allows users to specify which services should be protected by the circuit breaker pattern via a whitelist in the `known.json` configuration file. A new entry for `circuitbreaker` is added to the configuration, containing an empty list for the circuit breaker whitelist. This development partially addresses issue [#1931](#1931), aimed at improving system resilience and fault tolerance, and is a significant stride towards building a more robust and reliable open-source library. * Whitelists cloudpathlib ([#2784](#2784)). In this release, we have whitelisted the cloudpathlib library by adding it to the known.json file. Cloudpathlib is a Python library for manipulating cloud paths, and includes several modules for interacting with various cloud storage systems. Each module has been added to the known.json file with an empty list, indicating that no critical issues have been found in these modules. However, we have added warnings for the use of direct filesystem references in specific classes and methods within the cloudpathlib.azure.azblobclient, cloudpathlib.azure.azblobpath, cloudpathlib.cloudpath, cloudpathlib.gs.gsclient, cloudpathlib.gs.gspath, cloudpathlib.local.implementations.azure, cloudpathlib.local.implementations.gs, cloudpathlib.local.implementations.s3, cloudpathlib.s3.s3client, and cloudpathlib.s3.sspath modules. The warning message indicates that the use of direct filesystem references is deprecated and will be removed in a future release. This change addresses a portion of issue [#1931](#1931). * Whitelists colorful ([#2785](#2785)). In this release, we have added support for the `colorful` library, a Python package for generating ANSI escape codes to colorize terminal output. The library contains several modules, including "ansi", "colors", "core", "styles", "terminal", and "utils", all of which have been whitelisted and added to the "known.json" file. This change resolves issue [#1931](#1931) and broadens the range of approved libraries that can be used in the project, enabling more flexible and visually appealing terminal output. * Whitelists cymem ([#2793](#2793)). In this release, we have made changes to the known.json file to whitelist the use of the cymem package in our project. This new entry includes sub-entries such as "cymem", "cymem.about", "cymem.tests", and "cymem.tests.test_import", which likely correspond to specific components or aspects of the package that require whitelisting. This change partially addresses issue [#1931](#1931), which may have been caused by the use or testing of the cymem package. It is important to note that this commit does not modify any existing functionality or add any new methods; rather, it simply grants permission for the cymem package to be used in our project. * Whitelists dacite ([#2795](#2795)). In this release, we have whitelisted the dacite library in our known.json file. Dacite is a library that enables the instantiation of Python classes with type hints, providing more robust and flexible object creation. By whitelisting dacite, users of our project can now utilize this library in their code without encountering any compatibility issues. This change partially addresses issue [#1931](#1931), which may have involved dacite or type hinting more generally, thereby enhancing the overall functionality and flexibility of our project for software engineers. * Whitelists databricks-automl-runtime ([#2794](#2794)). A new change has been implemented to whitelist the `databricks-automl-runtime` in the "known.json" file, enabling several nested packages and modules related to Databricks' auto ML runtime for forecasting and hyperparameter tuning. The newly added modules provide functionalities for data preprocessing and model training, including handling time series data, missing values, and one-hot encoding. This modification addresses a portion of issue [#1931](#1931), improving the library's compatibility with Databricks' auto ML runtime. * Whitelists dataclasses-json ([#2792](#2792)). A new configuration has been added to the "known.json" file, whitelisting the `dataclasses-json` library, which provides serialization and deserialization functionality to Python dataclasses. This change partially resolves issue [#1931](#1931) and introduces new methods for serialization and deserialization through this library. Additionally, the libraries `marshmallow` and its associated modules, as well as "typing-inspect," have also been whitelisted, adding further serialization and deserialization capabilities. It's important to note that these changes do not affect existing functionality, but instead provide new options for handling these data structures. * Whitelists dbl-tempo ([#2791](#2791)). A new library, dbl-tempo, has been whitelisted and is now approved for use in the project. This library provides functionality related to tempo, including interpolation, intervals, resampling, and utility methods. These new methods have been added to the known.json file, indicating that they are now recognized and approved for use. This change is critical for maintaining backward compatibility and project maintainability. It addresses part of issue [#1931](#1931) and ensures that any new libraries or methods are thoroughly vetted and documented before implementation. Software engineers are encouraged to familiarize themselves with the new library and its capabilities. * whitelist blis ([#2776](#2776)). In this release, we have added the high-performance computing library `blis` to our whitelist, partially addressing issue [#1931](#1931). The blis library is optimized for various CPU architectures and provides dense linear algebra capabilities, which can improve the performance of workloads that utilize these operations. With this change, the blis library and its components have been included in our system's whitelist, enabling users to leverage its capabilities. Familiarity with high-performance libraries and their impact on system performance is essential for software engineers, and the addition of blis to our whitelist is a testament to our commitment to providing optimal performance. * whitelists brotli ([#2777](#2777)). In this release, we have partially addressed issue [#1931](#1931) by adding support for the Brotli data compression algorithm in our project. The Brotli JSON object and an empty array for `brotli` have been added to the "known.json" configuration file to recognize and support its use. This change does not modify any existing functionality or introduce new methods, but rather whitelists Brotli as a supported algorithm for future use in the project. This enhancement allows for more flexibility and options when working with data compression, providing software engineers with an additional tool for optimization and performance improvements. Dependency updates: * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)).
Merged
nfx
added a commit
that referenced
this pull request
Oct 4, 2024
* Added `Farama-Notifications` to known list ([#2822](#2822)). A new configuration has been implemented in this release to integrate Farama-Notifications into the existing system, partially addressing issue [#193](#193) * Added `aiohttp-cors` library to known list ([#2775](#2775)). In this release, we have added the `aiohttp-cors` library to our project, providing asynchronous Cross-Origin Resource Sharing (CORS) handling for the `aiohttp` library. This addition enhances the robustness and flexibility of CORS management in our relevant projects. The library includes several new modules such as "aiohttp_cors", "aiohttp_cors.abc", "aiohttp_cors.cors_config", "aiohttp_cors.mixin", "aiohttp_cors.preflight_handler", "aiohttp_cors.resource_options", and "aiohttp_cors.urldispatcher_router_adapter", which offer functionalities for configuring and handling CORS in `aiohttp` applications. This change partially resolves issue [#1931](#1931) and further strengthens our application's security and cross-origin resource sharing capabilities. * Added `category-encoders` library to known list ([#2781](#2781)). In this release, we've added the `category-encoders` library to our supported libraries, which provides a variety of methods for encoding categorical variables as numerical data, including one-hot encoding and target encoding. This addition resolves part of issue [#1931](#1931), which concerned the support of this library. The library has been integrated into our system by adding a new entry for `category-encoders` in the known.json file, which contains several modules and classes corresponding to various encoding methods provided by the library. This enhancement enables software engineers to leverage the capabilities of `category-encoders` library to encode categorical variables more efficiently and effectively. * Added `cmdstanpy` to known list ([#2786](#2786)). In this release, we have added `cmdstanpy` and `stanio` libraries to our codebase. `cmdstanpy` is a Python library for interfacing with the Stan probabilistic programming language and has been added to the whitelist. This addition enables the use of `cmdstanpy`'s functionalities, including loading, inspecting, and manipulating Stan model objects, as well as running MCMC simulations. Additionally, we have included the `stanio` library, which provides functionality for reading and writing Stan data and model files. These additions enhance the codebase's capabilities for working with probabilistic models, offering expanded options for loading, manipulating, and simulating models written in Stan. * Added `confection` library to known list ([#2787](#2787)). In this release, the `confection` library, a lightweight, pure Python library for parsing and formatting cookies with two modules for working with cookie headers and utility functions, has been added to the known list of libraries and is now usable within the project. Additionally, several modules from the `srsly` library, a collection of serialization utilities for Python including support for JSON, MessagePack, cloudpickle, and Ruamel YAML, have been added to the known list of libraries, increasing the project's flexibility and functionality in handling serialized data. This partially resolves issue [#1931](#1931). * Added `configparser` library to known list ([#2796](#2796)). In this release, we have added support for the `configparser` library, addressing issue [#1931](#1931). `Configparser` is a standard Python library used for parsing configuration files. This change not only whitelists the library but also includes the "backports.configparser" and "backports.configparser.compat" modules, providing backward compatibility for older versions of Python. By recognizing and supporting the `configparser` library, users can now utilize it in their code with confidence, knowing that it is a known and supported library. This update also ensures that the backports for older Python versions are recognized, enabling users to leverage the library seamlessly, regardless of the Python version they are using. * Added `diskcache` library to known list ([#2790](#2790)). A new update has been made to include the `diskcache` library in our open-source library's known list, as detailed in the release notes. This addition brings in multiple modules, including `diskcache`, `diskcache.cli`, `diskcache.core`, `diskcache.djangocache`, `diskcache.persistent`, and `diskcache.recipes`. The `diskcache` library is a high-performance caching system, useful for a variety of purposes such as caching database queries, API responses, or any large data that needs frequent access. By adding the `diskcache` library to the known list, developers can now leverage its capabilities in their projects, partially addressing issue [#1931](#1931). * Added `dm-tree` library to known list ([#2789](#2789)). In this release, we have added the `dm-tree` library to our project's known list, enabling its integration and use within our software. The `dm-tree` library is a C++ API that provides functionalities for creating and manipulating tree data structures, with support for sequences and tree benchmarking. This addition expands our range of available data structures, addressing the lack of support for tree data structures and partially resolving issue [#1931](#1931), which may have been related to the integration of the `dm-tree` library. By incorporating this library, we aim to enhance our project's performance and versatility, providing software engineers with more options for handling tree data structures. * Added `evaluate` to known list ([#2821](#2821)). In this release, we have added the `evaluate` package and its dependent libraries to our open-source library. The `evaluate` package is a tool for evaluating and analyzing machine learning models, providing a consistent interface to various evaluation tasks. Its dependent libraries include `colorful`, `cmdstanpy`, `comm`, `eradicate`, `multiprocess`, and `xxhash`. The `colorful` library is used for colorizing terminal output, while `cmdstanpy` provides Python infrastructure for Stan, a platform for statistical modeling and high-performance statistical computation. The `comm` library is used for creating and managing IPython comms, and `eradicate` is used for removing unwanted columns from pandas DataFrame. The `multiprocess` library is used for spawning processes, and `xxhash` is used for the XXHash algorithms, which are used for fast hash computation. This addition partly resolves issue [#1931](#1931), providing enhanced functionality for evaluating machine learning models. * Added `future` to known list ([#2823](#2823)). In this commit, we have added the `future` module, a compatibility layer for Python 2 and Python 3, to the project's known list in the configuration file. This module provides a wide range of backward-compatible tools and fixers to smooth over the differences between the two major versions of Python. It includes numerous sub-modules such as "future.backports", "future.builtins", "future.moves", and "future.standard_library", among others, which offer backward-compatible features for various parts of the Python standard library. The commit also includes related modules like "libfuturize", "libpasteurize", and `past` and their respective sub-modules, which provide tools for automatically converting Python 2 code to Python 3 syntax. These additions enhance the project's compatibility with both Python 2 and Python 3, providing developers with an easier way to write cross-compatible code. By adding the `future` module and related tools, the project can take full advantage of the features and capabilities provided, simplifying the process of writing code that works on both versions of the language. * Added `google-api-core` to known list ([#2824](#2824)). In this commit, we have added the `google-api-core` and `proto-plus` packages to our codebase. The `google-api-core` package brings in a collection of modules for low-level support of Google Cloud services, such as client options, gRPC helpers, and retry mechanisms. This addition enables access to a wide range of functionalities for interacting with Google Cloud services. The `proto-plus` package includes protobuf-related modules, simplifying the handling and manipulation of protobuf messages. This package includes datetime helpers, enums, fields, marshaling utilities, message definitions, and more. These changes enhance the project's versatility, providing users with a more feature-rich environment for interacting with external services, such as those provided by Google Cloud. Users will benefit from the added functionality and convenience provided by these packages. * Added `google-auth-oauthlib` and dependent libraries to known list ([#2825](#2825)). In this release, we have added the `google-auth-oauthlib` and `requests-oauthlib` libraries and their dependencies to our repository to enhance OAuth2 authentication flow support. The `google-auth-oauthlib` library is utilized for Google's OAuth2 client authentication and authorization flows, while `requests-oauthlib` provides OAuth1 and OAuth2 support for the `requests` library. This change partially resolves the missing dependencies issue and improves the project's ability to handle OAuth2 authentication flows with Google and other providers. * Added `greenlet` to known list ([#2830](#2830)). In this release, we have added the `greenlet` library to the known list in the configuration file, addressing part of issue [#193](#193) * Added `gymnasium` to known list ([#2832](#2832)). A new update has been made to include the popular open-source `gymnasium` library in the project's configuration file. The library provides various environments, spaces, and wrappers for developing and testing reinforcement learning algorithms, and includes modules such as "gymnasium.core", "gymnasium.envs", "gymnasium.envs.box2d", "gymnasium.envs.classic_control", "gymnasium.envs.mujoco", "gymnasium.envs.phys2d", "gymnasium.envs.registration", "gymnasium.envs.tabular", "gymnasium.envs.toy_text", "gymnasium.experimental", "gymnasium.logger", "gymnasium.spaces", and "gymnasium.utils", each with specific functionality. This addition enables developers to utilize the library without having to modify any existing code and take advantage of the latest features and bug fixes. This change partly addresses issue [#1931](#1931), likely related to using `gymnasium` in the project, allowing developers to now use it for developing and testing reinforcement learning algorithms. * Added and populate UCX `workflow_runs` table ([#2754](#2754)). In this release, we have added and populated a new `workflow_runs` table in the UCX project to track the status of workflow runs and handle concurrent writes. This update resolves issue [#2600](#2600) and is accompanied by modifications to the `migration-process-experimental` workflow, new `WorkflowRunRecorder` and `ProgressTrackingInstallation` classes, and updated user documentation. We have also added unit tests, integration tests, and a `record_workflow_run` method in the `MigrationWorkflow` class. The new table and methods have been tested to ensure they correctly record workflow run information. However, there are still some issues to address, such as deciding on getting workflow run status from `parse_log_task`. * Added collection of used tables from Python notebooks and files and SQL queries ([#2772](#2772)). This commit introduces the collection and storage of table usage information as part of linting jobs to enable tracking of legacy table usage and lineage. The changes include the modification of existing workflows, addition of new tables and views, and the introduction of new classes such as `UsedTablesCrawler`, `LineageAtom`, and `TableInfoNode`. The new classes and methods support tracking table usage and lineage in Python notebooks, files, and SQL queries. Unit tests and integration tests have been added and updated to ensure the correct functioning of this feature. This is the first pull request in a series of three, with the next two focusing on using the table information in queries and displaying results in the assessment dashboard. * Changed logic of direct filesystem access linting ([#2766](#2766)). This commit modifies the direct filesystem access (DFSA) linting logic to reduce false positives and improve precision. Previously, all string constants matching a DFSA pattern were detected, with false positives filtered on a case-by-case basis. The new approach narrows DFSA detection to instances originating from `spark` or `dbutils` modules, ensuring relevance and minimizing false alarms. The commit introduces new methods, such as 'is_builtin()' and 'get_call_name()', to determine if a given node is a built-in or not. Additionally, it includes unit tests and updates to the test cases in `test_directfs.py` to reflect the new detection criteria. This change enhances the linting process and enables developers to maintain better control over direct filesystem access within the `spark` and `dbutils` modules. * Fixed integration issue when collecting tables ([#2817](#2817)). In this release, we have addressed integration issues related to table collection in the Databricks Labs UCX project. We have introduced a new `UsedTablesCrawler` class to crawl tables in paths and queries, which resolves issues reported in tickets [#2800](#2800) and [#2808](#2808). Additionally, we have updated the `directfs_access_crawler_for_paths` and `directfs_access_crawler_for_queries` methods to work with the new `UsedTablesCrawler` class. We have also made changes to the `workflow_linter` method to include the new `used_tables_crawler_for_paths` property. Furthermore, we have refactored the `lint` method of certain classes to a `collect_tables` method, which returns an iterable of `UsedTable` objects to improve table collection. The `lint` method now processes the collected tables and raises advisories as needed, while the `apply` method remains unchanged. Integration tests were executed as part of this commit. * Increase test coverage ([#2818](#2818)). In this update, we have expanded the test suite for the `Tree` class in our Python AST codebase with several new unit tests. These tests are designed to verify various behaviors, including checking for `None` returns, validating string truncation, ensuring `NotImplementedError` exceptions are raised during node appending and method calls, and testing the correct handling of global variables. Additionally, we have included tests that ensure a constant is not from a specific module. This enhancement signifies our dedication to improving test coverage and consistency, which will aid in maintaining code quality, detecting unintended side effects, and preventing regressions in future development efforts. * Strip preliminary comments in pip cells ([#2763](#2763)). In this release, we have addressed an issue in the processing of pip commands preceded by non-MAGIC comments, ensuring that pip-based library management in Databricks notebooks functions correctly. The changes include stripping preliminary comments and handling the case where the pip command is preceded by a single '%' or '!'. Additionally, a new unit test has been added to validate the behavior of a notebook containing a malformed pip cell. This test checks that the notebook can still be parsed and built into a dependency graph without issues, even in the presence of non-MAGIC comments preceding the pip install command. The code for the test is written in Python and uses the Notebook, Dependency, and DependencyGraph classes to parse the notebook and build the dependency graph. The overall functionality of the code remains unchanged, and the code now correctly processes pip commands in the presence of non-MAGIC comments. * Temporarily ignore `MANAGED` HMS tables on external storage location ([#2837](#2837)). This release introduces changes to the behavior of the `_migrate_external_table` method in the `table_migrate.py` file, specifically for handling managed tables located on external storage. Previously, the method attempted to migrate any external table, but with this change, it now checks if the object type is 'MANAGED'. If it is, a warning message is logged, and the migration is skipped due to UCX's lack of support for migrating managed tables on external storage. This change affects the existing workflow, specifically the behavior of the `migrate_dbfs_root_tables` function in the HMS table migration test suite. The function now checks for the absence of certain SQL queries, specifically those involving `SYNC TABLE` and `ALTER TABLE`, in the `backend.queries` list to ensure that queries related to managed tables on external storage locations are excluded. This release includes unit tests and integration tests to verify the changes and ensure proper behavior for the modified workflow. Issue [#2838](#2838) has been resolved with this commit. * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)). In this release, we have updated the sqlglot requirement in the pyproject.toml file to allow for any version greater than or equal to 25.5.0 but less than 25.25. This resolves a conflict in the previous requirement, which ranged from >=25.5.0 to <25.23. The update includes several bug fixes, refactors, and new features, such as support for the OVERLAY function in PostgreSQL and a flag to automatically exclude Keep diff nodes. Additionally, the check_deploy job has been simplified, and the supported dialect count has increased from 21 to 23. This update ensures that the project remains up-to-date and compatible with the latest version of sqlglot, while also improving functionality and stability. * Whitelists catalogue library ([#2780](#2780)). In this release, we've implemented a change to whitelist the catalogue library, which partially addresses issue [#193](#193). This improvement allows for the reliable and secure use of the catalogue library in our open-source project. The whitelisting ensures that any potential security threats originating from this library are mitigated, enhancing the overall security of our software. This enhancement also promotes better code maintainability and readability, making it easier for software engineers to understand the library's role in the project. By addressing this issue, our library becomes more robust, dependable, and maintainable for both current and future developments. * Whitelists circuitbreaker ([#2783](#2783)). A circuit breaker pattern has been implemented in the library to enhance fault tolerance and prevent cascading failures by introducing a delay before retrying requests to a failed service. This feature is configurable and allows users to specify which services should be protected by the circuit breaker pattern via a whitelist in the `known.json` configuration file. A new entry for `circuitbreaker` is added to the configuration, containing an empty list for the circuit breaker whitelist. This development partially addresses issue [#1931](#1931), aimed at improving system resilience and fault tolerance, and is a significant stride towards building a more robust and reliable open-source library. * Whitelists cloudpathlib ([#2784](#2784)). In this release, we have whitelisted the cloudpathlib library by adding it to the known.json file. Cloudpathlib is a Python library for manipulating cloud paths, and includes several modules for interacting with various cloud storage systems. Each module has been added to the known.json file with an empty list, indicating that no critical issues have been found in these modules. However, we have added warnings for the use of direct filesystem references in specific classes and methods within the cloudpathlib.azure.azblobclient, cloudpathlib.azure.azblobpath, cloudpathlib.cloudpath, cloudpathlib.gs.gsclient, cloudpathlib.gs.gspath, cloudpathlib.local.implementations.azure, cloudpathlib.local.implementations.gs, cloudpathlib.local.implementations.s3, cloudpathlib.s3.s3client, and cloudpathlib.s3.sspath modules. The warning message indicates that the use of direct filesystem references is deprecated and will be removed in a future release. This change addresses a portion of issue [#1931](#1931). * Whitelists colorful ([#2785](#2785)). In this release, we have added support for the `colorful` library, a Python package for generating ANSI escape codes to colorize terminal output. The library contains several modules, including "ansi", "colors", "core", "styles", "terminal", and "utils", all of which have been whitelisted and added to the "known.json" file. This change resolves issue [#1931](#1931) and broadens the range of approved libraries that can be used in the project, enabling more flexible and visually appealing terminal output. * Whitelists cymem ([#2793](#2793)). In this release, we have made changes to the known.json file to whitelist the use of the cymem package in our project. This new entry includes sub-entries such as "cymem", "cymem.about", "cymem.tests", and "cymem.tests.test_import", which likely correspond to specific components or aspects of the package that require whitelisting. This change partially addresses issue [#1931](#1931), which may have been caused by the use or testing of the cymem package. It is important to note that this commit does not modify any existing functionality or add any new methods; rather, it simply grants permission for the cymem package to be used in our project. * Whitelists dacite ([#2795](#2795)). In this release, we have whitelisted the dacite library in our known.json file. Dacite is a library that enables the instantiation of Python classes with type hints, providing more robust and flexible object creation. By whitelisting dacite, users of our project can now utilize this library in their code without encountering any compatibility issues. This change partially addresses issue [#1931](#1931), which may have involved dacite or type hinting more generally, thereby enhancing the overall functionality and flexibility of our project for software engineers. * Whitelists databricks-automl-runtime ([#2794](#2794)). A new change has been implemented to whitelist the `databricks-automl-runtime` in the "known.json" file, enabling several nested packages and modules related to Databricks' auto ML runtime for forecasting and hyperparameter tuning. The newly added modules provide functionalities for data preprocessing and model training, including handling time series data, missing values, and one-hot encoding. This modification addresses a portion of issue [#1931](#1931), improving the library's compatibility with Databricks' auto ML runtime. * Whitelists dataclasses-json ([#2792](#2792)). A new configuration has been added to the "known.json" file, whitelisting the `dataclasses-json` library, which provides serialization and deserialization functionality to Python dataclasses. This change partially resolves issue [#1931](#1931) and introduces new methods for serialization and deserialization through this library. Additionally, the libraries `marshmallow` and its associated modules, as well as "typing-inspect," have also been whitelisted, adding further serialization and deserialization capabilities. It's important to note that these changes do not affect existing functionality, but instead provide new options for handling these data structures. * Whitelists dbl-tempo ([#2791](#2791)). A new library, dbl-tempo, has been whitelisted and is now approved for use in the project. This library provides functionality related to tempo, including interpolation, intervals, resampling, and utility methods. These new methods have been added to the known.json file, indicating that they are now recognized and approved for use. This change is critical for maintaining backward compatibility and project maintainability. It addresses part of issue [#1931](#1931) and ensures that any new libraries or methods are thoroughly vetted and documented before implementation. Software engineers are encouraged to familiarize themselves with the new library and its capabilities. * whitelist blis ([#2776](#2776)). In this release, we have added the high-performance computing library `blis` to our whitelist, partially addressing issue [#1931](#1931). The blis library is optimized for various CPU architectures and provides dense linear algebra capabilities, which can improve the performance of workloads that utilize these operations. With this change, the blis library and its components have been included in our system's whitelist, enabling users to leverage its capabilities. Familiarity with high-performance libraries and their impact on system performance is essential for software engineers, and the addition of blis to our whitelist is a testament to our commitment to providing optimal performance. * whitelists brotli ([#2777](#2777)). In this release, we have partially addressed issue [#1931](#1931) by adding support for the Brotli data compression algorithm in our project. The Brotli JSON object and an empty array for `brotli` have been added to the "known.json" configuration file to recognize and support its use. This change does not modify any existing functionality or introduce new methods, but rather whitelists Brotli as a supported algorithm for future use in the project. This enhancement allows for more flexibility and options when working with data compression, providing software engineers with an additional tool for optimization and performance improvements. Dependency updates: * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)).
nfx
added a commit
that referenced
this pull request
Oct 17, 2024
* Added `lazy_loader` to known list ([#2991](#2991)). With this commit, the `lazy_loader` module has been added to the known list in the configuration file, addressing a portion of issue [#193](#193), which may have been caused by the discovery or loading of this module. The `lazy_loader` is a package or module that, once added to the known list, will be recognized and loaded by the system. This change does not affect any existing functionality or introduce new methods. The commit solely updates the known.json file to include `lazy_loader` with an empty list, indicating that it is ready for use. This modification will enable the correct loading and recognition of the `lazy_loader` module in the system. * Added `librosa` to known list ([#2992](#2992)). In this update, we have added several open-source libraries to the known list in the configuration file, including `librosa`, `llvmlite`, `msgpack`, `pooch`, `soundfile`, and `soxr`. These libraries are commonly used in data engineering, machine learning, and scientific computing tasks. `librosa` is a Python library for audio and music analysis, while `llvmlite` is a lightweight Python interface to the LLVM compiler infrastructure. `msgpack` is a binary serialization format like JSON, `pooch` is a package for managing external data files, `soundfile` is a library for reading and writing audio files, and `soxr` is a library for high-quality audio resampling. Each library has an empty list next to it for specifying additional configuration related to the library. This update partially resolves issue [#1931](#1931) by adding `librosa` to the known list, ensuring that these libraries will be properly recognized and utilized by the codebase. * Added `linkify-it-py` to known list ([#2993](#2993)). In this release, we have added support for two new open-source packages, `linkify-it-py` and `uc-micro-py`, to enhance the software's functionality and compatibility. The addition of `linkify-it-py` and its constituent modules, as well as the incorporation of `uc-micro-py` with its modules and classes, aims to expand the software's capabilities. These changes are related to the resolution of issue [#1931](#1931), and they will enable the software to work seamlessly with these packages, thereby providing a better user experience. * Added `lz4` to known list ([#2994](#2994)). In this release, we have added support for the LZ4 lossless data compression algorithm, which is known for its focus on compression and decompression speed. The implementation includes four variants: lz4, lz4.block, lz4.frame, and lz4.version, each providing different levels of compression and decompression speed and flexibility. This addition expands the range of supported compression algorithms, providing more options for users to choose from and partially addressing issue [#1931](#1931) related to supporting additional compression algorithms. This improvement will be beneficial to software engineers working with data compression in their projects. * Fixed `SystemError: AST constructor recursion depth mismatch` failing the entire job ([#3000](#3000)). This PR introduces more deterministic, Go-style, error handling for parsing Python code, addressing issues that caused the entire job to fail due to a `SystemError: AST constructor recursion depth mismatch` ([#3000](#3000)) and bug [#2976](#2976). It includes removing the `AstroidSyntaxError` import, adding an import for `SqlglotError`, and updating the `SqlParseError` exception to `SqlglotError` in the `lint` method of the `SqlLinter` class. Additionally, abstract classes `TablePyCollector` and `DfsaPyCollector` and their respective methods for collecting tables and direct file system accesses have been removed. The `PythonSequentialLinter` class, previously handling multiple responsibilities, has also been removed, enhancing code modularity, understandability, maintainability, and testability. The changes affect the `base.py`, `python_ast.py`, and `python_sequential_linter.py` modules. * Skip applying permissions for workspace system groups to Unity Catalog resources ([#2997](#2997)). This commit introduces changes to the ACL-related code in the `databricks labs ucx create-catalog-schemas` command and the `migrate-table-*` workflow, skipping the application of permissions for workspace system groups in the Unity Catalog. These system groups, which include 'admins', do not exist at the account level. To ensure the correctness of these modifications, unit and integration tests have been added, including a test that checks the proper handling of user privileges in system groups during catalog schema creation. The `AccessControlResponse` object has been updated for the `admins` and `users` groups, granting them specific permissions for a workspace and warehouse object, respectively, enhancing the system's functionality in multi-user environments with system groups.
Merged
nfx
added a commit
that referenced
this pull request
Oct 17, 2024
* Added `lazy_loader` to known list ([#2991](#2991)). With this commit, the `lazy_loader` module has been added to the known list in the configuration file, addressing a portion of issue [#193](#193), which may have been caused by the discovery or loading of this module. The `lazy_loader` is a package or module that, once added to the known list, will be recognized and loaded by the system. This change does not affect any existing functionality or introduce new methods. The commit solely updates the known.json file to include `lazy_loader` with an empty list, indicating that it is ready for use. This modification will enable the correct loading and recognition of the `lazy_loader` module in the system. * Added `librosa` to known list ([#2992](#2992)). In this update, we have added several open-source libraries to the known list in the configuration file, including `librosa`, `llvmlite`, `msgpack`, `pooch`, `soundfile`, and `soxr`. These libraries are commonly used in data engineering, machine learning, and scientific computing tasks. `librosa` is a Python library for audio and music analysis, while `llvmlite` is a lightweight Python interface to the LLVM compiler infrastructure. `msgpack` is a binary serialization format like JSON, `pooch` is a package for managing external data files, `soundfile` is a library for reading and writing audio files, and `soxr` is a library for high-quality audio resampling. Each library has an empty list next to it for specifying additional configuration related to the library. This update partially resolves issue [#1931](#1931) by adding `librosa` to the known list, ensuring that these libraries will be properly recognized and utilized by the codebase. * Added `linkify-it-py` to known list ([#2993](#2993)). In this release, we have added support for two new open-source packages, `linkify-it-py` and `uc-micro-py`, to enhance the software's functionality and compatibility. The addition of `linkify-it-py` and its constituent modules, as well as the incorporation of `uc-micro-py` with its modules and classes, aims to expand the software's capabilities. These changes are related to the resolution of issue [#1931](#1931), and they will enable the software to work seamlessly with these packages, thereby providing a better user experience. * Added `lz4` to known list ([#2994](#2994)). In this release, we have added support for the LZ4 lossless data compression algorithm, which is known for its focus on compression and decompression speed. The implementation includes four variants: lz4, lz4.block, lz4.frame, and lz4.version, each providing different levels of compression and decompression speed and flexibility. This addition expands the range of supported compression algorithms, providing more options for users to choose from and partially addressing issue [#1931](#1931) related to supporting additional compression algorithms. This improvement will be beneficial to software engineers working with data compression in their projects. * Fixed `SystemError: AST constructor recursion depth mismatch` failing the entire job ([#3000](#3000)). This PR introduces more deterministic, Go-style, error handling for parsing Python code, addressing issues that caused the entire job to fail due to a `SystemError: AST constructor recursion depth mismatch` ([#3000](#3000)) and bug [#2976](#2976). It includes removing the `AstroidSyntaxError` import, adding an import for `SqlglotError`, and updating the `SqlParseError` exception to `SqlglotError` in the `lint` method of the `SqlLinter` class. Additionally, abstract classes `TablePyCollector` and `DfsaPyCollector` and their respective methods for collecting tables and direct file system accesses have been removed. The `PythonSequentialLinter` class, previously handling multiple responsibilities, has also been removed, enhancing code modularity, understandability, maintainability, and testability. The changes affect the `base.py`, `python_ast.py`, and `python_sequential_linter.py` modules. * Skip applying permissions for workspace system groups to Unity Catalog resources ([#2997](#2997)). This commit introduces changes to the ACL-related code in the `databricks labs ucx create-catalog-schemas` command and the `migrate-table-*` workflow, skipping the application of permissions for workspace system groups in the Unity Catalog. These system groups, which include 'admins', do not exist at the account level. To ensure the correctness of these modifications, unit and integration tests have been added, including a test that checks the proper handling of user privileges in system groups during catalog schema creation. The `AccessControlResponse` object has been updated for the `admins` and `users` groups, granting them specific permissions for a workspace and warehouse object, respectively, enhancing the system's functionality in multi-user environments with system groups.
nfx
added a commit
that referenced
this pull request
Oct 30, 2024
* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process. * Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions. * Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages. * Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. * Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193) * Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. * Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. * Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. * Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. * Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. * Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064). * Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs. * Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object. * Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them. * Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory. * Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions. * Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster. * Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults. * Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff. * Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set. * Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios. * Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process. * Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats. * Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process. * Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes. * Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar. * Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow. * Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances. * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library. * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements. * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line. Dependency updates: * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).
Merged
nfx
added a commit
that referenced
this pull request
Oct 30, 2024
* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process. * Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions. * Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages. * Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. * Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193) * Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. * Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. * Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. * Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. * Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. * Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064). * Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs. * Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object. * Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them. * Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory. * Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions. * Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster. * Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults. * Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff. * Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set. * Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios. * Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process. * Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats. * Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process. * Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes. * Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar. * Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow. * Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances. * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library. * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements. * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line. Dependency updates: * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR removes redundant pyspark, databricks-connect, delta-spark, and pandas dependencies and their usages.
After it we can use consistent crawlers across HMS Crawling and Workspace Permissions.
This PR supersedes and closes #105