Download wheel dependency locally to register it to the dependency graph #1704

JCZuurmond · 2024-05-16T15:13:19Z

Changes

This PR downloads a wheel dependency locally (in a temporary location) to register it to the dependency graph

Linked issues

Part of #1640

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: experimental-linter
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

src/databricks/labs/ucx/source_code/jobs.py

codecov · 2024-05-21T12:29:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.46%. Comparing base (a8e823f) to head (2f2ea52).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1704   +/-   ##
=======================================
  Coverage   89.45%   89.46%           
=======================================
  Files          95       95           
  Lines       11839    11843    +4     
  Branches     2073     2075    +2     
=======================================
+ Hits        10591    10595    +4     
  Misses        853      853           
  Partials      395      395

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nfx

few nits

src/databricks/labs/ucx/source_code/jobs.py

tests/integration/source_code/test_jobs.py

src/databricks/labs/ucx/source_code/python_libraries.py

JCZuurmond

Waiting for changes from: https://github.com/databrickslabs/ucx/tree/support-pip-install-requirements

…e dependency graph (#1753) ## Changes Support linting job tasks with requirements.txt dependency ### Linked issues Resolves #1644 Similar to #1704 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs ucx ...` - [ ] added a new workflow - [x] modified existing workflow: `experimental-workflow-linter` - [ ] added a new table - [ ] modified existing table: `...` ### Tests  - [x] manually tested - [x] added unit tests - [x] added integration tests - [ ] verified on staging environment (screenshot attached) --------- Co-authored-by: Serge Smertin <[email protected]>

* Added `%pip` cell resolver ([#1697](#1697)). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue [#1642](#1642) and following up on [#1694](#1694). The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project. * Added downloads of `requirementst.txt` dependency locally to register it to the dependency graph ([#1753](#1753)). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue [#1644](#1644) and is similar to [#1704](#1704). The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the `experimental-workflow-linter` workflow. The `lint_job` method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files. * Added ability to install UCX on workspaces without Public Internet connectivity ([#1566](#1566)). A new flag, `upload_dependencies`, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue [#573](#573) and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of `databricks-labs-blueprint` from `<0.7.0` to `>=0.6.0`, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the `upload_dependencies` flag is set to True. * Added initial interface for data comparison framework ([#1695](#1695)). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new `StandardDataComparator` class has been implemented for comparing the data of two tables, and a `StandardSchemaComparator` class tests the comparison of table schemas. The framework also includes the `DatabricksTableMetadataRetriever` class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as `StandardDataProfiler` for profiling data, `SchemaComparator` and `DataComparator` for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility. * Added lint local code command ([#1710](#1710)). A new `lint local code` command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The `lint-local-code` command is implemented in the `application.py` file, with supporting methods and classes added to the `workspace_cli.py` and `databricks.labs.ucx.source_code` packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards. * Added table in mount migration ([#1225](#1225)). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend. * Added workflows to trigger table reconciliations ([#1721](#1721)). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's `$inventory_database.reconciliation_results` view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management. * Always refresh HMS stats when getting table size ([#1713](#1713)). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case `test_table_size_crawler` in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality. * Automatically retrieve `aws_account_id` from aws profile instead of prompting ([#1715](#1715)). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of `aws_account_id` by automatically retrieving it from the AWS profile. An optional `kms-key` flag has been documented for creating roles, providing more flexibility. The `create-missing-principals` command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue [#1714](#1714). Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing `aws_cli_run_command`, ensuring automated retrieval of `aws_account_id`. A test has also been added to raise an error when AWS CLI is not found in the system path. * Detect dependencies of libraries installed via pip ([#1703](#1703)). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues [#1642](#1642) and [#1202](#1202). It modifies certain tests and reduces their execution time. The PipResolver class in `databricks.labs.ucx.source_code.graph` is used to detect and resolve library dependencies installed via pip, with methods to locate, install, and register libraries in a specified folder. A new Whitelist feature and updated DistInfoPackage class are also included. Although unit tests have been added, no new user documentation, CLI commands, workflows, or tables have been added or modified. The previous site_packages attribute has been removed from the GlobalContext class. * Emit problems with code belonging to job ([#1730](#1730)). In this release, the jobs.py file has been updated with new functionality in the JobProblem class, enabling it to convert itself into a string message using the new as_message() method. The refresh_report() method has been modified to call a new _lint_job() method when provided with a job object, which returns a list of JobProblem instances. The lint_job() method has also been updated to call _lint_job() and return a list of JobProblem instances, with a new behavior to log warning messages when problems are found. The changes include the addition of a new method, `lint_job`, for linting a job and returning any problems found. The changes have been tested through the addition of a new integration test, `test_job_linter_some_notebook_graph_with_problems`, and are manually tested and covered with unit and integration tests. This release addresses issue [#1542](#1542) and improves the job linter functionality, specifically detecting and emitting problems related to code belonging to a job during the lin job. The new `JobProblem` class has an `as_message()` method that returns a string representation of the problem, and a unit test for this method has been added. The `DependencyResolver` in the `DependencyGraph` constructor has also been modified. * Fixed `create-catalogs-schemas` to allow more than 1 level nesting more than the external location ([#1701](#1701)). The `create-catalogs-schemas` library has been updated to allow for more than one level of nesting beyond the external location, addressing issue [#1700](#1700). This release includes a new CLI command, as well as modifications to the existing `databricks labs ucx ...` command. A new workflow has been added and existing functionality has been changed to support the additional nesting levels. The changes have been thoroughly tested through manual testing, unit tests, and integration tests using the `fnmatch.fnmatch` method for validating location patterns. Software engineers adopting this project will benefit from these enhancements. * Fixed local file resolver logic with relative paths and site-packages ([#1685](#1685)). This commit addresses an issue ([#1685](#1685)) related to the local file resolver logic for relative paths and site-packages. The resolver's logic has been updated to look for `_package_/__init__.py` instead of relying on `dist-info` metadata, and the resolver has been wired back into the global resolver chain with updated calling code. No changes have been made to user documentation, CLI commands, workflows, or tables. New methods have not been added, but existing functionality has been modified to enhance local file resolution handling. Unit tests have been added and manually verified to ensure proper functionality. * Fixed look up logic where instance profile name does not match role name ([#1716](#1716)). A fix has been implemented to improve the robustness of the instance profile lookup mechanism in the open-source library. Previously, the code relied on the role name being the same as the instance profile name, which resulted in issues when the names did not match ([#1716](#1716), [#1711](#1711)). This has been addressed by updating the `role_name` method in the `AWSRoleAction` class to use a new regex pattern 'AWSResources.ROLE_NAME_REGEX', and renaming the `get_instance_profile` method in the `AWSResources` class to `get_instance_profile_arn` to reflect the change in return type from a string to an ARN. A new method, 'get_instance_profile_role_arn', has also been added to the `AWSResources` class to retrieve the role ARN from the instance profile. Additionally, new methods `get_instance_profile_arn` and `instance_lookup` have been added to improve testing capabilities. * Fixed pip install in a multiline cell ([#1728](#1728)). This release includes a fix for an issue where pip install commands with multiline code were not being handled correctly (issue [#1728](#1728), issue [#1642](#1642)). The `build_dependency_graph` function of the `PipCell` class has been updated to properly register the library specified in the pip install command, even if it is spread over multiple lines. The function now splits the original code by spaces or new lines, allowing it to extract the library name correctly. These changes have been thoroughly tested through manual testing and unit tests to ensure that pip install commands with multiline code are now handled correctly, resulting in the library being installed and registered properly. * README update about Standard workspaces ([#1734](#1734)). In this release, the README file of our open-source library has been updated to provide additional user documentation on compatibility with Standard Workspaces on Databricks. The changes include an outlined incompatibility section, specifically designed for users of Standard Workspaces. It is important to note that these updates are purely informational and do not involve any changes to existing commands, workflows, tables, or functionality within the code. No new methods or modifications have been made to the existing functionality. The commit does not include any tests, as the changes are limited to updating user documentation. The changes have been manually tested to ensure accuracy. The target audience for this release includes software engineers who are adopting the project and may require additional guidance on compatibility with Standard Workspaces. Additionally, please note that a Databricks Premium or Enterprise workspace is now a prerequisite for using this library. * Show code problems found by workflow linter in the migration dashboard ([#1741](#1741)). This commit introduces a new feature to the migration dashboard: an experimental workflow linter that identifies code compatibility problems for Unity Catalog integration. The feature includes a new CLI command, `migration_report`, which refreshes the migration dashboard after all previous tasks are completed, and an existing command, `databricks labs ucx ...`, has been modified. The `experimental-workflow-linter` workflow has also been changed, and new functionality has been added in the form of a new workflow. A new SQL query for displaying code compatibility problems is located in the file "02_1_code_compatibility_problems.sql". User documentation has been added, and the changes have been manually tested. This feature aims to improve the migration dashboard's functionality and provide a better experience for users. Targeted at software engineers, this feature will help in identifying and resolving code compatibility issues during the migration process. * Support for s3a/ s3n protocols when using mount point ([#1765](#1765)). In this release, we have added support for s3a and s3n protocols when using mount points in the metastore locations. A new static method, `_get_ext_location_definitions`, has been introduced, which generates a name for a resource defined by the location and now supports additional prefixes "s3a://" and "s3n://" for defining resources in S3. For Azure Blob Storage, the container name is extracted from the location and included in the resource name. If the location does not match the supported formats, a warning is logged, and the script is not generated. These changes offer more flexibility in defining resources and improve the system's ability to handle various cloud storage solutions. Additionally, the `test_save_external_location_mapping_missing_location` function in `test_locations.py` has been updated to include test cases for s3a and s3n protocols, enhancing the software's functionality. * Support joining an existing collection when installing UCX ([#1675](#1675)). The AccountInstaller class has been updated to include a new functionality that allows users to join an existing collection during UCX installation. This is achieved by presenting the user with a list of workspaces they have access to, allowing them to select one, and then checking if there are existing workspace IDs present in the selected workspace. If so, the installation will join the corresponding collection; otherwise, a new collection will be created. This feature simplifies UCX migration for large organizations with multiple workspaces by allowing them to manage collections instead of individual workspaces. Relevant user documentation and CLI commands have been updated, along with new and modified tests to ensure proper functionality. The commit includes the addition of new methods, `join_collection` and `is_account_install`, as well as updates to the `install_on_account` method to call `join_collection` if specified. Unit tests and integration tests have been added to ensure the proper functioning of the new feature. * Updated UCX job cluster policy AWS zone_id to `auto` ([#1735](#1735)). In this release, the UCX job cluster policy for AWS has been updated to use `auto` for the zone_id, allowing Databricks to choose the zone based on a default value in the region. This change, which resolves issue [#533](#533), affects the definition method in the policy.py file, where a check has been added to remove 'aws_attributes.zone_id' if an instance pool ID is provided. The tests for this change include manual testing and new unit tests, with modifications to existing workflows. The diff shows updates to the test_policy.py file, where the 'aws_attributes.zone_id' is set to `auto` in several functions. No new CLI commands or documentation have been provided as part of this update. * Updated assessment.md - `spark.catalog.x` guidance needed updating ([#1708](#1708)). With the release of DBR 14+, the `spark.catalog.*` functions, which were previously not recommended for use on shared compute clusters due to security reasons, are now considered safe to use. This change in guidance is reflected in the updated assessment.md document, which also notes that `spark.sql("<sql command>")` may still be a more suitable alternative for certain common spark.catalog functions like tableExists, listTables, and setDefaultCatalog. The corresponding `spark._jsparkSession.catalog` methods are also mentioned as potential alternatives on DBR 14.1 and above. It is important to note that no new methods or functionality have been added, and no existing functionality has been changed - only the guidance in the documentation has been updated. This update has been manually tested and implemented in the documentation to ensure accuracy and reliability for software engineers. Dependency updates: * Updated sqlglot requirement from <23.15,>=23.9 to >=23.9,<23.16 ([#1681](#1681)). * Updated databricks-labs-blueprint requirement from <0.6.0,>=0.4.3 to >=0.4.3,<0.7.0 ([#1688](#1688)). * Updated sqlglot requirement from <23.16,>=23.9 to >=23.9,<23.18 ([#1724](#1724)). * Updated sqlglot requirement from <23.18,>=23.9 to >=23.9,<24.1 ([#1745](#1745)). * Updated databricks-sdk requirement from ~=0.27.0 to >=0.27,<0.29 ([#1756](#1756)). * Bump databrickslabs/sandbox from acceptance/v0.2.1 to 0.2.2 ([#1769](#1769)).

github-actions · 2024-05-31T13:18:56Z

✅ 184/184 passed, 1 flaky, 23 skipped, 2h37m48s total

Flaky tests:

🤪 test_delete_ws_groups_should_delete_renamed_and_reflected_groups_only (3m33.391s)

_{Running from acceptance #3643}

* Added handling for legacy ACL `DENY` permission in group migration ([#1815](#1815)). In this release, the handling of `DENY` permissions during group migrations in our legacy ACL table has been improved. Previously, `DENY` operations were denoted with a `DENIED` prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of `DENIED` in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue [#1803](#1803). A new test function, test_hive_deny_sql(), has also been added to test the behavior of the `DENY` permission. * Added handling for parsing corrupted log files ([#1817](#1817)). The `logs.py` file in the `src/databricks/labs/ucx/installer` directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method `test_parse_logs_warns_for_corrupted_log_file` that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. * Added known problems with `pyspark` package ([#1813](#1813)). In this release, updates have been made to the `src/databricks/labs/ucx/source_code/known.json` file to document known issues with the `pyspark` package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new `KnownProblem` dataclass has been added to the `known.py` file, which includes methods for converting the object to a dictionary for better encoding of problems. The `_analyze_file` method has also been updated to use a `known_problems` set of `KnownProblem` objects, improving readability and management of known problems within the application. These changes address issue [#1813](#1813) and improve the documentation of known issues with `pyspark`. * Added library linting for jobs launched on shared clusters ([#1689](#1689)). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue [#1637](#1637). A new function, `_register_existing_cluster_id(graph: DependencyGraph)`, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the `test_jobs.py` file in the `tests/integration/source_code` directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the `jobs` and `compute` modules from the `databricks.sdk.service` package. Additionally, a new `WorkflowTaskContainer` method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. * Added linters to check for spark logging and configuration access ([#1808](#1808)). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via `sc.conf`, and `rdd.mapPartitions`. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the `SparkConnectLinter` class and are executed as part of the `databricks labs ucx` command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. * Added list of known dependency compatibilities and regeneration infrastructure for it ([#1747](#1747)). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the `known.json` file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. * Added more known libraries from Databricks Runtime ([#1812](#1812)). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios. * Added more known packages from Databricks Runtime ([#1814](#1814)). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility. * Added support for `.egg` Python libraries in jobs ([#1789](#1789)). This commit adds support for `.egg` Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue [#1643](#1643). It includes the addition of a new method, `PythonLibraryResolver`, which replaces the old `PipResolver`, and is used to register egg library dependencies in the `DependencyGraph`. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where `PipResolver` is replaced with `PythonLibraryResolver` from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from `.egg` files. * Added table migration workflow guide ([#1607](#1607)). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience. * Added workflow linter for spark python tasks ([#1810](#1810)). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the `_register_spark_python_task` method in the `jobs.py` file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The `test_job_spark_python_task_linter_happy_path` test checks the linter on a valid job configuration where all required libraries are specified, while the `test_job_spark_python_task_linter_unhappy_path` test checks the linter on an invalid job configuration where required libraries are not specified. These tests ensure that the workflow linter for Spark Python tasks is functioning correctly and can help identify any potential issues in job configurations. * Connect all linters to `LinterContext` and add functional testing framework ([#1811](#1811)). This commit connects all linters, including those related to JVM, to the critical path for improved code linting, and introduces a functional testing framework to simplify the writing of code linting verification tests. The `pyproject.toml` file has been updated to include a new configuration for the `ignore-paths` option, utilizing a regular expression to exclude certain files or directories from linting. The testing framework is particularly useful for verifying the correct functioning of linters, reducing the risk of errors and improving the overall development experience. These changes will help to improve the reliability and efficiency of the linting process, making it easier to write and maintain high-quality code. * Deduplicate errors emitted by Spark Connect linter ([#1824](#1824)). This pull request introduces error deduplication for the Spark Connect linter and adds new functional tests using an updated framework. The modifications include the addition of user documentation and unit tests, as well as alterations to existing commands and workflows. Specifically, a new CLI command has been added, and the command `databricks labs ucx ...` has been modified. Additionally, a new workflow has been implemented, and an existing workflow has been updated. No new tables or modifications to existing tables are present. Testing has been conducted through manual testing and new unit tests, with no integration tests or staging environment tests specified. The `verify` method in the `test_functional.py` file has been updated to sort the actual problems list before comparing it to the expected problems list, ensuring consistent ordering of results. The changes aim to improve the functionality and usability of the Spark Connect linter for our software engineer audience. * Download wheel dependency locally to register it to the dependency graph ([#1704](#1704)). A new feature has been implemented in the open-source library to enhance dependency management for wheel files. Previously, when the library type was wheel, a `not-yet-implemented` DependencyProblem would be yielded. Now, the system downloads the wheel file from a remote location, saves it to a temporary directory, and registers the local file to the dependency graph. This allows for more comprehensive handling of wheel dependencies, as they are now downloaded and registered instead of simply being flagged as "not-yet-implemented". Additionally, new functions for creating jobs, making notebooks, and generating random values have been added to enable more comprehensive testing of the workflow linter. New tests have been implemented to check the linter's behavior when there is a missing library dependency and to verify that the linter correctly handles wheel dependencies. These changes improve the testing capabilities of the workflow linter and ensure that all dependencies are properly accounted for and managed within the system. A new test method, 'test_workflow_task_container_builds_dependency_graph_for_python_wheel', has been added to ensure that the dependency graph is built correctly for Python wheels and to improve test coverage. * Drop pyspark `register` lint matcher ([#1818](#1818)). In the latest release, the `register` lint matcher has been removed from pyspark, indicating that the specific usage pattern for the `register` method in UDTFRegistration is no longer required. This change affects the linting process during code reviews, but does not impact the functionality of the code directly. Other matchers for DataFrame, DataFrameReader, DataFrameWriter, and direct filesystem access remain unchanged. The `register` method, which was likely used to register a temporary table or view in pyspark, is no longer considered a best practice or necessary feature. If you previously relied on the `register` method in your pyspark code, you will need to find an alternative solution. This update aims to improve the quality and consistency of pyspark code by removing outdated or unnecessary functionality. * Enabled joining an existing installation to a collection ([#1799](#1799)). This change introduces several new features and modifications to the open-source library, aimed at enhancing the management and organization of workspaces within a collection. A new command `join-collection` has been added to allow a workspace to join a collection using its workspace ID. The `report-account-compatibility` command has been updated with a new flag `--workspace-ids`, and the `alias` command has been updated with a new description. Two new commands `principal-prefix-access` and `create-missing-principals` have been introduced for AWS, and a new command `create-uber-principal` has been introduced for Azure to handle the creation of service principals with STORAGE BLOB READER access for storage accounts used by tables in the workspace. The code's readability and maintainability have been improved by modifying the method `_can_administer` to `can_administer` and `_load_workspace_info` to `load_workspace_info` in the `workspaces.py` file. A new `join_collection` command has been added to the `ucx` application instance to enable joining an existing installation to a collection. Additionally, modifications to the `install.py` file and `test_installation.py` file have been made to facilitate the integration of existing installations into a collection. The tests have been updated to ensure that the joining process works correctly in various scenarios. Overall, these changes provide more flexibility and ease of use for users and improve the interoperability and security of the system. * Fixed `migrate-credential` cli command on AWS ([#1732](#1732)). In this release, the `migrate-credential` CLI command for AWS has been improved and fixed. The command now includes changes to the `access.py` file in the `databricks/labs/ucx/aws` directory. Notable updates are the refactoring of the `role_name` method into a dataclass called `AWSCredentialCandidate`, the addition of the method `_aws_role_trust_doc`, and the removal of the `_databricks_trust_statement` method. The `_aws_s3_policy` method has been updated to include `s3:PutObjectAcl` in the allowed actions, and methods `_create_role` and `_get_role_access_task` have been updated to use `arn` instead of `role_name`. Additionally, the `create_uc_role` and `update_uc_trust_role` methods have been combined into a single `update_uc_role` method. The `migrate-credentials` command in the `cli.py` file has also been updated to support migration of AWS Instance Profiles to UC storage credentials. These improvements resolve issue [#1726](#1726) and enhance the functionality and reliability of the `migrate-credential` command for AWS. * Fixed crasher when running migrate-local-code ([#1794](#1794)). In this release, we have addressed a crasher issue that occurred when running the `migrate-local-code` command. The change involves modifying the `local_file_migrator` property in the `LocalCheckoutContext` class to use a lambda function instead of directly passing `self.languages`. This ensures that the languages are loaded only when the `local_file_migrator` property is accessed, preventing unnecessary load and potential crashes. The change does not introduce any new functionalities, but instead modifies existing commands related to local file migration. Comprehensive manual testing and unit tests have been conducted to ensure the fix works as expected without negatively impacting other parts of the system. * Fixed inconsistent behavior in `%pip` cell handling ([#1785](#1785)). This PR addresses inconsistent behavior in `%pip` cell handling by modifying Python library installation to occur in a designated path lookup, rather than deep within the library tree. These changes impact various components, such as the `PipResolver` class, which no longer requires a `FileLoader` instance as an argument and now takes a `Whitelist` instance directly. Additionally, tests like `test_detect_s3fs_import` and `test_detect_s3fs_import_in_dependencies` are affected by these modifications. Overall, these changes streamline the `%pip` feature, improving library installation efficiency and consistency. * Fixed issue when creating view using `WITH` clause ([#1809](#1809)). In this release, we have addressed an issue that occurred when creating a view using a `WITH` clause, which was causing potential errors or incorrect results due to improper handling of aliases. A new method, `_read_aliases`, has been introduced to read and store aliases from the `WITH` clause as a set, and during view dependency analysis, if an old table's name matches an alias, it is now skipped to prevent double-counting. This ensures improved accuracy and reliability of view creation with `WITH` clauses. Moreover, the commit includes adjustments to import statements, addition of unit tests, and the introduction of a new class `TableView` in the `databricks.labs.ucx.hive_metastore.view_migrate` module to test whether a view with a local dataset should be skipped. This release also includes a test for migrating a view with columns, ensuring that views with local datasets are now handled correctly. The fix resolves issue [#1798](#1798). * Fixed linting for non-UTF8 encoded files ([#1804](#1804)). This commit addresses linting issues for files that are not encoded in UTF-8, improving compatibility with non-UTF-8 encoded files in the databricks labs ucx project. Previously, the linter and fixer tools were unable to process non-UTF-8 encoded files, causing them to fail. This issue has been resolved by adding a check for file encoding during linting and handling the case where the file is not encoded in UTF-8 by returning a failure message. A new method, `getpreferredencoding(False)`, has been introduced to determine the file's encoding, ensuring UTF-8 compatibility. Additionally, a new test method, `test_file_linter_lints_non_ascii_encoded_file`, has been added to check the linter's behavior with non-ASCII encoded files. This enhancement simplifies the linting process, allowing for better file handling of non-UTF-8 encoded files, and is supported by manual testing and unit tests. * Further fix for DENY permissions ([#1834](#1834)). This commit addresses issue [#1834](#1834) by implementing a fix for handling DENY permissions in the legacy TACL migration logic. Previously, all permissions were grouped in a single GRANT statement, but they have now been updated to be split into separate GRANT and DENY statements. This change improves the clarity and maintainability of the code and also increases test coverage with the addition of unit tests and integration tests. A new test function `test_tacl_applier_deny_and_grant()` has been added to demonstrate the use of the updated logic for handling DENY permissions. The resulting SQL queries now include both GRANT and DENY statements, reflecting the updated logic. These changes ensure that the DENY permissions are correctly applied, increasing the overall test coverage and confidence in the code. * Removed false warning on DataFrame.insertInto() about the default format changing from parquet to delta ([#1823](#1823)). This pull request removes a false warning related to the use of DataFrameWriter.insertInto(), which had been incorrectly flagging a potential issue due to the default format change from Parquet to Delta. The warning is now suppressed as it is no longer relevant, since the operation ignores any specified format and uses the existing format of the underlying table. Additionally, an unnecessary linting suppression has been removed. These changes improve the accuracy of the warning system and eliminate confusion for users, with no impact on functionality, usability, or performance. The changes have been manually tested and do not require any new unit or integration tests, CLI commands, workflows, or tables. * Support linting python wheel tasks ([#1821](#1821)). This release introduces support for linting python wheel tasks, addressing issue [#1](#1) * Updated linting checks for Spark table methods ([#1816](#1816)). This commit updates linting checks for PySpark's Spark table methods, focusing on improving handling of migrated tables and deprecating direct filesystem references in favor of the Unity Catalog. New tests and examples include literal and variable references to known and unknown tables, as well as cases with extra or out-of-position arguments. The commit also highlights false positives and trivial references in unrelated contexts. These changes aim to ensure proper usage of Spark table methods, improve codebase consistency, and minimize potential issues related to migrations and format changes. Dependency updates: * Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2 ([#1819](#1819)).

JCZuurmond requested review from ericvergnaud and pritishpai May 16, 2024 15:13

JCZuurmond force-pushed the feat/add-integration-test-for-linting-task-with-whl-dependency branch from c414984 to 748adb6 Compare May 16, 2024 15:17

JCZuurmond commented May 16, 2024

View reviewed changes

src/databricks/labs/ucx/source_code/jobs.py Show resolved Hide resolved

JCZuurmond marked this pull request as ready for review May 16, 2024 15:51

JCZuurmond requested a review from a team May 16, 2024 15:51

JCZuurmond had a problem deploying to account-admin May 16, 2024 15:51 — with GitHub Actions Failure

JCZuurmond mentioned this pull request May 16, 2024

Added dependency resolving for %pip cells #1662

Closed

JCZuurmond force-pushed the feat/add-integration-test-for-linting-task-with-whl-dependency branch from 242110c to a9eee05 Compare May 21, 2024 12:28

JCZuurmond had a problem deploying to account-admin May 21, 2024 12:28 — with GitHub Actions Failure

nfx approved these changes May 21, 2024

View reviewed changes

src/databricks/labs/ucx/source_code/jobs.py Outdated Show resolved Hide resolved

tests/integration/source_code/test_jobs.py Outdated Show resolved Hide resolved

tests/integration/source_code/test_jobs.py Outdated Show resolved Hide resolved

JCZuurmond had a problem deploying to account-admin May 21, 2024 15:16 — with GitHub Actions Failure

JCZuurmond commented May 21, 2024

View reviewed changes

src/databricks/labs/ucx/source_code/python_libraries.py Outdated Show resolved Hide resolved

JCZuurmond force-pushed the feat/add-integration-test-for-linting-task-with-whl-dependency branch from 2658515 to 20c3c3a Compare May 22, 2024 06:42

JCZuurmond had a problem deploying to account-admin May 22, 2024 06:42 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin May 22, 2024 06:44 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin May 22, 2024 06:53 — with GitHub Actions Failure

JCZuurmond enabled auto-merge May 22, 2024 06:55

nfx approved these changes May 22, 2024

View reviewed changes

src/databricks/labs/ucx/source_code/python_libraries.py Outdated Show resolved Hide resolved

src/databricks/labs/ucx/source_code/python_libraries.py Outdated Show resolved Hide resolved

JCZuurmond added the pr/do-not-merge this pull request is not ready to merge label May 22, 2024

JCZuurmond commented May 22, 2024

View reviewed changes

JCZuurmond mentioned this pull request May 23, 2024

Download requirementst.txt dependency locally to register it to the dependency graph #1753

Merged

11 tasks

nfx mentioned this pull request May 27, 2024

Release v0.24.0 #1775

Merged

JCZuurmond force-pushed the feat/add-integration-test-for-linting-task-with-whl-dependency branch from 76bdf97 to cda765b Compare May 31, 2024 10:24

JCZuurmond had a problem deploying to account-admin May 31, 2024 10:24 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin May 31, 2024 10:26 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin May 31, 2024 12:23 — with GitHub Actions Error

JCZuurmond temporarily deployed to account-admin May 31, 2024 12:45 — with GitHub Actions Inactive

JCZuurmond added 20 commits May 31, 2024 15:50

Add integration test with PyPI dependency

4c320f6

Register wheel when present

9040832

Install wheel locally

d1f6f96

Add integration test for importing wheel

0eca54a

Fix indent

62174a2

Add unit test for wheel

073b1da

Move yield from into context manager

fbd22dc

Use make directory fixture

0e58181

Split tests

835968c

Fix test

4ad502a

Isort

125be1c

Fix import

1091508

Remove duplicate test

2aa959f

Lower not-yet-implemented test

2d92a7e

Fix white list

2f7ac79

Rename variable

3a2bb00

Move test back

b9f56df

Revert wrong changes from rebase

4bb4922

Fix integration test with default white list

229fba9

Force failing job to be present

2f2ea52

JCZuurmond force-pushed the feat/add-integration-test-for-linting-task-with-whl-dependency branch from ad9668f to 2f2ea52 Compare May 31, 2024 13:50

JCZuurmond temporarily deployed to account-admin May 31, 2024 13:50 — with GitHub Actions Inactive

nfx merged commit 0eb0d46 into main May 31, 2024
8 checks passed

nfx deleted the feat/add-integration-test-for-linting-task-with-whl-dependency branch May 31, 2024 13:58

nfx mentioned this pull request Jun 4, 2024

Release v0.25.0 #1836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download wheel dependency locally to register it to the dependency graph #1704

Download wheel dependency locally to register it to the dependency graph #1704

JCZuurmond commented May 16, 2024 •

edited

Loading

codecov bot commented May 21, 2024 •

edited

Loading

nfx left a comment

JCZuurmond left a comment

github-actions bot commented May 31, 2024 •

edited

Loading

Download wheel dependency locally to register it to the dependency graph #1704

Download wheel dependency locally to register it to the dependency graph #1704

Conversation

JCZuurmond commented May 16, 2024 • edited Loading

Changes

Linked issues

Functionality

Tests

codecov bot commented May 21, 2024 • edited Loading

Codecov Report

nfx left a comment

Choose a reason for hiding this comment

JCZuurmond left a comment

Choose a reason for hiding this comment

github-actions bot commented May 31, 2024 • edited Loading

JCZuurmond commented May 16, 2024 •

edited

Loading

codecov bot commented May 21, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading