Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: jobs, pipelines, policies and clusters assessment is incorrect and needs cleanup to extract common code paths #823

Closed
1 task done
nfx opened this issue Jan 22, 2024 · 2 comments · Fixed by #855
Assignees
Labels
step/assessment go/uc/upgrade - Assessment Step

Comments

@nfx
Copy link
Collaborator

nfx commented Jan 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Cluster policy and cluster scanning is duplicated for 4 times in clusters, policies, jobs, and DLTs. This is unmaintainable, has incorrect checks, and has to be refactored to extract common parts.

Expected Behavior

tests are concise and refer to fixtures in JSON files:

https://github.com/databrickslabs/ucx/blob/main/tests/unit/assessment/test_clusters.py#L111-L116

def test_cluster_assessment_cluster_policy_no_spark_conf():
    ws = workspace_client_mock(clusters="no-spark-conf.json")
    crawler = ClustersCrawler(ws, MockBackend(), "ucx")
    result_set1 = list(crawler.snapshot())
    assert len(result_set1) == 1
    assert result_set1[0].success == 1

Steps To Reproduce

No response

Cloud

AWS

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

No response

@qziyuan
Copy link
Contributor

qziyuan commented Jan 29, 2024

@nfx Do you want all unit tests to be refactored to use the json file like

ws = workspace_client_mock(clusters="assortment-conf.json")

instead of things like:

    sample_clusters = [
        ClusterDetails(
            autoscale=AutoScale(min_workers=1, max_workers=6),
            cluster_source=ClusterSource.UI,
            spark_context_id=5134472582179565315,
            spark_env_vars=None,
            spark_conf={
                "spark.hadoop.fs.azure.account.oauth2.client.id.abcde.dfs.core.windows.net": "1234567890",
                "spark.databricks.delta.formatCheck.enabled": "false",
            },
            spark_version="9.3.x-cpu-ml-scala2.12",
            cluster_id="0810-225833-atlanta69",
            cluster_name="Tech Summit FY24 Cluster-1",
            policy_id="bdqwbdqiwd1111",
        )
    ]

@qziyuan
Copy link
Contributor

qziyuan commented Jan 29, 2024

@nfx to deduplicate the cluster scanning, does it make sense to have one crawler to scan all cluster, init_script, cluster_policy info once and save them to a delta table, and then let the cluster, jobs, pipelines, init_scripts to load the cluster info from this table instead of calling the api individually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
step/assessment go/uc/upgrade - Assessment Step
Projects
3 participants