Add `migrate-tables` workflow #1045

qziyuan · 2024-03-12T06:56:58Z

Changes

Add migrate-tables workflow. This PR includes the workflow only described in milestone 1 in #1035
table_migration_progress table is out of scope in this PR.

Linked issues

Relates #333 #670

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

codecov · 2024-03-12T06:58:26Z

Codecov Report

Attention: Patch coverage is 93.93939% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 88.94%. Comparing base (2b22656) to head (568d64e).
Report is 5 commits behind head on main.

❗ Current head 568d64e differs from pull request most recent head 23f93e8. Consider uploading reports for the commit 23f93e8 to get more accurate results

Files	Patch %	Lines
src/databricks/labs/ucx/install.py	71.42%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1045      +/-   ##
==========================================
- Coverage   88.98%   88.94%   -0.05%     
==========================================
  Files          51       52       +1     
  Lines        6501     6683     +182     
  Branches     1169     1197      +28     
==========================================
+ Hits         5785     5944     +159     
- Misses        466      482      +16     
- Partials      250      257       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nfx · 2024-03-12T08:19:12Z

src/databricks/labs/ucx/install.py

+                        # spark_conf={"spark.sql.sources.parallelPartitionDiscovery.parallelism": "1000"},
+                        autoscale=compute.AutoScale(  # number of executors matters to parallelism of file copy
+                            min_workers=1,
+                            max_workers=10,


Make this configurable

refactored, please review

Can you put 10 as a variable in config?

nfx · 2024-03-12T08:21:59Z

src/databricks/labs/ucx/runtime.py

+
+
+@task("migrate-tables", job_cluster="migration_sync")
+def migrate_views(cfg: WorkspaceConfig, ws: WorkspaceClient, sql_backend: SqlBackend, installation: Installation):


Skip the views for now

nfx · 2024-03-12T08:29:46Z

tests/unit/framework/test_tasks.py

+        )
+    )
+
+    mock_api_request = create_autospec(PreparedRequest)


Rewrite the code to use "get_workspace_id()" from workspace client and mock that

I don't get this. Could you elaborate more?

ws.get_workspace_id.return_value = 123

nfx · 2024-03-12T08:36:26Z

src/databricks/labs/ucx/install.py

@@ -779,6 +779,35 @@ def _job_clusters(self, names: set[str]):
                    ),
                )
            )
+        if "migration_clone" in names:


Why do we need two cluster? Can we have one?

Make sure that override clusters work in the integration tests new_installation()

We have two types migration:

In place migrate that just migrate the metadata to UC, like SYNC command, this just need a a small cluster.

migration that need to do data copy, like deep clone and CTAS for dbfs root tables. We need larger cluster for this.

let's use just one cluster, since we default auto scale with one worker.

nfx · 2024-03-12T08:38:29Z

src/databricks/labs/ucx/install.py

+                        # https://databricks.atlassian.net/browse/ES-975874
+                        # the default 200 partition may not be enough for large tables, but it's hard to
+                        # find a number that fits all. If we need higher parallelism, we can use config below
+                        # spark_conf={"spark.sql.sources.parallelPartitionDiscovery.parallelism": "1000"},


Make it configurable during installer with a default value in workspace config

refactored, please review

nfx · 2024-03-12T08:41:17Z

src/databricks/labs/ucx/framework/tasks.py

@@ -249,4 +254,7 @@ def trigger(*argv):
    ) as task_logger:
        ucx_logger = logging.getLogger("databricks.labs.ucx")
        ucx_logger.info(f"UCX v{__version__} After job finishes, see debug logs at {task_logger}")
-        current_task.fn(cfg, workspace_client, sql_backend)
+        if current_task.workflow == "migrate-tables":


Don't do this. Change signatures of all methods instead

… configure table_migration job clusters. Remove migrate_views workflow task.

nfx

How can we make it without a breaking change?

nfx · 2024-03-13T08:37:08Z

src/databricks/labs/ucx/framework/tasks.py

 @dataclass
 class Task:
    task_id: int
    workflow: str
    name: str
    doc: str
-    fn: Callable[[WorkspaceConfig, WorkspaceClient, SqlBackend], None]
+    fn: AssessmentFunctions | MigrationFunctions


Revert this change.

nfx · 2024-03-13T08:38:49Z

src/databricks/labs/ucx/install.py

@@ -170,6 +173,16 @@ def deploy_schema(sql_backend: SqlBackend, inventory_schema: str):
    deployer.deploy_view("table_estimates", "queries/views/table_estimates.sql")


+def load_cluster_specs() -> dict[str, ClusterSpec]:


Don't load something from a yaml file if it can be defined as 5 lines of python code. Revert

nfx · 2024-03-13T08:39:16Z

src/databricks/labs/ucx/install.py

@@ -247,6 +260,38 @@ def _configure_new_installation(self) -> WorkspaceConfig:

        policy_id, instance_profile, spark_conf_dict = self._policy_installer.create(inventory_database)

+        # Load job cluster specifications
+        cluster_specs = load_cluster_specs()


Don't load from yaml, define it in code

nfx · 2024-03-13T08:43:19Z

src/databricks/labs/ucx/install.py

-            num_workers=0,
-            policy_id=self.config.policy_id,
-        )
+        cluster_specs = self._config.cluster_specs


No, this gets too fragile. Installer is the critical component and this introduces severe risk of breakage. Revert.

nfx · 2024-03-13T08:45:03Z

src/databricks/labs/ucx/runtime.py

@@ -43,7 +47,7 @@ def setup_tacl(*_):


 @task("assessment", depends_on=[crawl_tables, setup_tacl], job_cluster="tacl")
-def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend):
+def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend, _installation: Installation):


Suggested change

def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend, _installation: Installation):

def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend, install: Installation):

nfx · 2024-03-13T08:46:42Z

src/databricks/labs/ucx/upgrades/v0.16.0_added_cluster_spec.py

+    policy_id = v16_config.policy_id
+
+    cluster_specs = load_cluster_specs()
+    for _, cluster_spec in cluster_specs.items():


This is too risky. How can we do it without a breaking change?

nfx · 2024-03-13T08:48:16Z

tests/unit/framework/test_tasks.py

+        )
+    )
+
+    mock_api_request = create_autospec(PreparedRequest)


ws.get_workspace_id.return_value = 123

nfx · 2024-03-13T08:50:41Z

src/databricks/labs/ucx/config.py

@@ -31,7 +32,7 @@ class WorkspaceConfig:  # pylint: disable=too-many-instance-attributes
    # Starting path for notebooks and directories crawler
    workspace_start_path: str = "/"
    instance_profile: str | None = None
-    spark_conf: dict[str, str] | None = None
+    cluster_specs: dict[str, ClusterSpec] | None = None


This is a breaking change. How can we do without it?

qziyuan · 2024-03-14T02:29:39Z

Close this draft PR. Please review the new PR #1051

qziyuan added 5 commits March 10, 2024 23:22

add workflow for migrate tables

c1e15c3

add job cluster spec for migrate_tables

846fb77

correct job cluster name

809329f

add unit tests

d22081d

suppress pylint warning because pylint does not aggre isort

568d64e

nfx requested changes Mar 12, 2024

View reviewed changes

qziyuan added 3 commits March 12, 2024 22:12

refactor the code for setting job cluster specfication. Add prompt to…

fbc7fcc

… configure table_migration job clusters. Remove migrate_views workflow task.

add upgrade script for upgrading WorkspaceConfig

cd73d4c

add installation to all task functions signature in runtime.py

23f93e8

nfx requested changes Mar 13, 2024

View reviewed changes

qziyuan closed this Mar 14, 2024

qziyuan deleted the feature/migrate_table_wf branch March 14, 2024 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `migrate-tables` workflow #1045

Add `migrate-tables` workflow #1045

qziyuan commented Mar 12, 2024

codecov bot commented Mar 12, 2024 •

edited

Loading

nfx Mar 12, 2024

qziyuan Mar 13, 2024

nfx Mar 14, 2024

nfx Mar 12, 2024

qziyuan Mar 13, 2024

nfx Mar 12, 2024

qziyuan Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 12, 2024

qziyuan Mar 12, 2024 •

edited

Loading

qziyuan Mar 12, 2024

nfx Mar 12, 2024

qziyuan Mar 13, 2024

nfx Mar 12, 2024

qziyuan Mar 13, 2024

nfx left a comment

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

nfx Mar 13, 2024

qziyuan commented Mar 14, 2024



		@task("migrate-tables", job_cluster="migration_sync")
		def migrate_views(cfg: WorkspaceConfig, ws: WorkspaceClient, sql_backend: SqlBackend, installation: Installation):

		@@ -170,6 +173,16 @@ def deploy_schema(sql_backend: SqlBackend, inventory_schema: str):
		deployer.deploy_view("table_estimates", "queries/views/table_estimates.sql")


		def load_cluster_specs() -> dict[str, ClusterSpec]:

	def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend, _installation: Installation):
	def crawl_grants(cfg: WorkspaceConfig, _: WorkspaceClient, sql_backend: SqlBackend, install: Installation):

Add migrate-tables workflow #1045

Add migrate-tables workflow #1045

Conversation

qziyuan commented Mar 12, 2024

Changes

Linked issues

Functionality

Tests

codecov bot commented Mar 12, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qziyuan Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qziyuan commented Mar 14, 2024

Add `migrate-tables` workflow #1045

Add `migrate-tables` workflow #1045

codecov bot commented Mar 12, 2024 •

edited

Loading

qziyuan Mar 12, 2024 •

edited

Loading