Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define table migration process #307

Merged
merged 1 commit into from
Oct 3, 2023
Merged

Define table migration process #307

merged 1 commit into from
Oct 3, 2023

Conversation

FastLee
Copy link
Contributor

@FastLee FastLee commented Sep 27, 2023

Added a table migration doc.
Let's discuss the migration process.

docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
1. The assessment generates "databases" tables or an outline for a configuration file.
1. It can be used by the user to configure the database upgrade.

This is the structure we recommend:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume we don't have this table, what happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create it.

| upgrade_status | int | 0-Not Upgraded <br/> 1-Failed Upgrade <br/> 2-Partial Upgrade <br/> 3-Full Upgrade | 0 |
| upgrade_messages | string | Json with a list of all the upgrade errors. | empty |

1. We can use a notebook with IPYWidgets to update the table (or configuration file).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a job? What cluster configuration does this job run on? What about thousands of workspaces?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this should be a job - and it should operate on multiple tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking for the interactive step.
What UI do we provide the user to update the table/manifest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still challenging if we need a UI. How can we do it without the UI at all and have it fully automated?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested we have a fast-track option to have it fully automated that uses sensible defaults.

In the event a user wants to modify the mapping, we can generate an output file for the user to review and modify in lieu of a UI, csv, tsv, excel, etc. User would make any custom mapping overrides in this file.

This file could be uploaded back to the workspace via the tool, and the migration could proceed.

docs/table-upgrade.md Outdated Show resolved Hide resolved
@nfx nfx changed the title Created a table-migration.md Define table migration process Sep 27, 2023
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! could you answer the question I've left in this and the previous reviews?

docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved

We don't expect this process to be a "one and done" process. This typically is an iterative process and may require a few runs.

We suggest to keep track of the migration and provide the user a continuous feedback of the progress and status of the upgrade.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate how this continuous feedback should be delivered to the customer, if we're migrating 300 or 3000 workspaces.

docs/table-upgrade.md Outdated Show resolved Hide resolved
docs/table-upgrade.md Outdated Show resolved Hide resolved
1. The assessment runs and capture all the tables in HMS (Done).
1. Each table is categorized based on the type and storage
1. We use the "upgraded_to" table property to determine if the table was already upgraded to UC
1. The assessment generates a list of "recommended External Locations". These locations are used by the external tables and are required for an "in place" upgrade.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is done with that list? Please elaborate

| upgrade_status | int | 0-Not Upgraded <br/> 1-Failed Upgrade <br/> 2-Partial Upgrade <br/> 3-Full Upgrade | 0 |
| upgrade_messages | string | Json with a list of all the upgrade errors. | empty |

1. We can use a notebook with IPYWidgets to update the table (or configuration file).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still challenging if we need a UI. How can we do it without the UI at all and have it fully automated?

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now for tables, there also needs to be a report on table/db inconsistency - like
A: db1.tbl1, db1.tbl3
B: db1.tbl2

And the team(s) that are driving UC Migration within account would make a decision after some time in review (of excel spreadsheet). By the way, we can split UCX installation across different Azure Subscriptions. And every installation would just focus on defining target catalog mapping per database. But here are unanswered questions:

two workspaces, same dbs, all different tables and columns (all managed tables, effectively)
two workspaces, same dbs, 90% same tables, 10% are different tables
two workspaces, two different dbs
We can technically support both db_to_catalog and workspace_to_catalog, and even at the same time, but db_to_catalog will override workspace_to_catalog. We also need default_catalog_for_workspace, if workspace_to_catalog is set (default catalog for all workspaces is set per metastore)..

We can also do another override for tables, but we have unanswered questions:

what if same db, same workspace, same table, but different columns/order/types? Ignore and keep in hive metastore? And then rerun the scan for tables and grants?
what if during migration catalog/database/table were deleted either from hms and/or uc?
Speaking of metastores, in the beginning, there needs to be workspace_to_metastore mapping with default_metastore_for_workspace. Can we come up with a good default mapping here? Coarse or fine grained? Select between the two? Ask for inline input? How many conflicts we expect to justify the need to create/support custom mapping?

the last very important question is what future-proof configuration format might we need for this mapping. That's why I don't want any configuration after the assessment step.

@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Merging #307 (7759580) into main (035320c) will increase coverage by 0.11%.
The diff coverage is 87.71%.

❗ Current head 7759580 differs from pull request most recent head e75e4c9. Consider uploading reports for the commit e75e4c9 to get more accurate results

@@            Coverage Diff             @@
##             main     #307      +/-   ##
==========================================
+ Coverage   83.47%   83.58%   +0.11%     
==========================================
  Files          30       30              
  Lines        2269     2254      -15     
  Branches      395      394       -1     
==========================================
- Hits         1894     1884      -10     
+ Misses        290      285       -5     
  Partials       85       85              
Files Coverage Δ
src/databricks/labs/ucx/__about__.py 100.00% <100.00%> (ø)
src/databricks/labs/ucx/config.py 86.50% <100.00%> (ø)
src/databricks/labs/ucx/install.py 80.82% <ø> (ø)
src/databricks/labs/ucx/hive_metastore/tables.py 94.30% <90.56%> (ø)
src/databricks/labs/ucx/assessment/crawlers.py 73.29% <84.48%> (-0.22%) ⬇️

... and 3 files with indirect coverage changes

|finance| de_dev | finance |
|hr | de_dev | human_resources|
|sales | ucx-dev_ws | sales |
1. We need to provide a UI or other option to override these values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have capacity to build and maintain UI. How can we express these overrides as YAML configuration? How would command-line interface could look for it? please specify the sample format, try to be very explicit with corner-cases.

@nfx nfx added the feat/migration-index mapping of databases to catalog or potentially other databases label Oct 2, 2023
@nfx nfx added this to the 1 week milestone Oct 2, 2023
@zpappa zpappa removed this from the 1 week milestone Oct 2, 2023
1. We should migrate ACLs for the tables (where applicable). We should highlight cases where we can't (no direct translation/conflicts)
1. We should consider automating ACLs based on Instance Profiles / Service Principals and other legacy security mechanisms

## Tables (Parquet/Delta) on DBFS root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep Managed Tables separate from External Tables. This is how it maps to our upgrade steps.

External tables are easier and lower risk. We should be able to deliver on those faster.

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address other comments and properly rebase the branch, so that it has changes only to docs/table-upgrade.md.

@FastLee FastLee force-pushed the feature/table_upgrade_doc branch from 7759580 to e75e4c9 Compare October 3, 2023 14:17
@FastLee FastLee requested a review from nfx October 3, 2023 14:18
@nfx nfx added this pull request to the merge queue Oct 3, 2023
Merged via the queue into main with commit 10abd0c Oct 3, 2023
zpappa pushed a commit that referenced this pull request Oct 4, 2023
Added a table migration doc.
Let's discuss the migration process.
@nfx nfx deleted the feature/table_upgrade_doc branch October 17, 2023 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/migration-index mapping of databases to catalog or potentially other databases
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants