-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define table migration process #307
Conversation
docs/table-upgrade.md
Outdated
1. The assessment generates "databases" tables or an outline for a configuration file. | ||
1. It can be used by the user to configure the database upgrade. | ||
|
||
This is the structure we recommend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume we don't have this table, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create it.
docs/table-upgrade.md
Outdated
| upgrade_status | int | 0-Not Upgraded <br/> 1-Failed Upgrade <br/> 2-Partial Upgrade <br/> 3-Full Upgrade | 0 | | ||
| upgrade_messages | string | Json with a list of all the upgrade errors. | empty | | ||
|
||
1. We can use a notebook with IPYWidgets to update the table (or configuration file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a job? What cluster configuration does this job run on? What about thousands of workspaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this should be a job - and it should operate on multiple tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking for the interactive step.
What UI do we provide the user to update the table/manifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still challenging if we need a UI. How can we do it without the UI at all and have it fully automated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested we have a fast-track option to have it fully automated that uses sensible defaults.
In the event a user wants to modify the mapping, we can generate an output file for the user to review and modify in lieu of a UI, csv, tsv, excel, etc. User would make any custom mapping overrides in this file.
This file could be uploaded back to the workspace via the tool, and the migration could proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work! could you answer the question I've left in this and the previous reviews?
docs/table-upgrade.md
Outdated
|
||
We don't expect this process to be a "one and done" process. This typically is an iterative process and may require a few runs. | ||
|
||
We suggest to keep track of the migration and provide the user a continuous feedback of the progress and status of the upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please elaborate how this continuous feedback should be delivered to the customer, if we're migrating 300 or 3000 workspaces.
docs/table-upgrade.md
Outdated
1. The assessment runs and capture all the tables in HMS (Done). | ||
1. Each table is categorized based on the type and storage | ||
1. We use the "upgraded_to" table property to determine if the table was already upgraded to UC | ||
1. The assessment generates a list of "recommended External Locations". These locations are used by the external tables and are required for an "in place" upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is done with that list? Please elaborate
docs/table-upgrade.md
Outdated
| upgrade_status | int | 0-Not Upgraded <br/> 1-Failed Upgrade <br/> 2-Partial Upgrade <br/> 3-Full Upgrade | 0 | | ||
| upgrade_messages | string | Json with a list of all the upgrade errors. | empty | | ||
|
||
1. We can use a notebook with IPYWidgets to update the table (or configuration file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still challenging if we need a UI. How can we do it without the UI at all and have it fully automated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now for tables, there also needs to be a report on table/db inconsistency - like
A: db1.tbl1, db1.tbl3
B: db1.tbl2
And the team(s) that are driving UC Migration within account would make a decision after some time in review (of excel spreadsheet). By the way, we can split UCX installation across different Azure Subscriptions. And every installation would just focus on defining target catalog mapping per database. But here are unanswered questions:
two workspaces, same dbs, all different tables and columns (all managed tables, effectively)
two workspaces, same dbs, 90% same tables, 10% are different tables
two workspaces, two different dbs
We can technically support both db_to_catalog and workspace_to_catalog, and even at the same time, but db_to_catalog will override workspace_to_catalog. We also need default_catalog_for_workspace, if workspace_to_catalog is set (default catalog for all workspaces is set per metastore)..
We can also do another override for tables, but we have unanswered questions:
what if same db, same workspace, same table, but different columns/order/types? Ignore and keep in hive metastore? And then rerun the scan for tables and grants?
what if during migration catalog/database/table were deleted either from hms and/or uc?
Speaking of metastores, in the beginning, there needs to be workspace_to_metastore mapping with default_metastore_for_workspace. Can we come up with a good default mapping here? Coarse or fine grained? Select between the two? Ask for inline input? How many conflicts we expect to justify the need to create/support custom mapping?
the last very important question is what future-proof configuration format might we need for this mapping. That's why I don't want any configuration after the assessment step.
Codecov Report
@@ Coverage Diff @@
## main #307 +/- ##
==========================================
+ Coverage 83.47% 83.58% +0.11%
==========================================
Files 30 30
Lines 2269 2254 -15
Branches 395 394 -1
==========================================
- Hits 1894 1884 -10
+ Misses 290 285 -5
Partials 85 85
|
docs/table-upgrade.md
Outdated
|finance| de_dev | finance | | ||
|hr | de_dev | human_resources| | ||
|sales | ucx-dev_ws | sales | | ||
1. We need to provide a UI or other option to override these values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have capacity to build and maintain UI. How can we express these overrides as YAML configuration? How would command-line interface could look for it? please specify the sample format, try to be very explicit with corner-cases.
docs/table-upgrade.md
Outdated
1. We should migrate ACLs for the tables (where applicable). We should highlight cases where we can't (no direct translation/conflicts) | ||
1. We should consider automating ACLs based on Instance Profiles / Service Principals and other legacy security mechanisms | ||
|
||
## Tables (Parquet/Delta) on DBFS root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep Managed Tables separate from External Tables. This is how it maps to our upgrade steps.
External tables are easier and lower risk. We should be able to deliver on those faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address other comments and properly rebase the branch, so that it has changes only to docs/table-upgrade.md
.
7759580
to
e75e4c9
Compare
Added a table migration doc. Let's discuss the migration process.
Added a table migration doc.
Let's discuss the migration process.