[FEATURE]: Migrate external tables not supported by the "sync" command #889

FastLee · 2024-02-05T15:47:23Z

Is there an existing issue for this?

I have searched the existing issues

Problem statement

Tables that are not one of the supported table format for the sync command are not currently migrated to UC.

Fine-grained:

[FEATURE]: Add CTAS table migration function #1340

Related issues:

Proposed Solution

Allow users to migrate unsupported type, by converting these to Delta.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

qziyuan · 2024-03-27T07:16:49Z

Supported Spark DataSource by SYNC:
Delta, Parquet, CSV, JSON, ORC, TEXT, AVRO
Not supported Spark DataSource by UC and SYNC:
BINARYFILE, JDBC, LIBSVM, custom implementation of org.apache.spark.sql.sources.DataSourceRegister
All Hive Serde table are not supported by SYNC.

Migration strategy:

table.provider	Hive Serde(row format) and file format	migration strategy
BINARYFILE	NA	1. By default CTAS to Delta 2. Prompt If user want to keep the original file format instead of writing their binary content into parquet file, if so do not migrate
JDBC	NA	1. Do not migrate right now 2. In the future, migrate to Lakehouse Federation, if no supported federation connector consider view based solution
LIBSVM	NA	Do not migrate
custom implementation of DataSourceRegister	NA	Do not migrate
HIVE	inputFormat=OrcInputFormat outputFormat=OrcOutputFormat serde=OrcSerde	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using ORC ... location ...`
HIVE	inputFormat=MapredParquetInputFormat outputFormat=MapredParquetOutputFormat serde=ParquetHiveSerDe	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using PARQUET ... location ...`
Hive	inputFormat=AvroContainerInputFormat outputFormat=AvroContainerOutputFormat serde=AvroSerDe	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using AVRO ... location ...`
Hive	inputFormat=SequenceFileInputFormat outputFormat=SequenceFileOutputFormat serde=LazySimpleSerDe	CTAS to Delta
Hive	inputFormat=RCFileInputFormat outputFormat=RCFileOutputFormat serde=LazyBinaryColumnarSerDe	CTAS to Delta
Hive	inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=LazySimpleSerDe	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using CSV ... location ...` need to get the field and line delimiter from HMS table metadata and set it accordingly in UC CSV table, also disable the quote. If the HMS table storage properties contains `escape.delim`, `mapkey.delim`, `colelction.delim`, `serialization.format` which are unsupported, do CTAS delta
Hive	inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=RegexSerDe	CTAS to Delta
Hive	inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=JsonSerDe	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using JSON ... location ...` need test
Hive	inputFormat=TextInputFormat outputFormat=HiveIgnoreKeyTextOutputFormat serde=OpenCSVSerde	1. By default CTAS to Delta. 2. If user prefer in place upgrade and confirmed during installation, migrate with `create table ... using CSV ... location ...`
Hive	All other non native serdes	CTAS to Delta, if failed skip it

Changes required

Table crawler need to crawl hive serde and file format info.
Table class should store hive serde and file format info..

Reference:

Full Hive Serde class name for orc, parquet, avro can be found here
More details of OpenCSVSerde, JsonSerDe

nfx · 2024-03-27T08:42:51Z

@qziyuan isn't table format already there?

nfx · 2024-03-27T08:44:25Z

It looks like we have to pre-empt this decision making into create_table_mapping CSV

qziyuan · 2024-03-27T17:57:38Z

@qziyuan isn't table format already there?

@nfx For Hive Serde table, the current table format, derived from table.provider, will all be "HIVE". So we need extra info for serde, input/output format to differentiate them.

It looks like we have to pre-empt this decision making into create_table_mapping CSV

We could either

make the decision during the installation, so the decision will be stored in config and we don't need to change table_mapping csv structure. But the decision will be global
Or make the decision when create_table_mapping and add new field in the csv file to store the decision. In this way user is able to adjust the decision latter in table level by modifying the csv file.

…erde tables (#1412) ## Changes 1. Add `MigrateHiveSerdeTablesInPlace` workflow to in-place upgrade external Parquet, Orc, Avro hiveserde tables. 2. Add functions in `tables.py` to describe the table and extract the hiveserde details, update the ddl from `show create table` by replacing the old table name with migration target and dbfs mount table location if any, the new ddl will be used to create the new table in UC for the in-place migrate. 3. Add `_migrate_external_table_hiveserde` function in `table_migrate.py`. Add two new arguments `mounts` and `hiveserde_in_place_migrate` in `TablesMigrator` class, `mounts` will be used to replace the dbfs mnt table location if any, `hiveserde_in_place_migrate` will be used to control which hiveserde to be migrated in current run so we can have multiple tasks running in parallel and each just migrate one type of hiveserde. This PR also removed majority of codes from PR #1432 , because only subset of table formats can be in-place migrated to UC with ddl from `show create table`. Simply creating table with the updated ddl for all `What.EXTERNAL_NO_SYNC` will fail. ### Linked issues Closes #889 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs ucx ...` - [ ] added a new workflow - [ ] modified existing workflow: `...` - [ ] added a new table - [ ] modified existing table: `...` ### Tests  - [x] manually tested - [x] added unit tests - [x] added integration tests - [ ] verified on staging environment (screenshot attached)

…erde tables (databrickslabs#1412) ## Changes 1. Add `MigrateHiveSerdeTablesInPlace` workflow to in-place upgrade external Parquet, Orc, Avro hiveserde tables. 2. Add functions in `tables.py` to describe the table and extract the hiveserde details, update the ddl from `show create table` by replacing the old table name with migration target and dbfs mount table location if any, the new ddl will be used to create the new table in UC for the in-place migrate. 3. Add `_migrate_external_table_hiveserde` function in `table_migrate.py`. Add two new arguments `mounts` and `hiveserde_in_place_migrate` in `TablesMigrator` class, `mounts` will be used to replace the dbfs mnt table location if any, `hiveserde_in_place_migrate` will be used to control which hiveserde to be migrated in current run so we can have multiple tasks running in parallel and each just migrate one type of hiveserde. This PR also removed majority of codes from PR databrickslabs#1432 , because only subset of table formats can be in-place migrated to UC with ddl from `show create table`. Simply creating table with the updated ddl for all `What.EXTERNAL_NO_SYNC` will fail. ### Linked issues Closes databrickslabs#889 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs ucx ...` - [ ] added a new workflow - [ ] modified existing workflow: `...` - [ ] added a new table - [ ] modified existing table: `...` ### Tests  - [x] manually tested - [x] added unit tests - [x] added integration tests - [ ] verified on staging environment (screenshot attached)

FastLee added enhancement New feature or request needs-triage labels Feb 5, 2024

FastLee added this to UCX Feb 5, 2024

github-project-automation bot moved this to Triage in UCX Feb 5, 2024

nfx added migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step and removed needs-triage labels Feb 5, 2024

qziyuan self-assigned this Mar 26, 2024

nfx moved this from Triage to Active Backlog in UCX Apr 10, 2024

qziyuan mentioned this issue Apr 15, 2024

Add workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables #1412

Merged

11 tasks

nfx removed the enhancement New feature or request label Apr 22, 2024

nfx closed this as completed in #1412 Apr 23, 2024

github-project-automation bot moved this from Active Backlog to Archive in UCX Apr 23, 2024

qziyuan mentioned this issue Apr 24, 2024

Add CTAS migration workflow for external tables cannot be in place migrated #1510

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Migrate external tables not supported by the "sync" command #889

[FEATURE]: Migrate external tables not supported by the "sync" command #889

FastLee commented Feb 5, 2024 •

edited by qziyuan

Loading

qziyuan commented Mar 27, 2024 •

edited

Loading

nfx commented Mar 27, 2024

nfx commented Mar 27, 2024

qziyuan commented Mar 27, 2024 •

edited

Loading

[FEATURE]: Migrate external tables not supported by the "sync" command #889

[FEATURE]: Migrate external tables not supported by the "sync" command #889

Comments

FastLee commented Feb 5, 2024 • edited by qziyuan Loading

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context

qziyuan commented Mar 27, 2024 • edited Loading

Migration strategy:

Changes required

Reference:

nfx commented Mar 27, 2024

nfx commented Mar 27, 2024

qziyuan commented Mar 27, 2024 • edited Loading

FastLee commented Feb 5, 2024 •

edited by qziyuan

Loading

qziyuan commented Mar 27, 2024 •

edited

Loading

qziyuan commented Mar 27, 2024 •

edited

Loading