Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Migrate external tables not supported by the "sync" command #889

Closed
1 task done
FastLee opened this issue Feb 5, 2024 · 4 comments · Fixed by #1412 or #1510
Closed
1 task done

[FEATURE]: Migrate external tables not supported by the "sync" command #889

FastLee opened this issue Feb 5, 2024 · 4 comments · Fixed by #1412 or #1510
Assignees
Labels
migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step

Comments

@FastLee
Copy link
Contributor

FastLee commented Feb 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

Tables that are not one of the supported table format for the sync command are not currently migrated to UC.

Fine-grained:

Related issues:

Proposed Solution

Allow users to migrate unsupported type, by converting these to Delta.

Additional Context

No response

@FastLee FastLee added enhancement New feature or request needs-triage labels Feb 5, 2024
@FastLee FastLee added this to UCX Feb 5, 2024
@github-project-automation github-project-automation bot moved this to Triage in UCX Feb 5, 2024
@nfx nfx added migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step and removed needs-triage labels Feb 5, 2024
@qziyuan qziyuan self-assigned this Mar 26, 2024
@qziyuan
Copy link
Contributor

qziyuan commented Mar 27, 2024

  • Supported Spark DataSource by SYNC:
    Delta, Parquet, CSV, JSON, ORC, TEXT, AVRO

  • Not supported Spark DataSource by UC and SYNC:
    BINARYFILE, JDBC, LIBSVM, custom implementation of org.apache.spark.sql.sources.DataSourceRegister

  • All Hive Serde table are not supported by SYNC.

Migration strategy:

table.provider Hive Serde(row format) and file format migration strategy
BINARYFILE NA 1. By default CTAS to Delta
2. Prompt If user want to keep the original file format instead of writing their binary content into parquet file, if so do not migrate
JDBC NA 1. Do not migrate right now
2. In the future, migrate to Lakehouse Federation, if no supported federation connector consider view based solution
LIBSVM NA Do not migrate
custom implementation of DataSourceRegister NA Do not migrate
HIVE inputFormat=OrcInputFormat
outputFormat=OrcOutputFormat
serde=OrcSerde
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using ORC ... location ...
HIVE inputFormat=MapredParquetInputFormat
outputFormat=MapredParquetOutputFormat
serde=ParquetHiveSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using PARQUET ... location ...
Hive inputFormat=AvroContainerInputFormat
outputFormat=AvroContainerOutputFormat
serde=AvroSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using AVRO ... location ...
Hive inputFormat=SequenceFileInputFormat
outputFormat=SequenceFileOutputFormat
serde=LazySimpleSerDe
CTAS to Delta
Hive inputFormat=RCFileInputFormat
outputFormat=RCFileOutputFormat
serde=LazyBinaryColumnarSerDe
CTAS to Delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=LazySimpleSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ... need to get the field and line delimiter from HMS table metadata and set it accordingly in UC CSV table, also disable the quote. If the HMS table storage properties contains escape.delim, mapkey.delim, colelction.delim, serialization.format which are unsupported, do CTAS delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=RegexSerDe
CTAS to Delta
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=JsonSerDe
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using JSON ... location ... need test
Hive inputFormat=TextInputFormat
outputFormat=HiveIgnoreKeyTextOutputFormat
serde=OpenCSVSerde
1. By default CTAS to Delta.
2. If user prefer in place upgrade and confirmed during installation, migrate with create table ... using CSV ... location ...
Hive All other non native serdes CTAS to Delta, if failed skip it

Changes required

  1. Table crawler need to crawl hive serde and file format info.
  2. Table class should store hive serde and file format info..

Reference:

@nfx
Copy link
Collaborator

nfx commented Mar 27, 2024

@qziyuan isn't table format already there?

@nfx
Copy link
Collaborator

nfx commented Mar 27, 2024

It looks like we have to pre-empt this decision making into create_table_mapping CSV

@qziyuan
Copy link
Contributor

qziyuan commented Mar 27, 2024

@qziyuan isn't table format already there?

@nfx For Hive Serde table, the current table format, derived from table.provider, will all be "HIVE". So we need extra info for serde, input/output format to differentiate them.

It looks like we have to pre-empt this decision making into create_table_mapping CSV

We could either

  • make the decision during the installation, so the decision will be stored in config and we don't need to change table_mapping csv structure. But the decision will be global
  • Or make the decision when create_table_mapping and add new field in the csv file to store the decision. In this way user is able to adjust the decision latter in table level by modifying the csv file.

@nfx nfx moved this from Triage to Active Backlog in UCX Apr 10, 2024
@nfx nfx removed the enhancement New feature or request label Apr 22, 2024
@nfx nfx closed this as completed in #1412 Apr 23, 2024
nfx pushed a commit that referenced this issue Apr 23, 2024
…erde tables (#1412)

## Changes
1. Add `MigrateHiveSerdeTablesInPlace` workflow to in-place upgrade
external Parquet, Orc, Avro hiveserde tables.
2. Add functions in `tables.py` to describe the table and extract the
hiveserde details, update the ddl from `show create table` by replacing
the old table name with migration target and dbfs mount table location
if any, the new ddl will be used to create the new table in UC for the
in-place migrate.
3. Add `_migrate_external_table_hiveserde` function in
`table_migrate.py`. Add two new arguments `mounts` and
`hiveserde_in_place_migrate` in `TablesMigrator` class, `mounts` will be
used to replace the dbfs mnt table location if any,
`hiveserde_in_place_migrate` will be used to control which hiveserde to
be migrated in current run so we can have multiple tasks running in
parallel and each just migrate one type of hiveserde.

This PR also removed majority of codes from PR #1432 , because only
subset of table formats can be in-place migrated to UC with ddl from
`show create table`. Simply creating table with the updated ddl for all
`What.EXTERNAL_NO_SYNC` will fail.

### Linked issues

Closes #889 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [x] added unit tests
- [x] added integration tests
- [ ] verified on staging environment (screenshot attached)
@github-project-automation github-project-automation bot moved this from Active Backlog to Archive in UCX Apr 23, 2024
jincejames pushed a commit to jincejames/ucx that referenced this issue Apr 23, 2024
…erde tables (databrickslabs#1412)

## Changes
1. Add `MigrateHiveSerdeTablesInPlace` workflow to in-place upgrade
external Parquet, Orc, Avro hiveserde tables.
2. Add functions in `tables.py` to describe the table and extract the
hiveserde details, update the ddl from `show create table` by replacing
the old table name with migration target and dbfs mount table location
if any, the new ddl will be used to create the new table in UC for the
in-place migrate.
3. Add `_migrate_external_table_hiveserde` function in
`table_migrate.py`. Add two new arguments `mounts` and
`hiveserde_in_place_migrate` in `TablesMigrator` class, `mounts` will be
used to replace the dbfs mnt table location if any,
`hiveserde_in_place_migrate` will be used to control which hiveserde to
be migrated in current run so we can have multiple tasks running in
parallel and each just migrate one type of hiveserde.

This PR also removed majority of codes from PR databrickslabs#1432 , because only
subset of table formats can be in-place migrated to UC with ddl from
`show create table`. Simply creating table with the updated ddl for all
`What.EXTERNAL_NO_SYNC` will fail.

### Linked issues

Closes databrickslabs#889 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [x] added unit tests
- [x] added integration tests
- [ ] verified on staging environment (screenshot attached)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step
Projects
Archived in project
3 participants