Skip to content

[SUPPORT] HiveSyncTool: missing partitions #6277

@matthiasdg

Description

@matthiasdg

Describe the problem you faced

We have some IoT data tables with a few thousands of partitions; typically deviceId/year/month/day.
We do not sync to hive every commit, but at regular intervals.
For one of these tables I added a few months of historic data for an additional set of devices, as opposed to daily updates for the existing set. Somehow hive syncing with HiveSyncTool afterwards must have gone wrong (unfortunately do not have logs, so not sure if it failed or passed silently without detecting some partitions (suspect the latter)) because not all these partitions are present in hive. If I now run HiveSyncTool again, I just get e.g. Last commit time synced is 20220802000054258, Getting commits since then, which is what it does; it then picks up added partitions since that commit, but the ones that were not synced before are never added.

My current way of solving this is dropping the hive table and rerun HiveSyncTool from scratch. This adds all the partitions.

Steps to reproduce the behavior:

  1. Have a dataset with a large number of partitions deviceId/year/month/day (MultiPartKeysValueExtractor), sync to hive the first time. All is fine though it may take a long time
  2. Adding data to the existing partitions (new months/days will be added), syncing to hive still works
  3. Add a large amount of data for devices that were not in the set before, sync again -> in my case there are partitions for every new device, but lots of the underlying date partitions are missing.
  4. drop hive table and resync from scratch -> all partitions are there.

Expected behavior
I would expect to either get an error if partitions are not synced, so I do not get an updated last commit time synced or to have them all detected immediately

Environment Description

  • Hudi version : 0.10.0

  • Spark version : 3.1.2

  • Hive version : client side: 2.3.7 through hudi, standalone metastore 3.0

  • Hadoop version : 3.2.0

  • Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2

  • Running on Docker? (yes/no) : yes (k8s)

Metadata

Metadata

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions