Skip to content

[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release #7417

@Junyewu

Description

@Junyewu

Describe the problem you faced

with the HoodieROTablePathFilter load normal parquet file, it will be too slow when reaches a certain order of magnitude

For example:500 partitions and 50000 data files

data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}

To Reproduce
Steps to reproduce the behavior:

  1. submit spark application
spark-sql --master yarn \
--conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
  1. create temp view
create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");

Then slow load occurs

Environment Description

  • Hudi version : 0.10.0

  • Spark version : 3.1.1

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

use the PR [https://github.com//pull/3719] will mitigate this problem,again run

create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");

can finished in about 60 seconds

22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

Time taken: 61.771 seconds

At the same time,we have not repeated the problem [https://github.com//issues/4188]. In our spark cluster,[HUDI-3719] this PR has used to query partition tables for half a year,such as:

==create table==
CREATE EXTERNAL TABLE `pickinglogs`(
  `_hoodie_commit_time` string COMMENT '',
  `_hoodie_commit_seqno` string COMMENT '',
  `_hoodie_record_key` string COMMENT '',
  `_hoodie_partition_path` string COMMENT '',
  `_hoodie_file_name` string COMMENT '',
  `id` string COMMENT 'ID',

.......

  `meta_es_offset` string COMMENT '',
  `meta_type` string COMMENT '',
  `meta_status` int COMMENT '',
  `meta_md5` string COMMENT '',
  `ptk_time_create` string COMMENT '')
PARTITIONED BY (
  `year` string COMMENT '',
  `month` string COMMENT '',
  `day` string COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION



==query for sparksql==
spark-sql> select count(id) from pickinglogs where year=2022 and month between '08' and '10';
441834287
Time taken: 22.095 seconds, Fetched 1 row(s)

Stacktrace

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:performancePerformance optimizationspriority:highSignificant impact; potential bugs

    Type

    No type

    Projects

    Status

    ⏳ Awaiting Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions