-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the problem you faced
with the HoodieROTablePathFilter load normal parquet file, it will be too slow when reaches a certain order of magnitude
For example:500 partitions and 50000 data files
data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}
To Reproduce
Steps to reproduce the behavior:
- submit spark application
spark-sql --master yarn \
--conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
- create temp view
create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");
Then slow load occurs
Environment Description
-
Hudi version : 0.10.0
-
Spark version : 3.1.1
-
Hive version : 3.1.2
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
use the PR [https://github.com//pull/3719] will mitigate this problem,again run
create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");
can finished in about 60 seconds
22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
Time taken: 61.771 seconds
At the same time,we have not repeated the problem [https://github.com//issues/4188]. In our spark cluster,[HUDI-3719] this PR has used to query partition tables for half a year,such as:
==create table==
CREATE EXTERNAL TABLE `pickinglogs`(
`_hoodie_commit_time` string COMMENT '',
`_hoodie_commit_seqno` string COMMENT '',
`_hoodie_record_key` string COMMENT '',
`_hoodie_partition_path` string COMMENT '',
`_hoodie_file_name` string COMMENT '',
`id` string COMMENT 'ID',
.......
`meta_es_offset` string COMMENT '',
`meta_type` string COMMENT '',
`meta_status` int COMMENT '',
`meta_md5` string COMMENT '',
`ptk_time_create` string COMMENT '')
PARTITIONED BY (
`year` string COMMENT '',
`month` string COMMENT '',
`day` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
==query for sparksql==
spark-sql> select count(id) from pickinglogs where year=2022 and month between '08' and '10';
441834287
Time taken: 22.095 seconds, Fetched 1 row(s)
Stacktrace
Metadata
Metadata
Assignees
Labels
Type
Projects
Status