[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets  in hudi release


**Describe the problem you faced**

with the HoodieROTablePathFilter  load normal parquet file, it will be too slow when  reaches a certain order of magnitude

For example：500 partitions and 50000 data files

data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}



**To Reproduce**
Steps to reproduce the behavior:
1. submit spark application
```
spark-sql --master yarn \
--conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
```

2. create temp view
```
create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");
```

Then slow load occurs


**Environment Description**

* Hudi version : 0.10.0

* Spark version : 3.1.1

* Hive version : 3.1.2

* Hadoop version : 3.2.1

* Storage (HDFS/S3/GCS..) :  S3

* Running on Docker? (yes/no) : no


**Additional context**

use  the PR [https://github.com/apache/hudi/pull/3719] will  mitigate this problem，again run 
```
create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");
```
can finished in about 60 seconds
```
22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

Time taken: 61.771 seconds
```


At the same time，we have not repeated the problem [https://github.com/apache/hudi/issues/4188].  In our spark cluster，[HUDI-3719] this PR has used to query partition tables for half a year，such as：
```
==create table==
CREATE EXTERNAL TABLE `pickinglogs`(
  `_hoodie_commit_time` string COMMENT '',
  `_hoodie_commit_seqno` string COMMENT '',
  `_hoodie_record_key` string COMMENT '',
  `_hoodie_partition_path` string COMMENT '',
  `_hoodie_file_name` string COMMENT '',
  `id` string COMMENT 'ID',

.......

  `meta_es_offset` string COMMENT '',
  `meta_type` string COMMENT '',
  `meta_status` int COMMENT '',
  `meta_md5` string COMMENT '',
  `ptk_time_create` string COMMENT '')
PARTITIONED BY (
  `year` string COMMENT '',
  `month` string COMMENT '',
  `day` string COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION



==query for sparksql==
spark-sql> select count(id) from pickinglogs where year=2022 and month between '08' and '10';
441834287
Time taken: 22.095 seconds, Fetched 1 row(s)

```



**Stacktrace**




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release #7417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release #7417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions