[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

JigaoLuo · 2025-02-24T10:15:44Z

Is your feature request related to a problem? Please describe.:

As a user of the Parquet reader, I find pull request #17594 to be extremely useful. This report metrics are related to filter effectiveness.
I'm wondering if it would be reasonable to first export these C++ metrics to pylibcudf. Having access to these metrics would enhance its functionality in python world with pylibcudf.

Describe the solution you'd like:

I believe the implementation of such a solution could be straightforward, and I'm willing to take on the task if assigned. In python/pylibcudf/pylibcudf/io/types.pyx should be generally:

    @property
    def num_input_row_groups(self):
        return self.metadata.num_input_row_groups
    @property
    def num_row_groups_after_stats_filter(self):
        # std::optional checking
        return self.metadata.num_row_groups_after_stats_filter
    @property
    def num_row_groups_after_bloom_filter(self):
        # std::optional checking
        return self.metadata.num_row_groups_after_bloom_filter

However, there's one important point to note (which I think is also mentioned in the C++ comments): these metric variables are only valid for the Parquet reader. I'm unsure whether it would be necessary to provide additional documentation for pylibcudf users to clarify this limitation:

cudf/cpp/include/cudf/io/types.hpp

Lines 288 to 297 in d0e219e

    
           // The following variables are currently only computed for Parquet reader 
        
           size_type num_input_row_groups{0};  //!< Total number of input row groups across all data sources 
        
           std::optional<size_type> 
        
             num_row_groups_after_stats_filter;  //!< Number of remaining row groups after stats filter. 
        
                                                 //!< std::nullopt if no filtering done. Currently only 
        
                                                 //!< reported by Parquet readers 
        
           std::optional<size_type> 
        
             num_row_groups_after_bloom_filter;  //!< Number of remaining row groups after bloom filter. 
        
                                                 //!< std::nullopt if no filtering done. Currently only 
        
                                                 //!< reported by Parquet readers

The text was updated successfully, but these errors were encountered:

mroeschke · 2025-02-24T18:35:05Z

Thanks for the report.

Sure, the properties you mentioned would be appropriate for TableWithMetadata in pylibcudf. A pull request would be welcome!

JigaoLuo · 2025-02-26T11:03:00Z

@mroeschke Hello! I spent some time working on this yesterday, but I encountered a strange issue when passing a C++ variable to Python .pyx.

I have a simple implementation for converting std::optional<size_t> to an int in python.

Under normal circumstances, when filtering, the std::optional<size_t> should hold a valid value, which is the expected behavior. ✅
However, during the filtering process, the value received as a Python int is always 0, which doesn't match the size_t value in the C++ code, as I've verified through stack debugging. With Pdb, I am able to go to the .pyx stackframe but could no print any thing:

-> res = func(*args, **kwds)
  /home/jluo/cudf-dev/python/pylibcudf/pylibcudf/tests/io/test_parquet.py(156)test_read_parquet_filters_TODO()
-> print("num_row_groups_after_stats_filter: ", plc_table_w_meta.num_row_groups_after_stats_filter)
> /home/jluo/cudf-dev/types.pyx(432)pylibcudf.io.types.TableWithMetadata.num_row_groups_after_stats_filter.__get__()

I've been unable to figure out where the value is getting lost because there shouldn't be any loss in this conversion. If you have some free time, could you take a look at my commit? Since this bug is present, I haven't submitted a pull request yet.

mroeschke · 2025-02-26T19:57:57Z

If you have some free time, could you take a look at my commit?

You implementation looks OK so far. I would suggest you open a PR with your changes and the test case where this is always returning 0 as that would be easier to iterate and debug what your are seeing.

JigaoLuo added the feature request New feature or request label Feb 24, 2025

mroeschke added the pylibcudf Issues specific to the pylibcudf package label Feb 24, 2025

github-project-automation bot added this to cuDF Python Feb 24, 2025

github-project-automation bot moved this to Todo in cuDF Python Feb 24, 2025

JigaoLuo linked a pull request Feb 26, 2025 that will close this issue

Expose the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf #18106

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

JigaoLuo commented Feb 24, 2025 •

edited

Loading

mroeschke commented Feb 24, 2025

JigaoLuo commented Feb 26, 2025 •

edited

Loading

mroeschke commented Feb 26, 2025

[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

Comments

JigaoLuo commented Feb 24, 2025 • edited Loading

mroeschke commented Feb 24, 2025

JigaoLuo commented Feb 26, 2025 • edited Loading

mroeschke commented Feb 26, 2025

JigaoLuo commented Feb 24, 2025 •

edited

Loading

JigaoLuo commented Feb 26, 2025 •

edited

Loading