Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Exposing the Number of Filtered Parquet Rowgroups (IO Metadata) to pylibcudf, if makes sense #18074

Open
JigaoLuo opened this issue Feb 24, 2025 · 3 comments · May be fixed by #18106
Open
Labels
feature request New feature or request pylibcudf Issues specific to the pylibcudf package

Comments

@JigaoLuo
Copy link

JigaoLuo commented Feb 24, 2025

Is your feature request related to a problem? Please describe.:

As a user of the Parquet reader, I find pull request #17594 to be extremely useful. This report metrics are related to filter effectiveness.
I'm wondering if it would be reasonable to first export these C++ metrics to pylibcudf. Having access to these metrics would enhance its functionality in python world with pylibcudf.


Describe the solution you'd like:

I believe the implementation of such a solution could be straightforward, and I'm willing to take on the task if assigned. In python/pylibcudf/pylibcudf/io/types.pyx should be generally:

    @property
    def num_input_row_groups(self):
        return self.metadata.num_input_row_groups
    @property
    def num_row_groups_after_stats_filter(self):
        # std::optional checking
        return self.metadata.num_row_groups_after_stats_filter
    @property
    def num_row_groups_after_bloom_filter(self):
        # std::optional checking
        return self.metadata.num_row_groups_after_bloom_filter

However, there's one important point to note (which I think is also mentioned in the C++ comments): these metric variables are only valid for the Parquet reader. I'm unsure whether it would be necessary to provide additional documentation for pylibcudf users to clarify this limitation:

// The following variables are currently only computed for Parquet reader
size_type num_input_row_groups{0}; //!< Total number of input row groups across all data sources
std::optional<size_type>
num_row_groups_after_stats_filter; //!< Number of remaining row groups after stats filter.
//!< std::nullopt if no filtering done. Currently only
//!< reported by Parquet readers
std::optional<size_type>
num_row_groups_after_bloom_filter; //!< Number of remaining row groups after bloom filter.
//!< std::nullopt if no filtering done. Currently only
//!< reported by Parquet readers

@JigaoLuo JigaoLuo added the feature request New feature or request label Feb 24, 2025
@mroeschke
Copy link
Contributor

Thanks for the report.

Sure, the properties you mentioned would be appropriate for TableWithMetadata in pylibcudf. A pull request would be welcome!

@mroeschke mroeschke added the pylibcudf Issues specific to the pylibcudf package label Feb 24, 2025
@JigaoLuo
Copy link
Author

JigaoLuo commented Feb 26, 2025

@mroeschke Hello! I spent some time working on this yesterday, but I encountered a strange issue when passing a C++ variable to Python .pyx.

I have a simple implementation for converting std::optional<size_t> to an int in python.

  • Under normal circumstances, when filtering, the std::optional<size_t> should hold a valid value, which is the expected behavior. ✅
  • However, during the filtering process, the value received as a Python int is always 0, which doesn't match the size_t value in the C++ code, as I've verified through stack debugging. With Pdb, I am able to go to the .pyx stackframe but could no print any thing:
-> res = func(*args, **kwds)
  /home/jluo/cudf-dev/python/pylibcudf/pylibcudf/tests/io/test_parquet.py(156)test_read_parquet_filters_TODO()
-> print("num_row_groups_after_stats_filter: ", plc_table_w_meta.num_row_groups_after_stats_filter)
> /home/jluo/cudf-dev/types.pyx(432)pylibcudf.io.types.TableWithMetadata.num_row_groups_after_stats_filter.__get__()

I've been unable to figure out where the value is getting lost because there shouldn't be any loss in this conversion. If you have some free time, could you take a look at my commit? Since this bug is present, I haven't submitted a pull request yet.

@mroeschke
Copy link
Contributor

If you have some free time, could you take a look at my commit?

You implementation looks OK so far. I would suggest you open a PR with your changes and the test case where this is always returning 0 as that would be easier to iterate and debug what your are seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request pylibcudf Issues specific to the pylibcudf package
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants