Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support property filter pushdown by utilizing payload file formats #178

Merged
merged 30 commits into from
Jul 24, 2023

Conversation

Ziy1-Tan
Copy link
Contributor

@Ziy1-Tan Ziy1-Tan commented May 29, 2023

This PR is about C++ SDK for OSPP 2023
Issue number: #98.
You can find more detail about this feature here

Steps

Proposed changes

Filter pushdown is a performance optimization that prunes extraneous data from a Parquet or ORC file to reduce the amount of data that GraphAr scans and reads when a query on a file contains a filter expression. We want to support this feature for GraphAr C++ Reader SDK.

Forms of pushdown

  • FilterPushDown: filter the rows that meet the condition
  • ProjectPushDown: select the specified columns

Design

We enable pushdown in these ways:

  • Set filter and projection when constructing Reader:
    • ConstructVertexPropertyArrowChunkReader(options)
  • Set filter and projection after construction:
    • reader.Filter(...)
    • reader.Select(...)

Implementation

Pushdown options are wrapped into FilterOptions:

using Filter = std::shared_ptr<Expression>;
using ColumnNames =
    std::optional<std::reference_wrapper<std::vector<std::string>>>;

struct FilterOptions {
  // The row filter to apply to the table.
  Filter filter = nullptr;
  // The columns to include in the table. Select all columns by default.
  ColumnNames columns = std::nullopt;

  FilterOptions() {}
  FilterOptions(Filter filter, ColumnNames columns)
      : filter(filter), columns(columns) {}
};

Select column firstName , lastName from files where gender = female

TEST_CASE("test_vertex_property_pushdown") {
	// ...
  auto filter = _Equal(_Property("gender"), _Literal("female"));
  std::vector<std::string> expected_cols{"firstName", "lastName"};
  // ...
  // construct pushdown options
  FilterOptions options;
  options.filter = filter;
  options.columns = expected_cols;
  // print reader result
  auto walkReader = [&](VertexPropertyArrowChunkReader& reader){...};

  SECTION("pushdown by helper function") {
    std::cout << "Vertex property pushdown by helper function:\n";
    auto maybe_reader = ConstructVertexPropertyArrowChunkReader(graph_info, label, group, options);
    walkReader(maybe_reader.value());
  }
}

Nested expressions are also supported, e.g. "2012-06-02T04:30:44.526+0000" < reationDate and creationDate = creationDate

std::string property_name = "creationDate";
std::string value = "2012-06-02T04:30:44.526+0000";
auto expr1 = _LessThan(_Literal(value), _Property(property_name));
auto expr2 = _Equal(_Property(property_name), _Property(property_name));
auto filter = _And(expr1, expr2);

Result:

Vertex property pushdown by helper function:
Chunk: 0,       Nums: 60/100,   Range: (0, 100]
Chunk: 1,       Nums: 48/100,   Range: (100, 200]
Chunk: 2,       Nums: 46/100,   Range: (200, 300]
Chunk: 3,       Nums: 48/100,   Range: (300, 400]
Chunk: 4,       Nums: 51/100,   Range: (400, 500]
Chunk: 5,       Nums: 49/100,   Range: (500, 600]
Chunk: 6,       Nums: 47/100,   Range: (600, 700]
Chunk: 7,       Nums: 49/100,   Range: (700, 800]
Chunk: 8,       Nums: 54/100,   Range: (800, 900]
Chunk: 9,       Nums: 2/100,    Range: (900, 903]
Total Nums: 454/1000
Column Nums: 2
Column Names: `firstName` `lastName`

Scope

  • Wrap the arrow::compute::Expression
  • VertexPropertyArrowChunkReader
    • GetRange()
    • Filter(filter)
    • Select(col_names)
  • VertexPropertyArrowChunkReader
  • ConstructVertexPropertyArrowChunkReader(..., filter_options)
  • ConstructAdjListPropertyArrowChunkReader(..., filter_options)
  • ConstructVerticesCollection(..., filter_options)
  • ConstructEdgesCollection(..., filter_options)

TBD

  • Better validation for Filter() and Select() , i.g. match the property name
  • Large-scale data test cases;

@github-actions
Copy link

github-actions bot commented May 29, 2023

🎊 PR Preview 282e1d8 has been successfully built and deployed to https://alibaba-graphar-build-pr-178.surge.sh

🤖 By surge-preview

@Ziy1-Tan Ziy1-Tan changed the title [Feat] Support property filter pushdown by utilizing payload file formats [WIP][Feat] Support property filter pushdown by utilizing payload file formats May 29, 2023
@Ziy1-Tan
Copy link
Contributor Author

PTAL :) @lixueclaire @acezen

@acezen
Copy link
Contributor

acezen commented May 30, 2023

PTAL :) @lixueclaire @acezen

Good work! thanks for the change, we will take a look.

@lixueclaire
Copy link
Contributor

Well done! Thank you for completing this prototype. I agree with you that we could create a wrap with the arrow::compute::Expression to avoid exporting these methods of Arrow directly to users.
I have another concern regarding the GetRange() method used to retrieve the range of internal vertex IDs for the current table of VertexPropertyArrowChunkReader. While with filters, the vertex IDs will no longer be consecutive, rendering the returned value of this function meaningless. Do you have any ideas on this?

@Ziy1-Tan
Copy link
Contributor Author

Well done! Thank you for completing this prototype. I agree with you that we could create a wrap with the arrow::compute::Expression to avoid exporting these methods of Arrow directly to users. I have another concern regarding the GetRange() method used to retrieve the range of internal vertex IDs for the current table of VertexPropertyArrowChunkReader. While with filters, the vertex IDs will no longer be consecutive, rendering the returned value of this function meaningless. Do you have any ideas on this?

Good catch! Let me take a look.

@lixueclaire
Copy link
Contributor

You could fix the header files included refer to arrow_chunk_writer

@Ziy1-Tan Ziy1-Tan force-pushed the pushdown branch 10 times, most recently from d253175 to 6e5370e Compare June 17, 2023 16:17
@Ziy1-Tan Ziy1-Tan marked this pull request as ready for review July 3, 2023 16:18
@Ziy1-Tan Ziy1-Tan force-pushed the pushdown branch 3 times, most recently from 0c8832e to 8911968 Compare July 4, 2023 08:30
@Ziy1-Tan Ziy1-Tan changed the title [WIP][Feat] Support property filter pushdown by utilizing payload file formats [WIP][C++] Support property filter pushdown by utilizing payload file formats Jul 4, 2023
@Ziy1-Tan Ziy1-Tan force-pushed the pushdown branch 2 times, most recently from 9a090dd to 3d6d8d5 Compare July 4, 2023 11:54
@acezen acezen requested review from acezen and lixueclaire July 10, 2023 05:13
virtual ArrowExpression Evaluate() = 0;
};

class ExpressionProperty : public Expression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add some brief descriptions for the classes defined in this file.

Besides, if you think it is necessary, you can add new classes or methods into API references by updating this file

Copy link
Contributor Author

@Ziy1-Tan Ziy1-Tan Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greate tips! Let me take a look.

Signed-off-by: Ziy1-Tan <[email protected]>
Copy link
Contributor

@lixueclaire lixueclaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @acezen , do you have further comments on this change?

Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work.

@acezen
Copy link
Contributor

acezen commented Jul 18, 2023

Hi, @Ziy1-Tan, If you think the PR is ready, you can remove the WIP tag.

@Ziy1-Tan Ziy1-Tan changed the title [WIP][C++] Support property filter pushdown by utilizing payload file formats [C++] Support property filter pushdown by utilizing payload file formats Jul 21, 2023
@Ziy1-Tan
Copy link
Contributor Author

Hi, @Ziy1-Tan, If you think the PR is ready, you can remove the WIP tag.

Good. Let me add more uts for reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants