fix(iceberg): Date partition value parse issue#12126
fix(iceberg): Date partition value parse issue#12126nmahadevuni wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Can we not use std::stoi here instead of including a boost header?
In other comments i see that this function is slow. This could be a problem?
Also we wouldn't need a new include and add a new dependency here.
There was a problem hiding this comment.
std::stoi is converting a string like "2022-04-05" to int value 2022, so its not safe and may lead to wrong results.
There was a problem hiding this comment.
std::stoi should work here. https://www.geeksforgeeks.org/convert-string-to-int-in-cpp/
There was a problem hiding this comment.
std::stoi will work for converting int string to int. But it also doesn't throw error if we input a string like "2022-04-05" as it just converts it to 2022.
There was a problem hiding this comment.
Well, 2022-04-05 is parsed by boost into what number? It is not a valid integer in the first place. So either function will not work.
Also you are forgetting that the string is actually the days since epoch like you have in your description. Which means it is not an actual date formatted string. If this was the case you need a date parser here and not a string to int parse function in the first place.
There was a problem hiding this comment.
changed to use std::stoi
aditi-pandit
left a comment
There was a problem hiding this comment.
@nmahadevuni : Thanks for this fix. Have bunch of review comments.
There was a problem hiding this comment.
Why do you have this parameter to the function ? Since it seems like we never test its use really.
There was a problem hiding this comment.
vectorMaker_ is deprecated. Use makeFlatVector API instead.
There was a problem hiding this comment.
Don't think there is a need for this function as its only used in a single place and doesn't really represent anything.
There was a problem hiding this comment.
It is not reasonable to have an empty duckDbSql for this function, as that is the main sql string to verify the plan results with. Since we know the plan we are generating, it might be better to build the duckDBSql in this function itself.
There was a problem hiding this comment.
For this test case, we don't know how to generate the duckDbSql, when we add more test cases, we can enhance this function.
There was a problem hiding this comment.
I added an empty string check.
There was a problem hiding this comment.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
There was a problem hiding this comment.
We should remove the default "" value for this parameter.
There was a problem hiding this comment.
In this test case, we generate int vector for date values and create the data file which is ok for velox, but cannot create duckdb table with the same vectors, so using a sql statement to verify.
There was a problem hiding this comment.
Why are you testing this ? Isn't matching results sufficient ?
There was a problem hiding this comment.
not required, removed it.
There was a problem hiding this comment.
Maybe rename this function to assertPartitionKey.
There was a problem hiding this comment.
Want to keep this name generic, since it could be used to test any case.
There was a problem hiding this comment.
This method is building a very specific plan with TableScanNode and IcebergSplits... The assertQuery function name is being used in all the TestBase classes for very generic usage.
There was a problem hiding this comment.
assertPartitionKey also doesn't seem to be the right name, since we can just pass empty partition key. It is just another overloaded assertQuery method.
There was a problem hiding this comment.
please fix this to use C++ cast since we are touching this code.
There was a problem hiding this comment.
Sorry, didn't get it. toDays converts date into daysSinceEpoch, where to use C++ cast?
There was a problem hiding this comment.
The C++ style cast:
static_cast<folly::StringPiece>(partitionValue)
See the example in a line you actually removed (below in the review).
There was a problem hiding this comment.
Added C++ cast in both places.
velox/connectors/hive/SplitReader.h
Outdated
There was a problem hiding this comment.
don't need a const for enums and scalars.
There was a problem hiding this comment.
I made it a const reference.
There was a problem hiding this comment.
Are we missing something? This is just a plain const and not a const reference. Also for enums you don't need a reference. And the argument to the constructor is not a const & either.
There was a problem hiding this comment.
Sorry, I was looking at a different place altogether. You are right. I removed all const references.
There was a problem hiding this comment.
Its better to move the read-only const& parameters before the writable * ones. So lets move tableFormat to the first parameter.
There was a problem hiding this comment.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
There was a problem hiding this comment.
We should remove the default "" value for this parameter.
d3564ed to
c7843f4
Compare
|
Thank you @aditi-pandit @majetideepak @czentgr . I have addressed your comments. Please review. |
c7843f4 to
67b025c
Compare
|
@majetideepak @aditi-pandit @czentgr Addressed your comments. Please have a look. |
|
The date values are set in daysSinceEpoch in the FileScanTask returned by the Iceberg API. If we want to change that, that conversion back to string format has to happen in Coordinator. But Presto Java workers do not have any issue with that, and will need to be changed if we change it back to date string. So instead of changing at two places in Java, I fixed it in Velox. testFilters and applyPartitionFilter are static functions, setPartitionValue is the only class member. We will override and do special handling in IcebergSplitReader::setPartitionValue? @majetideepak |
@nmahadevuni Let's add |
67b025c to
ff840fc
Compare
As discussed on slack, added the bool to SplitReader class. Please review @majetideepak |
There was a problem hiding this comment.
writeToFile can be moved here since it is the same data.
I feel it is cleaner to create the plan and call HiveConnectorTestBase::assertQuery here.
This is inline with all the partition tests inside TableScanTest.cpp
There was a problem hiding this comment.
Agree... The current assertQuery function can be a lambda here for reuse.
There was a problem hiding this comment.
+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.
|
@Yuhta can you take a look at this change? Thanks. |
ff840fc to
2f5fa0a
Compare
velox/connectors/hive/TableHandle.h
Outdated
There was a problem hiding this comment.
I think we should follow the HiveTableHandle and add a generic argument like
const std::unordered_map<std::string, std::string>& columnParameters = {}
We can then define struct ColumnParameter similar to struct TableParameter and add an entry for our need. Example: key=partition.date.value.format, value=daysSinceEpoch/ISODateFormat as from our discussion offline.
2f5fa0a to
30f5dad
Compare
|
@majetideepak @Yuhta Thanks for the review. Made the changes as suggested. Please have a look. |
majetideepak
left a comment
There was a problem hiding this comment.
@nmahadevuni the Velox API looks good to me. Found some more comments.
There was a problem hiding this comment.
nit: remove commented code.
|
@Yuhta can you comment on this API? Thanks. |
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @nmahadevuni. Have few questions.
There was a problem hiding this comment.
This test is very simple. The partition keys, filters are the same and there is only one matching row that is selected. Would be good to add rows in the input that don't match the partition key and are filtered out. Also would be good to add a filter for another non-partition key value that matches an input row.
There was a problem hiding this comment.
+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.
There was a problem hiding this comment.
This check is repeated twice. Can we abstract this check as a method of ColumnHandle ?
velox/connectors/hive/TableHandle.h
Outdated
There was a problem hiding this comment.
The serialize and creata methods should be enhanced to also serialize/deserialize the columnParameters_ in the HiveColumnHandle.
There was a problem hiding this comment.
Do we need to change serde and toString methods? The columnParameters_ will be set in Prestissimo code. Even tableParameters_ from HiveTableHandle class is implemented the same way.
There was a problem hiding this comment.
@nmahadevuni : Yes, serde and toString methods should always be updated when we update classes that appear in a Velox plan.
toString is used in printPlanWithStats debugging utilities https://facebookincubator.github.io/velox/develop/debugging/print-plan-with-stats.html and the serde methods have been used in non-Presto use-cases.
tableParameters_ should be handled this way as well. Its a bug if it is not.
velox/connectors/hive/TableHandle.h
Outdated
There was a problem hiding this comment.
toString() should also be enhanced to show the columnParameters_.
Please update unit tests as well.
velox/connectors/hive/TableHandle.h
Outdated
There was a problem hiding this comment.
Please add a comment here indicating that these are used for metadata like column date value format.
If you have a link to how these are populated that will be great as well.
This change would be accompanied by presto-native code to populate these fields in IcebergPrestoToVeloxConnector::toVeloxColumnHandle method. So is there a change in the protocol classes as well to pass these from the Iceberg connector ? Do you the e2e Prestissimo PR ? Would be great to take a look at that to understand the scope of this change.
There was a problem hiding this comment.
Yes. We will need to change IcebergPrestoToVeloxConnector::toVeloxColumnHandle. This change will need to be merged first for me to work on the Prestissimo PR.
There was a problem hiding this comment.
Would be great to see the Prestissimo code at the same time to know this works e2e.
You can change the Velox submodule in the Presto project to pick up the changes from your branch https://git-scm.com/book/en/v2/Git-Tools-Submodules section "Pulling in Upstream Changes from the Submodule Remote"
and try out this code with presto-native-execution test.
See prestodb/presto#24138 for how a PR looks with Velox submodule changes for a branch with local commits.
velox/connectors/hive/TableHandle.h
Outdated
There was a problem hiding this comment.
Let's just make it typed struct as there is no need for serialization for this parameter. Something like this:
struct ColumnParseParameters {
enum PartitionDateValueFormat {
kISO8601,
kDaysSinceEpoch,
} partitionDateValueFormat;
};There was a problem hiding this comment.
Thanks @Yuhta . I have made the changes. Also, the draft changes on Prestissimo side are at nmahadevuni/presto@32b22cb. Please review @majetideepak @aditi-pandit
3804e10 to
f6964f1
Compare
f6964f1 to
27baa3c
Compare
|
Have made the changes. Please review. @Yuhta @majetideepak @aditi-pandit |
27baa3c to
52b81fa
Compare
majetideepak
left a comment
There was a problem hiding this comment.
@nmahadevuni this looks nice! Few comments.
@aditi-pandit can you take another look?
majetideepak
left a comment
There was a problem hiding this comment.
Thanks, @nmahadevuni
I will add the merge label once Aditi approves.
52b81fa to
457e82a
Compare
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @nmahadevuni
|
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@nmahadevuni looks like this unit test failed internally. Could you take a look?
|
|
it's a memory leak detected by ASAN: |
Thank you. Will have a look. |
There was a problem hiding this comment.
This is a leak.
You can use auto plan = PlanBuilder(pool_.get()).startTableScan()....
There was a problem hiding this comment.
PlanBuilder().startTableScan()
457e82a to
b0596f0
Compare
|
Fixed it and ran the leaks tool on Mac to check. @pedroerp @majetideepak |
|
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
fixes prestodb/presto#24371
Iceberg partition values are already in daysSinceEpoch, but in velox we assume its in date form and try to convert as with Hive. Fixed this.