Subfield pruning in Parquet#13271
Conversation
|
What about tests? We’d like to have a and a.b both be structs with s* as scalar members and then do select a, a.s1, a.b, a.b.s2 in different combinations. Also both a and a.b should have nulls.
If this works this seems good, at least this is very compact if the functionality is complete.
From: Maria Basmanova <notifications@github.com>
Sent: Thursday, August 22, 2019 4:21 PM
To: prestodb/presto <presto@noreply.github.com>
Cc: oerling <erling@xs4all.nl>; Review requested <review_requested@noreply.github.com>
Subject: Re: [prestodb/presto] Subfield pruning in Parquet (#13271)
@mbasmanova <https://github.com/mbasmanova> requested review from @prestodb/aria on: #13271 <#13271> Subfield pruning in Parquet.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub <#13271?email_source=notifications&email_token=AKPPPTYKC2CGXH6K6KKPBKLQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI#event-2578113737> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT37HOIXSG2D7QW555DQF4NOJANCNFSM4IOZQHCA> . <https://github.com/notifications/beacon/AKPPPT33YRO3J6EAPCDGQ2LQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI.gif>
|
mbasmanova
left a comment
There was a problem hiding this comment.
@zhenxiao Looks nice. It is cool that subfield pruning can be achieved with so few lines of code. Like Orri mentioned, it would be good to add some tests. I don't see integration tests for Parquet in the code base though. I suggest to start by adding TestParquestDistributedQueries that mimics TestHiveDistributedQueries, then add subfield pruning tests there.
Here is a draft:
public class TestParquestDistributedQueries
extends AbstractTestDistributedQueries
{
protected TestParquestDistributedQueries()
{
super(TestParquestDistributedQueries::createQueryRunner);
}
private static QueryRunner createQueryRunner()
throws Exception
{
return HiveQueryRunner.createQueryRunner(getTables(),
ImmutableMap.of("experimental.pushdown-subfields-enabled", "true"),
"sql-standard",
ImmutableMap.of("hive.storage-format", "PARQUET"),
Optional.empty());
}
@Test
public void testSubfieldPruning()
{
getQueryRunner().execute("CREATE TABLE test_subfield_pruning AS " +
"SELECT orderkey, linenumber, shipdate, " +
" CAST(ROW(orderkey, linenumber, ROW(day(shipdate), month(shipdate), year(shipdate))) " +
" AS ROW(orderkey BIGINT, linenumber INTEGER, shipdate ROW(ship_day TINYINT, ship_month TINYINT, ship_year INTEGER))) AS info " +
"FROM lineitem");
try {
assertQuery("SELECT info.orderkey, info.shipdate.ship_month FROM test_subfield_pruning", "SELECT orderkey, month(shipdate) FROM lineitem");
assertQuery("SELECT orderkey FROM test_subfield_pruning WHERE info.shipdate.ship_month % 2 = 0", "SELECT orderkey FROM lineitem WHERE month(shipdate) % 2 = 0");
}
finally {
getQueryRunner().execute("DROP TABLE test_subfield_pruning");
}
}
@Override
protected boolean supportsNotNullColumns()
{
return false;
}
@Override
public void testDelete()
{
// Hive connector currently does not support row-by-row delete
}
@Test
public void testExplainOfCreateTableAs()
{
String query = "CREATE TABLE copy_orders AS SELECT * FROM orders";
MaterializedResult result = computeActual("EXPLAIN " + query);
assertEquals(getOnlyElement(result.getOnlyColumnAsSet()), getExplainPlan(query, LOGICAL));
}
}
There was a problem hiding this comment.
I'd rename paths to types or subfieldTypes because this variable holds a list of types, not paths.
There was a problem hiding this comment.
- drop else after return
- inline
resultvariable
if (types.isEmpty()) {
return new MessageType(subfield.getRootName(), ImmutableList.of());
}
org.apache.parquet.schema.Type type = types.get(types.size() - 1);
for (int i = types.size() - 2; i >= 0; --i) {
GroupType groupType = types.get(i).asGroupType();
type = new MessageType(groupType.getName(), ImmutableList.of(type));
}
return new MessageType(subfield.getRootName(), ImmutableList.of(type));
3a61b8f to
b5476e8
Compare
b5476e8 to
d9f9803
Compare
|
thank you, @oerling @mbasmanova |
mbasmanova
left a comment
There was a problem hiding this comment.
@zhenxiao Looks good to me. Thanks for implementing this optimization.
@mbasmanova @nezihyigitbasi