Skip to content

Support materialized view creation, drop and query#15589

Merged
highker merged 3 commits intoprestodb:masterfrom
gggrace14:mpb
Mar 15, 2021
Merged

Support materialized view creation, drop and query#15589
highker merged 3 commits intoprestodb:masterfrom
gggrace14:mpb

Conversation

@gggrace14
Copy link
Copy Markdown
Contributor

@gggrace14 gggrace14 commented Jan 7, 2021

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have meetings for this PR. There are a lot details to figure out. Also, we need a lot of tests for this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it a generic interface:

alterTable(ConnectorSession session, String databaseName, String tableName, Table newTable)

Then, in this function, we should verify everything else is the same and update the table params only. If we found something else is different (e.g., owner, column, etc), we should fail it. Same, even for table params, we should only touch PRESTO_DEPENDENT_MATERIALIZED_VIEW_LIST.

Add a javadoc to indicate the functionality is very limited

@gggrace14 gggrace14 force-pushed the mpb branch 4 times, most recently from ae2354d to cea67a3 Compare February 2, 2021 06:04
@gggrace14 gggrace14 requested a review from highker February 2, 2021 18:41
@gggrace14 gggrace14 changed the title Support materialized view creation and query Support materialized view creation, drop and query Feb 2, 2021
@gggrace14 gggrace14 force-pushed the mpb branch 3 times, most recently from 4e9afee to 59d3299 Compare February 3, 2021 07:26
Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We need a lot more real tests. TestHiveDistributedQueries is a good place to add.
  • Only half way through the first commit

@gggrace14
Copy link
Copy Markdown
Contributor Author

  • We need a lot more real tests. TestHiveDistributedQueries is a good place to add.
  • Only half way through the first commit

@gggrace14 gggrace14 closed this Feb 4, 2021
@gggrace14 gggrace14 reopened this Feb 4, 2021
@jainxrohit
Copy link
Copy Markdown
Contributor

@gggrace14 can you please rebase these changes?

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate out the first commit as a separate PR. We probably don't want to merge the second commit at the moment. We will need DROP, REPLACE, SHOW CREATE, and REFRESH logic first.

@highker highker self-requested a review March 2, 2021 03:06
Comment on lines 85 to 87
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these could be local variables.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to address the comments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to comment. They're used in multiple Test cases (and initialized in setUp()). So might need to be class fields to be accessed by the Test cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to reply. They're used by multiple Test cases (and initialized in setUp()). So might need to be class members.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are used. Please check IntelliJ warnings. They can all be local variables than class members.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got what you mean. Revised. Good to see the IntelliJ highlight.

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments only

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a TODO here right? This is where the partition stitching happens. Without TODO, it's very easy to miss this part.

Comment on lines 85 to 87
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to address the comments?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? What would be the column mapping looks like?
test_region_nation_join_mv.nationkey = test_customer_base.nationkey
but test_orders_base is with orderstatus as the partition key right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, revised.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this will pass assert? It should fail right. It doesn't look like a legit use case.

Copy link
Copy Markdown
Contributor Author

@gggrace14 gggrace14 Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose for the use case below, the exchange_rate is a small dimension table partitioned by 'country'. If we include country as the mv partition key, a reload of a country will trigger the refresh of impacted revenue_aggregation_mv partitions. If we don't include country as a mv partition key, a reload of country won't affect the past revenue_aggregation_mv data. Need to have this flexibility to handle different customer needs.

CREATE MATERIALIZED VIEW IF NOT EXISTS revenue_aggregation_mv
WITH (partitioned_by = ARRAY['ds'])
AS SELECT
SUM(l.legal_budget * r.to_usd) AS total_legal_budget_usd, country, ds
FROM revenue_aggregation_imp l JOIN exchange_rate r ON l.country = r.country
GROUP BY country, ds

@gggrace14 gggrace14 force-pushed the mpb branch 2 times, most recently from 63a6b44 to 4b48c96 Compare March 8, 2021 23:20
@gggrace14 gggrace14 requested a review from highker March 8, 2021 23:23
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not addressed. We need to assertQueryFails if we are to insert into a materialized view will fail and delete from a materialized view will fail.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test where a join query will fail? Especially more tests to cover partition column mapping.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. We shouldn't throw MISSING_TABLE error. We should tell user we cannot alter a materialized view.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, we should throw errors other than MISSING_TABLE

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 333 to 336
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a warning. It's a failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break a line after this

Comment on lines 450 to 454
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -172,15 +173,15 @@ public synchronized List<String> getAllDatabases()
public synchronized void createTable(Table table)
{
TableType tableType = TableType.valueOf(table.getTableType());
Copy link
Copy Markdown
Contributor Author

@gggrace14 gggrace14 Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the class here along with other implementation at this level, org.apache.hadoop.hive.metastore.TableType is being called, while Hive 2.0 doesn't have TableType.MATERIALIZED_VIEW. So I need move the table type casting logic to one level above, ThriftMetastoreUtil.fromMetastoreApiTable() and toMetastoreApiTable().

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early comments; otherwise LGTM

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a () around !table.getTableType().equals(MANAGED_TABLE) && !table.getTableType().equals(MATERIALIZED_VIEW)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO here to remove the type change after Hive 3.0 upgrade

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention Facebook. This is open source; we should be neutral. Add a TODO here to remove the type change after Hive 3.0 upgrade

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to change this file anymore

Comment on lines 406 to 428
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental change? We don't have to modify this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where we cast managed_table back to materialized_view for getTable().

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assertUpdate("DROP TABLE test_customer_base"); and the same for test_orders_base. We wanna clean up the created table after unit test is done.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same drop table test_customer_base_1 here on exist

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, drop test_customer_base_2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same drop test_customer_base_3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop test_customer_base_4

@highker highker self-assigned this Mar 12, 2021
Add createMaterializedView API to MetadataManager and add CreateMaterializedViewTask
that calls the API to perform MV creation.

HiveMetadata implementation of the API is to add essential MV parameters
and create MV as a standard table. ConnectorMaterializedViewDefinition is the json
structure that contains all essential metadata we will save to Metastore. It will
serialize into the viewOriginalText field of table metadata. Add MV partition.
At the moment, we only support partitioned MV defined on partitioned base tables.
MV must have one partition directly matched (selected) from base tables.

Add MATERIALIZED_VIEW table type.

We don't support alter schema of materialized view at the moment.

We don't support MVs defined across different catalogs, which is enforced in
CreateMaterializedViewTask
Add dropMaterializedView API to MetadataManager, and add DropMaterializedViewTask
that calls the API to perform MV drop.

HiveMetadata implementation of the API is to directly drop the MV table.
@highker highker merged commit fc80960 into prestodb:master Mar 15, 2021
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the language spec get reviewed? I had concerns earlier on VIEW vs MATERIALIZED VIEW? Were those addressed?

@highker
Copy link
Copy Markdown

highker commented Mar 16, 2021

Did the language spec get reviewed? I had concerns earlier on VIEW vs MATERIALIZED VIEW? Were those addressed?

Chatted offline with @rongrong once early this year. We are going to model MATERIALIZED VIEW as VIEW with strong consistency. MATERIALIZED VIEW is VIEW with or without backfilled/materialized data. We will have a lot of code in Presto to make sure that. Also, we will have metastore support to guarantee transactions so that base tables and materialized views are always in sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants