Skip to content

Add product tests for Parquet reader#14283

Merged
raunaqmorarka merged 1 commit intotrinodb:masterfrom
raunaqmorarka:pq-pdt
Sep 26, 2022
Merged

Add product tests for Parquet reader#14283
raunaqmorarka merged 1 commit intotrinodb:masterfrom
raunaqmorarka:pq-pdt

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

Description

Runs tpcds and tpch at sf1 and verifies results

Non-technical explanation

Adds tests for parquet reader

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % erro-prone checks

Runs tpcds and tpch at sf1 and verifies results
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the additional tests overhead?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

what's the additional tests overhead?

It takes ~20min and is run as a separate suite in parallel https://github.com/trinodb/trino/actions/runs/3125999661/jobs/5071210570

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 26, 2022

It takes ~20min

That seems a lot. Do we have corresponding ORC tests? @nineinchnick do we now have a way to limit number of product tests running based on changeset?

@nineinchnick
Copy link
Copy Markdown
Member

Yes, we do limit tests executed in PRs (only, everything runs on master). It depends on connectors configured in the product test environment, so in this case, that'll include trino-hive. If there are changes in any non-plugin module, all tests are always executed.

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 26, 2022

If there are changes in any non-plugin module, all tests are always executed.

So Parquet PT will always be run?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

raunaqmorarka commented Sep 26, 2022

It takes ~20min

That seems a lot. Do we have corresponding ORC tests?

Existing suite-tpcds which runs on ORC files executed in 34m https://github.com/trinodb/trino/actions/runs/3125999661/jobs/5071210469

@nineinchnick
Copy link
Copy Markdown
Member

If there are changes in any non-plugin module, all tests are always executed.

So Parquet PT will always be run?

Yes, unless there are changes only in plugins other than hive, tpch, or tpcds, and not in any of the non-plugin modules (core, lib, etc.).

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 26, 2022

Yes, unless there are changes only in plugins other than hive, tpch, or tpcds, and not in any of the non-plugin modules (core, lib, etc.).

Could we make parquet PTs only run for Hive or Parquet module changes?

@nineinchnick
Copy link
Copy Markdown
Member

Yes, unless there are changes only in plugins other than hive, tpch, or tpcds, and not in any of the non-plugin modules (core, lib, etc.).

Could we make parquet PTs only run for Hive or Parquet module changes?

Not right now, because we can't identify which PTs depend on the Parquet module. Right now, we assume that all plugins configured in a PT environment are used by tests (suites) using it, and we map plugins to modules. We'd have to extend this mapping to non-plugin modules and identify libs that are only used by plugins. I'm afraid we'd fall into dependency hell.

I'm not sure how much work it would take and if is it worth it.

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Yes, unless there are changes only in plugins other than hive, tpch, or tpcds, and not in any of the non-plugin modules (core, lib, etc.).

Could we make parquet PTs only run for Hive or Parquet module changes?

Not right now, because we can't identify which PTs depend on the Parquet module. Right now, we assume that all plugins configured in a PT environment are used by tests (suites) using it, and we map plugins to modules. We'd have to extend this mapping to non-plugin modules and identify libs that are only used by plugins. I'm afraid we'd fall into dependency hell.

I'm not sure how much work it would take and if is it worth it.

IMO not worth it. Even if we could do it, it's possible that parquet reader may produce dictionary blocks or have some other difference in output pages from ORC which could hit different code paths in the engine. So it might be useful to run these tests even if the code change is only in the engine.

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 26, 2022

IMO not worth it. Even if we could do it, it's possible that parquet reader may produce dictionary blocks or have some other difference in output pages from ORC which could hit different code paths in the engine. So it might be useful to run these tests even if the code change is only in the engine.

Fine. @nineinchnick is extra 20mins ok?

@raunaqmorarka raunaqmorarka merged commit f573802 into trinodb:master Sep 26, 2022
@raunaqmorarka raunaqmorarka deleted the pq-pdt branch September 26, 2022 16:04
@github-actions github-actions bot added this to the 398 milestone Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants