Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: BigQueryIO direct read not reading all rows when set --setEnableBundling=true #26354

Closed
1 of 15 tasks
Abacn opened this issue Apr 19, 2023 · 5 comments · Fixed by #28778
Closed
1 of 15 tasks

[Bug]: BigQueryIO direct read not reading all rows when set --setEnableBundling=true #26354

Abacn opened this issue Apr 19, 2023 · 5 comments · Fixed by #28778

Comments

@Abacn
Copy link
Contributor

Abacn commented Apr 19, 2023

What happened?

The feature is introduced in #25392. When --setEnableBundling=true pipeline option is set, it turns out that BigQueryIO only reads a small fraction of row for large table. Reproduced reading tpcds_1T.web_sales table.

Number of rows: 720,000,376
--setEnableBundling=true: 44,550,489 rows read
--setEnableBundling=false: 720,000,376 rows read

Reading from tpcds_1G.web_sales table, the issue is not triggered, as 18,000 rows are read.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor Author

Abacn commented Apr 19, 2023

Affecting >=v2.46.0

demo that reproduces this issue: https://github.com/Abacn/beam-demo/tree/bigqueryiotestbranch

@Abacn Abacn self-assigned this Apr 19, 2023
@kennknowles
Copy link
Member

This seems like a pretty severe issue. Any progress?

@Abacn
Copy link
Contributor Author

Abacn commented Apr 27, 2023

The feature is introduced in Beam v2.46.0 and is activated only when this currently undocumented pipeline option is set. No production user is using it. @vachan-shetty is working on fix. Feel free to assign to yourself.

@Abacn Abacn removed their assignment Apr 27, 2023
@kennknowles
Copy link
Member

Is this related to #26521 or no?

@Abacn
Copy link
Contributor Author

Abacn commented May 3, 2023

@kennknowles thanks, will test if #26503 fixed the issue.

Update, no, that issue is for write, this is for read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants