test: worker.py: Fix worker test flakiness. by michael-berlin · Pull Request #2000 · vitessio/vitess

michael-berlin · 2016-08-26T18:39:24Z

The test failed because a vtworker retry was only triggered for the first shard and not the second one.

That's because the rows are not well balanced across both destination shards when you read them in primary key order. The rows of the first shard come first, then the rest.

This problem was exposed due to my recent decoupling of the number of chunks and the number of concurrent chunks processed. Before this fix, only one out of 6 chunks was processed at a time. That means, vtworker didn't write to the second destination shard while the test triggered the reparent.

Other changes:

Reduced the number of rows for the reparent test. On my laptop this reduced the clone duration from ~50 seconds to ~10 seconds. That's long enough to issue a reparent in between.
Wait for the destination replicas to catch up explicitly because it may take several seconds. (We write 1 row/query and therefore it takes so long to catch up.)

alainjobart · 2016-08-26T18:42:41Z

LGTM hopefully the reduced duration doesn't introduce flakiness on different hardware.

The test failed because a vtworker retry was only triggered for the first shard and not the second one. That's because the rows are not well balanced across both destination shards when you read them in primary key order. The rows of the first shard come first, then the rest. This problem was exposed due to my recent decoupling of the number of chunks and the number of concurrent chunks processed. Before this fix, only one out of 6 chunks was processed at a time. That means, vtworker didn't write to the second destination shard while the test triggered the reparent. Other changes: - Reduced the number of rows for the reparent test. On my laptop this reduced the clone duration from ~50 seconds to ~10 seconds. That's long enough to issue a reparent in between. - Wait for the destination replicas to catch up explicitly because it may take several seconds. (We write 1 row/query and therefore it takes so long to catch up.)

michael-berlin · 2016-08-27T00:24:22Z

LGTM hopefully the reduced duration doesn't introduce flakiness on different hardware.

I've actually increased the number of rows again to 4500 after I've confirmed that this works with other machines very well.

I also had to add the flag --min_rows_per_chunk 1 to make sure that we use at least two chunks.

googlebot added the cla: yes label Aug 26, 2016

michael-berlin assigned alainjobart Aug 26, 2016

michael-berlin force-pushed the unflake_worker branch from 9449ec3 to 8ce64ef Compare August 27, 2016 00:20

michael-berlin merged commit f9e5695 into vitessio:master Aug 27, 2016

michael-berlin deleted the unflake_worker branch August 27, 2016 01:17

michael-berlin mentioned this pull request Jul 19, 2017

doubly escape periods in counter names so we dont create invalid json #3000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: worker.py: Fix worker test flakiness.#2000

test: worker.py: Fix worker test flakiness.#2000
michael-berlin merged 1 commit intovitessio:masterfrom
michael-berlin:unflake_worker

michael-berlin commented Aug 26, 2016

Uh oh!

alainjobart commented Aug 26, 2016 •

edited

Loading

Uh oh!

michael-berlin commented Aug 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

michael-berlin commented Aug 26, 2016

Uh oh!

alainjobart commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-berlin commented Aug 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alainjobart commented Aug 26, 2016 •

edited

Loading