Skip to content

test: worker.py: Fix worker test flakiness.#2000

Merged
michael-berlin merged 1 commit intovitessio:masterfrom
michael-berlin:unflake_worker
Aug 27, 2016
Merged

test: worker.py: Fix worker test flakiness.#2000
michael-berlin merged 1 commit intovitessio:masterfrom
michael-berlin:unflake_worker

Conversation

@michael-berlin
Copy link
Contributor

The test failed because a vtworker retry was only triggered for the first shard and not the second one.

That's because the rows are not well balanced across both destination shards when you read them in primary key order. The rows of the first shard come first, then the rest.

This problem was exposed due to my recent decoupling of the number of chunks and the number of concurrent chunks processed. Before this fix, only one out of 6 chunks was processed at a time. That means, vtworker didn't write to the second destination shard while the test triggered the reparent.

Other changes:

  • Reduced the number of rows for the reparent test. On my laptop this reduced the clone duration from ~50 seconds to ~10 seconds. That's long enough to issue a reparent in between.
  • Wait for the destination replicas to catch up explicitly because it may take several seconds. (We write 1 row/query and therefore it takes so long to catch up.)

@alainjobart
Copy link
Contributor

alainjobart commented Aug 26, 2016

LGTM hopefully the reduced duration doesn't introduce flakiness on different hardware.

Approved with PullApprove

The test failed because a vtworker retry was only triggered for the first shard and not the second one.

That's because the rows are not well balanced across both destination shards when you read them in primary key order. The rows of the first shard come first, then the rest.

This problem was exposed due to my recent decoupling of the number of chunks and the number of concurrent chunks processed. Before this fix, only one out of 6 chunks was processed at a time. That means, vtworker didn't write to the second destination shard while the test triggered the reparent.

Other changes:
- Reduced the number of rows for the reparent test. On my laptop this reduced the clone duration from ~50 seconds to ~10 seconds. That's long enough to issue a reparent in between.
- Wait for the destination replicas to catch up explicitly because it may take several seconds. (We write 1 row/query and therefore it takes so long to catch up.)
@michael-berlin
Copy link
Contributor Author

LGTM hopefully the reduced duration doesn't introduce flakiness on different hardware.

I've actually increased the number of rows again to 4500 after I've confirmed that this works with other machines very well.

I also had to add the flag --min_rows_per_chunk 1 to make sure that we use at least two chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants