-
Notifications
You must be signed in to change notification settings - Fork 85
Cannot dump heavily skewed data in parallel #75
Comments
The original reproduction case:
|
Proposed implementation:
Scenario 1, evenly distributed data in the range `[0, 3333]` with 3 workers, and every number is filled.
Scenario 2, heavily skewed data, with data `[1, 10000]` ∪ `[12345678]` using 10 workers
Notice how in scenario 2 the algorithm exponentially narrow down the range to the uniform part and start to utilize all 10 threads. |
can we close it now? |
Fixed on TiDB 3.0+ by |
Example:
Currently Dumpling only splits the range using min and max ignoring the actual data distribution, so this will cause one thread to dump all small ID and while the rest only dump zero or one rows.
We could either introduce mydumper's bisection algorithm, or use a work-stealing algorithm for the completed threads to "steal" the unprocessed range from the working threads.
The text was updated successfully, but these errors were encountered: