[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388

sperlingxx · 2019-04-20T13:14:56Z

fix issue #4387 by replacing zip RDDs with caching original data in closure.

Corresponding unit tests have been added.

…XGBooostModel.transformInternal

CodingCat · 2019-04-20T14:53:35Z

thanks, yeah, this is a bug which is not easy to fix

this PR actually falls back to the previous problem about memory footprint, you can check f368d0d#diff-a435450e9c28607f848ccf3246944a44

let me think about what is the right way to fix, if we go to sort ,we need to do significant perf benchmarking before merging

sperlingxx · 2019-04-20T16:14:15Z

@CodingCat
Thank you for replying me that quickly.
Actually, this solution is quite like v0.81's implementation.
I have several ideas to improve this(maybe not good ideas = =)

do prediction in MiniBatch style
Maybe we can add a param like batch size?
add params idColumn and appendColumns
We filter other columns to reduce memory overhead of caching original data.

CodingCat · 2019-04-22T16:00:01Z

@sperlingxx can you elaborate more on the second approach?

CodingCat · 2019-04-22T16:02:56Z

at the same time I am benchmarking what if we sortWithinPartition beforehand

CodingCat · 2019-04-22T17:51:28Z

~~I tested with a prototype to sort each partition before hand,~~

~~without sorting, we need 6+mins to finish the prediction over 120G input~~

~~with sort, it grows to 12+mins to finish the same task~~

CodingCat · 2019-04-22T23:07:59Z

I am experimenting with several potential solutions and finding more problems in our implementation, will update soon

sperlingxx · 2019-04-23T02:38:03Z

@CodingCat
The second approach, we keep as fewer columns as we can, before batch data fetching(transformInternal). So, we can cache fewer data of original DataFrame(inputRDD).

something like:

dataset.toDF().select($(appendColumns) : _*)

And I think, maybe split prediction task into miniBatch is everything we need?

CodingCat · 2019-04-23T21:49:03Z

ok, so I essentially tried three approaches to resolve the issue and finding more problems in XGBoost.

approaches

I tried sorting the DataFrame before feeding to transformInternal(), duplicate dataset like the implementation here and miniBatch

benchmark of different approaches

I trained a model based on an internal dataset having 1.5b rows and around 20 features and load the model to predict the training dataset in a separate Spark application.

To scale the test, I manually duplicate the dataset in Spark application and the benchmark results only counts the time spent on prediction stage

used code: https://github.com/CodingCat/xgboost4j-spark-scalability/blob/master/src/main/scala/me/codingcat/xgboost4j/PureXGBoostPredictor.scala

benchmark results

resources I used --num-executors 100 --executor-memory 14g --executor-cores 8

problems here,

booster's prediction method is and has to be an synchronized method

xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java

Line 290 in bbe0dbd

private synchronized float[][] predict(DMatrix data,

I think in C++ layer, there is some sharing among different boosters in the same process (I didn't get enough time to debug it). If we make it as non-synchronized, we will meet a lot of double-freed error in native layer on prediction code path (I have tried to save booster and load back to create a new booster, even just use the broadcasted booster with non-synchronized version the method)

because of this synchronized, we are creating more context switches when using miniBatch approach, that's the reason we see the results like above (miniBatch is a bit slower if the scalability is not the issue)

CodingCat · 2019-04-23T22:06:31Z

the actions here

I will submit a PR to @sperlingxx 's branch soon
fix the sharing issue in booster after this is merged (not blocking tho)

hcho3 · 2019-04-23T22:08:48Z

@CodingCat BTW, you have permission to directly modify all PRs as a maintainer

CodingCat · 2019-04-23T22:17:57Z

@hcho3 how to modify with a bunch of changes?

hcho3 · 2019-04-23T22:26:15Z

@CodingCat You can get a local clone of this PR by running

git clone --recursive https://github.com/sperlingxx/xgboost -b hot_fix_spark_estimator

Then create a commit with a bunch of changes. You should have permission to run git push origin hot_fix_spark_estimator

CodingCat · 2019-04-23T22:27:17Z

I see.......thx

sperlingxx · 2019-04-24T06:59:17Z

@CodingCat
I'm not sure whether the implementation here will work in a minibatch way with the whole pipeline.
So, I rewrote the classification part in a pure lazy(iterator) style. I hope it will be helpful :)

CodingCat · 2019-04-24T15:32:20Z

@sperlingxx can you explain how your implementation is a pure lazy(iterator) style and why my suggested implementation is not?

CodingCat · 2019-04-24T15:34:03Z

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala

+        private var batchCnt = 0
+
+        private val batchIterImpl = rowIterator.grouped(
+          XGBoostClassificationModel.PREDICTION_BATCH_SIZE).flatMap { batchRow =>


batchRow has been a Seq[Row] instead of Iterator[] here, so it's not lazy evaluated and stays in memory until this batch is finished

but we should think about the memory footprint in this place as the "grouped iterator" has been put in memory for twice

CodingCat · 2019-04-24T15:37:57Z

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala

+            Rabit.init(rabitEnv.asJava)
+          }
+
+          val features = batchRow.map(row => row.getAs[Vector]($(featuresCol)))


regarding the memory footprint, you have put two Seq, one for Seq[Row], one for Seq[Vector] in memory,

you can compare with my implementation, it only keeps a Seq[Row] due to iterator.duplicate()

Oh, it's an unnecessary footprint. Maybe it can be replaced by

val features = batchRow.iterator.map(row => row.getAs[Vector]($(featuresCol)))

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala

CodingCat · 2019-04-24T16:19:44Z

this is a significant change regarding performance and we need to be very careful about the correctness as well

can you also use internal dataset for evaluation?

CodingCat · 2019-04-24T23:32:12Z

ok, here is my latest benchmark results, looks like with the current implementation is slower than the diff I pushed yesterday

My theory is still on the synchronized method in Booster

sperlingxx · 2019-04-25T03:49:50Z

ok, here is my latest benchmark results, looks like with the current implementation is slower than the diff I pushed yesterday

My theory is still on the synchronized method in Booster

Thanks for benchmarking!

I'm a little confused about the context switches cost caused by synchronized prediction . Is it because there are multiple spark tasks running on each executor concurrently, and they share the same Booster Instance? What's more, they share the same booster handle, so the method has to be decorated with synchronized?

CodingCat · 2019-04-25T03:53:22Z

Yes, because we are using a broadcast booster which is singleton per executor, and regarding why we use broadcasted booster, you can check my previous comments

CodingCat · 2019-04-25T19:53:15Z

I left more comments there, @sperlingxx would you please move forward with the PR

my suggestion is, use your way for the next version, and look at how to resolve shared properties among boosters after that

CodingCat · 2019-04-25T19:51:07Z

...es/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala

    val trainingDM = new DMatrix(Classification.train.iterator)
    val testDM = new DMatrix(Classification.test.iterator)
    val trainingDF = buildDataFrame(Classification.train)
    val testDF = buildDataFrame(Classification.test)
+    val randSortedTestDF = buildDataFrameWithRandSort(Classification.test)


let's separate them into two test to highlight the randsorting version and normal version of the test

CodingCat · 2019-04-25T19:51:30Z

...ges/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressorSuite.scala

@@ -25,11 +25,12 @@ import org.scalatest.FunSuite

 class XGBoostRegressorSuite extends FunSuite with PerTest {

-  test("XGBoost-Spark XGBoostRegressor ouput should match XGBoost4j: regression") {
+  test("XGBoost-Spark XGBoostRegressor output should match XGBoost4j: regression") {


make the test name consistent with classifier part

CodingCat · 2019-04-25T19:51:42Z

jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java

@@ -47,7 +47,7 @@
   *                  the prediction of these DMatrices will become faster than not-cached data.
   * @throws XGBoostError native error
   */
-  Booster(Map<String, Object> params, DMatrix[] cacheMats) throws XGBoostError {
+  public Booster(Map<String, Object> params, DMatrix[] cacheMats) throws XGBoostError {


this might no be necessary, maybe my bad

CodingCat · 2019-04-25T19:52:07Z

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala

+<<<<<<< HEAD
+=======
+
+>>>>>>> regressor impl


CodingCat · 2019-04-26T15:35:51Z

LGTM, thanks, will merge after CI is happy

dmlc#4388 Hot Fix info Author: Xu Xiao <[email protected]> AuthorDate: Sat Apr 27 02:09:20 2019 +0800 Commit: Nan Zhu <[email protected]> CommitDate: Fri Apr 26 11:09:20 2019 -0700 [BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction (dmlc#4388) * [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal * apply minibatch in prediction * an iterator-compatible minibatch prediction * regressor impl * continuous working on mini-batch prediction of xgboost4j-spark * Update Booster.java

[jvm-packages][hot-fix] fix column mismatch caused by zip actions at …

495a0af

…XGBooostModel.transformInternal

sperlingxx mentioned this pull request Apr 20, 2019

[jvm-packages] bug of XGBoostModel.transformInternal #4387

Closed

CodingCat changed the title ~~[jvm-packages] fix #4387~~ [jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction Apr 20, 2019

CodingCat changed the title ~~[jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction~~ [BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction Apr 22, 2019

hcho3 mentioned this pull request Apr 22, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

Nan Zhu and others added 2 commits April 23, 2019 15:29

apply minibatch in prediction

df5e2fc

an iterator-compatible minibatch prediction

19ad25e

CodingCat reviewed Apr 24, 2019

View reviewed changes

...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala Show resolved Hide resolved

regressor impl

dac2aa7

CodingCat reviewed Apr 25, 2019

View reviewed changes

sperlingxx and others added 2 commits April 26, 2019 14:43

continuous working on mini-batch prediction of xgboost4j-spark

2cbab8b

Update Booster.java

3888d55

CodingCat approved these changes Apr 26, 2019

View reviewed changes

CodingCat merged commit 2d875ec into dmlc:master Apr 26, 2019

sperlingxx deleted the hot_fix_spark_estimator branch April 27, 2019 14:53

hcho3 mentioned this pull request May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

lock bot locked as resolved and limited conversation to collaborators Jul 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388

[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388

sperlingxx commented Apr 20, 2019 •

edited by CodingCat

Loading

CodingCat commented Apr 20, 2019

sperlingxx commented Apr 20, 2019

CodingCat commented Apr 22, 2019

CodingCat commented Apr 22, 2019

CodingCat commented Apr 22, 2019 •

edited

Loading

CodingCat commented Apr 22, 2019

sperlingxx commented Apr 23, 2019 •

edited

Loading

CodingCat commented Apr 23, 2019 •

edited

Loading

CodingCat commented Apr 23, 2019

hcho3 commented Apr 23, 2019

CodingCat commented Apr 23, 2019

hcho3 commented Apr 23, 2019 •

edited

Loading

CodingCat commented Apr 23, 2019

sperlingxx commented Apr 24, 2019

CodingCat commented Apr 24, 2019

CodingCat Apr 24, 2019

CodingCat Apr 24, 2019

CodingCat Apr 24, 2019

sperlingxx Apr 25, 2019

CodingCat commented Apr 24, 2019

CodingCat commented Apr 24, 2019 •

edited

Loading

sperlingxx commented Apr 25, 2019 •

edited

Loading

CodingCat commented Apr 25, 2019

CodingCat commented Apr 25, 2019

CodingCat Apr 25, 2019

sperlingxx Apr 26, 2019

CodingCat Apr 25, 2019

CodingCat Apr 25, 2019

CodingCat Apr 25, 2019

CodingCat commented Apr 26, 2019

[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388

[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388

Conversation

sperlingxx commented Apr 20, 2019 • edited by CodingCat Loading

CodingCat commented Apr 20, 2019

sperlingxx commented Apr 20, 2019

CodingCat commented Apr 22, 2019

CodingCat commented Apr 22, 2019

CodingCat commented Apr 22, 2019 • edited Loading

CodingCat commented Apr 22, 2019

sperlingxx commented Apr 23, 2019 • edited Loading

CodingCat commented Apr 23, 2019 • edited Loading

CodingCat commented Apr 23, 2019

hcho3 commented Apr 23, 2019

CodingCat commented Apr 23, 2019

hcho3 commented Apr 23, 2019 • edited Loading

CodingCat commented Apr 23, 2019

sperlingxx commented Apr 24, 2019

CodingCat commented Apr 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodingCat commented Apr 24, 2019

CodingCat commented Apr 24, 2019 • edited Loading

sperlingxx commented Apr 25, 2019 • edited Loading

CodingCat commented Apr 25, 2019

CodingCat commented Apr 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodingCat commented Apr 26, 2019

sperlingxx commented Apr 20, 2019 •

edited by CodingCat

Loading

CodingCat commented Apr 22, 2019 •

edited

Loading

sperlingxx commented Apr 23, 2019 •

edited

Loading

CodingCat commented Apr 23, 2019 •

edited

Loading

hcho3 commented Apr 23, 2019 •

edited

Loading

CodingCat commented Apr 24, 2019 •

edited

Loading

sperlingxx commented Apr 25, 2019 •

edited

Loading