-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction #4388
Conversation
…XGBooostModel.transformInternal
thanks, yeah, this is a bug which is not easy to fix this PR actually falls back to the previous problem about memory footprint, you can check f368d0d#diff-a435450e9c28607f848ccf3246944a44 let me think about what is the right way to fix, if we go to sort ,we need to do significant perf benchmarking before merging |
@CodingCat
|
@sperlingxx can you elaborate more on the second approach? |
at the same time I am benchmarking what if we sortWithinPartition beforehand |
without sorting, we need 6+mins to finish the prediction over 120G input
|
I am experimenting with several potential solutions and finding more problems in our implementation, will update soon |
@CodingCat something like:
And I think, maybe split prediction task into miniBatch is everything we need? |
ok, so I essentially tried three approaches to resolve the issue and finding more problems in XGBoost.
I tried sorting the DataFrame before feeding to transformInternal(), duplicate dataset like the implementation here and miniBatch
I trained a model based on an internal dataset having 1.5b rows and around 20 features and load the model to predict the training dataset in a separate Spark application. To scale the test, I manually duplicate the dataset in Spark application and the benchmark results only counts the time spent on prediction stage
resources I used
booster's prediction method is and has to be an synchronized method xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java Line 290 in bbe0dbd
I think in C++ layer, there is some sharing among different boosters in the same process (I didn't get enough time to debug it). If we make it as non-synchronized, we will meet a lot of double-freed error in native layer on prediction code path (I have tried to save booster and load back to create a new booster, even just use the broadcasted booster with non-synchronized version the method) because of this |
the actions here
|
@CodingCat BTW, you have permission to directly modify all PRs as a maintainer |
@hcho3 how to modify with a bunch of changes? |
@CodingCat You can get a local clone of this PR by running
Then create a commit with a bunch of changes. You should have permission to run |
I see.......thx |
@CodingCat |
@sperlingxx can you explain how your implementation is a |
private var batchCnt = 0 | ||
|
||
private val batchIterImpl = rowIterator.grouped( | ||
XGBoostClassificationModel.PREDICTION_BATCH_SIZE).flatMap { batchRow => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchRow has been a Seq[Row] instead of Iterator[] here, so it's not lazy evaluated and stays in memory until this batch is finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we should think about the memory footprint in this place as the "grouped iterator" has been put in memory for twice
Rabit.init(rabitEnv.asJava) | ||
} | ||
|
||
val features = batchRow.map(row => row.getAs[Vector]($(featuresCol))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the memory footprint, you have put two Seq, one for Seq[Row], one for Seq[Vector] in memory,
you can compare with my implementation, it only keeps a Seq[Row] due to iterator.duplicate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's an unnecessary footprint. Maybe it can be replaced by
val features = batchRow.iterator.map(row => row.getAs[Vector]($(featuresCol)))
...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala
Show resolved
Hide resolved
this is a significant change regarding performance and we need to be very careful about the correctness as well can you also use internal dataset for evaluation? |
Thanks for benchmarking! I'm a little confused about the context switches cost caused by synchronized prediction . Is it because there are multiple spark tasks running on each executor concurrently, and they share the same Booster Instance? What's more, they share the same booster handle, so the method has to be decorated with synchronized? |
Yes, because we are using a broadcast booster which is singleton per executor, and regarding why we use broadcasted booster, you can check my previous comments |
I left more comments there, @sperlingxx would you please move forward with the PR my suggestion is, use your way for the next version, and look at how to resolve shared properties among boosters after that |
val trainingDM = new DMatrix(Classification.train.iterator) | ||
val testDM = new DMatrix(Classification.test.iterator) | ||
val trainingDF = buildDataFrame(Classification.train) | ||
val testDF = buildDataFrame(Classification.test) | ||
val randSortedTestDF = buildDataFrameWithRandSort(Classification.test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's separate them into two test to highlight the randsorting version and normal version of the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay!
@@ -25,11 +25,12 @@ import org.scalatest.FunSuite | |||
|
|||
class XGBoostRegressorSuite extends FunSuite with PerTest { | |||
|
|||
test("XGBoost-Spark XGBoostRegressor ouput should match XGBoost4j: regression") { | |||
test("XGBoost-Spark XGBoostRegressor output should match XGBoost4j: regression") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the test name consistent with classifier part
@@ -47,7 +47,7 @@ | |||
* the prediction of these DMatrices will become faster than not-cached data. | |||
* @throws XGBoostError native error | |||
*/ | |||
Booster(Map<String, Object> params, DMatrix[] cacheMats) throws XGBoostError { | |||
public Booster(Map<String, Object> params, DMatrix[] cacheMats) throws XGBoostError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might no be necessary, maybe my bad
<<<<<<< HEAD | ||
======= | ||
|
||
>>>>>>> regressor impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad
LGTM, thanks, will merge after CI is happy |
dmlc#4388 Hot Fix info Author: Xu Xiao <[email protected]> AuthorDate: Sat Apr 27 02:09:20 2019 +0800 Commit: Nan Zhu <[email protected]> CommitDate: Fri Apr 26 11:09:20 2019 -0700 [BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction (dmlc#4388) * [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal * apply minibatch in prediction * an iterator-compatible minibatch prediction * regressor impl * continuous working on mini-batch prediction of xgboost4j-spark * Update Booster.java
fix issue #4387 by replacing zip RDDs with caching original data in closure.
Corresponding unit tests have been added.
closes #4387
closes #4307