-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] xgboost4j-spark Prediction Optimization #4307
Comments
@mingyang thanks for reporting the issue and the analysis the reason we didn't use any per-row approach (e.g. udf, appending to Row) is that per-instance prediction with XGBoost is very slow due to the overhead in creating DMatrix, etc. I don't have a workaround better than the second one you already mentioned for now I am thinking about provide some built-in tools which overlaps with Spark ML in functionality (e.g. cross-validation, ) but provide better performance with some special cares to the characteristics of XGBoost. What you mentioned here can be a very good use case for the tools |
OK, this makes sense to me now. @CodingCat How about this workaround, then? In pseudo-scala code:
Please forgive my syntax errors since I don't normally write scala code. But I hope my idea is clear. If the memory footprint of keeping the partition in the memory is too high, which I don't think is a real issue since the spark default partition size when reading from hdfs is 128MB only, then we can do multiple mini-batches (e.g. every 100 or 1000 rows) within each partition. This way, we don't need to have another copy of the whole partition in memory, but only one mini-batch. |
we do experience some issues with loading everything to memory (you can check #4033), and actually in your code, the iterator |
Pardon my scala snippet. It was only to illustrate the main idea. If we weren't using So, is it possible to leave this tradeoff to the users by providing either 1) two prediction functions (e.g. |
we intentionally avoid using too much memory in xgb itself to improve the scalability, and since you have to use more memory with in something like I think persisting before training will not bring additional overhead here, |
Motivation
Recently, I've been testing whether xgboost4j-spark could be a viable solution in production. One problem surfaced when multiple models were applied to the same dataset in sequence.
In an industrial setting, it's very likely that a practitioner builds multiple models using the same dataset, with same features but different labels for different tasks. For example, one might want to predict the demographic status of her users (gender, age, marital status, etc.) with the same set of base features.
Ideal Situation
Once multiple models have been trained on the labeled dataset, one wants to make predictions on a (potentially) much bigger unlabeled dataset. In pseudo code, it looks like
Ideally, this only loops through the dataset ONCE while applying all models.
Current Observations
In a toy setup
I've tried to apply 4 models, using xgboost4j-spark and spark logistic regression (LR), to a small dataset and monitored the IO. What's far from ideal is when applying multiple xgboost4j-spark models, the number of input records grows exponentially with the number of models applied:
In other words, if there are N models, then spark will read the dataset 2^N times. On the contrary, native spark LR simply reads the dataset once no matter how many models applied.
Even when just applying one model, the current implementation requires 2X of the dataset.
In a more realistic setup
This is where I first noticed the problem:
Suspected Reasons
In Implementation, xgboost4j-spark uses the RDD interface with method
zipPartition
, while spark models uses spark sql udfs to do prediction.This could make a huge difference in the execution plans.
Workaround
Suggested Solutions
spark.sql.functions.udf
), ORspark.sql.Row
s to avoid zippingThe text was updated successfully, but these errors were encountered: