[DISCUSSION] Integration with PySpark #1698

CodingCat · 2016-10-25T13:48:42Z

I just noticed that there are some requests for integration with PySpark http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

I also received some emails from the users discussing the same topic

I would like to initialize a discussion here on whether/when we shall start this work

@tqchen @terrytangyuan

terrytangyuan · 2016-10-25T15:58:11Z

@CodingCat Do you know how big is PySpark community? Most people just use Scala API. It seems like a lot of things need to be re-implemented in Python - please correct me if I am wrong.

CodingCat · 2016-10-25T17:05:41Z

I think PySpark is pretty prevalent in the community of Data Scientists, that means in the scenario of quick prototyping, etc. I heard about many cases that data scientists use pySpark to analyze the large volumes of data.

On the other side, most of production-level scenarios are based on Scala API (I only know a single case that people are using PySpark in large scale production)

terrytangyuan · 2016-10-25T17:18:25Z

Yeah I just feel like the current python API should be able to handle most prototyping needs. I personally care more about Spark when we want more production-ready stuff. Perhaps we should leave the discussion here so people can discuss their needs. In the meantime, it would be great if you can provide some details on approaches/estimates/steps for the integration.

CodingCat · 2016-11-02T23:52:20Z

just noticed some discussions in the community

http://apache-spark-developers-list.1001551.n3.nabble.com/Blocked-PySpark-changes-td19712.html

it seems that the development of PySpark is lagging behind...as the downstream library, I vote to hold on to dedicate to PySpark integration.....

terrytangyuan · 2016-11-03T02:18:42Z

Yeah it's also hard to debug into any issues you encountered (at least when I was trying it last year)...

ckljohn · 2017-02-06T08:37:20Z

In the roadmap (#873), it said Distributed python has been implemented. Does it mean that xgboost can run on a hadoop cluster with python? (I'm not meaning pyspark)

tqchen · 2017-02-06T18:31:58Z

yes, see the example posted in the link

yiming-chen · 2017-02-10T03:48:59Z

what is the difference of running xgboost on hadoop cluster with python vs. running xgboost on hadoop cluster with scala api? Are there major performance differences?
I think there are still a lot of people using pyspark for production model as well.

CodingCat · 2017-02-10T05:18:16Z

@yiming-chen the goal of xgboost4j-spark is to unify ETL and model training in the same pipeline

the question comes down to what language users use when doing ETL? based on my observation and experience, 95% users are building their ETL system with scala

berch · 2017-03-28T19:50:16Z

@CodingCat I don't know where you got your 95% stat from, but PySpark is definitely widely used in my experience. For example, we are trying to integrate Airflow to schedule the job for our pipeline and Python would be suitable in that situation.

CodingCat · 2017-03-28T21:24:44Z

@berch PySpark is widely used by you and you are going to integrate with airflow...is it relevant with what I said?

zherebetskyy · 2017-05-03T22:34:35Z

@CodingCat @tqchen Data Science community will definitely benefit from XGboost been implemented in PySpark, because:

In general, Python is Missing -lgomp when installing #3 language as of March 2017 with 10.2% of popularity (vs 1.8% for Scala)
http://redmonk.com/sogrady/2017/03/17/language-rankings-1-17/
https://jobsquery.it/stats/language/group;
In terms of performance PySpark vs Scala, I would assume, it does not matter that much because it's almost all Scala under the hood for Spark, right?
I know, at least, 3 AI startups that use PySpark for production and, more generally, Python is much more popular for Data Scientists and in demand on job market (
http://r4stats.com/articles/popularity/)
My conclusion: XGboost implemented in PySpark will have largest impact for DataScience among all other implementations.
(P.S. once stable, make sure, Cloudera implements it)

CodingCat · 2017-05-04T14:23:43Z

feel free to send a PR , you will find the cost

CodingCat · 2017-05-04T14:52:25Z

to avoid coming back to thread time and time again, I will close the discussion with the conclusion that

I personally will not vote to continue this effort to integrate PySpark (for now)
anyone else is more than welcomed to contribute on this, however, we need to at least consider the following things:
- do not introduce another python package
- backward compatibility to the current python API when implement integration
- handle the lagging behind features of pyspark's ML

haiy · 2018-01-11T08:24:34Z

So we can't use pyspark to load XGBoost-spark model? @CodingCat

wpopielarski · 2018-04-12T13:10:03Z

so, actually scala XGBoost can be more less painfully wrapped in PySpark JavaEstimator API. I played a little and have such prototype:

from pyspark.ml.wrapper import JavaEstimator, JavaModel
from pyspark.ml.param.shared import *
from pyspark.ml.util import *
from pyspark.context import SparkContext

class XGBoost(JavaEstimator, JavaMLWritable, JavaMLReadable, HasRegParam, HasElasticNetParam):

    def __init__(self, paramMap = {}):
        super(XGBoost, self).__init__()
        scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
        self._java_obj = self._new_java_obj(
            "ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
        self._defaultParamMap = paramMap
        self._paramMap = paramMap

    def setParams(self, paramMap = {}):
        return self._set(paramMap)

    def _create_model(self, javaTrainingData):
        return JavaModel(javaTrainingData)

I think it still needs some work but I was able to run Xgboost in PySpark with it.

sratakon · 2018-04-14T17:25:19Z

Wieslaw thanks for sharing the code snippet on XGBoost PySpark wrapper. Can you share the code on invoking the XGBoost class with the appropriate parameters?

Thanks

AakashBasuRZT · 2018-05-16T10:21:34Z

@wpopielarski it is an awesome job you did. Can you please share the code on invoking the XGBoost with the parameters needed? That would be a great help!

wpopielarski · 2018-05-23T10:48:45Z

this is something like:

        from app.xgboost import XGBoost
        xgboost_params = {
            "eta"  : 0.023,
            "max_depth" : 10,
            "min_child_weight" : 0.3,
            "subsample" : 0.7,
            "colsample_bytree" : 0.82,
            "colsample_bylevel" : 0.9,
            "base_score" : base_score,
            "eval_metric" : "auc",
            "seed" : 49,
            "silent" : 1,
            "objective" : "binary:logistic",
            "round" : 10,
            "nWorkers" : 2,
            "useExternalMemory" : True
        }
        xgboost_estimator = XGBoost.XGBoost(xgboost_params)
...
        model = xgboost_estimator.fit(data)

thesuperzapper · 2018-05-25T03:30:15Z

I am getting close to doing a PR with proper PySpark support.

AakashBasuRZT · 2018-05-25T05:25:39Z

@thesuperzapper , that's great!

How long do you think it would take to wrap it up? Do share insights while progressing.

Thanks!

haiy · 2018-06-11T12:33:22Z

hi, I write a simple version with ParamGridBuilder in case anyone interested, it's really easy to customize it.

1 create a package dir mkdir -p ml/dmlc/xgboost4j/scala in any valid PYTHONPATH dir.
2 copy code below to ml/dmlc/xgboost4j/scala/spark.py

from pyspark.ml.classification import JavaClassificationModel, JavaMLWritable, JavaMLReadable, TypeConverters, Param, \
    Params, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol, SparkContext
from pyspark.ml.wrapper import JavaModel, JavaWrapper, JavaEstimator


class XGBParams(Params):
    '''

    '''
    eta = Param(Params._dummy(), "eta",
                "step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative",
                typeConverter=TypeConverters.toFloat)
    max_depth = Param(Params._dummy(), "max_depth",
                      "maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. 0 indicates no limit, limit is required for depth-wise grow policy.range: [0,∞]",
                      typeConverter=TypeConverters.toInt)
    min_child_weight = Param(Params._dummy(), "min_child_weight",
                             "minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will berange: [0,∞]",
                             typeConverter=TypeConverters.toFloat)
    max_delta_step = Param(Params._dummy(), "max_delta_step",
                           "Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.",
                           typeConverter=TypeConverters.toInt)
    subsample = Param(Params._dummy(), "subsample",
                      "subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.",
                      typeConverter=TypeConverters.toFloat)
    colsample_bytree = Param(Params._dummy(), "colsample_bytree",
                             "subsample ratio of columns when constructing each tree",
                             typeConverter=TypeConverters.toFloat)
    colsample_bylevel = Param(Params._dummy(), "colsample_bylevel",
                              "subsample ratio of columns for each split, in each level.",
                              typeConverter=TypeConverters.toFloat)
    max_leaves = Param(Params._dummy(), "max_leaves",
                       "Maximum number of nodes to be added. Only relevant for the ‘lossguide’ grow policy.",
                       typeConverter=TypeConverters.toInt)

    def __init__(self):
        super(XGBParams, self).__init__()

class XGBoostClassifier(JavaEstimator, JavaMLWritable, JavaMLReadable, XGBParams,
                        HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol):
    def __init__(self, paramMap={}):
        super(XGBoostClassifier, self).__init__()
        scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
        self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
        self._defaultParamMap = paramMap
        self._paramMap = paramMap

    def setParams(self, paramMap={}):
        return self._set(paramMap)

    def _create_model(self, java_model):
        return XGBoostClassificationModel(java_model)


class XGBoostClassificationModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaMLReadable):

    def getBooster(self):
        return self._call_java("booster")

    def saveBooster(self, save_path):
        jxgb = JavaWrapper(self.getBooster())
        jxgb._call_java("saveModel", save_path)

3 play it as a normal pyspark model!

thesuperzapper · 2018-06-11T21:23:07Z

@AakashBasuRZT @haiy, we are now working on this properly in Issue #3370, with PR #3376 providing initial support.

sagnik-rzt · 2018-06-18T12:12:16Z

@haiy could you show me a snippet of code that fits the classifier on some arbitary dataset? I have followed points 1 and 2 which you have outlined, but I am not able to understand your 3rd point.

haiy · 2018-06-19T03:34:11Z

@sagnik-rzt check this sample

sagnik-rzt · 2018-06-19T07:14:11Z

@haiy I'm trying to run this:

import pyspark
import pandas as pd
from dmlc.xgboost4j.scala.spark import XGBoostClassifier
from sklearn.utils import shuffle

sc = pyspark.SparkContext('local[2]')
spark = pyspark.sql.SparkSession(sc)
df = pd.DataFrame({'x1': range(10), 'x2': [10] * 10, 'y': shuffle([0 for i in range(5)] + [1 for i in range(5)])})
sdf = spark.createDataFrame(df)
X = sdf.select(['x1', 'x2'])
Y = sdf.select(['y'])
print(X.show(5))

params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
xgb_model = XGBoostClassifier(params)

and it is giving this exception :

Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/auto_ML/pyspark_xgboost.py", line 20, in <module>
    xgb_model = XGBoostClassifier(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 47, in __init__
    self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/auto_ML/pyspark_xgboost.py", line 20, in <module>
    xgb_model = XGBoostClassifier(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 47, in __init__
    self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception ignored in: <bound method JavaParams.__del__ of XGBoostClassifier_4f9eb5d1388e9e1424a4>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 105, in __del__
    SparkContext._active_spark_context._gateway.detach(self._java_obj)
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1897, in detach
    java_object._detach()
AttributeError: 'NoneType' object has no attribute '_detach'

Environment:
Python 3.6
Spark 2.3
Scala 2.11

wpopielarski · 2018-06-19T08:21:17Z

@sagnik-rzt
not sure but do you add xgboost-spark.jar with deps on spark classpath?

sagnik-rzt · 2018-06-19T09:58:38Z

@wpopielarski Hey no I haven't done that. Any idea where I can find that jar file?

haiy · 2018-06-19T11:06:42Z

@sagnik-rzt hi, ~~please download the jar here, it's the official spark4j fat jar.~~ . Sorrry, I just found my jar was built based on mac, just try to build it as @wpopielarski suggest. And put it in the spark deps dir, like $SPARK_HOME/jars .

wpopielarski · 2018-06-19T13:05:49Z

need to build it your own :), with maven and profile assembly which builds fat jar in jvm-packages/xgboost-spark/target or so.

wpopielarski · 2018-06-20T07:58:05Z

@sagnik-rzt not sure what you are going to do but to build fat jar for your OS just clone dmlc xgboost github project, cd to jvm-packages and run mvn with assemby profile. I don't have faintest idea how to write gradle build file.

sagnik-rzt · 2018-06-29T12:30:04Z

Okay so I have built a fat jar with dependencies and then copy-pasted it to $SPARK_HOME/jars.
However, the same exception still pertains:

Traceback (most recent call last):
  File "/home/sagnikb/PycharmProjects/xgboost/test_import.py", line 21, in <module>
    clf = xgb(params)
  File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 48, in __init__
    self._java_obj = self._new_java_obj("dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

wpopielarski · 2018-06-30T08:25:10Z

sorry, but you run it on cluster, locally from some IDE project? If you are using spark-submit it is better to add deps to --jars switch 2018-06-29 14:30 GMT+02:00 sagnik-rzt <[email protected]>:

…

Okay so I have built a fat jar with dependencies and then copy-pasted it to $SPARK_HOME/jars. However, the same exception still pertains: Traceback (most recent call last): File "/home/sagnikb/PycharmProjects/xgboost/test_import.py", line 21, in <module> clf = xgb(params) File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 48, in __init__ self._java_obj = self._new_java_obj("dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap) File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args) TypeError: 'JavaPackage' object is not callable — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1698 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALEzS3JmKjO0AZ6JMzcixwCce0_3zRM0ks5uBh3ogaJpZM4KgAY_> .

thesuperzapper · 2018-07-04T01:45:08Z

I am currently working on rebasing #3376 to the new spark branch, In the mean time, a few people have asked for how to use the current code on XGBoost-0.72.

Here is a zip file with the pyspark code for XGBoost-0.72.
Download: sparkxgb.zip

All you need to do is:

Add the normal Scala XGBoost jars and dependencies to your job. (e.g. using --jars or the spark.jars config).
Once the job has started, run this in python: (Or any local location which every executor can see)

sc.addPyFile("hdfs:///XXXX/XXXX/XXXX/sparkxgb.zip")

Test with the following code. (Assuming you have moved sample_binary_classification_data.txt to a reachable location, it's normally in $SPARK_HOME/data/mllib/sample_binary_classification_data.txt)

from sparkxgb import XGBoostEstimator

# Load Data
dataPath = "sample_binary_classification_data.txt"
dataDF = spark.read.format("libsvm").load(dataPath)

# Split into Train/Test
trainDF, testDF = dataDF.randomSplit([0.8, 0.2], seed=1000)

# Define and train model
xgboost = XGBoostEstimator(
    # General Params
    nworkers=1, nthread=1, checkpointInterval=-1, checkpoint_path="",
    use_external_memory=False, silent=0, missing=float("nan"),
    
    # Column Params
    featuresCol="features", labelCol="label", predictionCol="prediction", 
    weightCol="weight", baseMarginCol="baseMargin", 
    
    # Booster Params
    booster="gbtree", base_score=0.5, objective="binary:logistic", eval_metric="error", 
    num_class=2, num_round=2, seed=None,
    
    # Tree Booster Params
    eta=0.3, gamma=0.0, max_depth=6, min_child_weight=1.0, max_delta_step=0.0, subsample=1.0,
    colsample_bytree=1.0, colsample_bylevel=1.0, reg_lambda=0.0, alpha=0.0, tree_method="auto",
    sketch_eps=0.03, scale_pos_weight=1.0, grow_policy='depthwise', max_bin=256,
    
    # Dart Booster Params
    sample_type="uniform", normalize_type="tree", rate_drop=0.0, skip_drop=0.0,
    
    # Linear Booster Params
    lambda_bias=0.0
)
xgboost_model = xgboost.fit(trainDF)

# Transform test set
xgboost_model.transform(testDF).show()

# Write model/classifier
xgboost.write().overwrite().save("xgboost_class_test")
xgboost_model.write().overwrite().save("xgboost_class_test.model")

Note:

This will only work for Spark 2.2+
Pipelines and ParamGridBuilder are kind of supported, use the modified pipeline object with from sparkxgb.pipeline import XGBoostPipeline,XGBoostPipelineModel as you would the normal objects.
You have to use float("+inf") for missing values rather than float("nan") for null values to be treated correctly due to an error in XGboost-0.72.
You cannot load back untrained model objects, (See [jvm-packages] ML Pipelines Load/Save and train error #3035)
This API will change with a full release of pyspark support.

BogdanCojocar · 2018-07-08T10:42:25Z

@thesuperzapper I'm trying to test this with pyspark on the jupyter notebook.

My system:
python 3.6.1
xgboost 0.72
spark 2.2.0
java 1.8
scala 2.12

When I'm trying to load the XGBoostEstimator I get:

Exception in thread "Thread-19" java.lang.NoClassDefFoundError: ml/dmlc/xgboost4j/scala/EvalTrait
	at java.lang.Class.getDeclaredMethods0(Native Method)
	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
	at java.lang.Class.privateGetPublicMethods(Class.java:2902)
	at java.lang.Class.getMethods(Class.java:1615)
	at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:272)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.EvalTrait
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 12 more

Is this a bug or am I missing some requirements?

thesuperzapper · 2018-07-08T11:26:22Z

@BogdanCojocar It seems like your missing the xgboost library.

You need both these jars for xgboost to work properly:

You can download the required jars from those maven links.

BogdanCojocar · 2018-07-08T12:47:44Z

Thanks @thesuperzapper. Works fine. Great job with this integration to pyspark!

ericwang915 · 2018-07-11T04:24:37Z

Any suggestion on how to save the trained model to booster for loading in python module?

thesuperzapper · 2018-07-11T07:30:53Z

@ericwang915 Typically to get a model which inter-operates with other XGBoost libraries, you would use the .booster.saveModel("XXX/XXX") method on the model object, where XXX is a local (non HDFS) path on the Spark driver. If you use other saving method you get an error (See: #2480)

However, I forgot to add a method to call the save function in that version of the wrapper, I will do this tomorrow if I get time. (I live in NZ... so timezones)

ericwang915 · 2018-07-11T08:30:34Z

Thank you. By the way, during the training process, there is no log showing the evaluation metrics and boosting round even though the silent is set as 1.

ccdtzccdtz · 2018-08-16T22:04:08Z

@thesuperzapper thanks for the instruction. I was able to follow your instruction to train/save xgboost model in pyspark. Any idea on how to access other xgboost model function like (scala)getFeatureScore()?

thesuperzapper · 2018-08-20T08:27:37Z

@ccdtzccdtz currently I am rewiring the pyspark wrapper since 0.8 had massive changes to the Spark API, when finished, I aim to have feature parity with the Spark Scala API.

I did not expose the native booster method in my initial pyspark wrapper, but if you use the Spark Scala API, you can call xgboost_model_object.nativeBooster.getFeatureScore( and use it like normal.

nitinkak001 · 2018-08-29T19:10:11Z

I have seen XGBoost on pyspark failing consistently if it is run 2 or more times. I am running it on the same dataset with the same code. First time it succeeds but the second time and subsequently it fails. I am using XGBoost 0.72 on Spark 2.3. I have to restart the pyspark shell to run the job successfully again.

I use xgboost.trainWithDataFrame for training purposes.

Has anyone seen this issue?

sagnik-rzt · 2018-09-04T12:03:16Z

Hi @thesuperzapper
What you prescribed works for me on a single worker node.
However, when I try to run pyspark xgboost on a using more than one worker (3 in this case), the executors become idle and shut down after a while.
This is the code I'm trying to run on the Titanic dataset (which is a small dataset):

from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = SparkSession\
        .builder\
        .appName("PySpark XGBOOST Titanic")\
        .getOrCreate()

#spark.sparkContext.addPyFile("../sparkxgb.zip")

from automl.sparkxgb import XGBoostEstimator

schema = StructType(
  [StructField("PassengerId", DoubleType()),
    StructField("Survival", DoubleType()),
    StructField("Pclass", DoubleType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", DoubleType()),
    StructField("SibSp", DoubleType()),
    StructField("Parch", DoubleType()),
    StructField("Ticket", StringType()),
    StructField("Fare", DoubleType()),
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
  ])

df_raw = spark\
  .read\
  .option("header", "true")\
  .schema(schema)\
  .csv("titanic.csv")


df = df_raw.na.fill(0)

sexIndexer = StringIndexer() \
    .setInputCol("Sex") \
    .setOutputCol("SexIndex") \
    .setHandleInvalid("keep")

cabinIndexer = StringIndexer() \
    .setInputCol("Cabin") \
    .setOutputCol("CabinIndex") \
    .setHandleInvalid("keep")

embarkedIndexer = StringIndexer() \
    .setInputCol("Embarked") \
    .setOutputCol("EmbarkedIndex") \
    .setHandleInvalid("keep")

vectorAssembler  = VectorAssembler()\
  .setInputCols(["Pclass", "SexIndex", "Age", "SibSp", "Parch", "Fare", "CabinIndex", "EmbarkedIndex"])\
  .setOutputCol("features")

xgboost = XGBoostEstimator(nworkers=2,
    featuresCol="features",
    labelCol="Survival",
    predictionCol="prediction"
)

pipeline = Pipeline().setStages([sexIndexer, cabinIndexer, embarkedIndexer, vectorAssembler, xgboost])
trainDF, testDF = df.randomSplit([0.8, 0.2], seed=24)

model  =pipeline.fit(trainDF)
print(trainDF.schema)

This is the stack trace:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.5, DMLC_TRACKER_PORT=9093, DMLC_NUM_WORKER=3}2018-09-04 08:52:55 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.43: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2018-09-04 08:52:55 ERROR AsyncEventQueue:91 - Interrupted while posting to TaskFailedListener. Removing that listener.java.lang.InterruptedException: ExecutorLost during XGBoost Training: ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. at org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)

The executor is stuck at:
org.apache.spark.RDD.foreachPartition(RDD.scala:927) ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:348)

Environment: Python 3.5.4, Spark Version 2.3.1, Xgboost 0.72

nitinkak001 · 2018-09-04T18:44:27Z

Can you share your xgboost and spark configurations? How many workers(xgboost workers), spark executors, cores etc.

…

-Nitin

On Tue, Sep 4, 2018 at 5:03 AM sagnik-rzt ***@***.***> wrote: Hi @thesuperzapper <https://github.com/thesuperzapper> What you prescribed works for me on a single worker node. However, when I try to run pyspark xgboost on a using more than one worker, the executors become idle and shut down after a while. This is the stack trace: ''' Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.5, DMLC_TRACKER_PORT=9093, DMLC_NUM_WORKER=3}2018-09-04 08:52:55 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.43: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2018-09-04 08:52:55 ERROR AsyncEventQueue:91 - Interrupted while posting to TaskFailedListener. Removing that listener.java.lang.InterruptedException: ExecutorLost during XGBoost Training: ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. at org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) ''' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1698 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJY-XklxsZB_FE7ZoAarV_fqw8D3JqxWks5uXmwqgaJpZM4KgAY_> .

thesuperzapper · 2018-09-05T11:19:43Z

@sagnik-rzt I am surprised that works at all, as that pyspark wrapper only support XGboost 0.72, we are still working on the 0.8 one.

vectosaurus · 2018-09-06T07:46:26Z

@thesuperzapper, based on the version you provided, I redid some parts to support xgboost 0.80.
However, I am ending up with an py4j.protocol.Py4JError: ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier does not exist in the JVM error. I have provided a full description here. Could you take a look?

All the codes are placed here.

thesuperzapper · 2018-09-06T11:01:20Z

There is quite a lot more changes needed than you have made to make it work with 0.8.

The main reason I haven't just pushed out a 0.8 version, is because I really don't want to make an xgboost specific pipeline object like I have in the 0.72, I am working on a way to hopefully have the pyspark xgboost object work with the default pipeline persistence.

vectosaurus · 2018-09-07T10:28:03Z

@thesuperzapper, while using the codes for 0.72, I used the XGBoostEstimator object directly that is without using the XGBoostPipeline object. And while doing so, I noticed that the training/fitting doesn't get distributed across the workers on the cluster. Is it necessary to use the XGBoostPipeline for distribution across the workers?

If that's not the case, do you know why the training doesn't get distributed across the workers?

Update
I tried the training by setting XGBoostEstimator as a stage in XGBoostPipeline but the issue persists. Training doesn't get distributed across workers when I run it on the cluster while it does for other pyspark supported models.

Have you observed this behavior? How do I tackle it?

thesuperzapper · 2018-09-09T22:17:03Z

I have mostly re-coded the wrapper for XGBoost 0.8, but as my work cluster is still on 2.2, I cant test it easily in distributed mode, as my Dockerized Spark 2.3 cluster cant even train Scala XGBoost distributed models without getting shuffle location missing issues.

I think the issues @sagnik-rzt and others are experiencing, are related to your cluster config or some deeper issue with Spark-Scala XGBoost.

Are you able to train a model in Spark-Scala XGBoost?

vectosaurus · 2018-09-10T06:30:11Z

Thanks @thesuperzapper, I thought shuffle locations were handled internally, that is, it would be taken care of independent of the cluster config. But I found this stackoverflow post so will be implementing those suggestions.

Also, could you share your 0.8 version, if it is ready? I can test for distribution on my cluster. It has spark 2.3.1 and python 3.5.

anaveenan · 2018-09-12T00:37:28Z

after saving the model and loading getting the following error

IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'

can you please help with this . thanks

import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark.sparkContext.addPyFile("sparkxgb.zip")
from sparkxgb import XGBoostEstimator
schema = StructType(
  [StructField("PassengerId", DoubleType()),
    StructField("Survival", DoubleType()),
    StructField("Pclass", DoubleType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", DoubleType()),
    StructField("SibSp", DoubleType()),
    StructField("Parch", DoubleType()),
    StructField("Ticket", StringType()),
    StructField("Fare", DoubleType()),
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
  ])

df_raw = spark\
  .read\
  .option("header", "true")\
  .schema(schema)\
  .csv("train.csv")

 df = df_raw.na.fill(0)

 sexIndexer = StringIndexer()\
  .setInputCol("Sex")\
  .setOutputCol("SexIndex")\
  .setHandleInvalid("keep")
    
cabinIndexer = StringIndexer()\
  .setInputCol("Cabin")\
  .setOutputCol("CabinIndex")\
  .setHandleInvalid("keep")
    
embarkedIndexer = StringIndexer()\
  .setInputCol("Embarked")\
  .setOutputCol("EmbarkedIndex")\
  .setHandleInvalid("keep")

vectorAssembler = VectorAssembler()\
  .setInputCols(["Pclass", "SexIndex", "Age", "SibSp", "Parch", "Fare", "CabinIndex", "EmbarkedIndex"])\
  .setOutputCol("features")
xgboost = XGBoostEstimator(
    featuresCol="features", 
    labelCol="Survival", 
    predictionCol="prediction"
)

pipeline = Pipeline().setStages([sexIndexer, cabinIndexer, embarkedIndexer, vectorAssembler, xgboost])
model = pipeline.fit(df)
model.transform(df).select(col("PassengerId"), col("prediction")).show()

model.save("model_xgboost")
loadedModel = Pipeline.load("model_xgboost")


IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'


#predict2 = loadedModel.transform(df)

Tried the following option

from pyspark.ml import PipelineModel
#model.save("model_xgboost")
loadedModel = PipelineModel.load("model_xgboost")

Getting the following errror

No module named ml.dmlc.xgboost4j.scala.spark

thesuperzapper · 2018-09-12T12:26:46Z

DEV DOWNLOAD LINK: sparkxgb.zip

This version will work with XGBoost-0.8, but please dont use it for anything other than testing, or contributing to this thread, as stuff will change.
(Also note: I have removed all the backports for Spark 2.2, so this only supports Spark 2.3)

The main issue I am aware of with that version, is that classification models wont load back after being saved, giving the error: TypeError: 'JavaPackage' object is not callable. However, strangely XGBoostPipelineModel works just fine with an XGBoost classification stage. This leads me to think is an issue on my end, can someone verify if reading classification models works for them?

Regardless, I am attempting to properly implement DefaultParamsWritable, which would remove the need for the dedicated XGBoostPipeline, which will be way easier to maintain long term, so the read/write issue should become irrelevant anyway. (This also might allow persistence on CrossValidator to work)

CodingCat added type: python discussion labels Oct 25, 2016

CodingCat changed the title ~~[DISCUSS] Integration with PySpark~~ [DISCUSSION] Integration with PySpark Oct 25, 2016

CodingCat closed this as completed May 4, 2017

thesuperzapper mentioned this issue Sep 30, 2018

[jvm-packages] PySpark Support Checklist #3370

Closed

5 tasks

BogdanCojocar mentioned this issue Oct 15, 2018

Issues with XGBoostEstimator cell BogdanCojocar/medium-articles#1

Closed

lock bot locked as resolved and limited conversation to collaborators Dec 11, 2018

[DISCUSSION] Integration with PySpark #1698

[DISCUSSION] Integration with PySpark #1698

Comments

CodingCat commented Oct 25, 2016

terrytangyuan commented Oct 25, 2016

CodingCat commented Oct 25, 2016

terrytangyuan commented Oct 25, 2016

CodingCat commented Nov 2, 2016

terrytangyuan commented Nov 3, 2016

ckljohn commented Feb 6, 2017

tqchen commented Feb 6, 2017

yiming-chen commented Feb 10, 2017

CodingCat commented Feb 10, 2017

berch commented Mar 28, 2017

CodingCat commented Mar 28, 2017

zherebetskyy commented May 3, 2017 • edited Loading

CodingCat commented May 4, 2017

CodingCat commented May 4, 2017

haiy commented Jan 11, 2018 • edited Loading

wpopielarski commented Apr 12, 2018

sratakon commented Apr 14, 2018

AakashBasuRZT commented May 16, 2018

wpopielarski commented May 23, 2018

thesuperzapper commented May 25, 2018

AakashBasuRZT commented May 25, 2018

haiy commented Jun 11, 2018

thesuperzapper commented Jun 11, 2018 • edited Loading

sagnik-rzt commented Jun 18, 2018 • edited Loading

haiy commented Jun 19, 2018

sagnik-rzt commented Jun 19, 2018 • edited Loading

wpopielarski commented Jun 19, 2018

sagnik-rzt commented Jun 19, 2018

haiy commented Jun 19, 2018 • edited Loading

wpopielarski commented Jun 19, 2018

wpopielarski commented Jun 20, 2018

sagnik-rzt commented Jun 29, 2018

wpopielarski commented Jun 30, 2018 via email

thesuperzapper commented Jul 4, 2018 • edited Loading

BogdanCojocar commented Jul 8, 2018 • edited Loading

thesuperzapper commented Jul 8, 2018

BogdanCojocar commented Jul 8, 2018

ericwang915 commented Jul 11, 2018

thesuperzapper commented Jul 11, 2018

ericwang915 commented Jul 11, 2018

ccdtzccdtz commented Aug 16, 2018

thesuperzapper commented Aug 20, 2018

nitinkak001 commented Aug 29, 2018 • edited Loading

sagnik-rzt commented Sep 4, 2018 • edited Loading

nitinkak001 commented Sep 4, 2018 via email

thesuperzapper commented Sep 5, 2018

vectosaurus commented Sep 6, 2018

thesuperzapper commented Sep 6, 2018

vectosaurus commented Sep 7, 2018

thesuperzapper commented Sep 9, 2018

vectosaurus commented Sep 10, 2018 • edited Loading

anaveenan commented Sep 12, 2018 • edited Loading

thesuperzapper commented Sep 12, 2018

zherebetskyy commented May 3, 2017 •

edited

Loading

haiy commented Jan 11, 2018 •

edited

Loading

thesuperzapper commented Jun 11, 2018 •

edited

Loading

sagnik-rzt commented Jun 18, 2018 •

edited

Loading

sagnik-rzt commented Jun 19, 2018 •

edited

Loading

haiy commented Jun 19, 2018 •

edited

Loading

thesuperzapper commented Jul 4, 2018 •

edited

Loading

BogdanCojocar commented Jul 8, 2018 •

edited

Loading

nitinkak001 commented Aug 29, 2018 •

edited

Loading

sagnik-rzt commented Sep 4, 2018 •

edited

Loading

vectosaurus commented Sep 10, 2018 •

edited

Loading

anaveenan commented Sep 12, 2018 •

edited

Loading