Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

Add training and test functions to integrate the native XGBoost library #281

Merged
merged 43 commits into from
Sep 6, 2016

Conversation

maropu
Copy link
Contributor

@maropu maropu commented Apr 27, 2016

I'm working on the integration of the XGBoost.
As a first step, each UDTF training worker simply outputs a single xgboost model, and then a testing phase computes prediction values for each built model and averages them.

-- `xgboost_model` outputs (string model_id, byte[] model) where each UDTF process
-- creates a xgboost model with an unique ID.
CREATE TABLE xgboost_models AS
  SELECT train_xgboost(add_bias("features"), "label")
    FROM training_data;

-- prediction
CREATE TABLE test_data_with_id AS
  SELECT rowid(), features
    FROM test_data

SELECT rowid, AVG(predicted)
  FROM (
      SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
        FROM xgboost_models CROSS JOIN test_data_with_id
    )
  GROUP BY rowid

<version>0.5</version>
<scope>system</scope>
<systemPath>${basedir}/lib/xgboost4j-0.5-jar-with-dependencies.jar</systemPath>
</dependency>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, no xgboost4j exists in the maven repository. I'll ask the author later why the library is not uploaded there.

@maropu
Copy link
Contributor Author

maropu commented Apr 27, 2016

An obvious other problem is that the native xgboost library depends on some shared libraries;

$ otool -L libxgboost4j.dylib 
libxgboost4j.dylib:
        jvm-packages/lib/libxgboost4j.so (compatibility version 0.0.0, current version 0.0.0)
        /usr/local/lib/gcc/4.9/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.20.0)
        /usr/local/lib/gcc/4.9/libgomp.1.dylib (compatibility version 2.0.0, current version 2.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0)
        /usr/local/lib/gcc/4.9/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)

Also, xgboost4j includes no native library for each platform by default. That is, you need compile it for your own platform. This could make it harder to deploy xgboost4j in a large heterogenous computing cluster. To mitigate the portability and deployment issues, we might need to fix these issues directly in the original xgboost4j codes. Otherwise, we need to fork and fix them.

@maropu
Copy link
Contributor Author

maropu commented Apr 27, 2016

I think snappy-java is a winner example in these kinda issues.

@maropu maropu mentioned this pull request Apr 27, 2016
11 tasks
@maropu
Copy link
Contributor Author

maropu commented Apr 27, 2016

I asked xgboost4j guys in here.

@myui
Copy link
Owner

myui commented Apr 27, 2016

better to create our own package xgboost4j-native including native libraries if they are not positive to make it.

@maropu
Copy link
Contributor Author

maropu commented Apr 27, 2016

okay

@@ -359,6 +359,21 @@ public static double getAsConstDouble(@Nonnull final ObjectInspector numberOI)
+ TypeInfoUtils.getTypeInfoFromObjectInspector(numberOI));
}

@SuppressWarnings("unchecked")
Copy link
Owner

@myui myui Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to Primitives, not related to Hive. Also, I think asFloat is not necessary while asFloatArray is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll fix it.

@maropu maropu force-pushed the XgboostIntegration branch 2 times, most recently from 84b4c80 to 6461121 Compare April 29, 2016 02:05
@maropu
Copy link
Contributor Author

maropu commented May 1, 2016

I implemented a XGBoost multiclass classifier;

-- `xgboost_model` outputs (string model_id, byte[] model) where each UDTF process
-- creates a xgboost model with an unique ID.
CREATE TABLE xgboost_models AS
  SELECT train_multiclass_xgboost_classifier(add_bias("features"), "label")
    FROM training_data;

-- prediction
CREATE TABLE test_data_with_id AS
  SELECT rowid(), features
    FROM test_data

SELECT rowid, max_row(avg_prob_per_label, label)
  FROM (
      SELECT rowid, label, AVG(probability) AS avg_prob_per_label
        FROM (
            SELECT xgboost_multiclass_predict(rowid, features, model_id, model) AS (rowid, label, probability)
              FROM xgboost_models CROSS JOIN test_data_with_id
          )
        GROUP BY rowid, label
    )
  GROUP BY rowid

@maropu
Copy link
Contributor Author

maropu commented May 1, 2016

I did a XGBoost binary classifier;

-- `xgboost_model` outputs (string model_id, byte[] model) where each UDTF process
-- creates a xgboost model with an unique ID.
CREATE TABLE xgboost_models AS
  SELECT train_xgboost_classifier(add_bias("features"), "label")
    FROM training_data;

-- prediction
CREATE TABLE test_data_with_id AS
  SELECT rowid(), features
    FROM test_data

SELECT rowid, CAST((case when AVG(predicted) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as predicted
  FROM (
      SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
        FROM xgboost_models CROSS JOIN test_data_with_id
    )
  GROUP BY rowid

@maropu maropu changed the title [WIP] Add training and test functions to integrate the native XGBoost library Add training and test functions to integrate the native XGBoost library May 1, 2016
@myui myui added this to the v0.4 milestone May 2, 2016
@myui myui self-assigned this May 2, 2016
@coveralls
Copy link

coveralls commented May 10, 2016

Coverage Status

Coverage decreased (-0.8%) to 32.884% when pulling bfe447f on maropu:XgboostIntegration into 8f880ab on myui:master.

@myui
Copy link
Owner

myui commented May 10, 2016

ref. to discussions #251

@coveralls
Copy link

coveralls commented Sep 1, 2016

Coverage Status

Coverage decreased (-0.7%) to 35.171% when pulling 4f5706c on maropu:XgboostIntegration into 01b6434 on myui:master.

@coveralls
Copy link

coveralls commented Sep 1, 2016

Coverage Status

Coverage decreased (-0.7%) to 35.171% when pulling b57f2d9 on maropu:XgboostIntegration into 01b6434 on myui:master.

@maropu
Copy link
Contributor Author

maropu commented Sep 1, 2016

How to use XGBoost functions on DataFrame/Spark is as follows;

// Load libsvm-formatted training data 
val trainDf = sqlContext.sparkSession.read.format("libsvm").load("E2006.train")

// Load test data
val testDf = sqlContext.sparkSession.read.format("libsvm").load("E2006.test")
  .withColumn("rowid", rowid()).cache

// Set XGBoost options here
val xgbOptions = XGBoostOptions()
  .set("num_round", "10000")
  .set("max_depth", "32")

// Build models with the XGBoost and write them into persistent storages
trainDf.train_xgboost_regr($"features", $"label", s"${xgbOptions}")
  .write.format(xgboost).load("xgboost_models")

// Load models from the storage
val model = sqlContext.sparkSession.read.format(xgboost).load("xgboost_models")

// Do prediction
val predict = model.join(testDf).xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
  .groupBy("rowid").avg()
  .as("rowid", "predicted")

// Check precision
val result = predict.join(testDf, predict("rowid") === testDf("rowid"), "INNER")
result.select(avg(abs($"predicted" - $"target"))).show

@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Changes Unknown when pulling 73d8090 on maropu:XgboostIntegration into * on myui:master*.

@maropu
Copy link
Contributor Author

maropu commented Sep 5, 2016

okay, I removed the dependencies (libgcc and libstdc++).
Could you check this again?

@@ -9,13 +9,12 @@
<relativePath>../../pom.xml</relativePath>
</parent>

<artifactId>hivemall-spark</artifactId>
<name>Hivemall on Spark</name>
<artifactId>hivemall-spark_2.11</artifactId>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2.11? hivemall-spark-v1 and hivemall-spark-v2 might be good for artifact id.

Copy link
Contributor Author

@maropu maropu Sep 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means this jar supports spark-2.0 compiled with scala-2.11.
The naming rule is along with sprak jar.

@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Changes Unknown when pulling 0ad666f on maropu:XgboostIntegration into * on myui:master*.

@maropu maropu force-pushed the XgboostIntegration branch 2 times, most recently from 377d61c to 4c0e33c Compare September 5, 2016 12:13
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 4c0e33c on maropu:XgboostIntegration into * on myui:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 4c0e33c on maropu:XgboostIntegration into * on myui:master*.

@maropu maropu force-pushed the XgboostIntegration branch 3 times, most recently from e0cb90d to 0ad666f Compare September 5, 2016 22:55
@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Changes Unknown when pulling a8f4cf2 on maropu:XgboostIntegration into * on myui:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a8f4cf2 on maropu:XgboostIntegration into * on myui:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 3296a82 on maropu:XgboostIntegration into * on myui:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 3296a82 on maropu:XgboostIntegration into * on myui:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 826b390 on maropu:XgboostIntegration into * on myui:master*.

@coveralls
Copy link

coveralls commented Sep 6, 2016

Coverage Status

Changes Unknown when pulling e6889dc on maropu:XgboostIntegration into * on myui:master*.

@myui myui merged commit 4f9037e into myui:master Sep 6, 2016
@maropu
Copy link
Contributor Author

maropu commented Sep 6, 2016

fyi: If you get some errors in a xgboost binary bundled in the jar, you can compile the binary on your platform: mvn -Pcompile-xgboost clean package

@myui
Copy link
Owner

myui commented Sep 6, 2016

@maropu LGTM. Merged. Well done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants