Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

Add training and test functions to integrate the native XGBoost library #281

Merged
merged 43 commits into from
Sep 6, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
8882b74
Add training & test functions by using the native XGBoost library
maropu Apr 26, 2016
fe3f535
Add xgboost4j.jar in core/lib
maropu Apr 28, 2016
14fa18f
Add NativeLibLoader to a load custom-compiled xgboost library
maropu Apr 29, 2016
3f46afa
Update .gitignore
maropu Apr 29, 2016
ffdd020
Add a script to make a costom-built xgboost binary
maropu Apr 29, 2016
5d393a5
Use HadoopUtils.getTaskId() to generate unique ids
maropu Apr 29, 2016
f5f8c47
Remove unnecessary functions in HiveUtils
maropu Apr 29, 2016
f8e1779
Add entries in define-all.hive
maropu Apr 29, 2016
31b8ca6
Add XGBoostUDTF to provide common functionality for XGBoost
maropu Apr 30, 2016
b145bf4
Add XGBoostBinaryClassifierUDTF for binary classification
maropu Apr 30, 2016
2a6e03b
Add a `hivemall.xgboost.lib property` for loading user-defined native…
maropu Apr 30, 2016
83e2a8d
Add XGBoostMulticlassClassifierUDTF for multiclass classification
maropu May 1, 2016
29e7a46
Fix bugs in define-all-as-permanent.hive
maropu May 10, 2016
b016ad3
Rename an illegal file name
maropu May 10, 2016
dff1765
Support XGBoost functions on DataFrame/Spark
maropu Sep 1, 2016
56a1269
Update the XGBoost library
maropu Sep 1, 2016
25ab0ec
Remove system scope in core/pom.xml
maropu Sep 1, 2016
7c345cb
Update import-packages.spark
maropu Sep 1, 2016
9f55332
Add tests for train_xgboost_regr and train_xgboost_classifier
maropu Sep 1, 2016
84c92ae
Update bin/build_xgboost.sh
maropu Sep 1, 2016
6eb4def
Add tests for train_xgboost_multiclass_classifier
maropu Sep 1, 2016
79c4479
Update .travis.yml
maropu Sep 1, 2016
3414fb9
Move xgboost functions into a xgboost submodule
maropu Sep 2, 2016
1352a40
Update bin/build_xgboost.sh
maropu Sep 2, 2016
4e85fe5
Apply revew comments
maropu Sep 2, 2016
60f65f9
Add -q options in .travis.yml
maropu Sep 2, 2016
1754a81
Add a xgboost binary for Linux/x86_64
maropu Sep 2, 2016
2c5e969
Fix bugs in spark/*/pom.xml
maropu Sep 2, 2016
6be073e
Add notations for XGBoost functions in HivemallOps
maropu Sep 2, 2016
7f98d12
Update .travis.yml to reduce # of executed tests
maropu Sep 3, 2016
fc9210f
Brush up exception handling
maropu Sep 4, 2016
308ba87
Update compilation options for xgboost
maropu Sep 4, 2016
29b5883
Add more tests for xgboost
maropu Sep 4, 2016
5a043ac
Remove unnecessary dependencies in pom.xml
maropu Sep 4, 2016
2a2c0d0
Build a jar for xgboost with portable binaries
maropu Sep 5, 2016
67b03d2
Remove static-links for libgcc and libstdg++
maropu Sep 5, 2016
73d8090
Add an option to enable static links in bin/build_xgboost.sh
maropu Sep 5, 2016
0ad666f
Move the property of scala.version into topdir/pom.xml
maropu Sep 5, 2016
7bf055f
Fix bugs in bin/build_xgboost.sh
maropu Sep 5, 2016
9b9d440
Fix version numbers for xgboost
maropu Sep 5, 2016
a8f4cf2
Add activeByDefault in pom.xml
maropu Sep 5, 2016
826b390
Fix version numbers for spark modules
maropu Sep 5, 2016
e6889dc
Add a profile to compile xgboost
maropu Sep 6, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ scalastyle-output.xml
scalastyle.txt
derby.log
spark/bin/zinc-*
*.dylib
*.so
8 changes: 6 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,16 @@ branches:
- master
- develop

before_install:
- mvn validate -Pxgboost

notifications:
email: false

script:
- mvn test -Pspark-1.6
- mvn test -Pspark-2.0
- mvn -q test -Pspark-2.0
# test the spark-1.6 module only in this second run
- mvn -q test -Pspark-1.6 -Dtest=org.apache.spark.*
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be tested both on spark 1.6 and spark 2.0.
Before committing this change, test was run successfully. Is this change required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think so.
IIUC the first tests the Hivemall core stuffs (e.g., core, nlp, and mixserv) and the spark-2.0 module.
The other tests the spark-1.6 module only because the Hivemall core stuff has already been tested in the first test.


after_success:
- mvn clean cobertura:cobertura coveralls:report
64 changes: 64 additions & 0 deletions bin/build_xgboost.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

# Hivemall: Hive scalable Machine Learning Library
#
# Copyright (C) 2015 Makoto YUI
# Copyright (C) 2013-2015 National Institute of Advanced Industrial Science and Technology (AIST)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -eu
set -o pipefail

# Target commit hash value
XGBOOST_HASHVAL='85443403310e90bd8a90a1f817841520838b4ac7'

# Move to a top directory
if [ "$HIVEMALL_HOME" == "" ]; then
if [ -e ../bin/${0##*/} ]; then
HIVEMALL_HOME=".."
elif [ -e ./bin/${0##*/} ]; then
HIVEMALL_HOME="."
else
echo "env HIVEMALL_HOME not defined"
exit 1
fi
fi

cd $HIVEMALL_HOME

# Final output dir for a custom-compiled xgboost binary
HIVEMALL_LIB_DIR="$HIVEMALL_HOME/xgboost/src/main/resources/lib/"
rm -rf $HIVEMALL_LIB_DIR >> /dev/null
mkdir -p $HIVEMALL_LIB_DIR

# Move to an output directory
XGBOOST_OUT="$HIVEMALL_HOME/target/xgboost-$XGBOOST_HASHVAL"
rm -rf $XGBOOST_OUT >> /dev/null
mkdir -p $XGBOOST_OUT
cd $XGBOOST_OUT

# Fetch xgboost sources
git clone --progress https://github.com/maropu/xgboost.git
cd xgboost
git checkout $XGBOOST_HASHVAL

# Resolve dependent sources
git submodule init
git submodule update

# Copy a built binary to the output
cd jvm-packages
ENABLE_STATIC_LINKS=1 ./create_jni.sh
cp ./lib/libxgboost4j.* "$HIVEMALL_LIB_DIR"

37 changes: 31 additions & 6 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
<modules>
<module>core</module>
<module>nlp</module>
<module>xgboost</module>
<module>mixserv</module>
</modules>

Expand All @@ -52,6 +53,7 @@
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<protobuf.version>2.5.0</protobuf.version>
<protoc.path>${env.PROTOC_PATH}</protoc.path>
<scala.version>2.11.8</scala.version>
</properties>

<repositories>
Expand All @@ -70,25 +72,48 @@

<profiles>
<profile>
<id>spark-1.6</id>
<id>spark-2.0</id>
<modules>
<module>spark/spark-1.6</module>
<module>spark/spark-2.0</module>
<module>spark/spark-common</module>
</modules>
<properties>
<spark.version>1.6.1</spark.version>
<spark.version>2.0.0</spark.version>
</properties>
</profile>
<profile>
<id>spark-2.0</id>
<id>spark-1.6</id>
<modules>
<module>spark/spark-2.0</module>
<module>spark/spark-1.6</module>
<module>spark/spark-common</module>
</modules>
<properties>
<spark.version>2.0.0</spark.version>
<spark.version>1.6.1</spark.version>
</properties>
</profile>
<profile>
<id>compile-xgboost</id>
<build>
<plugins>
<plugin>
<artifactId>exec-maven-plugin</artifactId>
<groupId>org.codehaus.mojo</groupId>
<executions>
<execution>
<id>native</id>
<phase>generate-sources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>./bin/build_xgboost.sh</executable>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
<profile>
<id>doclint-java8-disable</id>
<activation>
Expand Down
18 changes: 18 additions & 0 deletions resources/ddl/define-all-as-permanent.hive
Original file line number Diff line number Diff line change
Expand Up @@ -575,3 +575,21 @@ CREATE FUNCTION rf_ensemble as 'hivemall.smile.tools.RandomForestEnsembleUDAF' U
DROP FUNCTION IF EXISTS guess_attribute_types;
CREATE FUNCTION guess_attribute_types as 'hivemall.smile.tools.GuessAttributesUDF' USING JAR '${hivemall_jar}';

------------------------------
-- XGBoost related features --
------------------------------

DROP FUNCTION train_xgboost_regr;
CREATE FUNCTION train_xgboost_regr AS 'hivemall.xgboost.regression.XGBoostRegressionUDTF' USING JAR '${hivemall_jar}';

DROP FUNCTION train_xgboost_classifier;
CREATE FUNCTION train_xgboost_classifier AS 'hivemall.xgboost.classification.XGBoostBinaryClassifierUDTF' USING JAR '${hivemall_jar}';

DROP FUNCTION train_multiclass_xgboost_classifier;
CREATE FUNCTION train_multiclass_xgboost_classifier AS 'hivemall.xgboost.classification.XGBoostMulticlassClassifierUDTF' USING JAR '${hivemall_jar}';

DROP FUNCTION xgboost_predict;
CREATE FUNCTION xgboost_predict AS 'hivemall.xgboost.tools.XGBoostPredictUDTF' USING JAR '${hivemall_jar}';

DROP FUNCTION xgboost_multiclass_predict;
CREATE FUNCTION xgboost_multiclass_predict AS 'hivemall.xgboost.tools.XGBoostMulticlassPredictUDTF' USING JAR '${hivemall_jar}';
19 changes: 19 additions & 0 deletions resources/ddl/define-all.hive
Original file line number Diff line number Diff line change
Expand Up @@ -571,6 +571,25 @@ create temporary function rf_ensemble as 'hivemall.smile.tools.RandomForestEnsem
drop temporary function guess_attribute_types;
create temporary function guess_attribute_types as 'hivemall.smile.tools.GuessAttributesUDF';

------------------------------
-- XGBoost related features --
------------------------------

drop temporary function train_xgboost_regr;
create temporary function train_xgboost_regr as 'hivemall.xgboost.regression.XGBoostRegressionUDTF';

drop temporary function train_xgboost_classifier;
create temporary function train_xgboost_classifier as 'hivemall.xgboost.classification.XGBoostBinaryClassifierUDTF';

drop temporary function train_multiclass_xgboost_classifier;
create temporary function train_multiclass_xgboost_classifier as 'hivemall.xgboost.classification.XGBoostMulticlassClassifierUDTF';

drop temporary function xgboost_predict;
create temporary function xgboost_predict as 'hivemall.xgboost.tools.XGBoostPredictUDTF';

drop temporary function xgboost_multiclass_predict;
create temporary function xgboost_multiclass_predict as 'hivemall.xgboost.tools.XGBoostMulticlassPredictUDTF';

--------------------------------------------------------------------------------------------------
-- macros available from hive 0.12.0
-- see https://issues.apache.org/jira/browse/HIVE-2655
Expand Down
3 changes: 1 addition & 2 deletions resources/ddl/import-packages.spark
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@ import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.hive.HivemallOps._
import org.apache.spark.sql.hive.HivemallUtils
import org.apache.spark.sql.hive.XGBoostOptions
// Needed for implicit conversions
import org.apache.spark.sql.hive.HivemallUtils._
import sqlContext.implicits._

val ft2vec = HivemallUtils.funcVectorizer()
38 changes: 20 additions & 18 deletions spark/spark-1.6/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,10 @@
<relativePath>../../pom.xml</relativePath>
</parent>

<artifactId>hivemall-spark</artifactId>
<name>Hivemall on Spark</name>
<artifactId>hivemall-spark-${spark.version}_${scala.version}</artifactId>
<name>Hivemall on Spark 1.6</name>
<packaging>jar</packaging>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>1.6.1</spark.version>
<scala.version>2.11.8</scala.version>
</properties>

<dependencies>
<!-- hivemall dependencies -->
<dependency>
Expand All @@ -27,18 +21,13 @@
<version>${project.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>io.github.myui</groupId>
<artifactId>hivemall-mixserv</artifactId>
<version>${project.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>io.github.myui</groupId>
<artifactId>hivemall-spark-common</artifactId>
<version>${project.version}</version>
<scope>compile</scope>
</dependency>

<!-- other third-party dependencies -->
<dependency>
<groupId>org.scala-lang</groupId>
Expand All @@ -64,6 +53,12 @@
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
Expand All @@ -76,7 +71,14 @@
<version>1.8</version>
<scope>compile</scope>
</dependency>

<!-- test dependencies -->
<dependency>
<groupId>io.github.myui</groupId>
<artifactId>hivemall-mixserv</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.xerial</groupId>
<artifactId>xerial-core</artifactId>
Expand Down Expand Up @@ -126,17 +128,17 @@
</jvmArgs>
</configuration>
</plugin>
<!-- hivemall-spark-xx.jar -->
<!-- hivemall-spark_2.11-xx.jar -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.5</version>
<configuration>
<finalName>${project.artifactId}-v1.6-${project.version}</finalName>
<finalName>${project.artifactId}-${project.version}</finalName>
<outputDirectory>${project.parent.build.directory}</outputDirectory>
</configuration>
</plugin>
<!-- hivemall-spark-xx-with-dependencies.jar including minimum dependencies -->
<!-- hivemall-spark_2.11-xx-with-dependencies.jar including minimum dependencies -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
Expand All @@ -149,7 +151,7 @@
<goal>shade</goal>
</goals>
<configuration>
<finalName>${project.artifactId}-v1.6-${project.version}-with-dependencies</finalName>
<finalName>${project.artifactId}-${project.version}-with-dependencies</finalName>
<outputDirectory>${project.parent.build.directory}</outputDirectory>
<minimizeJar>false</minimizeJar>
<createDependencyReducedPom>false</createDependencyReducedPom>
Expand Down
Loading