A library for Positive-Unlabeled Learning for Apache Spark MLlib (ml package)
Original Positive-Unlabeled learning algorithm; firstly proposed in
Liu, B., Dai, Y., Li, X. L., Lee, W. S., & Philip, Y. (2002). Partially supervised classification of text documents. In ICML 2002, Proceedings of the nineteenth international conference on machine learning. (pp. 387–394).
Modified Positive-Unlabeled learning algorithm; main idea is to gradually refine set of positive examples. Pseudocode was taken from:
Fusilier, D. H., Montes-y-Gómez, M., Rosso, P., & Cabrera, R. G. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443.
Spark 1.5+
(Spark 2+ was not tested,
but should work if replace SparkContext by SparkSession
and mllib.linalg.Vector by ml.linalg.Vector)
The library is published into Maven central and JCenter. Add the following lines depending on your build system.
compile 'ru.ispras:pu4spark:0.3'<dependency>
<groupId>ru.ispras</groupId>
<artifactId>pu4spark</artifactId>
<version>0.3</version>
</dependency>libraryDependencies += "ru.ispras" % "pu4spark" % "0.3"
Build library with gradle:
./gradlew jarval inputLabelName = "category"
val srcFeaturesName = "srcFeatures"
val outputLabel = "outputLabel"
val puLearnerConfig = TraditionalPULearnerConfig(0.05, 1, LogisticRegressionConfig())
val puLearner = puLearnerConfig.build()
val df = ... //needed df that contains at least the following columns:
// binary label for positive and unlabel (inputLabelName)
// and features assembled as vector (featuresName)
val weightedDF = puLearner.weight(preparedDf, inputLabelName, srcFeaturesName, outputLabel)Returned dataframe contains probability estimation for each instance in the column outputLabel.
Features can be assembled to one column by using VectorAssembler:
val assembler = new VectorAssembler()
.setInputCols(df.columns.filter(c => c != rowName)) //keep here only feature columns
.setOutputCol(featuresName)
val pipeline = new Pipeline().setStages(Array(assembler))
val preparedDf = pipeline.fit(df).transform(df)