Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions dev/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pep8*.py
pycodestyle*.py
6 changes: 3 additions & 3 deletions dev/appveyor-guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# AppVeyor Guides

Currently, SparkR on Windows is being tested with [AppVeyor](https://ci.appveyor.com). This page describes how to set up AppVeyor with Spark, how to run the build, check the status and stop the build via this tool. There is the documenation for AppVeyor [here](https://www.appveyor.com/docs). Please refer this for full details.
Currently, SparkR on Windows is being tested with [AppVeyor](https://ci.appveyor.com). This page describes how to set up AppVeyor with Spark, how to run the build, check the status and stop the build via this tool. There is the documentation for AppVeyor [here](https://www.appveyor.com/docs). Please refer this for full details.


### Setting up AppVeyor
Expand Down Expand Up @@ -45,7 +45,7 @@ Currently, SparkR on Windows is being tested with [AppVeyor](https://ci.appveyor

<img width="144" alt="2016-08-30 12 16 35" src="https://cloud.githubusercontent.com/assets/6477701/18075026/3ee57bc6-6eac-11e6-826e-5dd09aeb0e7c.png">

- Since we will use Github here, click the "GITHUB" button and then click "Authorize Github" so that AppVeyor can access to the Github logs (e.g. commits).
- Since we will use Github here, click the "GITHUB" button and then click "Authorize Github" so that AppVeyor can access the Github logs (e.g. commits).

<img width="517" alt="2016-09-04 11 10 22" src="https://cloud.githubusercontent.com/assets/6477701/18228819/9a4d5722-7299-11e6-900c-c5ff6b0450b1.png">

Expand Down Expand Up @@ -87,7 +87,7 @@ Currently, SparkR on Windows is being tested with [AppVeyor](https://ci.appveyor

<img width="176" alt="2016-08-30 12 29 41" src="https://cloud.githubusercontent.com/assets/6477701/18075336/de618b52-6eae-11e6-8f01-e4ce48963087.png">

- If the build is running, "CANCEL BUILD" buttom appears. Click this button top cancel the current build.
- If the build is running, "CANCEL BUILD" button appears. Click this button to cancel the current build.

<img width="158" alt="2016-08-30 1 11 13" src="https://cloud.githubusercontent.com/assets/6477701/18075806/4de68564-6eb3-11e6-855b-ee22918767f9.png">

Expand Down
12 changes: 6 additions & 6 deletions dev/lint-python
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ python -B -m compileall -q -l $PATHS_TO_CHECK > "$PYCODESTYLE_REPORT_PATH"
compile_status="${PIPESTATUS[0]}"

# Get pycodestyle at runtime so that we don't rely on it being installed on the build server.
#+ See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
# Updated to latest official version for pep8. pep8 is formally renamed to pycodestyle.
# See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
# Updated to the latest official version of pep8. pep8 is formally renamed to pycodestyle.
PYCODESTYLE_VERSION="2.3.1"
PYCODESTYLE_SCRIPT_PATH="$SPARK_ROOT_DIR/dev/pycodestyle-$PYCODESTYLE_VERSION.py"
PYCODESTYLE_SCRIPT_REMOTE_PATH="https://raw.githubusercontent.com/PyCQA/pycodestyle/$PYCODESTYLE_VERSION/pycodestyle.py"
Expand All @@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH"
export "PATH=$PYTHONPATH:$PATH"

# There is no need to write this output to a file
#+ first, but we do so so that the check status can
#+ be output before the report, like with the
#+ scalastyle and RAT checks.
# first, but we do so so that the check status can
# be output before the report, like with the
# scalastyle and RAT checks.
python "$PYCODESTYLE_SCRIPT_PATH" --config=dev/tox.ini $PATHS_TO_CHECK >> "$PYCODESTYLE_REPORT_PATH"
pycodestyle_status="${PIPESTATUS[0]}"

Expand All @@ -73,7 +73,7 @@ else
fi

if [ "$lint_status" -ne 0 ]; then
echo "PYCODESTYLE checks failed."
echo "pycodestyle checks failed."
cat "$PYCODESTYLE_REPORT_PATH"
rm "$PYCODESTYLE_REPORT_PATH"
exit "$lint_status"
Expand Down
4 changes: 2 additions & 2 deletions dev/run-pip-tests
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ shopt -s nullglob
FWDIR="$(cd "$(dirname "$0")"/..; pwd)"
cd "$FWDIR"

echo "Constucting virtual env for testing"
echo "Constructing virtual env for testing"
VIRTUALENV_BASE=$(mktemp -d)

# Clean up the virtual env enviroment used if we created one.
# Clean up the virtual env environment used if we created one.
function delete_virtualenv() {
echo "Cleaning up temporary directory - $VIRTUALENV_BASE"
rm -rf "$VIRTUALENV_BASE"
Expand Down
2 changes: 1 addition & 1 deletion dev/run-tests-jenkins
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
# Wrapper script that runs the Spark tests then reports QA results
# to github via its API.
# Environment variables are populated by the code here:
#+ https://github.com/jenkinsci/ghprb-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ghprb/GhprbTrigger.java#L139
# https://github.com/jenkinsci/ghprb-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ghprb/GhprbTrigger.java#L139

FWDIR="$( cd "$( dirname "$0" )/.." && pwd )"
cd "$FWDIR"
Expand Down
8 changes: 4 additions & 4 deletions dev/sparktestsupport/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@
@total_ordering
class Module(object):
"""
A module is the basic abstraction in our test runner script. Each module consists of a set of
source files, a set of test commands, and a set of dependencies on other modules. We use modules
to define a dependency graph that lets determine which tests to run based on which files have
changed.
A module is the basic abstraction in our test runner script. Each module consists of a set
of source files, a set of test commands, and a set of dependencies on other modules. We use
modules to define a dependency graph that let us determine which tests to run based on which
files have changed.
"""

def __init__(self, name, dependencies, source_file_regexes, build_profile_flags=(), environ={},
Expand Down
6 changes: 3 additions & 3 deletions dev/sparktestsupport/toposort.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ def toposort(data):
"""Dependencies are expressed as a dictionary whose keys are items
and whose values are a set of dependent items. Output is a list of
sets in topological order. The first set consists of items with no
dependences, each subsequent set consists of items that depend upon
items in the preceeding sets.
dependencies, each subsequent set consists of items that depend upon
items in the preceding sets.
"""

# Special case empty input.
Expand All @@ -59,7 +59,7 @@ def toposort(data):
v.discard(k)
# Find all items that don't depend on anything.
extra_items_in_deps = _reduce(set.union, data.values()) - set(data.keys())
# Add empty dependences where needed.
# Add empty dependencies where needed.
data.update({item: set() for item in extra_items_in_deps})
while True:
ordered = set(item for item, dep in data.items() if len(dep) == 0)
Expand Down
4 changes: 2 additions & 2 deletions dev/tests/pr_merge_ability.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@
# found at dev/run-tests-jenkins.
#
# Arg1: The Github Pull Request Actual Commit
#+ known as `ghprbActualCommit` in `run-tests-jenkins`
# known as `ghprbActualCommit` in `run-tests-jenkins`
# Arg2: The SHA1 hash
#+ known as `sha1` in `run-tests-jenkins`
# known as `sha1` in `run-tests-jenkins`
#

ghprbActualCommit="$1"
Expand Down
4 changes: 2 additions & 2 deletions dev/tests/pr_public_classes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@
# found at dev/run-tests-jenkins.
#
# Arg1: The Github Pull Request Actual Commit
#+ known as `ghprbActualCommit` in `run-tests-jenkins`
# known as `ghprbActualCommit` in `run-tests-jenkins`

ghprbActualCommit="$1"

# $ghprbActualCommit is an automatic merge commit generated by GitHub; its parents are some Spark
# master commit and the tip of the pull request branch.

# By diffing$ghprbActualCommit^...$ghprbActualCommit and filtering to examine the diffs of only
# non-test files, we can gets us changes introduced in the PR and not anything else added to master
# non-test files, we can get changes introduced in the PR and not anything else added to master
# since the PR was branched.

# Handle differences between GNU and BSD sed
Expand Down
21 changes: 19 additions & 2 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,13 @@ and the migration guide below will explain all changes between releases.

### Breaking changes

There are no breaking changes.
* The class and trait hierarchy for logistic regression model summaries was changed to be cleaner
and better accommodate the addition of the multi-class summary. This is a breaking change for user
code that casts a `LogisticRegressionTrainingSummary` to a
` BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary`
method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail
(_note_ this is an `Experimental` API). This _does not_ affect the Python `summary` method, which
will still work correctly for both multinomial and binary cases.

### Deprecations and changes of behavior

Expand All @@ -123,8 +129,19 @@ new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
**Changes of behavior**

* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial). In 2.2 and
The default parallelism used in `OneVsRest` is now set to 1 (i.e. serial). In `2.2` and
earlier versions, the level of parallelism was set to the default threadpool size in Scala.
* [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156):
The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than
`1`. This will cause training results to be different between `2.3` and earlier versions.
* [SPARK-21681](https://issues.apache.org/jira/browse/SPARK-21681):
Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients
when some features had zero variance.
* [SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957):
Tree algorithms now use mid-points for split values. This may change results from model training.
* [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657):
Fixed an issue where the features generated by `RFormula` without an intercept were inconsistent
with the output in R. This may change results from model training in this scenario.

## Previous Spark versions

Expand Down
10 changes: 9 additions & 1 deletion python/pyspark/sql/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,12 @@ def __init__(self, sparkContext, jsparkSession=None):
self._jsc = self._sc._jsc
self._jvm = self._sc._jvm
if jsparkSession is None:
jsparkSession = self._jvm.SparkSession(self._jsc.sc())
if self._jvm.SparkSession.getDefaultSession().isDefined() \
and not self._jvm.SparkSession.getDefaultSession().get() \
.sparkContext().isStopped():
jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
else:
jsparkSession = self._jvm.SparkSession(self._jsc.sc())
self._jsparkSession = jsparkSession
self._jwrapped = self._jsparkSession.sqlContext()
self._wrapped = SQLContext(self._sc, self, self._jwrapped)
Expand All @@ -225,6 +230,7 @@ def __init__(self, sparkContext, jsparkSession=None):
if SparkSession._instantiatedSession is None \
or SparkSession._instantiatedSession._sc._jsc is None:
SparkSession._instantiatedSession = self
self._jvm.SparkSession.setDefaultSession(self._jsparkSession)

def _repr_html_(self):
return """
Expand Down Expand Up @@ -759,6 +765,8 @@ def stop(self):
"""Stop the underlying :class:`SparkContext`.
"""
self._sc.stop()
# We should clean the default session up. See SPARK-23228.
self._jvm.SparkSession.clearDefaultSession()
SparkSession._instantiatedSession = None

@since(2.0)
Expand Down
28 changes: 27 additions & 1 deletion python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
from pyspark.sql.types import _array_signed_int_typecode_ctype_mappings, _array_type_mappings
from pyspark.sql.types import _array_unsigned_int_typecode_ctype_mappings
from pyspark.sql.types import _merge_type
from pyspark.tests import QuietTest, ReusedPySparkTestCase, SparkSubmitTests
from pyspark.tests import QuietTest, ReusedPySparkTestCase, PySparkTestCase, SparkSubmitTests
from pyspark.sql.functions import UserDefinedFunction, sha2, lit
from pyspark.sql.window import Window
from pyspark.sql.utils import AnalysisException, ParseException, IllegalArgumentException
Expand Down Expand Up @@ -2925,6 +2925,32 @@ def test_sparksession_with_stopped_sparkcontext(self):
sc.stop()


class SparkSessionTests(PySparkTestCase):

# This test is separate because it's closely related with session's start and stop.
# See SPARK-23228.
def test_set_jvm_default_session(self):
spark = SparkSession.builder.getOrCreate()
try:
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined())
finally:
spark.stop()
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isEmpty())

def test_jvm_default_session_already_set(self):
# Here, we assume there is the default session already set in JVM.
jsession = self.sc._jvm.SparkSession(self.sc._jsc.sc())
self.sc._jvm.SparkSession.setDefaultSession(jsession)

spark = SparkSession.builder.getOrCreate()
try:
self.assertTrue(spark._jvm.SparkSession.getDefaultSession().isDefined())
# The session should be the same with the exiting one.
self.assertTrue(jsession.equals(spark._jvm.SparkSession.getDefaultSession().get()))
finally:
spark.stop()


class UDFInitializationTests(unittest.TestCase):
def tearDown(self):
if SparkSession._instantiatedSession is not None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,18 +46,27 @@ object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
}

plan.transform {
case Except(left, right) if isEligible(left, right) =>
Distinct(Filter(Not(transformCondition(left, skipProject(right))), left))
case e @ Except(left, right) if isEligible(left, right) =>
val newCondition = transformCondition(left, skipProject(right))
newCondition.map { c =>
Distinct(Filter(Not(c), left))
}.getOrElse {
e
}
}
}

private def transformCondition(left: LogicalPlan, right: LogicalPlan): Expression = {
private def transformCondition(left: LogicalPlan, right: LogicalPlan): Option[Expression] = {
val filterCondition =
InferFiltersFromConstraints(combineFilters(right)).asInstanceOf[Filter].condition

val attributeNameMap: Map[String, Attribute] = left.output.map(x => (x.name, x)).toMap

filterCondition.transform { case a : AttributeReference => attributeNameMap(a.name) }
if (filterCondition.references.forall(r => attributeNameMap.contains(r.name))) {
Some(filterCondition.transform { case a: AttributeReference => attributeNameMap(a.name) })
} else {
None
}
}

// TODO: This can be further extended in the future.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -660,12 +660,13 @@ object SQLConf {
val WHOLESTAGE_HUGE_METHOD_LIMIT = buildConf("spark.sql.codegen.hugeMethodLimit")
.internal()
.doc("The maximum bytecode size of a single compiled Java function generated by whole-stage " +
"codegen. When the compiled function exceeds this threshold, " +
"the whole-stage codegen is deactivated for this subtree of the current query plan. " +
s"The default value is ${CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT} and " +
"this is a limit in the OpenJDK JVM implementation.")
"codegen. When the compiled function exceeds this threshold, the whole-stage codegen is " +
"deactivated for this subtree of the current query plan. The default value is 65535, which " +
"is the largest bytecode size possible for a valid Java method. When running on HotSpot, " +
s"it may be preferable to set the value to ${CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT} " +
"to match HotSpot's implementation.")
.intConf
.createWithDefault(CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT)
.createWithDefault(65535)

val WHOLESTAGE_SPLIT_CONSUME_FUNC_BY_OPERATOR =
buildConf("spark.sql.codegen.splitConsumeFuncByOperator")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,21 @@ class ReplaceOperatorSuite extends PlanTest {
comparePlans(optimized, correctAnswer)
}

test("replace Except with Filter when only right filter can be applied to the left") {
val table = LocalRelation(Seq('a.int, 'b.int))
val left = table.where('b < 1).select('a).as("left")
val right = table.where('b < 3).select('a).as("right")

val query = Except(left, right)
val optimized = Optimize.execute(query.analyze)

val correctAnswer =
Aggregate(left.output, right.output,
Join(left, right, LeftAnti, Option($"left.a" <=> $"right.a"))).analyze

comparePlans(optimized, correctAnswer)
}

test("replace Distinct with Aggregate") {
val input = LocalRelation('a.int, 'b.int)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,7 @@ public byte[] getBinary(int ordinal) {
@Override
public CalendarInterval getInterval(int ordinal) {
if (columns[ordinal].isNullAt(rowId)) return null;
final int months = columns[ordinal].getChild(0).getInt(rowId);
final long microseconds = columns[ordinal].getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
return columns[ordinal].getInterval(rowId);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
import org.apache.spark.unsafe.types.UTF8String;

/**
* A column vector backed by Apache Arrow. Currently time interval type and map type are not
* A column vector backed by Apache Arrow. Currently calendar interval type and map type are not
* supported.
*/
@InterfaceStability.Evolving
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import org.apache.spark.sql.catalyst.util.MapData;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.Decimal;
import org.apache.spark.unsafe.types.CalendarInterval;
import org.apache.spark.unsafe.types.UTF8String;

/**
Expand Down Expand Up @@ -195,6 +196,7 @@ public double[] getDoubles(int rowId, int count) {
* struct field.
*/
public final ColumnarRow getStruct(int rowId) {
if (isNullAt(rowId)) return null;
return new ColumnarRow(this, rowId);
}

Expand Down Expand Up @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
public abstract byte[] getBinary(int rowId);

/**
* Returns the ordinal's child column vector.
* Returns the calendar interval type value for rowId.
*
* In Spark, calendar interval type value is basically an integer value representing the number of
* months in this interval, and a long value representing the number of microseconds in this
* interval. An interval type vector is the same as a struct type vector with 2 fields: `months`
* and `microseconds`.
*
* To support interval type, implementations must implement {@link #getChild(int)} and define 2
* child vectors: the first child vector is an int type vector, containing all the month values of
* all the interval values in this vector. The second child vector is a long type vector,
* containing all the microsecond values of all the interval values in this vector.
*/
public final CalendarInterval getInterval(int rowId) {
if (isNullAt(rowId)) return null;
final int months = getChild(0).getInt(rowId);
final long microseconds = getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
}

/**
* @return child [[ColumnVector]] at the given ordinal.
*/
public abstract ColumnVector getChild(int ordinal);
protected abstract ColumnVector getChild(int ordinal);

/**
* Data type for this column.
Expand Down
Loading