[SPARK-14720][SPARK-13643] Remove HiveContext (step 1) #12485

andrewor14 · 2016-04-19T01:11:01Z

What changes were proposed in this pull request?

In Spark 2.0 we will have a new entry point for users known as the SparkSession. This class will handle the lazy initialization of the Hive metastore if the user runs commands that require interaction with the metastore (e.g. CREATE TABLE). With this, we can remove the HiveContext, which is an odd API to be exposed to Spark users.

This patch doesn't fully remove HiveContext but does most of the work. A follow-up patch will actually delete the file itself. I've left all the variable naming and any further refactor for later.

How was this patch tested?

Jenkins.

This requires changing all the downstream places that take in HiveContext and replacing that with (SQLContext, HiveSessionState).

Now both shared state and session state is tracked in SparkSession and we use reflection to instantiate them. After this commit SQLContext and HiveContext are just wrappers for SparkSession.

yhuai · 2016-04-19T01:44:17Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+  def setConf(props: Properties): Unit = sessionState.setConf(props)

  /** Set the given Spark SQL configuration property. */
  private[sql] def setConf[T](entry: ConfigEntry[T], value: T): Unit = conf.setConf(entry, value)


Seems we also need to change this?

SparkQA · 2016-04-19T02:12:20Z

Test build #56166 has finished for PR 12485 at commit ce1214d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-19T03:15:57Z

Test build #56170 has finished for PR 12485 at commit 6019541.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-19T03:39:52Z

Test build #56174 has finished for PR 12485 at commit 75d1115.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Previously we still tried to load HiveContext even if the user explicitly specified an "in-memory" catalog impelmentation. Now it will load a SQLContext in this case.

It was failing because we were passing in a subclass of SparkContext into SparkSession, and the reflection was using the wrong class to get the constructor. This is now fixed with ClassTags.

Avoid some unnecessary casts.

SparkQA · 2016-04-19T20:41:33Z

Test build #56254 has finished for PR 12485 at commit bc35206.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-19T21:28:27Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   */
+  private def reflect[T, Arg <: AnyRef](
+      className: String,
+      ctorArg: Arg)(implicit ctorArgTag: ClassTag[Arg]): T = {


You don't need a class tag. You can call ctorArg.getClass().

didn't work because there are places where we pass subclasses of Arg in here (see SQLExecutionSuite)

The problem was that we weren't using the right QueryExecution when we called TestHive.sessionState.executePlan. We were using HiveQueryExecution instead of the custom one that we created in TestHiveContext. This turned out to be very difficult to fix due to the tight coupling of QueryExecution within TestHiveContext. I had to refactor this code significantly to extract the nested logic one by one.

SparkQA · 2016-04-19T21:54:27Z

Test build #56276 has finished for PR 12485 at commit b3d23fa.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait PartitionCoalescer
- class PartitionGroup(val prefLoc: Option[String] = None)
- trait ObjectProducer extends LogicalPlan
- trait ObjectConsumer extends UnaryNode
- case class DeserializeToObject(
- case class SerializeFromObject(
- case class AppendColumnsWithObject(
- case class AppendColumnsWithObject(

SparkQA · 2016-04-19T22:58:56Z

Test build #56273 has finished for PR 12485 at commit 303f991.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-19T23:45:42Z

Test build #56278 has finished for PR 12485 at commit ddc752a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

The problem was that we were getting everything from executionHive's hiveconf and setting that in metadataHive, overriding the value of `hive.metastore.warehouse.dir`, which we customize in TestHive. This resulted in a bunch of "Table src does not exist" errors from Hive.

SparkQA · 2016-04-20T02:56:01Z

Test build #56303 has finished for PR 12485 at commit 8bf1236.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T02:59:42Z

Test build #56305 has finished for PR 12485 at commit 74b105e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Timer(val iteration: Int)
- case class Case(name: String, fn: Timer => Unit)
- trait ScalaReflection
- trait CheckAnalysis extends PredicateHelper
- abstract class SubqueryExpression extends Expression
- abstract class PredicateSubquery extends SubqueryExpression with Unevaluable with Predicate
- case class InSubQuery(value: Expression, query: LogicalPlan) extends PredicateSubquery
- case class Exists(query: LogicalPlan) extends PredicateSubquery
- case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition
- abstract class OutputWriterFactory extends Serializable
- abstract class OutputWriter
- case class HadoopFsRelation(
- trait FileFormat
- case class Partition(values: InternalRow, files: Seq[FileStatus])
- trait FileCatalog
- class HDFSFileCatalog(
- case class FakeFileStatus(

rxin · 2016-04-20T06:53:27Z

Hey @andrewor14 - I took a quick look at this. Shouldn't we move the sessionstate stuff in its own pr since it would be easier to get that in and also easier to review?

It may take time to track all places where we only use SQLContext. So, let's change the catalog conf's default value to in-memory. In the constructor of HiveContext, we will set this conf to hive.

SparkQA · 2016-04-20T18:31:16Z

Test build #56374 has finished for PR 12485 at commit 9422128.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T19:37:39Z

Test build #56382 has finished for PR 12485 at commit 6e3c366.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…nState and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of apache#12485. Author: Andrew Or <[email protected]> Author: Yin Huai <[email protected]> Closes apache#12522 from yhuai/spark-session.

## What changes were proposed in this pull request? This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class. Note: A couple of things will break after this patch. These will be fixed separately. - the python HiveContext - all the documentation / comments referencing HiveContext - there will be no more HiveContext in the REPL (fixed by #12589) ## How was this patch tested? No change in functionality. Author: Andrew Or <[email protected]> Closes #12585 from andrewor14/delete-hive-context.

Andrew Or added 14 commits April 18, 2016 10:45

[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState

f6585f9

Make HiveSessionState take in SQLContext, not HiveContext

fe89b8d

This requires changing all the downstream places that take in HiveContext and replacing that with (SQLContext, HiveSessionState).

Move QueryExecution out of HiveContext

54046d6

Merge branch 'master' of github.com:apache/spark into spark-session

5fc8177

Minor cleanup

83b3f70

Implement SparkSession and use it to track state

b33514c

Now both shared state and session state is tracked in SparkSession and we use reflection to instantiate them. After this commit SQLContext and HiveContext are just wrappers for SparkSession.

Merge branch 'master' of github.com:apache/spark into spark-session

8379143

Clean up some TODO's and bad signatures

6b808aa

Move the bulk of HiveContext into SessionCatalog

5198955

Remove more things from HiveContext

d58c6af

Fix style

edaebe5

Merge branch 'master' of github.com:apache/spark into spark-session

ce1214d

Minor fixes

4f3ade9

Use in-memory catalog by default in tests

6019541

andrewor14 force-pushed the spark-session branch from 8255969 to 6019541 Compare April 19, 2016 01:36

andrewor14 changed the title ~~[SPARK-14720] Remove HiveContext (step 1)~~ [SPARK-14720][SPARK-13643][WIP] Remove HiveContext (step 1) Apr 19, 2016

yhuai reviewed Apr 19, 2016
View reviewed changes

Fix NPE when initializing HiveSessionState

75d1115

andrewor14 force-pushed the spark-session branch from 2ecc444 to 75d1115 Compare April 19, 2016 01:50

Fix the conf

36d6bc8

Andrew Or added 5 commits April 19, 2016 10:04

Merge branch 'master' of github.com:apache/spark into spark-session

95ffe86

Fix REPL in in-memory case

a1d45e8

Previously we still tried to load HiveContext even if the user explicitly specified an "in-memory" catalog impelmentation. Now it will load a SQLContext in this case.

Fix tests: set "in-memory" in more places

0df39fc

Fix style

0d7309b

Fix SQLExecutionSuite

bc35206

It was failing because we were passing in a subclass of SparkContext into SparkSession, and the reflection was using the wrong class to get the constructor. This is now fixed with ClassTags.

andrewor14 force-pushed the spark-session branch from 575fd47 to bc35206 Compare April 19, 2016 18:51

Fix ParquetHadoopFsRelationSuite

5fcc249

Avoid some unnecessary casts.

marmbrus reviewed Apr 19, 2016
View reviewed changes

Andrew Or added 4 commits April 19, 2016 14:33

Make diff slightly smaller?

303f991

Fix test compile

d27ec50

Merge branch 'master' of github.com:apache/spark into spark-session

b3d23fa

Andrew Or added 3 commits April 19, 2016 15:01

Fix HiveResolutionSuite

ddc752a

Fix StatisticsSuite

9b8dc3a

Minor change

e257137

Andrew Or added 2 commits April 19, 2016 18:06

Merge branch 'master' of github.com:apache/spark into spark-session

74b105e

yhuai mentioned this pull request Apr 20, 2016

[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class #12522

Closed

yhuai and others added 3 commits April 20, 2016 09:30

Fix tests

32212bb

Merge branch 'master' of github.com:apache/spark into spark-session

9422128

Fix Python tests.

6e3c366

It may take time to track all places where we only use SQLContext. So, let's change the catalog conf's default value to in-memory. In the constructor of HiveContext, we will set this conf to hive.

andrewor14 changed the title ~~[SPARK-14720][SPARK-13643][WIP] Remove HiveContext (step 1)~~ [SPARK-14720][SPARK-13643] Remove HiveContext (step 1) Apr 20, 2016

andrewor14 closed this Apr 20, 2016

andrewor14 deleted the spark-session branch April 20, 2016 20:51

andrewor14 mentioned this pull request Apr 21, 2016

[SPARK-14721][SQL] Remove HiveContext (part 2) #12585

Closed

[SPARK-14720][SPARK-13643] Remove HiveContext (step 1) #12485

[SPARK-14720][SPARK-13643] Remove HiveContext (step 1) #12485

Uh oh!

Conversation

andrewor14 commented Apr 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yhuai Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

marmbrus Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 Apr 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

rxin commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andrewor14 commented Apr 19, 2016 •

edited

Loading

andrewor14 Apr 19, 2016 •

edited

Loading