Skip to content

Conversation

@andrewor14
Copy link
Contributor

@andrewor14 andrewor14 commented Apr 19, 2016

What changes were proposed in this pull request?

In Spark 2.0 we will have a new entry point for users known as the SparkSession. This class will handle the lazy initialization of the Hive metastore if the user runs commands that require interaction with the metastore (e.g. CREATE TABLE). With this, we can remove the HiveContext, which is an odd API to be exposed to Spark users.

This patch doesn't fully remove HiveContext but does most of the work. A follow-up patch will actually delete the file itself. I've left all the variable naming and any further refactor for later.

How was this patch tested?

Jenkins.

@andrewor14 andrewor14 changed the title [SPARK-14720] Remove HiveContext (step 1) [SPARK-14720][SPARK-13643][WIP] Remove HiveContext (step 1) Apr 19, 2016
def setConf(props: Properties): Unit = sessionState.setConf(props)

/** Set the given Spark SQL configuration property. */
private[sql] def setConf[T](entry: ConfigEntry[T], value: T): Unit = conf.setConf(entry, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we also need to change this?

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56166 has finished for PR 12485 at commit ce1214d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56170 has finished for PR 12485 at commit 6019541.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56174 has finished for PR 12485 at commit 75d1115.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Andrew Or added 5 commits April 19, 2016 10:04
Previously we still tried to load HiveContext even if the user
explicitly specified an "in-memory" catalog impelmentation. Now
it will load a SQLContext in this case.
It was failing because we were passing in a subclass of
SparkContext into SparkSession, and the reflection was using
the wrong class to get the constructor. This is now fixed with
ClassTags.
Avoid some unnecessary casts.
@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56254 has finished for PR 12485 at commit bc35206.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
private def reflect[T, Arg <: AnyRef](
className: String,
ctorArg: Arg)(implicit ctorArgTag: ClassTag[Arg]): T = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a class tag. You can call ctorArg.getClass().

Copy link
Contributor Author

@andrewor14 andrewor14 Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't work because there are places where we pass subclasses of Arg in here (see SQLExecutionSuite)

Andrew Or added 4 commits April 19, 2016 14:33
The problem was that we weren't using the right QueryExecution
when we called TestHive.sessionState.executePlan. We were using
HiveQueryExecution instead of the custom one that we created
in TestHiveContext.

This turned out to be very difficult to fix due to the tight
coupling of QueryExecution within TestHiveContext. I had to
refactor this code significantly to extract the nested logic
one by one.
@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56276 has finished for PR 12485 at commit b3d23fa.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait PartitionCoalescer
    • class PartitionGroup(val prefLoc: Option[String] = None)
    • trait ObjectProducer extends LogicalPlan
    • trait ObjectConsumer extends UnaryNode
    • case class DeserializeToObject(
    • case class SerializeFromObject(
    • case class AppendColumnsWithObject(
    • case class AppendColumnsWithObject(

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56273 has finished for PR 12485 at commit 303f991.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56278 has finished for PR 12485 at commit ddc752a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Andrew Or added 2 commits April 19, 2016 18:06
The problem was that we were getting everything from
executionHive's hiveconf and setting that in metadataHive,
overriding the value of `hive.metastore.warehouse.dir`,
which we customize in TestHive. This resulted in a bunch
of "Table src does not exist" errors from Hive.
@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56303 has finished for PR 12485 at commit 8bf1236.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56305 has finished for PR 12485 at commit 74b105e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Timer(val iteration: Int)
    • case class Case(name: String, fn: Timer => Unit)
    • trait ScalaReflection
    • trait CheckAnalysis extends PredicateHelper
    • abstract class SubqueryExpression extends Expression
    • abstract class PredicateSubquery extends SubqueryExpression with Unevaluable with Predicate
    • case class InSubQuery(value: Expression, query: LogicalPlan) extends PredicateSubquery
    • case class Exists(query: LogicalPlan) extends PredicateSubquery
    • case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition
    • abstract class OutputWriterFactory extends Serializable
    • abstract class OutputWriter
    • case class HadoopFsRelation(
    • trait FileFormat
    • case class Partition(values: InternalRow, files: Seq[FileStatus])
    • trait FileCatalog
    • class HDFSFileCatalog(
    • case class FakeFileStatus(

@rxin
Copy link
Contributor

rxin commented Apr 20, 2016

Hey @andrewor14 - I took a quick look at this. Shouldn't we move the sessionstate stuff in its own pr since it would be easier to get that in and also easier to review?

yhuai and others added 3 commits April 20, 2016 09:30
It may take time to track all places where we only use SQLContext.
So, let's change the catalog conf's default value to in-memory.
In the constructor of HiveContext, we will set this conf to hive.
@andrewor14 andrewor14 changed the title [SPARK-14720][SPARK-13643][WIP] Remove HiveContext (step 1) [SPARK-14720][SPARK-13643] Remove HiveContext (step 1) Apr 20, 2016
@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56374 has finished for PR 12485 at commit 9422128.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56382 has finished for PR 12485 at commit 6e3c366.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 20, 2016
…nState and Create a SparkSession class

## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.

## How was this patch tested?
Existing tests

This PR is trying to fix test failures of apache#12485.

Author: Andrew Or <[email protected]>
Author: Yin Huai <[email protected]>

Closes apache#12522 from yhuai/spark-session.
@andrewor14 andrewor14 closed this Apr 20, 2016
@andrewor14 andrewor14 deleted the spark-session branch April 20, 2016 20:51
asfgit pushed a commit that referenced this pull request Apr 25, 2016
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <[email protected]>

Closes #12585 from andrewor14/delete-hive-context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants