[SPARK-19633][SS] FileSource read from FileSink by lw-lin · Pull Request #16987 · apache/spark

lw-lin · 2017-02-19T07:56:50Z

What changes were proposed in this pull request?

Right now file source always uses InMemoryFileIndex to scan files from a given path.

But when reading the outputs from another streaming query, the file source should use MetadataFileIndex to list files from the sink log. This patch adds this support.

`MetadataFileIndex` or `InMemoryFileIndex`

spark
  .readStream
  .format(...)
  .load("/some/path") // for a non-glob path:
                      //   - use `MetadataFileIndex` when `/some/path/_spark_meta` exists
                      //   - fall back to `InMemoryFileIndex` otherwise

spark
  .readStream
  .format(...)
  .load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex`

How was this patch tested?

two newly added tests

SparkQA · 2017-02-19T07:57:38Z

Test build #73124 has started for PR 16987 at commit b66d2cc.

lw-lin · 2017-02-19T08:21:16Z

Jenkins retest this please

SparkQA · 2017-02-19T10:13:53Z

Test build #73126 has finished for PR 16987 at commit b66d2cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-02-19T13:42:54Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

this is to keep track of the file name for later checking

lw-lin · 2017-02-20T14:45:52Z

@marmbrus @zsxwing would you take a look at this? thanks!

marmbrus · 2017-02-21T21:08:06Z

Thanks for working on this, however I'm not sure if we want to go with this approach. In Spark 2.2, I think we should consider deprecating the manifest files and instead use deterministic file names to get exactly once semantics.

lw-lin · 2017-02-23T05:45:11Z

Using deterministic file names sounds great. Thanks! I'm closing this.

marmbrus · 2017-02-23T19:26:14Z

I spoke too soon, sorry! Thinking about it more the deterministic filename solution is not great as the number of partitions could change for several reasons.

Given that would you mind reopening this?

/cc @zsxwing do you have time to review?

lw-lin · 2017-02-24T00:21:56Z

Reopening :-)

SparkQA · 2017-02-24T01:44:15Z

Test build #73374 has finished for PR 16987 at commit b66d2cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-02-24T01:48:52Z

retest this please

SparkQA · 2017-02-24T03:42:17Z

Test build #73382 has finished for PR 16987 at commit b66d2cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Overall looks good. Could you rewrite the tests to use real streaming queries rather than modifying the log manually? It's better to have two queries, one is writing to FileSink, the other is reading from the same folder using FileSource.

zsxwing · 2017-02-24T19:22:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

I guess sourceHasMetadata is generated here is because of hasMetadata. Could you move hasMetadata to object FileStreamSink? Then you can do it inside FileStreamSource.

Yea hasMetadata was the reason! Now it lives in object FileStreamSink :-D

zsxwing · 2017-02-24T19:26:01Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

nit: you can merge the latest master and use test directly. Not need to use testWithUninterruptibleThread after #16947

done; thanks! and good job for #16947!

zsxwing · 2017-02-24T19:26:43Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

nit: same as above

SparkQA · 2017-02-26T14:42:45Z

Test build #73492 has finished for PR 16987 at commit d31cb76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-02-27T00:10:06Z

Rebased to master and tests updated. @zsxwing would you take another look when you've got a minute?

zsxwing · 2017-02-27T08:02:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+  /**
+   * If the source has a metadata log indicating which files should be read, then we should use it.
+   * We figure out whether there exists some metadata log only when user gives a non-glob path.
+   */


Just found one corner case: if the query to write files has not yet started, the current folder will contain no files even it's an output folder of the file sink. I think we should always call sourceHasMetadata until the folder is not empty.

Actually, why not just change sourceHasMetadata to a method? sparkSession.sessionState.newHadoopConf() seems expensive but we can save it into a field.

ah thanks! I was about to change it to a method which would stop detecting once we know for sure to use a metadatafileindex or a inmemoryfileindex and remember this information. will push an udpate soon.

and add a dedicated test case of course

lw-lin · 2017-02-28T08:55:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+   *
+   * None        means we don't know at the moment
+   * Some(true)  means we know for sure the source DOES have metadata
+   * Some(false) means we know for sure the source DOSE NOT have metadata


( some notes here since the changes are not trival )

here we're using this sourceHasMetadata to indicate whether we know for sure the source has metadata, as stated in the source file comments:

None means we don't know at the moment

Some(true) means we know for sure the source DOES have metadata

Some(false) means we know for sure the source DOSE NOT have metadata

lw-lin · 2017-02-28T08:58:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

-        // Note if `sourceHasMetadata` holds, then `qualifiedBasePath` is guaranteed to be a
-        // non-glob path
-        new MetadataLogFileIndex(sparkSession, qualifiedBasePath)
+


then based on sourceHasMetadata's value, we can choose which FileIndex should be used. As shown below, case None requires most of the care.

seems like sourceHasMetadata match { case ... } is more appropriate here

lw-lin · 2017-02-28T09:03:07Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

-      val sources = query.get.logicalPlan.collect {
-        case StreamingExecutionRelation(source, _) if source.isInstanceOf[FileStreamSource] =>
-          source.asInstanceOf[FileStreamSource]
-      }


this common logic is extracted out

SparkQA · 2017-02-28T10:44:13Z

Test build #73578 has finished for PR 16987 at commit eed1c04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks good overall. Left some style comments.

zsxwing · 2017-02-28T19:55:14Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

  }

+  /** Execute arbitrary code */
+  case class Execute(val func: StreamExecution => Any) extends StreamAction {


How about just make this extend AssertOnQuery to avoid adding new case clause to testStream which is already pretty long?

fixed, thanks!

zsxwing · 2017-02-28T19:56:51Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+    withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") {
+      withTempDirs { case (dir, tmp) =>
+        // q1 is a streaming query that reads from memory and writes to text files
+        val q1_source = MemoryStream[String]


nit: please don't use _ in a variable name.

zsxwing · 2017-02-28T22:02:27Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala


+  test("read data from outputs of another streaming query") {
+    withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") {
+      withTempDirs { case (dir, tmp) =>


tmp is not used. Why not just name them as (outputDir, checkpointDir)? Same for other tests.

zsxwing · 2017-02-28T22:17:41Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      val q1_source = MemoryStream[String]
+      val q1_checkpointDir = new File(dir, "q1_checkpointDir").getCanonicalPath
+      val q1_outputDir = new File(dir, "q1_outputDir")
+      assert(q1_outputDir.mkdir())                    // prepare the output dir for q2 to read


nit: just put the command following the statement with 1 space. Using the current format is hard to maintain in future because it requires to align comments. Same for other comments.

understood & fixed

zsxwing · 2017-02-28T22:18:35Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      testStream(q2)(
+        AssertOnQuery { q2 =>
+          val fileSource = getSourcesFromStreamingQuery(q2).head
+          fileSource.sourceHasMetadata === None       // q1 has not started yet, verify that q2


nit: put the comment above this line. Same for other comments

// q1 has not started yet, verify that q2 doesn't know whether q1 has metadata fileSource.sourceHasMetadata === None

zsxwing · 2017-02-28T22:19:34Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+          fileSource.sourceHasMetadata === Some(true) // q1 has started, verify that q2 knows q1 has
+                                                      // metadata by now
+        },
+        CheckAnswer("keep2"),                         // answer should be correct


nit: // answer should be correct is obvious. Don't add such comments.

zsxwing · 2017-02-28T22:20:09Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+                                                      // doesn't know whether q1 has metadata
+        },
+        Execute { _ =>
+          q1 = q1_write.start(q1_outputDir.getCanonicalPath)  // start q1 !!!


nit: // start q1 !!! is obvious. Don't add such comments.

zsxwing · 2017-02-28T22:28:32Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+          q2ProcessAllAvailable(),
+          CheckAnswer("keep2", "keep3", "keep4"),
+
+          // stop q1 manually


nit: // stop q1 manually is obvious. Don't add such comments.

zsxwing · 2017-02-28T22:31:32Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+    }
+  }
+
+  test("read partitioned data from outputs of another streaming query") {


This test seems not necessary. It will pass even if the source doesn't use the partition information.

In the long term, we should write the partition information to the file sink log, then we can read it in the file source. However, it's out of scope. If you have time, you can think about it and submit a new PR after this one.

test removed -- Let me think about this write partition infommation thing :)
thanks!

lw-lin · 2017-03-01T03:24:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

-        allFiles = allFilesUsingInMemoryFileIndex()
-        if (allFiles.isEmpty) {
-          // we still cannot decide
+    sourceHasMetadata match {


simply switched to sourceHasMetadata match { case... case ... case ...}
actual diff is quite small

SparkQA · 2017-03-01T05:13:38Z

Test build #73653 has finished for PR 16987 at commit 62fd518.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-03-01T06:57:56Z

LGTM. Thanks! Merging to master.

lw-lin changed the title ~~[WIP][SPARK-][SS] FileSource read from FileSink~~ [SPARK-19633][SS] FileSource read from FileSink Feb 19, 2017

lw-lin commented Feb 19, 2017

View reviewed changes

lw-lin closed this Feb 23, 2017

lw-lin reopened this Feb 24, 2017

zsxwing requested changes Feb 24, 2017

View reviewed changes

File Source reads from File Sink

d31cb76

lw-lin force-pushed the source-read-from-sink branch from b66d2cc to d31cb76 Compare February 26, 2017 12:49

zsxwing reviewed Feb 27, 2017

View reviewed changes

Deal with corner cases

eed1c04

lw-lin commented Feb 28, 2017

View reviewed changes

zsxwing requested changes Feb 28, 2017

View reviewed changes

Fix styles

62fd518

lw-lin commented Mar 1, 2017

View reviewed changes

zsxwing approved these changes Mar 1, 2017

View reviewed changes

asfgit closed this in 4913c92 Mar 1, 2017

lw-lin deleted the source-read-from-sink branch March 1, 2017 09:06

misutoth mentioned this pull request Sep 28, 2018

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

Closed

Conversation

lw-lin commented Feb 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

MetadataFileIndex or InMemoryFileIndex

How was this patch tested?

Uh oh!

SparkQA commented Feb 19, 2017

Uh oh!

lw-lin commented Feb 19, 2017

Uh oh!

SparkQA commented Feb 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin commented Feb 20, 2017

Uh oh!

marmbrus commented Feb 21, 2017

Uh oh!

lw-lin commented Feb 23, 2017

Uh oh!

marmbrus commented Feb 23, 2017

Uh oh!

lw-lin commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

zsxwing commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 26, 2017

Uh oh!

lw-lin commented Feb 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lw-lin commented Feb 19, 2017 •

edited

Loading

`MetadataFileIndex` or `InMemoryFileIndex`

lw-lin Feb 28, 2017 •

edited

Loading

lw-lin Feb 28, 2017 •

edited

Loading