[SPARK-19633][SS] FileSource read from FileSink#16987
[SPARK-19633][SS] FileSource read from FileSink#16987lw-lin wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #73124 has started for PR 16987 at commit |
|
Jenkins retest this please |
|
Test build #73126 has finished for PR 16987 at commit
|
There was a problem hiding this comment.
this is to keep track of the file name for later checking
|
Thanks for working on this, however I'm not sure if we want to go with this approach. In Spark 2.2, I think we should consider deprecating the manifest files and instead use deterministic file names to get exactly once semantics. |
|
Using deterministic file names sounds great. Thanks! I'm closing this. |
|
I spoke too soon, sorry! Thinking about it more the deterministic filename solution is not great as the number of partitions could change for several reasons. Given that would you mind reopening this? /cc @zsxwing do you have time to review? |
|
Reopening :-) |
|
Test build #73374 has finished for PR 16987 at commit
|
|
retest this please |
|
Test build #73382 has finished for PR 16987 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Overall looks good. Could you rewrite the tests to use real streaming queries rather than modifying the log manually? It's better to have two queries, one is writing to FileSink, the other is reading from the same folder using FileSource.
There was a problem hiding this comment.
I guess sourceHasMetadata is generated here is because of hasMetadata. Could you move hasMetadata to object FileStreamSink? Then you can do it inside FileStreamSource.
There was a problem hiding this comment.
Yea hasMetadata was the reason! Now it lives in object FileStreamSink :-D
b66d2cc to
d31cb76
Compare
|
Test build #73492 has finished for PR 16987 at commit
|
|
Rebased to master and tests updated. @zsxwing would you take another look when you've got a minute? |
| /** | ||
| * If the source has a metadata log indicating which files should be read, then we should use it. | ||
| * We figure out whether there exists some metadata log only when user gives a non-glob path. | ||
| */ |
There was a problem hiding this comment.
Just found one corner case: if the query to write files has not yet started, the current folder will contain no files even it's an output folder of the file sink. I think we should always call sourceHasMetadata until the folder is not empty.
There was a problem hiding this comment.
Actually, why not just change sourceHasMetadata to a method? sparkSession.sessionState.newHadoopConf() seems expensive but we can save it into a field.
There was a problem hiding this comment.
ah thanks! I was about to change it to a method which would stop detecting once we know for sure to use a metadatafileindex or a inmemoryfileindex and remember this information. will push an udpate soon.
There was a problem hiding this comment.
and add a dedicated test case of course
| * | ||
| * None means we don't know at the moment | ||
| * Some(true) means we know for sure the source DOES have metadata | ||
| * Some(false) means we know for sure the source DOSE NOT have metadata |
There was a problem hiding this comment.
( some notes here since the changes are not trival )
here we're using this sourceHasMetadata to indicate whether we know for sure the source has metadata, as stated in the source file comments:
Nonemeans we don't know at the momentSome(true)means we know for sure the source DOES have metadataSome(false)means we know for sure the source DOSE NOT have metadata
| // Note if `sourceHasMetadata` holds, then `qualifiedBasePath` is guaranteed to be a | ||
| // non-glob path | ||
| new MetadataLogFileIndex(sparkSession, qualifiedBasePath) | ||
|
|
There was a problem hiding this comment.
then based on sourceHasMetadata's value, we can choose which FileIndex should be used. As shown below, case None requires most of the care.
There was a problem hiding this comment.
seems like sourceHasMetadata match { case ... } is more appropriate here
| val sources = query.get.logicalPlan.collect { | ||
| case StreamingExecutionRelation(source, _) if source.isInstanceOf[FileStreamSource] => | ||
| source.asInstanceOf[FileStreamSource] | ||
| } |
There was a problem hiding this comment.
this common logic is extracted out
|
Test build #73578 has finished for PR 16987 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Looks good overall. Left some style comments.
| } | ||
|
|
||
| /** Execute arbitrary code */ | ||
| case class Execute(val func: StreamExecution => Any) extends StreamAction { |
There was a problem hiding this comment.
How about just make this extend AssertOnQuery to avoid adding new case clause to testStream which is already pretty long?
| withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { | ||
| withTempDirs { case (dir, tmp) => | ||
| // q1 is a streaming query that reads from memory and writes to text files | ||
| val q1_source = MemoryStream[String] |
There was a problem hiding this comment.
nit: please don't use _ in a variable name.
|
|
||
| test("read data from outputs of another streaming query") { | ||
| withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { | ||
| withTempDirs { case (dir, tmp) => |
There was a problem hiding this comment.
tmp is not used. Why not just name them as (outputDir, checkpointDir)? Same for other tests.
| val q1_source = MemoryStream[String] | ||
| val q1_checkpointDir = new File(dir, "q1_checkpointDir").getCanonicalPath | ||
| val q1_outputDir = new File(dir, "q1_outputDir") | ||
| assert(q1_outputDir.mkdir()) // prepare the output dir for q2 to read |
There was a problem hiding this comment.
nit: just put the command following the statement with 1 space. Using the current format is hard to maintain in future because it requires to align comments. Same for other comments.
| testStream(q2)( | ||
| AssertOnQuery { q2 => | ||
| val fileSource = getSourcesFromStreamingQuery(q2).head | ||
| fileSource.sourceHasMetadata === None // q1 has not started yet, verify that q2 |
There was a problem hiding this comment.
nit: put the comment above this line. Same for other comments
// q1 has not started yet, verify that q2 doesn't know whether q1 has metadata
fileSource.sourceHasMetadata === None
| fileSource.sourceHasMetadata === Some(true) // q1 has started, verify that q2 knows q1 has | ||
| // metadata by now | ||
| }, | ||
| CheckAnswer("keep2"), // answer should be correct |
There was a problem hiding this comment.
nit: // answer should be correct is obvious. Don't add such comments.
| // doesn't know whether q1 has metadata | ||
| }, | ||
| Execute { _ => | ||
| q1 = q1_write.start(q1_outputDir.getCanonicalPath) // start q1 !!! |
There was a problem hiding this comment.
nit: // start q1 !!! is obvious. Don't add such comments.
| q2ProcessAllAvailable(), | ||
| CheckAnswer("keep2", "keep3", "keep4"), | ||
|
|
||
| // stop q1 manually |
There was a problem hiding this comment.
nit: // stop q1 manually is obvious. Don't add such comments.
| } | ||
| } | ||
|
|
||
| test("read partitioned data from outputs of another streaming query") { |
There was a problem hiding this comment.
This test seems not necessary. It will pass even if the source doesn't use the partition information.
There was a problem hiding this comment.
In the long term, we should write the partition information to the file sink log, then we can read it in the file source. However, it's out of scope. If you have time, you can think about it and submit a new PR after this one.
There was a problem hiding this comment.
test removed -- Let me think about this write partition infommation thing :)
thanks!
| allFiles = allFilesUsingInMemoryFileIndex() | ||
| if (allFiles.isEmpty) { | ||
| // we still cannot decide | ||
| sourceHasMetadata match { |
There was a problem hiding this comment.
simply switched to sourceHasMetadata match { case... case ... case ...}
actual diff is quite small
|
Test build #73653 has finished for PR 16987 at commit
|
|
LGTM. Thanks! Merging to master. |
What changes were proposed in this pull request?
Right now file source always uses
InMemoryFileIndexto scan files from a given path.But when reading the outputs from another streaming query, the file source should use
MetadataFileIndexto list files from the sink log. This patch adds this support.MetadataFileIndexorInMemoryFileIndexHow was this patch tested?
two newly added tests