[SPARK-14678][SQL]Add a file sink log to support versioning and compaction #12435

zsxwing · 2016-04-15T23:24:17Z

What changes were proposed in this pull request?

This PR adds a special log for FileStreamSink for two purposes:

Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.

FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.

FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).

How was this patch tested?

FileStreamSinkLogSuite

SparkQA · 2016-04-16T00:45:33Z

Test build #55977 has finished for PR 12435 at commit 29e3088.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FileLog(path: String, size: Long, action: String)
- class FileStreamSinkLog(sqlContext: SQLContext, path: String)

zsxwing · 2016-04-18T19:46:26Z

cc @marmbrus @tdas

tdas · 2016-04-18T19:57:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * @param path the file path


nit: Add basic doc string on what this class represents.

tdas · 2016-04-18T21:57:47Z

overall looks quite good. just a few nits on naming and docs.

zsxwing · 2016-04-18T23:11:07Z

@tdas FYI, I changed FileStressSuite's numRecords to 1000000 and it passed locally as well.

SparkQA · 2016-04-19T00:10:19Z

Test build #56150 has finished for PR 12435 at commit 48d7fbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SinkFileStatus(path: String, size: Long, action: String)

tdas · 2016-04-19T01:32:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala

+          if (isCompactionBatch(batchId, 3)) {
+            // Since batchId is a compaction batch, the batch log file should contain all logs
+            assert(sinkLog.get(batchId).getOrElse(Nil) === (0 to batchId).map {
+              id => SinkFileStatus("/a/b/" + id, 100L, FileStreamSinkLog.ADD_ACTION)


nit: this and line 147 can be deduped

tdas · 2016-04-19T01:35:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

+   * Returns all files except the deleted ones.
   */
-  def allLogs(): Array[FileLog] = {
+  def allFiles(): Array[SinkFileStatus] = {


Can you make this allSinkFile() so that its not ambiguous with the log files?

SparkQA · 2016-04-19T07:30:51Z

Test build #56197 has finished for PR 12435 at commit e8c14d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-04-19T13:18:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

+   * should set a reasonable `fileCleanupDelayMs`. We will wait until then so that the compaction
+   * file is guaranteed to be visible for all readers
+   */
+  private val fileCleanupDelayMs = sqlContext.getConf(SQLConf.FILE_SINK_LOG_CLEANUP_DELAY)


all AWS S3 endpoints now implement create consistency: if a new object is created, then a GET made directly on it will return that object.

what can take time to appear is the aggregate file in an ls of the parent "directory" —that's really a wild card match on the path. If the processes can determine the final name of the compaction file, they can look for that file directly (getFileStatus() should suffice, open() even better). If the compact file isn't found, they can look for the non-aggregate files. All that should be required is the aggregate file fully written (with a close() at the end of output operation which doesn't discard any raised exception), before deleting the original files. Adding a minor delay is a low-harm feature, but having a direct check for the aggregate file is something which should be done first

@steveloughran thanks for pointing out it. I updated the codes. Now it will try to access the next compaction/aggregate file directly. However, a cleanup delay is still helpful to avoid a live lock.

SparkQA · 2016-04-19T19:38:37Z

Test build #56242 has finished for PR 12435 at commit c6a10e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-19T20:11:10Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val FILE_SINK_LOG_CLEANUP_DELAY =
+    SQLConfigBuilder("spark.sql.streaming.fileSink.log.cleanupDelay")
+      .internal()
+      .doc("How long in milliseconds a file is guaranteed to be visible for all readers.")


Why do we need this? I thought the plan was to use optimistic concurrency control (i.e. just retry if there is a FileNotFoundException).

Why do we need this? I thought the plan was to use optimistic concurrency control (i.e. just retry if there is a FileNotFoundException).

See my comments here: https://github.com/apache/spark/pull/12435/files#diff-e529f046ee04b9926e8dd88e131134e5R61

In addition, old Hadoop's open method doesn't guarantee to throw FileNotFoundException. E.g., https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/s3/S3FileSystem.java#L181

Ignore S3; look at S3N in Hadoop 2.4. Sadly, it doesn't either; I didn't fix that till 2.5 & HADOOP-9361/HADOOP-9597. Hadoop 2.4 s3n is broken in other ways; look at HADOOP-10457.

to summarise: Don't use s3n in Hadoop 2.4; it was the first update to a later Jets3t library and under tested. 2.5 fixed it, 2.6.0 added s3a, though that's not ready for use in 2.7.

Best to do a check for existence up front (getFileStatus()), which works everywhere.

SparkQA · 2016-04-20T00:34:53Z

Test build #56289 has finished for PR 12435 at commit 7eec0c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SinkFileStatus(

SparkQA · 2016-04-20T19:01:45Z

Test build #56380 has finished for PR 12435 at commit e2cd25c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-20T20:32:44Z

Thanks, merging to master!

tedyu · 2016-04-20T21:48:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

+      throw new IllegalStateException("Incomplete log file")
+    }
+    val version = lines(0)
+    if (version != VERSION) {


Should this be 'version > VERSION' ?

Should this be 'version > VERSION' ?

It doesn't matter now. This is the first version. We will update the logic here when we add a new format in future.

steveloughran · 2016-04-21T09:10:33Z

@zsxwing if you have q's about the quirks of s3* APIs and endpoints, feel free to email me direct, stevel @ hortonworks.

Add a file sink log to support versioning and compaction

29e3088

zsxwing changed the title ~~Add a file sink log to support versioning and compaction~~ [SPARK-14678][SQL]Add a file sink log to support versioning and compaction Apr 15, 2016

tdas reviewed Apr 18, 2016
View reviewed changes

Address comments

48d7fbf

tdas reviewed Apr 19, 2016
View reviewed changes

Address more comments

e8c14d6

steveloughran reviewed Apr 19, 2016
View reviewed changes

zsxwing added 2 commits April 19, 2016 10:55

Try the next compaction batch directly

6ba701a

Reduce the default cleanup delay

c6a10e2

marmbrus reviewed Apr 19, 2016
View reviewed changes

Add more fields from FileStatus

7eec0c1

Use Hadoop FileSystem API

e2cd25c

asfgit closed this in 7bc9485 Apr 20, 2016

tedyu reviewed Apr 20, 2016
View reviewed changes

zsxwing deleted the sink-log branch April 20, 2016 21:52

[SPARK-14678][SQL]Add a file sink log to support versioning and compaction #12435

[SPARK-14678][SQL]Add a file sink log to support versioning and compaction #12435

Uh oh!

Conversation

zsxwing commented Apr 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 16, 2016

Uh oh!

zsxwing commented Apr 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Apr 18, 2016

Uh oh!

zsxwing commented Apr 18, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

marmbrus commented Apr 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants