Skip to content

Commit 70b606f

Browse files
committed
[SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffer size
### What changes were proposed in this pull request? This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449. This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options. ### Why are the changes needed? To work around uniVocity/univocity-parsers#449. ### Does this PR introduce _any_ user-facing change? No, it's only internal option. ### How was this patch tested? Manually tested by modifying the unittest added in #31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79..705f38dbfbd 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2456,6 +2456,7 abstract class CSVSuite test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") { val bufSize = 128 val line = "X" * (bufSize - 1) + "| |" + spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128) withTempPath { path => Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") ``` Closes #32231 from HyukjinKwon/SPARK-35045-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent a74f601 commit 70b606f

File tree

2 files changed

+11
-0
lines changed

2 files changed

+11
-0
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@ class CSVOptions(
212212
val lineSeparatorInWrite: Option[String] = lineSeparator
213213

214214
val inputBufferSize: Option[Int] = parameters.get("inputBufferSize").map(_.toInt)
215+
.orElse(SQLConf.get.getConf(SQLConf.CSV_INPUT_BUFFER_SIZE))
215216

216217
/**
217218
* The handling method to be used when unescaped quotes are found in the input.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2453,6 +2453,16 @@ object SQLConf {
24532453
.booleanConf
24542454
.createWithDefault(true)
24552455

2456+
val CSV_INPUT_BUFFER_SIZE = buildConf("spark.sql.csv.parser.inputBufferSize")
2457+
.internal()
2458+
.doc("If it is set, it configures the buffer size of CSV input during parsing. " +
2459+
"It is the same as inputBufferSize option in CSV which has a higher priority. " +
2460+
"Note that this is a workaround for the parsing library's regression, and this " +
2461+
"configuration is internal and supposed to be removed in the near future.")
2462+
.version("3.0.3")
2463+
.intConf
2464+
.createOptional
2465+
24562466
val REPL_EAGER_EVAL_ENABLED = buildConf("spark.sql.repl.eagerEval.enabled")
24572467
.doc("Enables eager evaluation or not. When true, the top K rows of Dataset will be " +
24582468
"displayed if and only if the REPL supports the eager evaluation. Currently, the " +

0 commit comments

Comments
 (0)