Skip to content

[SPARK-53183][SQL] Use Java Files.readString instead of o.a.s.sql.catalyst.util.fileToString#51911

Closed
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-53183
Closed

[SPARK-53183][SQL] Use Java Files.readString instead of o.a.s.sql.catalyst.util.fileToString#51911
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-53183

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 7, 2025

What changes were proposed in this pull request?

This PR aims to use Java 11+ java.nio.file.Files.readString instead of o.a.s.sql.catalyst.util.fileToString. In other words, this PR removes Spark's fileToString method from Spark code base.

Why are the changes needed?

Since Java 11, Files.readString exists. So, we don't need to maintain fileToString method. Note that Apache Spark always uses the default value of encoding, UTF-8.

def fileToString(file: File, encoding: Charset = UTF_8): String = {
val inStream = new FileInputStream(file)
try {
new String(ByteStreams.toByteArray(inStream), encoding)
} finally {
inStream.close()
}
}

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

BEFORE

$ git grep fileToString | wc -l
      22

AFTER

$ git grep fileToString | wc -l
       0

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 7, 2025
@dongjoon-hyun dongjoon-hyun force-pushed the SPARK-53183 branch 2 times, most recently from 0a1c44a to e13c675 Compare August 7, 2025 22:02
@dongjoon-hyun
Copy link
Member Author

Hi, @zhengruifeng . Could you review this method-replacement PR if you have some time, please?

@dongjoon-hyun
Copy link
Member Author

Could you review this PR when you have some time, @yaooqinn ?

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you so much. Roughly, at the first run, Java one is faster.

$ bin/spark-shell --driver-memory 12G
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-preview1
      /_/

Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 21.0.8)
Type in expressions to have them evaluated.
Type :help for more information.
25/08/07 19:59:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1754621949360).
Spark session available as 'spark'.

scala> spark.time(org.apache.spark.sql.catalyst.util.fileToString(new java.io.File("/tmp/1G.bin")).length)
Time taken: 523 ms
val res0: Int = 1073741824
$ bin/spark-shell --driver-memory 12G
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-preview1
      /_/

Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 21.0.8)
Type in expressions to have them evaluated.
Type :help for more information.
25/08/07 19:59:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1754621942077).
Spark session available as 'spark'.

scala> spark.time(java.nio.file.Files.readString(java.nio.file.Path.of("/tmp/1G.bin")).length)
Time taken: 339 ms
val res0: Int = 1073741824

@dongjoon-hyun
Copy link
Member Author

The test failure is a known flaky one. Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-53183 branch August 8, 2025 03:01
@zhengruifeng
Copy link
Contributor

Late LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you, @zhengruifeng .

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 22, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 23, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 23, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 24, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 25, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 25, 2025
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 29, 2025
…reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 29, 2025
…reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 30, 2025
…reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.
baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 30, 2025
…reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.
baibaichen added a commit to apache/incubator-gluten that referenced this pull request Dec 31, 2025
* [Scala 2.13][IntelliJ] Remove suppression for lint-multiarg-infix warnings in pom.xml

see apache/spark#43332

* [Scala 2.13][IntelliJ] Suppress warning for `ContentFile::path`

* [Scala 2.13][IntelliJ] Suppress warning for ContextAwareIterator initialization

* [Scala 2.13][IntelliJ] Refactor to use Symbol for column references to fix compilation error in Scala 2.13 with IntelliJ compiler: symbol literal is deprecated; use Symbol("i")

* [Fix] Replace deprecated fileToString with Files.readString for file reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.

* [Scala 2.13][IntelliJ] Update the Java compiler release version from 8 to `${java.version}` in the Scala 2.13 profiler to align it with `maven.compiler.target`

* [Refactor] Replace usage of `Symbol` with `col` for column references to align with Spark API best practices

---------

Co-authored-by: Chang chen <chenchang@apache.com>
QCLyu pushed a commit to QCLyu/incubator-gluten that referenced this pull request Jan 8, 2026
* [Scala 2.13][IntelliJ] Remove suppression for lint-multiarg-infix warnings in pom.xml

see apache/spark#43332

* [Scala 2.13][IntelliJ] Suppress warning for `ContentFile::path`

* [Scala 2.13][IntelliJ] Suppress warning for ContextAwareIterator initialization

* [Scala 2.13][IntelliJ] Refactor to use Symbol for column references to fix compilation error in Scala 2.13 with IntelliJ compiler: symbol literal is deprecated; use Symbol("i")

* [Fix] Replace deprecated fileToString with Files.readString for file reading in GlutenSQLQueryTestSuite

see apache/spark#51911 which removes Spark's fileToString method from Spark code base.

* [Scala 2.13][IntelliJ] Update the Java compiler release version from 8 to `${java.version}` in the Scala 2.13 profiler to align it with `maven.compiler.target`

* [Refactor] Replace usage of `Symbol` with `col` for column references to align with Spark API best practices

---------

Co-authored-by: Chang chen <chenchang@apache.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments