[SPARK-53732][SQL] Remember TimeTravelSpec in DataSourceV2Relation #52599

aokolnychyi · 2025-10-14T00:40:52Z

What changes were proposed in this pull request?

This PR adds TimeTravelSpec to DataSourceV2Relation when the relation is created by time traveling.

Why are the changes needed?

These changes are needed for subsequent PRs where I will modify Spark to reload certain tables to ensure consistent version scanning and DELETE, UPDATE, and MERGE isolation. Without this change, Spark looses track of whether the relation points to the current version of the table or time travels. As an engine, Spark must be aware whether a relation is the result of time traveling or points to the current table version/snapshot.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-10-14T00:42:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -229,6 +229,12 @@ object AnalysisContext {
    set(context)
    try f finally { set(originContext) }
  }
+
+  private[sql] def withAnalysisContext[A](context: AnalysisContext)(f: => A): A = {


Needed for testing below.

aokolnychyi · 2025-10-14T00:43:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RelationResolution.scala

@@ -189,6 +191,7 @@ class RelationResolution(override val catalogManager: CatalogManager)

      case table =>
        if (isStreaming) {
+          assert(timeTravelSpec.isEmpty, "time travel is not allowed in streaming")


It should be impossible to reach this line with a valid time travel spec. Just a sanity check.

aokolnychyi · 2025-10-14T00:43:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TimeTravelSpec.scala

-case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec
-case class AsOfVersion(version: String) extends TimeTravelSpec
+case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec {
+  override def toString: String = s"TIMESTAMP AS OF $timestamp"


Needed for proper simpleString implementation in DataSourceV2Relation. See tests below.

aokolnychyi · 2025-10-14T00:44:12Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

@@ -65,7 +66,12 @@ abstract class DataSourceV2RelationBase(
  override def skipSchemaResolution: Boolean = table.supports(TableCapability.ACCEPT_ANY_SCHEMA)

  override def simpleString(maxFields: Int): String = {
-    s"RelationV2${truncatedString(output, "[", ", ", "]", maxFields)} $name"
+    val outputString = truncatedString(output, "[", ", ", "]", maxFields)


Covered with tests.

aokolnychyi · 2025-10-14T00:44:43Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala

@@ -66,6 +66,19 @@ class BasicInMemoryTableCatalog extends TableCatalog {
    }
  }

+  def pinTable(ident: Identifier, version: String): Unit = {


Used in time travel tests.

aokolnychyi · 2025-10-14T00:47:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

@@ -211,15 +218,29 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
      plan: LogicalPlan,
      cascade: Boolean,
      blocking: Boolean): Unit = {
-    uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)
+    EliminateSubqueryAliases(plan) match {


Added this branch to avoid changes in the behavior. Some connectors (like Iceberg) use these methods for their custom commands. I wanted to be on the safer side and keep the old behavior for these calls. That's it, if any of these methods are called with DataSourceV2Relation without time travel spec, we will invalidate all cache entries (including time travel) like before.

so when r.timeTravelSpec.isEmpty, plan1.sameResult(plan2) won't work with dsv2? Could you add comment in the code?

aokolnychyi · 2025-10-14T00:47:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+  /**
+   * Re-caches all cache entries that reference the given table name.
+   */
+  def recacheByTableName(


I will need this in subsequent PRs.

aokolnychyi · 2025-10-14T00:49:39Z

cc @dongjoon-hyun @cloud-fan @gengliangwang @huaxingao @szehon-ho @viirya

dongjoon-hyun · 2025-10-14T01:00:58Z

Thank you for pinging me, @aokolnychyi .

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TimeTravelSpec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

aokolnychyi · 2025-10-21T02:39:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

@@ -748,6 +748,13 @@ class SessionCatalog(
    getRawLocalOrGlobalTempView(toNameParts(name)).map(getTempViewPlan)
  }

+  /**


An overloaded version that takes name parts directly instead of taking TableIdentifier.

aokolnychyi · 2025-10-21T02:42:50Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Catalog.scala

@@ -810,20 +811,12 @@ class Catalog(sparkSession: SparkSession) extends catalog.Catalog {
   * @since 2.0.0
   */
  override def uncacheTable(tableName: String): Unit = {
-    // We first try to parse `tableName` to see if it is 2 part name. If so, then in HMS we check


I needed to migrate this logic to uncaching by name. I feel like the try/catch was redundant. Keep in mind that internally calls parseMultipartIdentifier on spark.table(tableName) so it is not different compared to the old implementation. I see that getLocalOrGlobalTempView already handles multi part names correctly.

It would be great to have another pair of eyes on this one, though.

aokolnychyi · 2025-10-21T02:53:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

@@ -214,12 +221,33 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)
  }

-  def uncacheTableOrView(spark: SparkSession, name: Seq[String], cascade: Boolean): Unit = {
+  private[sql] def uncacheRelation(


This is used in UncacheTableExec. I kept it private as we may migrate that node to use ResolvedIdentifier instead of a proper relation. That said, I'd do that in a separate PR.

We probably need to keep it forever, as UNCACHE TABLE works for temp views as well, which we need to uncache by an arbitrary plan.

but this is very specific to UNCACHE TABLE, shall we move the pattern match into UncacheTableExec?

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

cloud-fan · 2025-10-21T03:48:29Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Catalog.scala

+    // use CommandUtils for all tables (including with 3+ part names)
+    val nameParts = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)
+    sessionCatalog.getLocalOrGlobalTempView(nameParts).map(uncacheView).getOrElse {
+      CommandUtils.uncacheTableOrView(sparkSession, tableName)


nit: we can pass the parsed name parts.

cloud-fan · 2025-10-21T03:54:47Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

  }

+  // recaches all cache entries (including time travel) for the given table


I can understand this for REFRESH TABLE and REPLACE TABLE, which we need to fully refresh all the caches of the give table, in case of dramatic table changes. But for other DDL commands, they just move the state of the table forward, same as DML commands, why do we treat them differently?

cloud-fan · 2025-10-21T03:58:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala

+      // create first unresolved relation with a plan ID
+      val unresolved1 = UnresolvedRelation(Seq("testcat", "tab"))
+      val planId1 = 12345L
+      unresolved1.setTagValue(LogicalPlan.PLAN_ID_TAG, planId1)


why is PLAN_ID_TAG related to this PR?

I modified that piece of code so wanted to make sure this works too. I made the test time-travel specific.

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

aokolnychyi · 2025-10-21T20:33:19Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Catalog.scala

+    // use CommandUtils for all tables (including with 3+ part names)
+    val nameParts = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)
+    sessionCatalog.getLocalOrGlobalTempView(nameParts).map(uncacheView).getOrElse {
+      sparkSession.table(tableName) // throws an error if the table doesn't exist


This line is here to preserve the original behavior.

aokolnychyi · 2025-10-21T20:39:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceTableExec.scala


  val tableProperties = CatalogV2Util.convertTableProperties(tableSpec)

  override protected def run(): Seq[InternalRow] = {
    if (catalog.tableExists(ident)) {
-      val table = catalog.loadTable(ident)


The load seems redundant now. Thoughts?

yea we can remove it now

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

cloud-fan · 2025-10-23T02:52:07Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      assert(cacheManager.lookupCachedData(spark, v2Relation).isDefined)
+
+      // uncache using DataSourceV2Relation directly
+      sql(s"UNCACHE TABLE $t")


I can't connect the dots between this SQL UNCACHE TABLE and uncache using DataSourceV2Relation directly

BTW this test case looks almost the same with the next test case uncache DSv2 table using SQL

Sorry, I meant to test CommandsUtils.uncacheTable() with DataSourceV2Relation. Updated.

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

szehon-ho · 2025-10-24T00:06:06Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala

+      isDistributionStrictlyRequired,
+      numRowsPerSplit)
+
+    copiedTable.dataMap.synchronized {


nit: does this help? We dont return copiedTable yet so there's no chance this code conflicts with another access of copiedTable's dataMap?

Also curious if we can just use super.copy() and set fields instead of constructing InMemoryTable?

You are right, we should sync on dataMap of the table that's being copied.

I am not sure I got the idea on using super.copy() as this isn't a case class and if it was, we will have to deep clone all fields to ensure the copy is independent.

I updated the synchronization piece.

cloud-fan · 2025-10-27T23:40:47Z

thanks, merging to master!

### What changes were proposed in this pull request? This PR adds `TimeTravelSpec` to `DataSourceV2Relation` when the relation is created by time traveling. ### Why are the changes needed? These changes are needed for subsequent PRs where I will modify Spark to reload certain tables to ensure consistent version scanning and DELETE, UPDATE, and MERGE isolation. Without this change, Spark looses track of whether the relation points to the current version of the table or time travels. As an engine, Spark must be aware whether a relation is the result of time traveling or points to the current table version/snapshot. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52599 from aokolnychyi/spark-53732-v2. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

see apache/spark#52599

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark*/.../QueryExecutionShim.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala shims/spark*/.../v2/DataSourceV2RelationShim.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

github-actions bot added the SQL label Oct 14, 2025

aokolnychyi commented Oct 14, 2025

View reviewed changes

cloud-fan reviewed Oct 14, 2025

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Show resolved Hide resolved

aokolnychyi force-pushed the spark-53732-v2 branch from 24133d5 to 28d1e4a Compare October 14, 2025 17:19

gengliangwang reviewed Oct 15, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TimeTravelSpec.scala Show resolved Hide resolved

gengliangwang reviewed Oct 15, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Show resolved Hide resolved

cloud-fan reviewed Oct 15, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Show resolved Hide resolved

aokolnychyi force-pushed the spark-53732-v2 branch from 28d1e4a to cfa2aaa Compare October 21, 2025 02:37

aokolnychyi commented Oct 21, 2025

View reviewed changes

aokolnychyi force-pushed the spark-53732-v2 branch 2 times, most recently from 494e45c to 58c7627 Compare October 21, 2025 02:50

aokolnychyi commented Oct 21, 2025

View reviewed changes

cloud-fan reviewed Oct 21, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 21, 2025

View reviewed changes