Disable Spark Catalog caching for integration tests #501

kevinjqliu · 2024-03-06T22:14:20Z

While working on #444, I couldn't get integration tests working since using spark.sql to count the number of data files returns the wrong result. This becomes more bizarre when Python's iceberg table state is not the same as Spark's iceberg table state.

This PR adds a simple test to verify that "python's iceberg table snapshot id" is the same as "spark's iceberg table's snapshot id".

The culprit is the spark.sql.catalog.catalog-name.cache-enabled setting, which defaults to True and caches table metadata.
Spark sql calls will use the cached iceberg metadata instead of the updated one.

spark.sql.catalog.catalog-name.cache-enabled	true or false	Whether to enable catalog cache, default value is true

https://iceberg.apache.org/docs/latest/spark-configuration/#catalog-configuration

amogh-jahagirdar · 2024-03-06T23:53:18Z

Thanks @kevinjqliu I think this change makes sense. I don't think there's ever a reason on the Python side where we want to have the spark caching enabled. On the Iceberg Java side we do have tests which validate the caching catalog behavior when it's enabled/disabled so we don't need to test that through PyIceberg (I think). I've triggered CI if it passes, I'll go ahead and merge

amogh-jahagirdar · 2024-03-06T23:57:23Z

tests/integration/test_writes.py

@@ -355,6 +355,26 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w
    assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


+@pytest.mark.integration
+def test_multiple_spark_sql_queries(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:


Nit: I think a better name would be test_python_writes_with_spark_snapshot_reads or something more specific than what it currently is . It's mre verbose but I think it captures the goal of the test better

make sense, added!

sungwy · 2024-03-07T00:21:13Z

Great idea @kevinjqliu ! Thanks for adding this

amogh-jahagirdar · 2024-03-07T00:34:54Z

Sweet, thanks @kevinjqliu! I'm going to go ahead and merge this now.

kevinjqliu added 2 commits March 6, 2024 14:07

add test

c300fee

dont cache iceberg metadata in tests

22e8a21

kevinjqliu changed the title ~~Fix issuing multiple spark sql queries in tests~~ Fix issue with running multiple spark sql queries in the same integration test Mar 6, 2024

kevinjqliu mentioned this pull request Mar 6, 2024

Bin-pack Writes Operation into multiple parquet files, and parallelize writing WriteTasks #444

Merged

amogh-jahagirdar changed the title ~~Fix issue with running multiple spark sql queries in the same integration test~~ Disable Spark Catalog caching for integration tests Mar 6, 2024

amogh-jahagirdar approved these changes Mar 6, 2024

View reviewed changes

amogh-jahagirdar reviewed Mar 6, 2024

View reviewed changes

kevinjqliu added 2 commits March 6, 2024 15:58

rename

cd5dc3f

make lint

e36186e

amogh-jahagirdar merged commit e56326d into apache:main Mar 7, 2024
7 checks passed

kevinjqliu deleted the kevinjqliu/fix-multiple-spark-sql-queries branch March 7, 2024 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable Spark Catalog caching for integration tests #501

Disable Spark Catalog caching for integration tests #501

kevinjqliu commented Mar 6, 2024

amogh-jahagirdar commented Mar 6, 2024

amogh-jahagirdar Mar 6, 2024

kevinjqliu Mar 6, 2024

sungwy commented Mar 7, 2024

amogh-jahagirdar commented Mar 7, 2024

Disable Spark Catalog caching for integration tests #501

Disable Spark Catalog caching for integration tests #501

Conversation

kevinjqliu commented Mar 6, 2024

amogh-jahagirdar commented Mar 6, 2024

amogh-jahagirdar Mar 6, 2024

Choose a reason for hiding this comment

kevinjqliu Mar 6, 2024

Choose a reason for hiding this comment

sungwy commented Mar 7, 2024

amogh-jahagirdar commented Mar 7, 2024