Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Spark Catalog caching for integration tests #501

Conversation

kevinjqliu
Copy link
Contributor

Fix #482

While working on #444, I couldn't get integration tests working since using spark.sql to count the number of data files returns the wrong result. This becomes more bizarre when Python's iceberg table state is not the same as Spark's iceberg table state.

This PR adds a simple test to verify that "python's iceberg table snapshot id" is the same as "spark's iceberg table's snapshot id".

The culprit is the spark.sql.catalog.catalog-name.cache-enabled setting, which defaults to True and caches table metadata.
Spark sql calls will use the cached iceberg metadata instead of the updated one.

spark.sql.catalog.catalog-name.cache-enabled	true or false	Whether to enable catalog cache, default value is true

https://iceberg.apache.org/docs/latest/spark-configuration/#catalog-configuration

@kevinjqliu kevinjqliu changed the title Fix issuing multiple spark sql queries in tests Fix issue with running multiple spark sql queries in the same integration test Mar 6, 2024
@amogh-jahagirdar amogh-jahagirdar changed the title Fix issue with running multiple spark sql queries in the same integration test Disable Spark Catalog caching for integration tests Mar 6, 2024
@amogh-jahagirdar
Copy link
Contributor

Thanks @kevinjqliu I think this change makes sense. I don't think there's ever a reason on the Python side where we want to have the spark caching enabled. On the Iceberg Java side we do have tests which validate the caching catalog behavior when it's enabled/disabled so we don't need to test that through PyIceberg (I think). I've triggered CI if it passes, I'll go ahead and merge

@@ -355,6 +355,26 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w
assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


@pytest.mark.integration
def test_multiple_spark_sql_queries(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think a better name would be test_python_writes_with_spark_snapshot_reads or something more specific than what it currently is . It's mre verbose but I think it captures the goal of the test better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, added!

@sungwy
Copy link
Collaborator

sungwy commented Mar 7, 2024

Great idea @kevinjqliu ! Thanks for adding this

@amogh-jahagirdar
Copy link
Contributor

Sweet, thanks @kevinjqliu! I'm going to go ahead and merge this now.

@amogh-jahagirdar amogh-jahagirdar merged commit e56326d into apache:main Mar 7, 2024
7 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/fix-multiple-spark-sql-queries branch March 7, 2024 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark <> Iceberg bug integration test
3 participants