Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Sep 21, 2025

Rationale for this change

Closes #2492

Run pytest using Spark Connect for a more consistent test env

  • a few general cleanup changes

Are these changes tested?

Are there any user-facing changes?

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/clean-up-spark branch 2 times, most recently from 39bd08e to 6ac69bc Compare September 21, 2025 16:17
Comment on lines -40 to -45
# Hive/metastore files
metastore_db/

# Spark/metastore files
spark-warehouse/
derby.log
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed since we no longer run spark and metastore locally

CLEANUP_COMMAND = echo "Keeping containers running for debugging (KEEP_COMPOSE=1)"
else
CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans 2>/dev/null || true
CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans --timeout 0 2>/dev/null || true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont wait for docker compose down, more responsive

Comment on lines -36 to -37
- 8888:8888
- 8080:8080
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed port 8888 that was previously used for notebooks
replaced spark master web ui (8080) with spark app ui (4040)

Comment on lines +40 to +45
ENV SCALA_VERSION=2.12
ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION}
ENV ICEBERG_VERSION=1.9.2
ENV PYICEBERG_VERSION=0.10.0
ENV HADOOP_VERSION=3.3.4
ENV AWS_SDK_VERSION=1.12.753
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied over from tests/conftest, these were originally downloaded client side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is much better 👍

Comment on lines +37 to +43
# Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
spark.sql.catalog.spark_catalog.uri thrift://hive:9083
spark.hadoop.fs.s3a.endpoint http://minio:9000
spark.sql.catalogImplementation hive
spark.sql.warehouse.dir s3a://warehouse/hive/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark_catalog is primarily used by the test_migrate_table test. It calls <catalog>.system.snapshot which requires spark_catalog

CALL hive.system.snapshot('{src_table_identifier}', 'hive.{dst_table_identifier}')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires the SparkSessionCatalog 👍

@pytest.mark.integration
def test_add_files_snapshot_properties(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
identifier = f"default.unpartitioned_table_v{format_version}"
identifier = f"default.snapshot_properties_v{format_version}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name conflict with another test above

def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
identifier = f"default.unpartitioned_table_v{format_version}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer to use the test-name in the table name:

Suggested change
identifier = f"default.snapshot_properties_v{format_version}"
identifier = f"default. test_add_files_snapshot_properties_v{format_version}"

This way we can relate the two 👍

# ========================

PYTEST_ARGS ?= -v # Override with e.g. PYTEST_ARGS="-vv --tb=short"
PYTEST_ARGS ?= -v -x # Override with e.g. PYTEST_ARGS="-vv --tb=short"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-x to exit test when ctrl-c

  -x, --exitfirst       exit instantly on first error or failed test.

https://docs.pytest.org/en/6.2.x/reference.html#command-line-flags

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/clean-up-spark branch from 6ac69bc to 1c1d75e Compare September 21, 2025 16:25
@kevinjqliu kevinjqliu requested a review from Fokko September 21, 2025 17:04
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot! It consolidates a lot of the configuration, thanks for working on this 👍

Comment on lines +40 to +45
ENV SCALA_VERSION=2.12
ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION}
ENV ICEBERG_VERSION=1.9.2
ENV PYICEBERG_VERSION=0.10.0
ENV HADOOP_VERSION=3.3.4
ENV AWS_SDK_VERSION=1.12.753
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is much better 👍

Comment on lines +37 to +43
# Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
spark.sql.catalog.spark_catalog.uri thrift://hive:9083
spark.hadoop.fs.s3a.endpoint http://minio:9000
spark.sql.catalogImplementation hive
spark.sql.warehouse.dir s3a://warehouse/hive/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires the SparkSessionCatalog 👍

@kevinjqliu kevinjqliu merged commit 513295d into apache:main Sep 22, 2025
10 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/clean-up-spark branch September 22, 2025 16:53
This was referenced Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve test spark env

2 participants