-
Notifications
You must be signed in to change notification settings - Fork 387
infra: use spark connect to run pytests #2491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
39bd08e to
6ac69bc
Compare
| # Hive/metastore files | ||
| metastore_db/ | ||
|
|
||
| # Spark/metastore files | ||
| spark-warehouse/ | ||
| derby.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer needed since we no longer run spark and metastore locally
| CLEANUP_COMMAND = echo "Keeping containers running for debugging (KEEP_COMPOSE=1)" | ||
| else | ||
| CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans 2>/dev/null || true | ||
| CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans --timeout 0 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont wait for docker compose down, more responsive
| - 8888:8888 | ||
| - 8080:8080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed port 8888 that was previously used for notebooks
replaced spark master web ui (8080) with spark app ui (4040)
| ENV SCALA_VERSION=2.12 | ||
| ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION} | ||
| ENV ICEBERG_VERSION=1.9.2 | ||
| ENV PYICEBERG_VERSION=0.10.0 | ||
| ENV HADOOP_VERSION=3.3.4 | ||
| ENV AWS_SDK_VERSION=1.12.753 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied over from tests/conftest, these were originally downloaded client side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is much better 👍
| # Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore | ||
| spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog | ||
| spark.sql.catalog.spark_catalog.type hive | ||
| spark.sql.catalog.spark_catalog.uri thrift://hive:9083 | ||
| spark.hadoop.fs.s3a.endpoint http://minio:9000 | ||
| spark.sql.catalogImplementation hive | ||
| spark.sql.warehouse.dir s3a://warehouse/hive/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark_catalog is primarily used by the test_migrate_table test. It calls <catalog>.system.snapshot which requires spark_catalog
| CALL hive.system.snapshot('{src_table_identifier}', 'hive.{dst_table_identifier}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires the SparkSessionCatalog 👍
tests/integration/test_add_files.py
Outdated
| @pytest.mark.integration | ||
| def test_add_files_snapshot_properties(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None: | ||
| identifier = f"default.unpartitioned_table_v{format_version}" | ||
| identifier = f"default.snapshot_properties_v{format_version}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name conflict with another test above
iceberg-python/tests/integration/test_add_files.py
Lines 160 to 161 in 6935b41
| def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None: | |
| identifier = f"default.unpartitioned_table_v{format_version}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer to use the test-name in the table name:
| identifier = f"default.snapshot_properties_v{format_version}" | |
| identifier = f"default. test_add_files_snapshot_properties_v{format_version}" |
This way we can relate the two 👍
| # ======================== | ||
|
|
||
| PYTEST_ARGS ?= -v # Override with e.g. PYTEST_ARGS="-vv --tb=short" | ||
| PYTEST_ARGS ?= -v -x # Override with e.g. PYTEST_ARGS="-vv --tb=short" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-x to exit test when ctrl-c
-x, --exitfirst exit instantly on first error or failed test.
https://docs.pytest.org/en/6.2.x/reference.html#command-line-flags
6ac69bc to
1c1d75e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot! It consolidates a lot of the configuration, thanks for working on this 👍
| ENV SCALA_VERSION=2.12 | ||
| ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION} | ||
| ENV ICEBERG_VERSION=1.9.2 | ||
| ENV PYICEBERG_VERSION=0.10.0 | ||
| ENV HADOOP_VERSION=3.3.4 | ||
| ENV AWS_SDK_VERSION=1.12.753 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is much better 👍
| # Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore | ||
| spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog | ||
| spark.sql.catalog.spark_catalog.type hive | ||
| spark.sql.catalog.spark_catalog.uri thrift://hive:9083 | ||
| spark.hadoop.fs.s3a.endpoint http://minio:9000 | ||
| spark.sql.catalogImplementation hive | ||
| spark.sql.warehouse.dir s3a://warehouse/hive/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires the SparkSessionCatalog 👍
Rationale for this change
Closes #2492
Run pytest using Spark Connect for a more consistent test env
Are these changes tested?
Are there any user-facing changes?