Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 0 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,6 @@ coverage.xml
bin/
.vscode/

# Hive/metastore files
metastore_db/

# Spark/metastore files
spark-warehouse/
derby.log
Comment on lines -40 to -45
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed since we no longer run spark and metastore locally


# Python stuff
.mypy_cache/
htmlcov
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# Configuration Variables
# ========================

PYTEST_ARGS ?= -v # Override with e.g. PYTEST_ARGS="-vv --tb=short"
PYTEST_ARGS ?= -v -x # Override with e.g. PYTEST_ARGS="-vv --tb=short"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-x to exit test when ctrl-c

  -x, --exitfirst       exit instantly on first error or failed test.

https://docs.pytest.org/en/6.2.x/reference.html#command-line-flags

COVERAGE ?= 0 # Set COVERAGE=1 to enable coverage: make test COVERAGE=1
COVERAGE_FAIL_UNDER ?= 85 # Minimum coverage % to pass: make coverage-report COVERAGE_FAIL_UNDER=70
KEEP_COMPOSE ?= 0 # Set KEEP_COMPOSE=1 to keep containers after integration tests
Expand All @@ -37,7 +37,7 @@ endif
ifeq ($(KEEP_COMPOSE),1)
CLEANUP_COMMAND = echo "Keeping containers running for debugging (KEEP_COMPOSE=1)"
else
CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans 2>/dev/null || true
CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans --timeout 0 2>/dev/null || true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont wait for docker compose down, more responsive

endif

# ============
Expand Down
21 changes: 17 additions & 4 deletions dev/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,13 @@ ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$
RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME} && mkdir -p /home/iceberg/spark-events
WORKDIR ${SPARK_HOME}

# Remember to also update `tests/conftest`'s spark setting
ENV SPARK_VERSION=3.5.6
ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_2.12
ENV ICEBERG_VERSION=1.9.1
ENV SCALA_VERSION=2.12
ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION}
ENV ICEBERG_VERSION=1.9.2
ENV PYICEBERG_VERSION=0.10.0
ENV HADOOP_VERSION=3.3.4
ENV AWS_SDK_VERSION=1.12.753
Comment on lines +40 to +45
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied over from tests/conftest, these were originally downloaded client side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is much better 👍


# Try the primary Apache mirror (downloads.apache.org) first, then fall back to the archive
RUN set -eux; \
Expand All @@ -59,15 +61,26 @@ RUN set -eux; \
tar xzf "$FILE" --directory /opt/spark --strip-components 1; \
rm -rf "$FILE"

# Download Spark Connect server JAR
RUN curl --retry 5 -s -L https://repo1.maven.org/maven2/org/apache/spark/spark-connect_${SCALA_VERSION}/${SPARK_VERSION}/spark-connect_${SCALA_VERSION}-${SPARK_VERSION}.jar \
-Lo /opt/spark/jars/spark-connect_${SCALA_VERSION}-${SPARK_VERSION}.jar

# Download iceberg spark runtime
RUN curl --retry 5 -s https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-${ICEBERG_SPARK_RUNTIME_VERSION}/${ICEBERG_VERSION}/iceberg-spark-runtime-${ICEBERG_SPARK_RUNTIME_VERSION}-${ICEBERG_VERSION}.jar \
-Lo /opt/spark/jars/iceberg-spark-runtime-${ICEBERG_SPARK_RUNTIME_VERSION}-${ICEBERG_VERSION}.jar


# Download AWS bundle
RUN curl --retry 5 -s https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/${ICEBERG_VERSION}/iceberg-aws-bundle-${ICEBERG_VERSION}.jar \
-Lo /opt/spark/jars/iceberg-aws-bundle-${ICEBERG_VERSION}.jar

# Download hadoop-aws (required for S3 support)
RUN curl --retry 5 -s https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar \
-Lo /opt/spark/jars/hadoop-aws-${HADOOP_VERSION}.jar

# Download AWS SDK bundle
RUN curl --retry 5 -s https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar \
-Lo /opt/spark/jars/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar

COPY spark-defaults.conf /opt/spark/conf
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"

Expand Down
6 changes: 2 additions & 4 deletions dev/docker-compose-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,13 @@ services:
- rest
- hive
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
Comment on lines -36 to -37
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed port 8888 that was previously used for notebooks
replaced spark master web ui (8080) with spark app ui (4040)

- 15002:15002 # Spark Connect
- 4040:4040 # Spark UI
links:
- rest:rest
- hive:hive
Expand Down
4 changes: 1 addition & 3 deletions dev/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@
# under the License.
#

start-master.sh -p 7077
start-worker.sh spark://spark-iceberg:7077
start-history-server.sh
start-connect-server.sh

tail -f /dev/null
2 changes: 1 addition & 1 deletion dev/provision.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
"hive",
**{
"type": "hive",
"uri": "http://hive:9083",
"uri": "thrift://hive:9083",
"s3.endpoint": "http://minio:9000",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
Expand Down
21 changes: 18 additions & 3 deletions dev/spark-defaults.conf
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,35 @@
#

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

# Configure Iceberg REST catalog
spark.sql.catalog.rest org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type rest
spark.sql.catalog.rest.uri http://rest:8181
spark.sql.catalog.rest.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.warehouse s3://warehouse/rest/
spark.sql.catalog.rest.s3.endpoint http://minio:9000
spark.sql.catalog.rest.cache-enabled false

# Configure Iceberg Hive catalog
spark.sql.catalog.hive org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive.type hive
spark.sql.catalog.hive.uri http://hive:9083
spark.sql.catalog.hive.uri thrift://hive:9083
spark.sql.catalog.hive.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.hive.warehouse s3://warehouse/hive/
spark.sql.catalog.hive.s3.endpoint http://minio:9000

# Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
spark.sql.catalog.spark_catalog.uri thrift://hive:9083
spark.hadoop.fs.s3a.endpoint http://minio:9000
spark.sql.catalogImplementation hive
spark.sql.warehouse.dir s3a://warehouse/hive/
Comment on lines +37 to +43
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark_catalog is primarily used by the test_migrate_table test. It calls <catalog>.system.snapshot which requires spark_catalog

CALL hive.system.snapshot('{src_table_identifier}', 'hive.{dst_table_identifier}')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires the SparkSessionCatalog 👍


spark.sql.defaultCatalog rest

# Configure Spark UI and event logging
spark.ui.enabled true
spark.eventLog.enabled true
spark.eventLog.dir /home/iceberg/spark-events
spark.history.fs.logDirectory /home/iceberg/spark-events
spark.sql.catalogImplementation in-memory
Loading