Skip to content

Commit caf02ee

Browse files
authored
infra: prefer downloads.apache.org, fallback to archive.apache.org (#2494)
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change This PR changes the Dockerfile url to use `downloads.apache.org/spark` first and then fallbacks to `https://archive.apache.org/dist`. This should give us speed and reliability. `https://archive.apache.org/dist` is very slow, we switch to it because its more reliable and contains all versions of Spark. `downloads.apache.org/spark` hosts the latest versions of spark, its typically faster. Thanks @mccormickt12 for the great idea! ## Are these changes tested? yes ``` make test-integration-rebuild && make test-integration ``` also tested fallback logic ## Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
1 parent 6935b41 commit caf02ee

File tree

1 file changed

+16
-3
lines changed

1 file changed

+16
-3
lines changed

dev/Dockerfile

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,22 @@ ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_2.12
4242
ENV ICEBERG_VERSION=1.9.1
4343
ENV PYICEBERG_VERSION=0.10.0
4444

45-
RUN curl --retry 5 -s -C - https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
46-
&& tar xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
47-
&& rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz
45+
# Try the primary Apache mirror (downloads.apache.org) first, then fall back to the archive
46+
RUN set -eux; \
47+
FILE=spark-${SPARK_VERSION}-bin-hadoop3.tgz; \
48+
URLS="https://downloads.apache.org/spark/spark-${SPARK_VERSION}/${FILE} https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${FILE}"; \
49+
for url in $URLS; do \
50+
echo "Attempting download: $url"; \
51+
if curl --retry 3 --retry-delay 5 -f -s -C - "$url" -o "$FILE"; then \
52+
echo "Downloaded from: $url"; \
53+
break; \
54+
else \
55+
echo "Failed to download from: $url"; \
56+
fi; \
57+
done; \
58+
if [ ! -f "$FILE" ]; then echo "Failed to download Spark from all mirrors" >&2; exit 1; fi; \
59+
tar xzf "$FILE" --directory /opt/spark --strip-components 1; \
60+
rm -rf "$FILE"
4861

4962
# Download iceberg spark runtime
5063
RUN curl --retry 5 -s https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-${ICEBERG_SPARK_RUNTIME_VERSION}/${ICEBERG_VERSION}/iceberg-spark-runtime-${ICEBERG_SPARK_RUNTIME_VERSION}-${ICEBERG_VERSION}.jar \

0 commit comments

Comments
 (0)