Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions regtests/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@
# under the License.
#

FROM docker.io/apache/spark:3.5.4-python3
FROM docker.io/apache/spark:3.5.4-java17-python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java 17 WFM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is this Scala 2.13? I'd assume so, because there are separate images that have "scala2.12" in their tag name - but no images with "scala2.13".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is scala 2.12. Spark defaults to 2.12 for their images.

ARG POLARIS_HOST=polaris
ENV POLARIS_HOST=$POLARIS_HOST
ENV SPARK_HOME=/opt/spark

USER root
RUN apt update
RUN apt-get install -y diffutils wget curl python3.8-venv
RUN apt-get install -y diffutils wget curl python3.10-venv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would 3.12 or 3.12 work? Those versions still get bugfixes (not just security fixes).

Copy link
Contributor Author

@MonkeyCanCode MonkeyCanCode Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will. For my local, I am using Python 3.13. However, if we want to use the official spark image and different versions of python, we will need to compile from source code. In my previous PR of rework the test cases to pytest (paused for now, will pick it up again soon), I was using Python as a base image and built our own spark image on top (in that case, I am not locked to what Spark image is using and nor need to compile from source... setting up Spark will just be installing a software). Both will work.

It really comes to if we want to use the official Spark image and don't want to do software compile, we will be using that specific version of Python (e.g. for Centos7 which is also EOL, it is defaulted to Python 2 and Python3 will be referred to 3.8, but it is possible to setup different version of Python 3 there via different repo or compiled from source). In this case, the JDK 11 base image used by Spark is default to python 3.8 and JDK 17 is default to python 3.10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more context for those base images...Official Spark JDK 11 is based off eclipse-temurin:11-jre-focal which is built on top of ubuntu:20.04 while official Spark JDK 17 is based off eclipse-temurin:17.0.3_7-jdk-jammy which is built on top of ubuntu:22.04. And here is the python3 will resolve to for each of them when using apt with default repo (so to get diff version and not compile from source, we can also try to point to diff branch of the repo as well diff repo):

$ docker run -it ubuntu:20.04 /bin/bash
root@48dfe9519115:/# apt-cache madison python3
   python3 | 3.8.2-0ubuntu2 | http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

$ docker run -it ubuntu:22.04 /bin/bash
root@1a6950f03ad8:/# apt-cache madison python3
   python3 | 3.10.6-1~22.04.1 | http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
   python3 | 3.10.6-1~22.04 | http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
   python3 | 3.10.4-0ubuntu2 | http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. Just a question ;)

However, I'd generally stay away from already EOL'd versions and soon-to-be EOL versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand. If preferred, I can do a PR for using base python image then build spark on top. By doing so, we can do latest version for them (but we won't be using official spark image in that case as they don't have this type support). There is a similar request from Apache Iceberg as well, but their preferred route is using official spark image whenever possible.

let me know what you think. I can merge this one if no other concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, no need to do more effort at this point IMO. Seems to be a lot of initial and maintenance for a low win. Sticking w/ the official Spark image is fine for me. I don't see a pressing need to add more burden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort to look into this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime.

RUN mkdir -p /home/spark && \
chown -R spark /home/spark && \
mkdir -p /tmp/polaris-regtests && \
Expand Down
Loading