Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

follow the Best practices for writing Dockerfiles :

Always combine RUN apt-get update with apt-get install in the same RUN statement.

Why are the changes needed?

1, to address #42253 (comment)
2, when I attempted to change the apt-get install in #41918, the behavior was confusing. By following the best practices, further changes should work immediately.

Does this PR introduce any user-facing change?

NO, dev-only

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

NO

init 2

init
@github-actions github-actions bot added the BUILD label Sep 7, 2023
@zhengruifeng
Copy link
Contributor Author

cc @Yikun @dongjoon-hyun

@zhengruifeng
Copy link
Contributor Author

@Yikun
Copy link
Member

Yikun commented Sep 7, 2023

LGTM, but I think you should know (or maybe already know):

Always combine RUN apt-get update with apt-get install in the same RUN statement.

This best practices is for to help reduce redundat layer between update and install, but for spark infra, there are a tradeoff between image size download cost and image rebuild cost.

Move frequently modified dependencies to front means: once the Dockerfile head front dependencies are modified, the following content will also be rebuild (due to infra cache invalidated). Finally, it impacts PRs which change Dockerfile head lines or haven't rebase in time CI will cost more time (about 1h).

TL;DR: move stable dependencies to top, keep frequently modified dependencies to end.


RUN Rscript -e "install.packages(c('knitr', 'markdown', 'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow', 'roxygen2', 'xml2'), repos='https://cloud.r-project.org/')"

# See more in SPARK-39959, roxygen2 < 7.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So L58 not change is intentional, right?

Copy link
Contributor Author

@zhengruifeng zhengruifeng Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, should be also moved in L27, let me update

@zhengruifeng
Copy link
Contributor Author

LGTM, but I think you should know (or maybe already know):

Always combine RUN apt-get update with apt-get install in the same RUN statement.

This best practices is for to help reduce redundat layer between update and install, but for spark infra, there are a tradeoff between image size download cost and image rebuild cost.

Move frequently modified dependencies to front means: once the Dockerfile head front dependencies are modified, the following content will also be rebuild (due to infra cache invalidated). Finally, it impacts PRs which change Dockerfile head lines or haven't rebase in time CI will cost more time (about 1h).

TL;DR: move stable dependencies to top, keep frequently modified dependencies to end.

thanks for the detailed explaination, since we rarely modify the apt-install (we frequently changes the pip-install), so I think it would be fine.

@zhengruifeng
Copy link
Contributor Author

thanks @Yikun and @HyukjinKwon , merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants