-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Cleanup output of docker cache generation #16412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the log is quite useful to debug, thus I'd like to keep it. While the different threads are intermingled, it still gives you enough data to trace down an issue without having to run it locally first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Marco, would it be possible to merge this? splitting the PR is reasonable but CI is very flaky. It would help to get this in to alleviate the timeouts. Also having the output from the containers built in parallel is not really useful and not needed in my opinion. We would appreciate to get this merged so we can unblock CI soon. Thanks. |
@marcoabreu Logging is great. Based on what I understand, this PR logs it enough for one to understand what causes the error (without giving too much cluttered info) Lot of info is great (but since it executes in parallel), it seems to get garbled. What do you reckon? |
Keep in mind that this job only runs on master, which means that every job has passed as a requirement to be merged in the first place. This means that if this job suddenly fails, we got an inconsistency - which could be flaky. Since the logs don't do any harm and in 99% of the cases the build is a nop since it's fully incremental and no changes have been made (thus, the log being super short), the output is only really there if something actually changed. And that's potentially exact the log you're interesting in in that particular moment. Also, you wouldn't be able to tell from a first glance if the caching is working properly if the log wasn't there for this job (since the "using cached" output is missing). Generally, I take this change as a "fixing something that's not broken". You're constantly bring up CI being flaky as argument to merge multiple changes in one PR. I have said it on multiple occasions and still stand by the stance that this is not acceptable. Everybody else in this community is able to split up their changes and get them through CI, so you should also be able to. The time that was spent trying to convince me could've been used instead on extracting the crucial fix (which is the one-liner that increases the timeout) and retrying CI in case it flakes out. |
ci/docker_cache.py
Outdated
@@ -37,7 +37,7 @@ | |||
DOCKERHUB_LOGIN_NUM_RETRIES = 5 | |||
DOCKERHUB_RETRY_SECONDS = 5 | |||
DOCKER_CACHE_NUM_RETRIES = 3 | |||
DOCKER_CACHE_TIMEOUT_MINS = 15 | |||
DOCKER_CACHE_TIMEOUT_MINS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this mean that even though we have a working container that has all of the deps in it, we'll still build a new one from scratch after 15 (or now 30) minutes?
If this is the case, why isn't this number much, much higher? Like in terms of hours or days? I am always frustrated by watching the same intermediate dependencies being built when there doesn't seem to be any change at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is the job that publishes the cache. Right now, the job fails because it runs into a timeout. Once the job works again, everything will be back to normal with incremental builds.
Why shall we set it way higher? The job is only supposed to run a few minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind then, I thought this would explain why I see the cache invalidated so often and for no apparent reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the reason for the cache invalidations is because this job is broken since the 20th of September.
Marco, please check the logs from this job, is failing since two months, and causing containers to be rebuilt on CI. Really suprised that you are making such comments. Also what you are saying doesn't match my observations. I have run this in a separate machine to diagnose the problems and timeouts. |
With my quote I'm referring to majority of this PR, which is the log change. It won't solve the problem but is the mere reason the issue hasn't been resolved yet. With regards to the timeout change, I'm happy to merge it asap. |
Marco, there is a cost to the project of needlesly discussing minor things. This prevent us from making progress and is not useful. Have a look at this article: https://www.theregister.co.uk/2008/05/30/google_open_source_talk/ |
dbf5981
to
7dc622a
Compare
7dc622a
to
c340488
Compare
@marcoabreu could you have a look again at this PR? the changes were separated as you requested. Thanks. |
Thanks, I'd prefer to not surpress the output and not move forward with the change |
Thanks for your review. So going forward will you maintain this infrastructure then? |
Description
Cleanup output of docker cache generation script.
@mseth10