Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Centos docker builds failing during yum install #38832

Closed
tvernum opened this issue Feb 13, 2019 · 16 comments · Fixed by #38956
Closed

[CI] Centos docker builds failing during yum install #38832

tvernum opened this issue Feb 13, 2019 · 16 comments · Fixed by #38956
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Feb 13, 2019

I don't know if there's any solution to this, but I didn't want the issue to just get lost in build noise.

We seem to have recurring failures building the docker image on Centos because yum fails to retrieve the mirror list.

Feb 13 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=centos-7&&immutable&&linux&&docker/14/console

05:06:48 Cannot find a valid baseurl for repo: base/7/x86_64
05:06:48 �[0mCould not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
05:06:48 12: Timeout on http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds')
05:06:49 
05:06:49 The command '/bin/sh -c yum install -y unzip which' returned a non-zero code: 1
05:06:49 

Feb 11 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+periodic/30/console

11:12:05 Cannot find a valid baseurl for repo: base/7/x86_64
11:12:05 �[0mCould not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
11:12:05 14: curl#7 - "Failed to connect to 2604:1580:fe02:2::10: Network is unreachable"
11:12:05 The command '/bin/sh -c yum install -y unzip which' returned a non-zero code: 1
11:12:05 
11:12:05 

Feb 10 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/233/console

05:07:21 Cannot find a valid baseurl for repo: base/7/x86_64
05:07:21 �[0mCould not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
05:07:21 12: Timeout on http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds')
05:07:21 The command '/bin/sh -c yum update -y &&     yum install -y nc unzip wget which &&     yum clean all' returned a non-zero code: 1
05:07:21 
05:07:21 > Task
```

@tvernum tvernum added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Feb 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@dliappis dliappis self-assigned this Feb 13, 2019
@dliappis
Copy link
Contributor

dliappis commented Feb 13, 2019

@tvernum thanks for creating this. Ironically enough it seems that this is happening on CentOS-7 workers attempting to yum install inside the container based on centos:7.

I have been unable to reproduce this using a simple reproduction inside a clean CentOS 7 vagrant box with 20 iterations.

Attempted reproduction script
$ cat Dockerfile
FROM centos:7 AS builder
ENV PATH /usr/share/elasticsearch/bin:$PATH
ENV JAVA_HOME /opt/jdk-11.0.2
RUN curl --retry 8 -s https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz | tar -C /opt -zxf -
RUN ln -sf /etc/pki/ca-trust/extracted/java/cacerts /opt/jdk-11.0.2/lib/security/cacerts
RUN yum install -y unzip which
for i in {1..20}; do docker build -t testimage . && echo ">>>> Test $i successful"; docker rmi testimage; done

Spinning up a worker similar to the one used in the failures above I observed that the base OS has certain images already present (incl. centos7) and the above reproduction script was basically a no-op; while this speeds up things, it might be the reason behind some stale things.

@dliappis
Copy link
Contributor

Looked at this again.

It mostly happens on centos-7 workers but I also spotted it happening on a Ubuntu kibana-ci, devops-ci Ubuntu (beats) immutable workers and even on an openSUSE Leap 42.3 non immutable worker, so it doesn't look specifically related to centos-7 workers. The history seems to start on Jan 17 this year; this doesn’t really chronologically correspond to the packer cache script fix pr in https://github.com/elastic/elasticsearch/pull/38023/files.

Had a brainstorming session with @atorok on some ideas.

Alpar pointed out that the failing jobs were for non master branches hence not benefiting from caching. He'll work on a simple PR to extend the caching for older versions.

I looked at our Dockerfile and as an additional action we can:

  1. Remove yum install -y unzip which from the stage 0 image -- not needed for some time now.
  2. Clean up the yum section in stage 1; we still need yum update -y but I'll add a retry loop with some sleep to help even further.

Hopefully with all the above steps combined we will get rid of the noise.

dliappis added a commit to dliappis/elasticsearch that referenced this issue Feb 18, 2019
As the Dockerfile evolved we don't need anymore certain commands like
`unzip`, `which` and `wget` allowing us to slightly string the images
too.

Relates elastic#38832
@benwtrent
Copy link
Member

benwtrent commented Feb 21, 2019

Re-occurred in builds:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+periodic/89/console

02:06:40 Cannot find a valid baseurl for repo: base/7/x86_64
02:06:40 �[0mCould not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
02:06:40 14: curl#7 - "Failed to connect to 2607:f8f8:700:12::10: Network is unreachable"

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+periodic/89/console

04:06:23 Cannot find a valid baseurl for repo: base/7/x86_64
04:06:23 �[0mCould not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
04:06:23 12: Timeout on http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds')

@benwtrent benwtrent reopened this Feb 21, 2019
@danielmitterdorfer
Copy link
Member

@benwtrent can we please add at least the relevant build output here to help investigating? Jenkins links are usually invalid after a few weeks. While it is still possible to get the build output it is much more trouble to go to build-stats and dig for the respective logs.

@benwtrent
Copy link
Member

@danielmitterdorfer I updated my comment to include the snippet of the timeout failure. I initially did not add them as it did not provide any more information than what was already included in this issue.

@dliappis
Copy link
Contributor

@atorok this issue seems to persist e.g. in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/40/console. AFAICS updated base images incl. caching support have been building successfully.

This is a list of failures the past few days:

image

WDYT, should we be more resilient in all cases by introducing the retry commit: 6f52008?

@alpar-t
Copy link
Contributor

alpar-t commented Feb 27, 2019

I just checked and we do call the relevant task when building the image [7.1.0] [6.7.0] > Task :distribution:docker:pullFixture I think we build a different image for the fixture ( since we do it trough docker compose ) than our regular build, I taught that docker would cache and reuse the layers, but this doesn't seem to be the case.

I'm not against the retries, just looking to understand this a bit better as we also don't seem to be getting all the benefits of the caching.

@alpar-t
Copy link
Contributor

alpar-t commented Feb 27, 2019

I think it would be better to use image: "docker.elastic.co/elasticsearch/elasticsearch-oss:8.0.0-SNAPSHOT" in docker-compose.yml and have the fixture depend on the image build, we can then avoid building too images and should also better benefit from the caching.

@alpar-t
Copy link
Contributor

alpar-t commented Feb 27, 2019

I checked and :x-pack:test:smb-fixture:composeUp is rebuilt also, so something is definitely not right with the caches.

@dliappis
Copy link
Contributor

Another occurrence has been raised in #40205. The Docker caches didn't seem to get honored.

During a team discussion with @danielmitterdorfer / @rjernst and @mark-vieira we thought it may make sense to bring back this yum retries commit: 6f52008 and backport it back to 7.x/7.0 and even 6.7.

@rjernst
Copy link
Member

rjernst commented Mar 20, 2019

yum has a retries setting in yum.conf right? Why don't these retries work?

@alpar-t
Copy link
Contributor

alpar-t commented Mar 21, 2019

We should do the retires. I looked into it and the caching we do on images won't help us with this as I initially taught as we copy the Dockerfile ( even if we weren't the one used for the cache is still a different checkout ). We also have changing dependencies, as we just built the distirbution so we do want to regenerate the image.

@dliappis
Copy link
Contributor

yum has a retries setting in yum.conf right? Why don't these retries work?

Yes there is retries and by default it's set to 10, however from what I've seen it only honors it for failures pertinent to specific package files and not earlier failures e.g. pulling the mirrorlist (Timeout on http://mirrorlist.centos.org).

@dliappis
Copy link
Contributor

I raised #40349 resurrecting the retries

@dliappis
Copy link
Contributor

#40349 (retries for yum commands in Dockerfile with in-between sleep period) has been merged, I'll close this out now. If the problem surfaces again, feel free to re-open.

@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants