Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] Address CI failures with docker timeouts (v2) #19890

Merged
merged 2 commits into from
Feb 12, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions ci/safe_docker_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@
import atexit
import logging
import os
import random
import signal
import sys
import time
from functools import reduce
from itertools import chain
from typing import Dict, Any
Expand Down Expand Up @@ -117,6 +119,9 @@ def run(self, *args, **kwargs) -> int:
ret = 0
try:
# Race condition:
# add a random sleep to (a) give docker time to flush disk buffer after pulling image
# and (b) minimize race conditions between jenkins runs on same host
time.sleep(random.randint(2,10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does random help? vs let's say a fixed wait of 5 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each jenkins slave (linux cpu nodes, at least) have 2 "slots" they can run in parallel, and when 2 jobs using the same docker images start at the exact same time on these 2 slots, they both will attempt to pull down the image from ECR and start a container. If we randomize the delay, the idea is that both containers won't be requested to start at the exact same time.

# If the call to docker_client.containers.run is interrupted, it is possible that
# the container won't be cleaned up. We avoid this by temporarily masking the signals.
signal.pthread_sigmask(signal.SIG_BLOCK, {signal.SIGINT, signal.SIGTERM})
Expand Down