Race conditions when caching images sometimes causes cache corruption. #3634

rhpvorderman · 2019-05-28T08:44:17Z

Version of Singularity:

3.1.0

Expected behavior

When two singularity processes pull the same image, some sort of measures are taken that they do not write to the cache at the same time.

Actual behavior

Two singularity process will write to the cache at the same time. Oddly enough this works well in most of the cases. However sometimes we get cache corruption on our cluster. This happens when we start multiple jobs that require the same image simultaneously.

Steps to reproduce behavior

singularity cache clean --all
run singularity shell docker://python:3.7 simultaneously in two different terminals.

EDIT: I realize it is very hard to reproduce behaviour that happens 'sometimes'. I could not find a similar issue so I hope that other people with the same problem manage to find this one.

The text was updated successfully, but these errors were encountered:

WestleyK · 2019-05-28T23:12:55Z

Yeah, I can reproduce something like what you describe. Is this similar to your error?

$ singularity exec docker://ubuntu ls & sleep 2.0s && singularity cache clean --all
[1] 32257
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
FATAL:   Unable to handle docker://ubuntu uri: unable to build: While creating SIF: while creating container: container file creation failed: open /home/westleyk/.singularity/cache/oci-tmp/f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5/ubuntu_latest.sif: no such file or directory

[1]+  Exit 255                singularity exec docker://ubuntu ls

WestleyK · 2019-05-28T23:26:58Z

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
 2.59 MiB / 2.59 MiB [=======================================================] 100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory

EDIT: this bug is not related this issue, I was not on the master branch 🤦‍♂️

WestleyK · 2019-05-28T23:29:33Z

Btw, my singularity version is:

3.2.0-513.g3c02d0904

tbugfinder · 2019-05-29T05:03:38Z

I'm running two pulls in parallel using:

$ singularity --version
singularity version 3.2.1-1.el7

Parallel Pull:

$ rm -Rf ~/.singularity/cache/ ;  rm -f *.img ; strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10  & strace -ff -o /tmp/singularity/ubuntu1804.strace singularity pull --name ubuntu1804.img docker://ubuntu:18.04
[1] 262982
INFO:    Starting build...
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:89074f19944ee6c68e5da6dea5004e1339e4e8e9c54ea39641ad6e0bc0e4223b
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 1s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 27.89 MiB / 27.89 MiB [====================================================] 2s
Copying blob sha256:6cd3a42e50dfbbe2b8a505f7d3203c07e72aa23ce1bdc94c67221f7e72f9af6c
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 865 B / 865 B [============================================================] 0s
Copying blob sha256:26b902a7bf04aa8d7c02fd742898dab4b6c791b8e363fddc06298191167d5fac
 162 B / 162 B [============================================================] 0s
 164 B / 164 B [============================================================] 0s
Copying config sha256:7c8c583f970820a51dab6e0613761c4f99077d9a22b373a59f47ee2afb247e72
 0 B / 2.36 KiB [--------------------------------------------------------------]Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
Storing signatures
FATAL:   Unable to pull docker://ubuntu:18.10: conveyor failed to get: Error initializing source oci:/home/sigim/.singularity/cache/oci:50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4: no descriptor found for reference "50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4"
INFO:    Creating SIF file...
INFO:    Build complete: ubuntu1804.img
[1]+  Exit 255                strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10

rhpvorderman · 2019-05-29T11:02:58Z

We sometimes get this error when we have cache corruption: FATAL: container creation failed: mount error: can't remount /run/shm: no such file or directory . But maybe that is caused by something else. Never mind, this was not related.

I am glad you were able to reproduce the race-conditions! Thanks!

WestleyK · 2019-05-29T16:29:17Z

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
2.59 MiB / 2.59 MiB [=======================================================] >100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory

Nevermined this ^^^ problem, I was on a dev branch (not master) 🤦‍♂️ , that issue has nothing to do with cache corrupting.

But, there still is a bug if you clean the cache, while building a container. witch may not be a bug...

tbugfinder · 2019-07-01T18:32:46Z

@WestleyK Especially nextflow uses parallel pulls prior to the workflow execution.

tbugfinder · 2019-08-15T06:18:54Z

Is there any chance for a bug fix?

adamnovak · 2019-10-11T22:39:25Z

I'm also getting issues like this in Toil workflows trying to use Singularity:

Unable to handle docker://devorbitus/ubuntu-bash-jq-curl uri: unable to build: conveyor failed to get: no descriptor found for reference "7f5e6bce78bb52d74e6a0881ec91806d11978cedfd4caa43a6fb71c55350254a"

It seems difficult in practice to prevent all software running as the same user as you from trying to use Singularity to run the same image as you are trying to run. The only workaround I can come up with is always setting your own SINGULARITY_CACHEDIR, at which point you lose the benefit of caching between tasks.

tbugfinder · 2019-10-17T15:54:46Z

I've upgraded to latest version and still end up with the conveyor error:

$ singularity --version
singularity version 3.4.2-1.el7



Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ubuntu-18.10.img docker://ubuntu:18.10 > /dev/null
  status : 255
  message:
    [34mINFO:    Converting OCI blobs to SIF format
    INFO:    Starting build...
    Getting image source signatures
    Copying blob sha256:8a532469799e09ef8e1b56ebe39b87c8b9630c53e86380c13fbf46a09e51170e

     0 B / 25.82 MiB [-------------------------------------------------------------]
     8.88 MiB / 25.82 MiB [===================>------------------------------------]
     15.61 MiB / 25.82 MiB [=================================>---------------------]
     21.16 MiB / 25.82 MiB [=============================================>---------]
     25.82 MiB / 25.82 MiB [====================================================] 0s
    Copying blob sha256:32f4dcec3531395ca50469cbb6cba0d2d4fed1b8b2166c83b25b2f5171c7db62

     0 B / 34.32 KiB [-------------------------------------------------------------]
     34.32 KiB / 34.32 KiB [====================================================] 0s
    Copying blob sha256:230f0701585eb7153c6ba1a9b08f4cfbf6a25d026d7e3b78a47c0965e4c6d60a

     0 B / 868 B [-----------------------------------------------------------------]
     868 B / 868 B [============================================================] 0s
    Copying blob sha256:e01f70622967c0cca68d6a771ae7ff141c59ab979ac98b5184db665a4ace6415

     0 B / 164 B [-----------------------------------------------------------------]
     164 B / 164 B [============================================================] 0s
    Copying config sha256:e4186b579c943dcced1341ccc4b62ee0617614cafc5459733e2f2f7ef708f224

     0 B / 2.42 KiB [--------------------------------------------------------------]
     2.42 KiB / 2.42 KiB [======================================================] 0s
    Writing manifest to image destination
    Storing signatures
    FATAL:   While making image from oci registry: while building SIF from layers: conveyor failed to get: no descriptor found for reference "7d657275047118bb77b052c4c0ae43e8a289ca2879ebfa78a703c93aa8fd686c"

nextflow-io/nextflow#1210 sylabs/singularity#3634

adamnovak · 2020-01-03T17:41:46Z

As a response to https://github.com/sylabs/singularity/issues/4555#issuecomment-570612570, it would be extremely useful for my use case to have some synchronization inside Singularity that depends on atomic globally-consistent rename support, or even that depends on file lock support, on the backing filesystem. The result would be AFAIK no worse in the case where Singularity is running on multiple machines against a filesystem without support for these tools (i.e. you'd still get uncontrolled races and apparently arbitrary failures), but within a single machine with an ext4 home directory (which covers e.g. most cloud VMs) you would get actually-reliable performance.

dtrudg · 2020-01-03T18:48:39Z

@adamnovak - understood. There have been some caching code changes since earlier 3.x versions that I'm not entirely familiar with yet, but I believe we have fewer issues now. We can try and establish the exact points we have problems remaining, and take a look at improvements for the pretty constrained case you give there in the next release cycle. I just don't want to give any promises that we can solve things simply for people who are wanting to share cache directories between multiple users on arbitrary cluster filesystems.

We still recommend, that you singularity pull in a single location, single script etc. into a SIF file before any concurrent execution, and run against that immutable SIF.

* use small genomes to generate examples and stramline input definitions * corrected urls * relaxed allowed target name regex * stingency settings not ensembl specific, moved to main config * refactoring gtf/gff3 fileds def * major re-work of input staging and multitude of related changes * updated repr pep filtering * relaxed req to include supercontigs not just chromosomes * added sequencesToPlace spec to test config * restored core functionality after re-structure * cleanup, comments * added samtools container def * optional faidx process if idx no provided * added data set from non Esembl source * generalised gff3-based pep conversion to Ensembl style, also allows pass-through of already existing records * allowing user-specified chromosome id pattern for block and feature JSON generation * updated and documented test data sets * travis stub * opted for smaller samtools container * hack to handle gz (not bgz) files fro chr lengths * minor * Update README.md * Update .travis.yaml * Update .travis.yaml * Update .travis.yaml * Update .travis.yaml * test profile with local data * travis data download and untar * travis fixes * ubu version for travis * updated dep * for GH actions * docker user change for GH actions * docker groovy test for GHA * docker user * docker grp exists * added go for singularity * added groovy image with ps * reconf * test profile updates - fix for groovy @grab failing with singularity (read only file-system) - fix errorStrategy config * added Singularity install to GH actions * Singularity dependencies @ GH actions * working around https://github.com/sylabs/singularity/issues/3634 * test singularity pull form docker * explicit use of gawk - may matter on alpine * workaround for nextflow-io/nextflow#1210 sylabs/singularity#3634 * leaner fastx container * fastx and reconf * fix path to script, renamed tasks * test wspace path * added missing script, fixed GH actions cmd * ansi-lo on and try docker again * docker workflow test * fix typo * fix typo * fix for permission denied GH actions (?) * fix for groovy grapes in docker * test * test * test * test * another docker test * GH A job.needs experiemnt * GH A tidy * GH A fix indent * GH A fix job * added GH actions CI badge * re-implemented: duplicate emissions if multiple annotations per reference assembly * updated datastes in line with feature dev * another badge ver * fix * added EP datasets * ensure non-empty process out * generalised for different gff3 interpretations * Delete .travis.yaml * Update README.md * Update README.md * At & Bd ref fasta not needed * speeding things up: gawk in jq container and up resources * do not report markers placed outside pseudochromosomes (e.g. on scaffolds) * id pattern match extended to seq placement * redundant-ish * added TOC

dtrudg · 2020-02-06T20:31:24Z

This has surfaced again in #5020 - I'm going to close this issue and we'll pick it up there. We have a plan to move forward on this on that issue.

WestleyK self-assigned this May 28, 2019

WestleyK added the Bug label May 28, 2019

tbugfinder mentioned this issue May 29, 2019

Failed to pull the image from docker registry if the cache dir is on NFS #3579

Closed

tbugfinder mentioned this issue Jul 1, 2019

singularity parallel pull error of docker images nextflow-io/nextflow#1210

Closed

jscook2345 assigned GodloveD Aug 15, 2019

rsuchecki mentioned this issue Sep 3, 2019

Failed parallel singularity pulls csiro-crop-informatics/repset#44

Open

adamnovak mentioned this issue Oct 11, 2019

Error: unable to build: conveyor failed to get: no descriptor found for reference #4555

Closed

rsuchecki referenced this issue in plantinformatics/pretzel-input-generator Jan 3, 2020

working around https://github.com/sylabs/singularity/issues/3634

6313210

rsuchecki referenced this issue in plantinformatics/pretzel-input-generator Jan 3, 2020

workaround for

5db7bac

nextflow-io/nextflow#1210 sylabs/singularity#3634

dtrudg closed this as completed Feb 6, 2020

geektortoise mentioned this issue May 5, 2020

Fetching error when pulling multiple images that have the same image base cytomine/Cytomine-software-router#8

Open

ggabernet mentioned this issue Nov 13, 2020

Issue running rnaseq v2.0 (DSL2) with test profile nf-core/rnaseq#494

Closed

mmore500 mentioned this issue Feb 21, 2021

Rejigger HPCC workflow to prevent intermittent Singularity cache race conditions mmore500/dishtiny#43

Closed

alexlyttle mentioned this issue Aug 28, 2024

Use singularity containers for apptainer profile too nf-core/tools#2357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions when caching images sometimes causes cache corruption. #3634

Race conditions when caching images sometimes causes cache corruption. #3634

rhpvorderman commented May 28, 2019 •

edited

Loading

WestleyK commented May 28, 2019

WestleyK commented May 28, 2019 •

edited

Loading

WestleyK commented May 28, 2019

tbugfinder commented May 29, 2019

rhpvorderman commented May 29, 2019 •

edited

Loading

WestleyK commented May 29, 2019

tbugfinder commented Jul 1, 2019

tbugfinder commented Aug 15, 2019

adamnovak commented Oct 11, 2019 •

edited

Loading

tbugfinder commented Oct 17, 2019

adamnovak commented Jan 3, 2020

dtrudg commented Jan 3, 2020

dtrudg commented Feb 6, 2020

Race conditions when caching images sometimes causes cache corruption. #3634

Race conditions when caching images sometimes causes cache corruption. #3634

Comments

rhpvorderman commented May 28, 2019 • edited Loading

Version of Singularity:

Expected behavior

Actual behavior

Steps to reproduce behavior

WestleyK commented May 28, 2019

WestleyK commented May 28, 2019 • edited Loading

WestleyK commented May 28, 2019

tbugfinder commented May 29, 2019

rhpvorderman commented May 29, 2019 • edited Loading

WestleyK commented May 29, 2019

tbugfinder commented Jul 1, 2019

tbugfinder commented Aug 15, 2019

adamnovak commented Oct 11, 2019 • edited Loading

tbugfinder commented Oct 17, 2019

adamnovak commented Jan 3, 2020

dtrudg commented Jan 3, 2020

dtrudg commented Feb 6, 2020

rhpvorderman commented May 28, 2019 •

edited

Loading

WestleyK commented May 28, 2019 •

edited

Loading

rhpvorderman commented May 29, 2019 •

edited

Loading

adamnovak commented Oct 11, 2019 •

edited

Loading