Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions when caching images sometimes causes cache corruption. #3634

Closed
rhpvorderman opened this issue May 28, 2019 · 13 comments
Closed
Assignees
Labels

Comments

@rhpvorderman
Copy link

rhpvorderman commented May 28, 2019

Version of Singularity:

3.1.0

Expected behavior

When two singularity processes pull the same image, some sort of measures are taken that they do not write to the cache at the same time.

Actual behavior

Two singularity process will write to the cache at the same time. Oddly enough this works well in most of the cases. However sometimes we get cache corruption on our cluster. This happens when we start multiple jobs that require the same image simultaneously.

Steps to reproduce behavior

  1. singularity cache clean --all
  2. run singularity shell docker://python:3.7 simultaneously in two different terminals.

EDIT: I realize it is very hard to reproduce behaviour that happens 'sometimes'. I could not find a similar issue so I hope that other people with the same problem manage to find this one.

@WestleyK
Copy link
Contributor

Yeah, I can reproduce something like what you describe. Is this similar to your error?

$ singularity exec docker://ubuntu ls & sleep 2.0s && singularity cache clean --all
[1] 32257
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
FATAL:   Unable to handle docker://ubuntu uri: unable to build: While creating SIF: while creating container: container file creation failed: open /home/westleyk/.singularity/cache/oci-tmp/f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5/ubuntu_latest.sif: no such file or directory

[1]+  Exit 255                singularity exec docker://ubuntu ls

@WestleyK
Copy link
Contributor

WestleyK commented May 28, 2019

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
 2.59 MiB / 2.59 MiB [=======================================================] 100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory

EDIT: this bug is not related this issue, I was not on the master branch 🤦‍♂️

@WestleyK WestleyK self-assigned this May 28, 2019
@WestleyK WestleyK added the Bug label May 28, 2019
@WestleyK
Copy link
Contributor

Btw, my singularity version is:

3.2.0-513.g3c02d0904

@tbugfinder
Copy link

I'm running two pulls in parallel using:

$ singularity --version
singularity version 3.2.1-1.el7

Parallel Pull:

$ rm -Rf ~/.singularity/cache/ ;  rm -f *.img ; strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10  & strace -ff -o /tmp/singularity/ubuntu1804.strace singularity pull --name ubuntu1804.img docker://ubuntu:18.04
[1] 262982
INFO:    Starting build...
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:89074f19944ee6c68e5da6dea5004e1339e4e8e9c54ea39641ad6e0bc0e4223b
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 1s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 27.89 MiB / 27.89 MiB [====================================================] 2s
Copying blob sha256:6cd3a42e50dfbbe2b8a505f7d3203c07e72aa23ce1bdc94c67221f7e72f9af6c
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 865 B / 865 B [============================================================] 0s
Copying blob sha256:26b902a7bf04aa8d7c02fd742898dab4b6c791b8e363fddc06298191167d5fac
 162 B / 162 B [============================================================] 0s
 164 B / 164 B [============================================================] 0s
Copying config sha256:7c8c583f970820a51dab6e0613761c4f99077d9a22b373a59f47ee2afb247e72
 0 B / 2.36 KiB [--------------------------------------------------------------]Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
Storing signatures
FATAL:   Unable to pull docker://ubuntu:18.10: conveyor failed to get: Error initializing source oci:/home/sigim/.singularity/cache/oci:50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4: no descriptor found for reference "50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4"
INFO:    Creating SIF file...
INFO:    Build complete: ubuntu1804.img
[1]+  Exit 255                strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10

@rhpvorderman
Copy link
Author

rhpvorderman commented May 29, 2019

We sometimes get this error when we have cache corruption: FATAL: container creation failed: mount error: can't remount /run/shm: no such file or directory . But maybe that is caused by something else. Never mind, this was not related.

I am glad you were able to reproduce the race-conditions! Thanks!

@WestleyK
Copy link
Contributor

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
2.59 MiB / 2.59 MiB [=======================================================] >100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory

Nevermined this ^^^ problem, I was on a dev branch (not master) 🤦‍♂️ , that issue has nothing to do with cache corrupting.

But, there still is a bug if you clean the cache, while building a container. witch may not be a bug...

@tbugfinder
Copy link

@WestleyK Especially nextflow uses parallel pulls prior to the workflow execution.

@tbugfinder
Copy link

Is there any chance for a bug fix?

@adamnovak
Copy link

adamnovak commented Oct 11, 2019

I'm also getting issues like this in Toil workflows trying to use Singularity:

Unable to handle docker://devorbitus/ubuntu-bash-jq-curl uri: unable to build: conveyor failed to get: no descriptor found for reference "7f5e6bce78bb52d74e6a0881ec91806d11978cedfd4caa43a6fb71c55350254a"

It seems difficult in practice to prevent all software running as the same user as you from trying to use Singularity to run the same image as you are trying to run. The only workaround I can come up with is always setting your own SINGULARITY_CACHEDIR, at which point you lose the benefit of caching between tasks.

@tbugfinder
Copy link

I've upgraded to latest version and still end up with the conveyor error:

$ singularity --version
singularity version 3.4.2-1.el7



Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ubuntu-18.10.img docker://ubuntu:18.10 > /dev/null
  status : 255
  message:
    [34mINFO:    Converting OCI blobs to SIF format
    INFO:    Starting build...
    Getting image source signatures
    Copying blob sha256:8a532469799e09ef8e1b56ebe39b87c8b9630c53e86380c13fbf46a09e51170e

     0 B / 25.82 MiB [-------------------------------------------------------------]
     8.88 MiB / 25.82 MiB [===================>------------------------------------]
     15.61 MiB / 25.82 MiB [=================================>---------------------]
     21.16 MiB / 25.82 MiB [=============================================>---------]
     25.82 MiB / 25.82 MiB [====================================================] 0s
    Copying blob sha256:32f4dcec3531395ca50469cbb6cba0d2d4fed1b8b2166c83b25b2f5171c7db62

     0 B / 34.32 KiB [-------------------------------------------------------------]
     34.32 KiB / 34.32 KiB [====================================================] 0s
    Copying blob sha256:230f0701585eb7153c6ba1a9b08f4cfbf6a25d026d7e3b78a47c0965e4c6d60a

     0 B / 868 B [-----------------------------------------------------------------]
     868 B / 868 B [============================================================] 0s
    Copying blob sha256:e01f70622967c0cca68d6a771ae7ff141c59ab979ac98b5184db665a4ace6415

     0 B / 164 B [-----------------------------------------------------------------]
     164 B / 164 B [============================================================] 0s
    Copying config sha256:e4186b579c943dcced1341ccc4b62ee0617614cafc5459733e2f2f7ef708f224

     0 B / 2.42 KiB [--------------------------------------------------------------]
     2.42 KiB / 2.42 KiB [======================================================] 0s
    Writing manifest to image destination
    Storing signatures
    FATAL:   While making image from oci registry: while building SIF from layers: conveyor failed to get: no descriptor found for reference "7d657275047118bb77b052c4c0ae43e8a289ca2879ebfa78a703c93aa8fd686c"


rsuchecki referenced this issue in plantinformatics/pretzel-input-generator Jan 3, 2020
@adamnovak
Copy link

As a response to https://github.com/sylabs/singularity/issues/4555#issuecomment-570612570, it would be extremely useful for my use case to have some synchronization inside Singularity that depends on atomic globally-consistent rename support, or even that depends on file lock support, on the backing filesystem. The result would be AFAIK no worse in the case where Singularity is running on multiple machines against a filesystem without support for these tools (i.e. you'd still get uncontrolled races and apparently arbitrary failures), but within a single machine with an ext4 home directory (which covers e.g. most cloud VMs) you would get actually-reliable performance.

@dtrudg
Copy link
Contributor

dtrudg commented Jan 3, 2020

@adamnovak - understood. There have been some caching code changes since earlier 3.x versions that I'm not entirely familiar with yet, but I believe we have fewer issues now. We can try and establish the exact points we have problems remaining, and take a look at improvements for the pretty constrained case you give there in the next release cycle. I just don't want to give any promises that we can solve things simply for people who are wanting to share cache directories between multiple users on arbitrary cluster filesystems.

We still recommend, that you singularity pull in a single location, single script etc. into a SIF file before any concurrent execution, and run against that immutable SIF.

rsuchecki referenced this issue in plantinformatics/pretzel-input-generator Jan 8, 2020
* use small genomes to generate examples and stramline input definitions

* corrected urls

* relaxed allowed target name regex

* stingency settings not ensembl specific, moved to main config

* refactoring gtf/gff3 fileds def

* major re-work of input staging and multitude of related changes

* updated repr pep filtering

* relaxed req to include supercontigs not just chromosomes

* added sequencesToPlace spec to test config

* restored core functionality after re-structure

* cleanup, comments

* added samtools container def

* optional faidx process if idx no provided

* added data set from non Esembl source

* generalised gff3-based pep conversion to Ensembl style, also allows pass-through of already existing records

* allowing user-specified chromosome id pattern for block and feature JSON generation

* updated and documented test data sets

* travis stub

* opted for smaller samtools container

* hack to handle gz (not bgz) files fro chr lengths

* minor

* Update README.md

* Update .travis.yaml

* Update .travis.yaml

* Update .travis.yaml

* Update .travis.yaml

* test profile with local data

* travis data download and untar

* travis fixes

* ubu version for travis

* updated dep

* for GH actions

* docker user change for GH actions

* docker groovy test for GHA

* docker user

* docker grp exists

* added go for singularity

* added groovy image with ps

* reconf

* test profile updates

- fix for groovy @grab failing with singularity (read only file-system)
- fix errorStrategy config

* added Singularity install to GH actions

* Singularity dependencies @ GH actions

* working around https://github.com/sylabs/singularity/issues/3634

* test singularity pull form docker

* explicit use of gawk - may matter on alpine

* workaround for

nextflow-io/nextflow#1210
sylabs/singularity#3634

* leaner fastx container

* fastx and reconf

* fix path to script, renamed tasks

* test wspace path

* added missing script, fixed GH actions cmd

* ansi-lo on and try docker again

* docker workflow test

* fix typo

* fix typo

* fix for permission denied GH actions (?)

* fix for groovy grapes in docker

* test

* test

* test

* test

* another docker test

* GH A job.needs experiemnt

* GH A tidy

* GH A fix indent

* GH A fix job

* added GH actions CI badge

* re-implemented: duplicate emissions if multiple annotations per reference assembly

* updated datastes in line with feature dev

* another badge ver

* fix

* added EP datasets

* ensure non-empty process out

* generalised for different gff3 interpretations

* Delete .travis.yaml

* Update README.md

* Update README.md

* At & Bd ref fasta not needed

* speeding things up:  gawk in jq container and up resources

* do not report markers placed outside pseudochromosomes (e.g. on scaffolds)

* id pattern match extended to seq placement

* redundant-ish

* added TOC
@dtrudg
Copy link
Contributor

dtrudg commented Feb 6, 2020

This has surfaced again in #5020 - I'm going to close this issue and we'll pick it up there. We have a plan to move forward on this on that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants