-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race conditions when caching images sometimes causes cache corruption. #3634
Comments
Yeah, I can reproduce something like what you describe. Is this similar to your error? $ singularity exec docker://ubuntu ls & sleep 2.0s && singularity cache clean --all
[1] 32257
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO: Creating SIF file...
FATAL: Unable to handle docker://ubuntu uri: unable to build: While creating SIF: while creating container: container file creation failed: open /home/westleyk/.singularity/cache/oci-tmp/f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5/ubuntu_latest.sif: no such file or directory
[1]+ Exit 255 singularity exec docker://ubuntu ls |
After some messing around, I seem to have corrupted my cache, is this more like the error message your got? $ singularity pull library://alpine:latest
INFO: Downloading library image
2.59 MiB / 2.59 MiB [=======================================================] 100.00% 3.60 MiB/s 0s
FATAL: While pulling library image: while opening cached image: open : no such file or directory EDIT: this bug is not related this issue, I was not on the master branch 🤦♂️ |
Btw, my singularity version is:
|
I'm running two pulls in parallel using:
Parallel Pull:
|
I am glad you were able to reproduce the race-conditions! Thanks! |
Nevermined this ^^^ problem, I was on a dev branch (not master) 🤦♂️ , that issue has nothing to do with cache corrupting. But, there still is a bug if you clean the cache, while building a container. witch may not be a bug... |
@WestleyK Especially nextflow uses parallel pulls prior to the workflow execution. |
Is there any chance for a bug fix? |
I'm also getting issues like this in Toil workflows trying to use Singularity:
It seems difficult in practice to prevent all software running as the same user as you from trying to use Singularity to run the same image as you are trying to run. The only workaround I can come up with is always setting your own SINGULARITY_CACHEDIR, at which point you lose the benefit of caching between tasks. |
I've upgraded to latest version and still end up with the conveyor error:
|
nextflow-io/nextflow#1210 sylabs/singularity#3634
As a response to https://github.com/sylabs/singularity/issues/4555#issuecomment-570612570, it would be extremely useful for my use case to have some synchronization inside Singularity that depends on atomic globally-consistent rename support, or even that depends on file lock support, on the backing filesystem. The result would be AFAIK no worse in the case where Singularity is running on multiple machines against a filesystem without support for these tools (i.e. you'd still get uncontrolled races and apparently arbitrary failures), but within a single machine with an ext4 home directory (which covers e.g. most cloud VMs) you would get actually-reliable performance. |
@adamnovak - understood. There have been some caching code changes since earlier 3.x versions that I'm not entirely familiar with yet, but I believe we have fewer issues now. We can try and establish the exact points we have problems remaining, and take a look at improvements for the pretty constrained case you give there in the next release cycle. I just don't want to give any promises that we can solve things simply for people who are wanting to share cache directories between multiple users on arbitrary cluster filesystems. We still recommend, that you |
* use small genomes to generate examples and stramline input definitions * corrected urls * relaxed allowed target name regex * stingency settings not ensembl specific, moved to main config * refactoring gtf/gff3 fileds def * major re-work of input staging and multitude of related changes * updated repr pep filtering * relaxed req to include supercontigs not just chromosomes * added sequencesToPlace spec to test config * restored core functionality after re-structure * cleanup, comments * added samtools container def * optional faidx process if idx no provided * added data set from non Esembl source * generalised gff3-based pep conversion to Ensembl style, also allows pass-through of already existing records * allowing user-specified chromosome id pattern for block and feature JSON generation * updated and documented test data sets * travis stub * opted for smaller samtools container * hack to handle gz (not bgz) files fro chr lengths * minor * Update README.md * Update .travis.yaml * Update .travis.yaml * Update .travis.yaml * Update .travis.yaml * test profile with local data * travis data download and untar * travis fixes * ubu version for travis * updated dep * for GH actions * docker user change for GH actions * docker groovy test for GHA * docker user * docker grp exists * added go for singularity * added groovy image with ps * reconf * test profile updates - fix for groovy @grab failing with singularity (read only file-system) - fix errorStrategy config * added Singularity install to GH actions * Singularity dependencies @ GH actions * working around https://github.com/sylabs/singularity/issues/3634 * test singularity pull form docker * explicit use of gawk - may matter on alpine * workaround for nextflow-io/nextflow#1210 sylabs/singularity#3634 * leaner fastx container * fastx and reconf * fix path to script, renamed tasks * test wspace path * added missing script, fixed GH actions cmd * ansi-lo on and try docker again * docker workflow test * fix typo * fix typo * fix for permission denied GH actions (?) * fix for groovy grapes in docker * test * test * test * test * another docker test * GH A job.needs experiemnt * GH A tidy * GH A fix indent * GH A fix job * added GH actions CI badge * re-implemented: duplicate emissions if multiple annotations per reference assembly * updated datastes in line with feature dev * another badge ver * fix * added EP datasets * ensure non-empty process out * generalised for different gff3 interpretations * Delete .travis.yaml * Update README.md * Update README.md * At & Bd ref fasta not needed * speeding things up: gawk in jq container and up resources * do not report markers placed outside pseudochromosomes (e.g. on scaffolds) * id pattern match extended to seq placement * redundant-ish * added TOC
This has surfaced again in #5020 - I'm going to close this issue and we'll pick it up there. We have a plan to move forward on this on that issue. |
Version of Singularity:
3.1.0
Expected behavior
When two singularity processes pull the same image, some sort of measures are taken that they do not write to the cache at the same time.
Actual behavior
Two singularity process will write to the cache at the same time. Oddly enough this works well in most of the cases. However sometimes we get cache corruption on our cluster. This happens when we start multiple jobs that require the same image simultaneously.
Steps to reproduce behavior
singularity cache clean --all
singularity shell docker://python:3.7
simultaneously in two different terminals.EDIT: I realize it is very hard to reproduce behaviour that happens 'sometimes'. I could not find a similar issue so I hope that other people with the same problem manage to find this one.
The text was updated successfully, but these errors were encountered: