-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add option to exclude symlink creation for temporary files (for easy use with AWS S3) #93
Comments
The I don't think EFS makes sense for this, as the Your unrelated question is actually another reason why I don't really want to implement this. We convert FASTA files into internal formats and if possible do this through symlinks to not actually use additional storage space. Also some of the intermediate databases that get created are also symlinks to other intermediate databases. So avoiding symlinks is not quite trivial in the current implementation. |
There's an AWS service called S3 Express OneZone that is high-performance and low latency which is good for scratch space (at least better than s3 which is very slow when mounted).
Yea EBS is great but it's a bit tricky to do at scale because (from my understanding) you allocate it to a specific EC2 instance so you can't really run multiple jobs with it.
I saw this parameter and it's great especially if one were to try and use this w/ AWS s3 express one zone scratch space.
At least from my experience, it depends on the scale. I tried gene predictions on ~200 genomes or so and they temporary directory exceeded 1.5TB. For instance, I'm running it right now on MicroEuk50 which has 30M proteins on GCA_900893395.1 and its currently at 57G in temporary space when
If I use
EFS is wildly expensive to run analysis on. Right now, my only affordable option is to use EBS but that's easier for one-shot analysis and not as easily reproducible for workflows or use at scale (I need to estimate the storage and memory footprints for all the jobs, then create an EC2 with those specs, then run GNU parallel to run the jobs at the same time hoping it doesn't crash the VM). It would be great if I could deploy jobs and set the temporary directory to AWS s3 express one zone. Here is some of the pricing for reference.
Hmm... yea I can see how this will be tricky to adapt. I've had a great experience with MetaEuk so far and would love continue using it at scale for larger projects. I guess this GitHub issue can serve more as an example of potential limitations for using the tool at scale to consider during further development than a bona fide feature request. Alternatively, have you had any success with miniprot by any chance? I'm seeing it used more and more but the number of genes I get out are magnitudes more (>100k genes) compared to MetaEuk (with the clustered Microeukaryotic Protein Database I made in VEBA 2.0 publication Table 2 for use w/ MetaEuk). If I knew C I would definitely offer some pull requests but unfortunately I only know Python at a production level. Although, It's on my ever growing to do list! Regardless, I appreciate the insight and your time for the responses (also the amazing tool you developed which has allowed me to do much more robust climate change and public health research). |
Hi Josh, I'll let @milot-mirdita continue the discussion about S3, but here are a couple of relevant points:
|
I really appreciate the insight on this. Right now the database I'm using is clustered (similar to UniRef where it's clustered at 100%, 90%, and 50%) with the database I'm testing being around 30M proteins. I'm working on some methods to do more targeted iterative gene predictions with MetaEuk (casting a wide net of markers) then building smaller more targeted set from source organisms but it's early in development and I need to benchmark against ground truths. If I have any developments, I will add them here in case it's helpful. |
I'd like to run MetaEuk on hundreds of eukaryotic genome assemblies using AWS but the cost for writing to disk on EFS is extremely expensive. The alternative is to use an s3 bucket for temporary storage. However, this currently isn't possible with MetaEuk because it creates a symlink called "latest" in temporary which isn't support on s3 since it's not a traditional file system even when mounted (e.g., you can't create symlinks, remove files, or edit files once they are created w/ the latter being possible using the aws cli).
Would it be possible to make the temporary files more "s3 friendly" by creating an option to not create any symlinks in temporary directory (also not edit or remove files once they are created)?
Also, unrelated questions while I have your attention. If you run the following command:
metaeuk easy-predict ${FASTA} ${DB} ${OUTPUT_DIRECTORY} ${TMP}
whereFASTA
is a genome assembly fasta andDB
is MMseqs2 protein database, does the input fasta file get converted to a MMseqs2 database in the backend?The text was updated successfully, but these errors were encountered: