Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to Irber et al. sketch collection #5

Open
GeoMicroSoares opened this issue Nov 7, 2022 · 7 comments
Open

Access to Irber et al. sketch collection #5

GeoMicroSoares opened this issue Nov 7, 2022 · 7 comments

Comments

@GeoMicroSoares
Copy link

Good afternoon,

I would like to access the 7.5TB sketch collection named in Irber et al. (2022) please - how could this be made possible? I'm available at andre.rodrigues-soares at uni-due.de.

Thank you in advance.

@ctb
Copy link
Member

ctb commented Nov 7, 2022

hi @GeoMicroSoares thanks for asking! we don't have a standard boilerplate way of making it all available - let me check in with @luizirber to see what he suggests.

@ctb
Copy link
Member

ctb commented Nov 8, 2022

ok, consulted ;).

tl;dr distributing ~10 TB of data is annoying!

First, if you are "simply" interested in the search itself, mastiff (described here) will let you do the search in realtime!

Second, if you are interested in a subset of the sketches, we can give you direct HTTPS URLs to download them. It's probably fine for 1000-10,000 but beyond that we would start to be a bit worried about our Web server 😅 . Drop me an e-mail at [email protected] if you want some of the URLs (we are not posting them publicly just yet).

Third, I might be able to set up a globus endpoint for you to use! Would that meet your needs?

Last but not least, if you sent us physical hard drives we would be able to copy files for you.

Thoughts/preferences?

I have yet to post our catalog anywhere public, will focus on doing that next.

Feel free to swing by our gitter channel to chat about the options, too!

@GeoMicroSoares
Copy link
Author

Hi again! I had no idea mastiff was available! I just ran the ipython notebook on a genome of interest and the results were pretty awesome! Thanks so much for the work you put into making this available!!

A quick question: if I wanted to query a small collection of genomes, what would be the best way to do that? Should I concatenate them all into one fasta and process that file the same way? Would you recommend otherwise?

Thanks again for everything!

@ctb
Copy link
Member

ctb commented Nov 8, 2022

that's fantastic 😆

in the mastiff binder repo, there's a Snakefile but I haven't tested it recently - it SHOULD be as simple as putting all your genome sequences in sequences/*.fa.gz and running snakemake -j 2 or something. Will verify later!

@GeoMicroSoares
Copy link
Author

It is! What an awesome resource, thanks again for this!

Just a question - does this tool search the whole of the SRA every time it runs or was this a subsample of the SRA you indexed and is available for querying? If it's a subsample, how often will this be updated in the future?

@ctb
Copy link
Member

ctb commented Nov 8, 2022

I believe it's up to date with metagenomes as of a few weeks ago. We plan to update it regularly; it shouldn't ever be more than 2-4 months out of date. We're still working on how to communicate exactly what is in there but it's a work in progress!

@ctb
Copy link
Member

ctb commented Jan 8, 2023

From @luizirber -

There is a downloader here:

https://github.com/sourmash-bio/sra_search/blob/main/Snakefile#L37

Pulling from wort (ipfs), I didn't switch to load from farm yet

There is a reorganization of the pipeline here:

sourmash-bio/sra_search#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants