Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRJEB49093 #6

Open
lskatz opened this issue Mar 13, 2023 · 3 comments
Open

PRJEB49093 #6

lskatz opened this issue Mar 13, 2023 · 3 comments

Comments

@lskatz
Copy link
Member

lskatz commented Mar 13, 2023

A colleague wrote:

Not sure if it is of interest, but PRJEB49093 has 151 M. tuberculosis isolates with matched Illumina and Nanopore data. Mixture of resistance profiles.
The raw nanopore data and all the metadata files can also be found in the Data Sharing section of our paper https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(22)00301-9/fulltext#seccestitle210

So I will check out that bioproject to see if any genomes are fast to assemble.

@lskatz
Copy link
Member Author

lskatz commented Mar 13, 2023

After downloading all reads, these are the smallest file sizes after sorting and compressing with -9.

-rw-------. 1 gzu2 users 168M Mar 13 09:33 ERR9030520.fastq.gz
-rw-------. 1 gzu2 users 167M Mar 13 09:26 ERR9030361.fastq.gz
-rw-------. 1 gzu2 users 163M Mar 13 09:36 ERR9030334.fastq.gz
-rw-------. 1 gzu2 users 163M Mar 13 09:35 ERR9030284.fastq.gz
-rw-------. 1 gzu2 users 163M Mar 13 09:17 ERR9030249.fastq.gz
-rw-------. 1 gzu2 users 161M Mar 13 09:30 ERR9030283.fastq.gz
-rw-------. 1 gzu2 users 159M Mar 13 09:17 ERR9030474.fastq.gz
-rw-------. 1 gzu2 users 153M Mar 13 09:08 ERR9030398.fastq.gz
-rw-------. 1 gzu2 users 144M Mar 13 10:20 ERR9030424.fastq.gz
-rw-------. 1 gzu2 users 142M Mar 13 10:18 ERR9030315.fastq.gz
-rw-------. 1 gzu2 users 138M Mar 13 10:03 ERR9030319.fastq.gz
-rw-------. 1 gzu2 users 137M Mar 13 09:22 ERR9030505.fastq.gz
-rw-------. 1 gzu2 users 128M Mar 13 10:04 ERR9030329.fastq.gz
-rw-------. 1 gzu2 users 122M Mar 13 09:56 ERR9030258.fastq.gz
-rw-------. 1 gzu2 users 113M Mar 13 10:14 ERR9030503.fastq.gz
-rw-------. 1 gzu2 users 112M Mar 13 09:06 ERR9030316.fastq.gz

Assembling the smallest genome ERR9030316.fastq.gz takes this long

1236.86user 78.76system 5:36.83elapsed 390%CPU (0avgtext+0avgdata 8945592maxresident)k

And so I'm not sure if 5:36 is an acceptable time for a toy dataset.

@mbhall88
Copy link

Fair. You could always just subsample them? That'd assumable make them faster to assemble?

@lskatz
Copy link
Member Author

lskatz commented Mar 14, 2023

I don't have subsampling in this repo but I will consider it for later. I don't necessarily want to add another dependency. But another advantage of this repo is that there is a hashsum, meaning that even if I did subsample, I would have to do it in a deterministic way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants