Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec needs to include the accepted file extentions for the index files #215

Open
yfarjoun opened this issue May 26, 2017 · 7 comments
Open

Comments

@yfarjoun
Copy link
Contributor

In the wild there are *.bam.bai & *.bai as index files for *.bam in addition to *.cram.crai & *.crai as index files for *.cram files. This is in addition to .bai (and possibly .cram.bai) files being valid index files for cram. needless to say, this introduces significant overhead to programs that need to look for the index files and possible disagreement between different implementations when multiple valid index files are found (different implementations might search in a different order)

I suggest that we include in the specification a naming convention for the index files.

@jkbonfield
Copy link
Contributor

What are you considering doing? Stating one thing as canonical and changing either htslib or htsjdk so new files are generated adhering to the updated specification, but reading old files by checking by paths?

If so it sounds sensible, but my vote would be to go with the names used by the original author ;-)

@yfarjoun
Copy link
Contributor Author

ouch.

I was hoping that we could use this issue to nail down the proposal, and have the "original authors chime in. then I'll be happy to formalize is in an PR.

The next comment will include a table. Feel free to edit (and comment who changed last)

@yfarjoun
Copy link
Contributor Author

File Type Main File Extension Index File extension last touch + comments
Sam .sam N/A @yfarjoun
Bam .bam .bai @yfarjoun
Cram .cram .cram.crai @yfarjoun (though really??? shouldn't it be .crai?)

@jkbonfield
Copy link
Contributor

If we want to nail it down, then IMO we should nail it down to the original filenames supported by both early implementations and revert this picard commit which caused the disparity in the first place!

samtools/htsjdk@7459fba

@jmarshall
Copy link
Member

Surely this horse bolted many years ago for .bai vs .bam.bai. OTOH there may be hope for a single canonical filename for a CRAI index.

@tfenne
Copy link
Member

tfenne commented May 31, 2017

I agree with @jmarshall - I think the only reasonable thing to do for BAM is to document that both foo.bai and foo.bam.bai are valid index names for foo.bam. And maybe state a preference going forward, though that may be contentious. I for one read ".bam.bai" and "bam bam index" and dislike it as much as "ATM machine".

@jkbonfield
Copy link
Contributor

jkbonfield commented May 31, 2017

Agreed we shouldn't be forcing anything and obviously existing software now has to check both filenames as either can exist, but specifying a preference isn't a bad idea. I wouldn't argue for making the same mistake with CRAM though. If both implementations right now look in .cram.crai then leave well alone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants