Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read group platform unnecessarily strict #171

Closed
drtconway opened this issue May 27, 2023 · 2 comments
Closed

Read group platform unnecessarily strict #171

drtconway opened this issue May 27, 2023 · 2 comments
Labels
duplicate This issue or pull request already exists sam

Comments

@drtconway
Copy link

G'day.

Awesome library, thank you! I'm an experienced programmer, but new to Rust, so I apologise if the following is because of my rust-naivity.

The bam files from our pipeline contain read group headers like the following:

@RG	ID:<deleted>	LB:<deleted>	PL:Illumina	PU:<deleted>	SM:<deleted>

This leads to a parse error claiming that the platform is invalid.

Looking at the code, it looks like it wants "Illumina" to be all upper or all lower. This feels unnecessarily restrictive. Can I suggest two possible fixes:

  1. case fold the platform before matching
  2. introduce an "other" alternative to the platform enum and use it when the platform is well-formed but unknown.

These are not mutually exclusive, of course.

I am happy to do a pull request if you think these are a good idea.

Tom.

@drtconway
Copy link
Author

Stupidly, I didn't notice #111 before posting.

I thought I knew the spec, but obviously had missed that detail.

Still, it would maybe good to have a "tolerant" mode that accepts well-formed input even if it is not compliant, even if the output functions forbid the production of non-compliant output.

@zaeleus zaeleus added sam duplicate This issue or pull request already exists labels May 29, 2023
@zaeleus
Copy link
Owner

zaeleus commented May 29, 2023

Hi @drtconway,

As you saw in #111, the SAM specification has an explicit list of valid read group platform values, which must be in uppercase. After the discussion in samtools/hts-specs#679, one possible resolution was to allow this field to be case-insensitive. This is drafted in samtools/hts-specs#684; however, it shows that the ruling changed from case-insensitive to remaining uppercase and allowing lowercase. This is what noodles currently, preemptively implements. As such, this still doesn't help your case when the platform value is mixed case (disallowed in ecc0037).

A tolerant (aka lenient or silent) mode is pragmatic, but it's detrimental to interoperability. It is an abuse of the file format and allows out-of-spec files to be produced.

I would encourage you to add your feedback to samtools/hts-specs#679. The spec maintainers have yet to formally address this issue. It is very much something that needs to be fixed at the specification level, not in each implementation.

I'm closing this in favor of tracking it in #111, but feel free to continue this discussion in either.

@zaeleus zaeleus closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists sam
Projects
None yet
Development

No branches or pull requests

2 participants