Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not read contig and filter header lines on VCF ingest #189

Merged
merged 2 commits into from
Apr 14, 2020

Conversation

karenfeng
Copy link
Collaborator

What changes are proposed in this pull request?

With #173, VCF ingest now fails when merging files with contig lines that are missing the length field. This is resolved in newer versions of the HTSJDK as of samtools/htsjdk#1418.

As we don't need contig and filter header lines for schema inference during VCF ingest, we can skip reading them entirely. This PR adds an option to skip reading non-schema header lines while reading unique header lines.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

Copy link
Contributor

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, feel free to merge after addressing

headers.flatMap { header =>
val infoHeaderLines = header.getInfoHeaderLines.asScala
val formatHeaderLines = header.getFormatHeaderLines.asScala
val contigHeaderLines = header.getContigLines.asScala
val filterHeaderLines = header.getFilterLines.asScala
val contigHeaderLines =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you consolidate into a single if else? Generally it's less error-prone to check flags in as few places as possible.

@@ -103,19 +103,23 @@ object VCFHeaderUtils extends GlowLogging {
}

/**
* Find the unique header lines from an RDD of VCF headers.
* Find the unique desired header lines from an RDD of VCF headers.
* If lines of the same class found with the same ID, we pick one unless they are incompatible.
* If there are incompatible lines, [[IllegalArgumentException]] is thrown.
* Incompatible lines are:
* - FORMAT or INFO lines with the same ID but different types or counts
* - contig lines with the same ID but different lengths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a brief comment about the meaning of getNonSchemaHeaderLines.

@codecov
Copy link

codecov bot commented Apr 14, 2020

Codecov Report

Merging #189 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #189      +/-   ##
==========================================
- Coverage   93.67%   93.64%   -0.03%     
==========================================
  Files          86       86              
  Lines        4077     4077              
  Branches      379      365      -14     
==========================================
- Hits         3819     3818       -1     
- Misses        258      259       +1     
Impacted Files Coverage Δ
.../main/scala/io/projectglow/vcf/VCFFileFormat.scala 97.80% <100.00%> (-0.55%) ⬇️
...main/scala/io/projectglow/vcf/VCFHeaderUtils.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28b8068...225a03b. Read the comment docs.

Signed-off-by: Karen Feng <[email protected]>
@karenfeng karenfeng merged commit 505cbf7 into projectglow:master Apr 14, 2020
@karenfeng karenfeng deleted the read-contig-lines-on-demand branch April 14, 2020 18:46
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
…#189)

* Do not read contig and filter header lines on VCF ingest

Signed-off-by: Karen Feng <[email protected]>

* Cleanup

Signed-off-by: Karen Feng <[email protected]>
Signed-off-by: Henry Davidge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants