Skip to content

Conversation

@henrydavidge
Copy link
Contributor

What changes are proposed in this pull request?

Previously, if a BGEN file contained sampleIds, they were always included in the emitted DataFrame. If a user didn't want them (common when dealing with large cohort sizes), they had to manually drop them.

Now, you can provide the same includeSampleIds option used by the VCF and Plink datasources. If it is false, sample ids are not included in the output.

cc @emaxwell

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

@codecov
Copy link

codecov bot commented Nov 5, 2019

Codecov Report

Merging #59 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage    94.1%   94.11%   +<.01%     
==========================================
  Files          73       74       +1     
  Lines        3560     3566       +6     
  Branches      331      326       -5     
==========================================
+ Hits         3350     3356       +6     
  Misses        210      210
Impacted Files Coverage Δ
.../main/scala/io/projectglow/vcf/VCFFileFormat.scala 97.66% <100%> (-0.06%) ⬇️
...ain/scala/io/projectglow/bgen/BgenFileFormat.scala 98.73% <100%> (-0.08%) ⬇️
...scala/io/projectglow/bgen/BgenSchemaInferrer.scala 100% <100%> (ø) ⬆️
...n/scala/io/projectglow/plink/PlinkFileFormat.scala 100% <100%> (ø) ⬆️
...cala/io/projectglow/common/datasourceOptions.scala 100% <100%> (ø)
...rojectglow/transformers/pipe/PipeTransformer.scala 96.61% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2637b4e...a649844. Read the comment docs.

Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion. If you disagree let me know and I approve. The rest looks good.

import io.projectglow.common.logging.{HlsMetricDefinitions, HlsTagDefinitions, HlsTagValues, HlsUsageLogging}
import io.projectglow.common.{GlowLogging, WithUtils}
import io.projectglow.sql.util.{ComDatabricksDataSource, SerializableConfiguration}
import io.projectglow.vcf.VCFOption
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to gather all bgen options (including this one) in a BGENOption object like what we did for VCFOption. It will be cleaner and creates separation of VCF and BGEN. Just a thought. Let me know if you disagree and I will approve.

Signed-off-by: Henry D <[email protected]>
Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

@henrydavidge henrydavidge merged commit b099251 into projectglow:master Nov 6, 2019
@henrydavidge henrydavidge mentioned this pull request Nov 6, 2019
3 tasks
henrydavidge pushed a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
henrydavidge added a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* Add BGEN option to explicitly not include sampleIds

Signed-off-by: Henry D <[email protected]>

* organize datasource options

Signed-off-by: Henry D <[email protected]>

* add header

Signed-off-by: Henry D <[email protected]>

Signed-off-by: Henry Davidge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants