-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rlen calculation for vcf 4.4, 4.5 #1897
base: develop
Are you sure you want to change the base?
Conversation
875ebd2
to
0edf6ff
Compare
vcf.c
Outdated
{ | ||
uint8_t *f = (uint8_t*)v->shared.s, *t = NULL, *e = (uint8_t*)v->shared.s + v->shared.l; | ||
int size, type, id, lenid, endid, svlenid, i, bad; | ||
char **alleles = calloc(1, sizeof(char*) * v->n_allele); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocation needs to be checked. It could also possibly be made smaller as we only need the alleles for the length of the first, and the presence or absence of <INS>
, which could be stored in rather less space using a bitmap.
It would also be useful to look for <*>/<NON_REF>
, see later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use kbitset to track the <ins> alleles and removed the allocations.
vcf.c
Outdated
if (v->unpacked & BCF_UN_INFO || v->d.shared_dirty & BCF1_DIRTY_INF) { | ||
endinfo = bcf_get_info(h, v, "END"); | ||
svleninfo = version >= VCF44 ? bcf_get_info(h, v, "SVLEN") : NULL; | ||
} else if (f < e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part could be skipped if both endid
and svlenid
are less than zero (so not in the header) as in that case neither will match any of the INFO records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
vcf.c
Outdated
for (i = 0; i < v->n_fmt; ++i) { | ||
id = bcf_dec_typed_int1(f, &t); | ||
if (id == lenid) { | ||
if (!(lenfmt = malloc(sizeof(bcf_fmt_t)))) | ||
goto fail; | ||
free_len = 1; | ||
t = bcf_unpack_fmt_core1(f, v->n_sample, lenfmt); | ||
break; //that's all needed | ||
} else { | ||
f = t; | ||
size = bcf_dec_size(f, &t, &type); | ||
t += size * v->n_sample << bcf_type_shift[type]; | ||
} | ||
f = t; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly this could be skipped if lenid < 0
, and also if no <*>
or <NON_REF>
alleles have been seen. This would be a big win for records with lots of genotype data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
vcf.c
Outdated
break; | ||
} | ||
//expects only SV will have valid svlen and rest have '.' | ||
if (!strcmp(alleles[i+1], "<INS>")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to do this even if we don't have svleninfo
, as it's possible that we could have an <INS>
without a SVLEN INFO
to go with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
here the check is to avoid svlen value for <ins> rather than setting / using ref.len.
ref. len is anyway in use in rlen calculation later. so this check will be changed to strcmp(<ins>,..) and still kept inside the svlen processing.
Also the strcmp check changed to kbitset check.
vcf.c
Outdated
for (i = 0; i < v->n_info; ++i) { | ||
id = bcf_dec_typed_int1(f, &t); | ||
if (id == endid) { //END | ||
if (!(endinfo = malloc(sizeof(bcf_info_t)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only ever need one of these, so could it be allocated on the stack, along with a flag to say that it's in use?
Similarly for svleninfo
and lenfmt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated to use one in stack, uses' the existing pointer instead of the flag.
rlen set based on version, position, end, svlen and format len.
depends on PR #1861 (rebased).
recalculates rlen on END and SVLEN info field update, LEN format field update and allele updates.
test updated for validation of rlen calculation.
4.3 uses END and reference length
4.4 uses END, SVLEN and reference length
4.5 uses SVLEN, LEN, reference length
When SVLEN/LEN is not present, it is inferred from END. It is made irrespective of version and no. of alleles, to make it simple and it seems OK.
Cross check of SVLEN with allele is made only for INS, as it is expected to be '.', resulting in 0, for non - SV alleles.
Note:
Version change may necessitate rlen recalculation and needs to trigger it. It is not handled in this PR.
Until it is made, we may have to save as vcf and reload to have rlen calculated appropriately.
Fixes #1820