Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rlen calculation for vcf 4.4, 4.5 #1897

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Mar 18, 2025

rlen set based on version, position, end, svlen and format len.
depends on PR #1861 (rebased).

recalculates rlen on END and SVLEN info field update, LEN format field update and allele updates.
test updated for validation of rlen calculation.
4.3 uses END and reference length
4.4 uses END, SVLEN and reference length
4.5 uses SVLEN, LEN, reference length
When SVLEN/LEN is not present, it is inferred from END. It is made irrespective of version and no. of alleles, to make it simple and it seems OK.
Cross check of SVLEN with allele is made only for INS, as it is expected to be '.', resulting in 0, for non - SV alleles.

Note:
Version change may necessitate rlen recalculation and needs to trigger it. It is not handled in this PR.
Until it is made, we may have to save as vcf and reload to have rlen calculated appropriately.

Fixes #1820

@vasudeva8 vasudeva8 force-pushed the rlen branch 2 times, most recently from 875ebd2 to 0edf6ff Compare March 19, 2025 12:34
@daviesrob daviesrob self-assigned this Mar 20, 2025
vcf.c Outdated
{
uint8_t *f = (uint8_t*)v->shared.s, *t = NULL, *e = (uint8_t*)v->shared.s + v->shared.l;
int size, type, id, lenid, endid, svlenid, i, bad;
char **alleles = calloc(1, sizeof(char*) * v->n_allele);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocation needs to be checked. It could also possibly be made smaller as we only need the alleles for the length of the first, and the presence or absence of <INS>, which could be stored in rather less space using a bitmap.

It would also be useful to look for <*>/<NON_REF>, see later.

Copy link
Contributor Author

@vasudeva8 vasudeva8 Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use kbitset to track the <ins> alleles and removed the allocations.

vcf.c Outdated
if (v->unpacked & BCF_UN_INFO || v->d.shared_dirty & BCF1_DIRTY_INF) {
endinfo = bcf_get_info(h, v, "END");
svleninfo = version >= VCF44 ? bcf_get_info(h, v, "SVLEN") : NULL;
} else if (f < e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part could be skipped if both endid and svlenid are less than zero (so not in the header) as in that case neither will match any of the INFO records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

vcf.c Outdated
Comment on lines 6234 to 6248
for (i = 0; i < v->n_fmt; ++i) {
id = bcf_dec_typed_int1(f, &t);
if (id == lenid) {
if (!(lenfmt = malloc(sizeof(bcf_fmt_t))))
goto fail;
free_len = 1;
t = bcf_unpack_fmt_core1(f, v->n_sample, lenfmt);
break; //that's all needed
} else {
f = t;
size = bcf_dec_size(f, &t, &type);
t += size * v->n_sample << bcf_type_shift[type];
}
f = t;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly this could be skipped if lenid < 0, and also if no <*> or <NON_REF> alleles have been seen. This would be a big win for records with lots of genotype data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

vcf.c Outdated
break;
}
//expects only SV will have valid svlen and rest have '.'
if (!strcmp(alleles[i+1], "<INS>")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to do this even if we don't have svleninfo, as it's possible that we could have an <INS> without a SVLEN INFO to go with it.

Copy link
Contributor Author

@vasudeva8 vasudeva8 Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.
here the check is to avoid svlen value for <ins> rather than setting / using ref.len.
ref. len is anyway in use in rlen calculation later. so this check will be changed to strcmp(<ins>,..) and still kept inside the svlen processing.
Also the strcmp check changed to kbitset check.

vcf.c Outdated
for (i = 0; i < v->n_info; ++i) {
id = bcf_dec_typed_int1(f, &t);
if (id == endid) { //END
if (!(endinfo = malloc(sizeof(bcf_info_t))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only ever need one of these, so could it be allocated on the stack, along with a flag to say that it's in use?

Similarly for svleninfo and lenfmt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated to use one in stack, uses' the existing pointer instead of the flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VCF 4.4/4.5 specs: rlen should be computed rather than populated from END INFO field
2 participants