Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement a new validate command #220

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open

Conversation

Ayimany
Copy link
Collaborator

@Ayimany Ayimany commented Jul 27, 2023

Resolves #47

Overview

This branch introduces the functionality of a new sub-command meant to validate the structure of .hap files.

Usage and Documentation

It requires no new dependencies.

It can be invoked through:
haptools validate [--sorted/--not-sorted] [--genotypes <filename.pgen>] [--verbosity] <filename.hap>

Implementation Details

The code in this implementation is concentrated in the haptools/val_hapfile.py module.

The classes and functions that make up this module are the following:

HapFileIO (Class)

This class consists of a set of methods made to validate the existence and readability of the provided .hap file. It also cleans up the file prior to reading it's content.

It requires a filename and an optional logger. The filename should be a .hap file.

The main method that should be used to verify if the file exists is HapFileIO::validate_existence. It will check for the following:

  • The file exists
  • The file is a regular file
  • The user has sufficient permissions to read the file

click arguments already check for existence as well.

The method that should be used to filter and read the content within the .hap file is HapFileIO::lines. It takes in a filename as a Path and a boolean sorted to determine if the file should be sorted. It will:

  • Remove empty lines
  • Strip lines which have content of their extra white space
  • Create a Line object per line
  • Attach a line number to each line
  • Remove meta information lines which are not at the header (if unsorted)
  • Sort lines by separating headers and data lines

Sorting actually occurs wether sorted is True or not. What sorted does is remove unnecessary lines based on their positions which would otherwise be necessary if sorted

Line (Class)

Stores information about a line, such as it's number and content for future use.

HapFileValidator (Class)

The validator will have to use a HapFileIO as its main source for reading the file's content. This is done through a method and not the constructor. (Which only accepts a logger)

To load the data into the validator, use HapFileValidator::extract_and_store_content which takes in the HapFileIO to be used and and an optional boolean to determine whether the file should be sorted or not.

All of the following methods in the class test for different aspects of the .hap file.

validate_version_declarations
Determines if the version declaration is present, repeated or invalid

validate_column_additions
Determines if the extra column declarations are correctly formatted and well-formed. If so, they are added to the list of registered extra columns

validate_columns_fulfill_minreqs
Validates if all columns fulfill the minimum requirements

validate_haplotypes
Validates the haplotype row format

validate_repeats
Validates the repeat row format

validate_variants
Validates the variant row format

store_ids
Stores the IDs of each haplotype, repeat and variant for future use. Should be called before any ID validation methods

validate_variant_ids
Validates the ID presence of each variant. Each needs to be unique per haplotype and not collide with chromosome IDs

validate_extra_fields
Makes sure that the added extra fields conform to their addition signature

reorder_extra_fields
Parses the order[H|R|V] lines and reorders the extra fields if they are valid

compare_haps_to_pvar
Compares the variants in the .hap file to those in the .pvar file

is_hapfile_valid (Function)

Performs all of the possible checks available in the HapFileValidator class. Returns a boolean which is True when there are no errors or warnings

Tests

Only one case hasn't been fully tested due to OS limitations and the way they handle their file permissions. I am talking about validating whether the user has enough permissions to read the .hap file.


test_generated_haplotypes
Tests the dummy .hap generated by the haptools test suite

test_with_empty_lines
Tests a .hap with empty lines

test_with_out_of_header_metas_sorted
Test a sorted .hap with meta lines out of the header

test_with_out_of_header_metas_unsorted
Test an unsorted .hap with meta lines out of the header

test_with_10_extras_reordered
Tests a .hap file with 10 extra columns

test_with_unexistent_reorders
Tests a .hap with an order[H|R|V] which mentions a non-existent extra column

test_with_unexistent_fields
Tests a .hap with a data line that is not an H, R or V

test_with_inadequate_version
Tests a .hap with an incorrectly formatted version

test_with_no_version
Tests a .hap with no present version

test_with_multiple_versions
Tests a .hap with several versions present

test_with_inadequate_version_columns
Tests a .hap with a version column of only 2 fields

test_with_invalid_column_addition_column_count
Tests a .hap with an extra column declaration of invalid column count

test_with_invalid_column_addition_types
Tests a .hap with a column addition for a type which is not H, R or V

test_with_invalid_column_addition_data_types
Tests a .hap with a column addition of unrecognized data type (not s, d or .nf)

test_with_insufficient_columns
Tests a .hap with insufficient mandatory columns

test_with_inconvertible_starts
Tests a .hap with start positions that can't be converted to integers

test_with_inconvertible_ends
Tests a .hap with end positions that can't be converted to integers

test_with_inconvertible_starts_var
Tests a .hap with start positions that can't be converted to integers in variants

test_with_inconvertible_ends_var
Tests a .hap with end positions that can't be converted to integers in variants

test_valhap_with_start_after_end
Tests a .hap with the start position placed after the end position

test_is_directory
Tests a validation command with a filename that points to a directory

test_with_variant_id_of_chromosome
Tests a .hap with a variant whose ID is the same as a chromosome ID

test_with_hrid_of_chromosome
Tests a .hap with a haplotype or repeat with the same ID as a chromosome

test_with_unexistent_col_in_order
Tests a .hap with an order[H|R|V] field that references a non-existent extra column name

test_with_unassociated_haplotype
Tests a .hap with a haplotype that does not have at least one matching repeat

test_with_unrecognizable_allele
Tests a .hap with a variant whose allele is not G, C, T or A

test_with_duplicate_ids
Tests a .hap with duplicate IDs for H and R fields

test_with_duplicate_vids_per_haplotype
Tests a .hap with duplicate IDs for variants with the same haplotype association

test_with_excol_of_wrong_type
Tests a .hap with a data line which contains an extra column of d data type but receives s

test_with_multiple_order_defs
Tests a .hap with multiple order[H|R|V] of the same type

test_with_insufficient_excols_in_reorder
Tests a .hap with an order[H|R|V] that does not reference all extra columns

test_with_variant_inexistent_haplotype_id
Tests a .hap with with a variant that references a non-existent haplotype

test_with_missing_variant_in_pvar
Tests a .hap along with a .pvar file which is missing an ID present in the .hap

test_unreadable_hapfile
Passes a non-existent file to the validator


Future work

It would be wise to document the code further on.

Looking towards developing optimizations for this command would be of great help too although we should evaluate how frequently this command is to be used and how big the input files usually are in order to determine the severity of this issue.

Checklist

@Ayimany Ayimany requested a review from aryarm July 27, 2023 00:58
@Ayimany Ayimany self-assigned this Jul 27, 2023
@Ayimany Ayimany linked an issue Jul 27, 2023 that may be closed by this pull request
33 tasks
@Ayimany Ayimany changed the title Implementation of the validate-hapfile command feat: Implementation of the validate-hapfile command Jul 27, 2023
@aryarm
Copy link
Member

aryarm commented Jul 27, 2023

mad props, @Ayimany ! This is a very well written PR. The code is super clean and easy to follow. I'm excited to try it out and will let you know once I finish reviewing

Thanks again for doing this! This will tremendously help many users of haptools

@aryarm aryarm changed the title feat: Implementation of the validate-hapfile command feat: Implementation of the validate command Jul 31, 2023
@aryarm aryarm changed the title feat: Implementation of the validate command feat: implement a new validate command Aug 2, 2023
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for taking this project on, @Ayimany. Your code is super thorough and well-considered.

I left some suggestions. Most of them are small things related to the logging or the tests.

In addition to the suggested changes, it might also be a good idea to write docstrings for all functions and classes. You can look at the sim_phenotype.py module for examples of how to do this. We follow the conventions of numpydoc outlined here:
https://numpydoc.readthedocs.io/en/latest/format.html#documenting-classes

haptools/val_hapfile.py Outdated Show resolved Hide resolved
haptools/__main__.py Outdated Show resolved Hide resolved
docs/commands/validate.rst Outdated Show resolved Hide resolved
docs/commands/validate.rst Outdated Show resolved Hide resolved
docs/commands/validate.rst Outdated Show resolved Hide resolved
haptools/validate.py Outdated Show resolved Hide resolved
tests/test_validate.py Outdated Show resolved Hide resolved
haptools/__main__.py Outdated Show resolved Hide resolved
self.errc += 1

for i in range(
HapFileValidator.KEY_HAPLOTYPE, HapFileValidator.KEY_VARIANT + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add a global variable that defines the line types (as I suggested in the comment above this one), can you also use the length of the variable instead of KEY_VARIANT here? That way, it'll be flexible if we add more line types.

self, var_ids: list[str], underscores_to_semicolons: bool = False
):
ids: set[tuple[str, Line]] = set()
for chrom, dt in self.vrids.items():
Copy link
Member

@aryarm aryarm Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to tell you! In addition to checking whether 'Variant' IDs are in the PVAR file, we should also check that 'Repeat' IDs are in there.

We should probably create another parameter (called --repeat-pvar) to allow the user to specify a PVAR file for repeats

Also, we might need to verify that the files aren't file descriptors? Just in case someone is trying to use process substitution?

docs/commands/validate.rst Outdated Show resolved Hide resolved
@aryarm aryarm assigned aryarm and unassigned Ayimany Oct 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: a validate subcommand to check whether a .hap file is valid
2 participants