Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are Genotype.getAttributeAsX methods deprecated? #866

Open
lbergelson opened this issue Apr 27, 2017 · 9 comments
Open

Why are Genotype.getAttributeAsX methods deprecated? #866

lbergelson opened this issue Apr 27, 2017 · 9 comments

Comments

@lbergelson
Copy link
Member

There are a number of deprecated methods in Genotype that seem useful. There's no explanation why they're deprecated or what the replacement is.

ex:

    @Deprecated
    public int getAttributeAsInt(String key, int defaultValue) {
        Object x = getExtendedAttribute(key);
        if ( x == null || x == VCFConstants.MISSING_VALUE_v4 ) return defaultValue;
        if ( x instanceof Integer ) return (Integer)x;
        return Integer.valueOf((String)x); // throws an exception if this isn't a string
    }

Is there something wrong with these methods? Is there a replacement somewhere? There are analogous methods in VariantContext that are not deprecated. They were deprecated before being moved into htsjdk so the history is cut off. Does anyone know?

They seem like necessary functionality, maybe we should dedeprecate them? (and possibly refactor so there isn't duplicated code between them and the same methods in CommonInfo

@yfarjoun
Copy link
Contributor

yfarjoun commented Apr 27, 2017 via email

@magicDGS
Copy link
Member

I was wondering the same. What'as about having a CommonInfo field in the Genotype abstract class instead of the map of extended attributes in the FastGenotype? This will allow to use exactly the same methods as in VariantContext...

@tfenne
Copy link
Member

tfenne commented Apr 28, 2017

@yfarjoun I have no idea. I think that in general it would be really nice to make both Genotype and VariantContext a bit more type-safe, whether it's by un-deprecating these methods, or removing these methods and putting something similar in place. I still struggle when dealing with both Genotype and VariantContext to know what I'm going to get back, and have a lot of defensive code around this. E.g. when an attribute is an Int per alt allele I find myself writing defensive code to check for:

String("5")
String[]{"5"}
Integer(5)
Integer[5]

Since there's usually sufficient information in the header to know the types of things a-priori, I'd love to see us refactor to a model where for declared INFO/FORMAT attributes we enforce the correct type during set operations, and type covert to the correct type on parsing. I suppose we'd still need to handle attributes that aren't declared in the header since it happens too frequently in the wild to ignore, but even those we could try and force type-safe access to.

I'm totally willing to chip in on the effort here - if others are on a similar page perhaps we can use this issue to write up what we'd like the interface to Genotype to be and then work towards it there, and have VariantContext follow suit?

@lbergelson
Copy link
Member Author

lbergelson commented Apr 28, 2017

I would love to make this type safe. We have so much confusing defensive code and it always misses cases anyway resulting in bugs.

I have some thoughts about how this could be cleaned up a bit:

We could define a set of Accessor interfaces which would capture the type information.
Then you would get and set using constant defined Accessors. I seems like we could make it very clean and easy to extend, and it would provide type checking on the way in and the way out, as well as making it easy to store complex types and make extracting them uniform.

ex:

 public  interface AttributeAccessor<T> {
        T convert(Object attribute);  //each accessor would provide a method that does whatever conversion it needs
        String getKey();  // we'd keep the underlying map so we need the keys
    }
    //specializations could fill in the convert method allowing new accessors to be defined as lambdas
    public interface IntAccessor extends AttributeAccessor<Integer> {
        @Override
        default Integer convert(Object a) {
            return Integer.parseInt(a.toString()); //gross accessor that handles string/int/etc badly, we'd want to do something better here if we can't rely on sanitized input, or just a cast here if we can.
        }
    }
   
    //example accessor for MIN_DP, easy to define, put these someplace public like VCFConstants
    private static final IntAccessor MIN_DP = () -> "MIN_DP";
    //Genotype/ VariantContext could implement something like this:
    public <T> T getAttribute(AttributeAccessor<T> accessor){
        final Object attribute = getExtendedAttribute(accessor.getKey());
       if (attribute != null){
           return accessor.convert(attribute);
       } else {
           return null;
       }
    }
    
   //and this could go on the builders
    public <T> void setAttribute(AttributeAccessor<T> accessor, T value){
        attributes.put(accessor.getKey(), value);
    }

access is simple:

Integer minDP = genotype.getAttribute(MIN_DP)

This doesn't solve the important "loading from vcf" issue, but we could at least standardize the defensive code in the accessors. It would be best to fix the problem on the loading side as well, since as you say there's header information available.

@lindenb
Copy link
Contributor

lindenb commented Apr 28, 2017

@lbergelson (related idea) can we also use VCFHeaderLineType to define some String converters ?

VCFHeaderLineType.Float ### (T)->Float.parseFloat(T)

@magicDGS
Copy link
Member

I like the AttributeAccessor idea, but in combination with the VCFHeaderLineType. I propose the following.

  • Add a method to parse an Object into teh VCFHeaderLineType as suggested by @lindenb
  • Add the AttributeAccessor abstraction suggested by @lbergelson
  • Add a AttributeAccessor<T> getAccessor() method to VCFFormatHeaderLine (and the same to VCFInfoHeaderLine for VariantContext). This maybe require to add a type param to the format/info lines...

With this, we will be able to use type-safe and header lines for attributes. What do you think?

@lbergelson
Copy link
Member Author

I'm not sure binding the converter to the VCFHeaderLineType is sufficient, you also need to consider the VCFHeaderLineCount to determine if it's a list or a scalar. I.e. integer 1 is just an Integer, but Integer 2 is an int[].

We have to consider the performance impact / where we want validation failures to occur. It seems likely that it would be more efficient to only perform conversion/type validation, when accessing an attribute, since many operations access only a subset of the attributes.

@magicDGS
Copy link
Member

magicDGS commented May 2, 2017

What I meant is that the VCFHeaderLineType could convert a single-element, and implementations of VCFFormatHeaderLine/VCFInfoHeaderLine will be the ones which creates an array if necessary...

@magicDGS
Copy link
Member

magicDGS commented May 2, 2017

In addition, I think that this will be also another interesting point supporting the non-backwards API changes (#520), including the addition of a new variant and "genotype" interface (I dislike the Genotype name, because the VCF it is not always representing genotypes unless one of the attributes is used...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants