Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in grouped attribute search #49

Open
anantjiit2026 opened this issue Feb 26, 2025 · 5 comments
Open

Issue in grouped attribute search #49

anantjiit2026 opened this issue Feb 26, 2025 · 5 comments

Comments

@anantjiit2026
Copy link

I was trying the search api and i ran into this error.
context on the image:
i made 2 queries, the first one having extra constraints, but the result count of the first one came higher, i then searched using query builder which gave 133 results, the same as the second one.

Image

Image

I am using a conda environment, Python 3.13.2,OS: Ubuntu 24.04.1 LTS x86_64, I just made a new conda environment and installed rcsb-api to get this.

@piehld
Copy link
Collaborator

piehld commented Feb 26, 2025

Thank you for reporting @anantjiit2026! We apologize for the inconvenience. This is indeed a weird behavior, and it sort of relates to an idiosyncrasy with our API service as well as the Python package.

For most of our search attributes, this type of subquery grouping should work as expected. However, something a little different happens for the rcsb_binding_affinity object (and a handful of other special "nested" attributes). When using this object, usually it involves both a .type and .value attribute, just as you are doing. But the way that our search API expects them to be provided is as a single grouped pair, separate from the rest of the grouped attributes. This allows those two attributes to be searched in a coupled manner than as separate attributes.

...this may sound confusing, so it's probably easier to show you some examples...

First of all, here is what's happening vs. what's desired:

You'll notice that the only difference between the two is that in the Python-generated query all the "AND" attributes are flattened to the same group level, whereas in the properly formed query the rcsb_binding_affinity attributes are nested into their own personal grouped. It is this dedicated grouping of those two attributes (in the latter case) that allows them to be queried in a coupled manner. Otherwise, when they are flattened to the same group as other attributes (as in the former case), the search API looks for all entries that have rcsb_binding_affinity.type of EC50—of any .value—and a rcsb_binding_affinity.value of 2—of any .type. It's definitely a little strange, I know, but maybe it's a little clearer based on the screenshot of the Advanced Search UI that you shared, since there you can see how those two attributes are actually combined into one line of the selection menu.

All that said, you've highlighted a current limitation of our Python package, as we are currently flattening any groups when possible. So, we are currently looking into implementing a fix for this and will update you when it's available (hopefully in the next week). In the meantime, if you have any follow-up questions about this, please let us know.

@anantjiit2026
Copy link
Author

Thank you for the detailed explanation. I had a 2 more questions

Image

  1. Should this query have worked correctly?

Also i was working with the attributes from the advanced query builder(https://www.rcsb.org/docs/search-and-browse/advanced-search/attribute-details) and translating to the search api attributes from the "Attribute" entry. With nested attributes this needed a lookup for the value of the nested attribute.

Image
here i will have to lookup for rcsb_binding_affinity.type.

  1. Is there some preexisting way to do this code? I can post the second query in some other channel, as it may be unrelated to the current discussion.

@piehld
Copy link
Collaborator

piehld commented Feb 28, 2025

  1. Should this query have worked correctly?

The screenshot you shared should ideally have worked, but due to a current behavior in the python package, that query ends up being automatically flattened to the same group, hence why you are still getting 320 results instead of 133. We are currently working on a fix for that.

Hopefully we should have a fix for the above within the next week or so. But in the meantime, there is a way to force this nesting, though it is much less intuitive:

from rcsbapi.search.search_query import Group
from rcsbapi.search import AttributeQuery

q1 = AttributeQuery("rcsb_binding_affinity.type", "exact_match", "EC50")
q2 = AttributeQuery("rcsb_binding_affinity.value", "equals", 2.0)
q3 = AttributeQuery("rcsb_entry_info.selected_polymer_entity_types", "exists")
q4 = AttributeQuery("rcsb_nonpolymer_entity_container_identifiers.nonpolymer_comp_id", "exists")

q = Group("and", (q1 & q2, *[q3, q4]))
results = list(q())
  1. Is there some preexisting way to do this code? I can post the second query in some other channel, as it may be unrelated to the current discussion.

Nice job on noticing the difference with how those nested-type attributes are identified! We don't have existing code to do this for you yet, but will plan to try coming up with a way to do so in the future. I am curious on what code you're using to do that, though, if you don't mind sharing (here is perfectly fine as long as you're OK with that).

Perhaps related though, we do have code to help search through attributes, e.g.:

from rcsbapi.search import search_attributes as attrs

matching_attrs = attrs.search("rcsb_binding_affinity")

# print out all details for each matching attribute
for attr in matching_attrs:
    print(attr)

# print out just the attribute names
for attr in matching_attrs:
    print(attr.attribute)

Also, you can view our schema directly at: https://search.rcsb.org/rcsbsearch/v2/metadata/schema. There, you can identify which attributes will need to be grouped in pairs based on whether "rcsb_nested_indexing": true, e.g.:

Image

But as I mentioned, we are looking into ways to make this easier and more automated for the users (or a minimum, providing a warning message regarding the need to group those attributes together). These improvements will likely take a bit longer than a week or two though, just to gauge expectations.

@anantjiit2026
Copy link
Author

Thanks for the clarification and the links. I had been looking for something like the schema file.

For the lookup I was doing a search from the advanced query builder and looking at the corresponding search api query

Image

Image

Then locally making a lookup table for the value of the nested attributes

nested_lookup = {
    "Accession Code(s) - UniProt":"UniProt",
    "Accession Code(s) - GenBank":"GenBank",
    "Accession Code(s) - NORINE":"NORINE",
    "Global Quality Score - pLDDT":"pLDDT",
    "Identifier - Pfam Protein Family":"Pfam",
    "Name - Pfam Protein Family":"Pfam",
    ### out of order for checking
    "Count Per Polymer Entity - Modified chemical component":"modified_monomer",
    "Binding Affinity Value - EC50":"EC50",
    "Component Identifier - Investigated Molecule":"SUBJECT_OF_INVESTIGATION",
    "Component Identifier - Has Covalent Linkage":"Has_Covalent_Linkage",
    "Component Identifier - Has No Covalent Linkage":"Has_No_Covalent_Linkage",
    "Component Identifier - Has Metal Coordination":""
}

and if my query attribute had a nested attribute just adding it with the corresponding value from the table.

if 'Nested Attribute' in lookup[attr_op_val.attribute].keys():
    val_nested = nested_lookup[attr_op_val.attribute]
    if q==None:
        q = AttributeQuery(lookup[attr_op_val.attribute]['Nested Attribute'], "exact_match", val_nested)

attr_op_val.attribute ia a query builder attribute like Identifier - Pfam Protein Family

@piehld
Copy link
Collaborator

piehld commented Mar 12, 2025

Hi @anantjiit2026, thank you for sharing your approach with us. We're glad that strategy works for you for dealing with these edge cases for now. We do hope to implement a more automated "behind-the-scenes" solution for this in the future, but that is looking like it will be a bit more involved of a task than initially hoped.

Nonetheless, we are excited to let you know that we have introduced a much simpler and more intuitive solution for forcing the nested grouping of those special kinds of attributes (addressed in #51, thanks to @ivana-truong!). Now, to handle the particular case you originally mentioned in your issue, you can simply do:

from rcsbapi.search import group
q = group(q1 & q2) & q3 & q4

So, in full, your code would be:

from rcsbapi.search import group
from rcsbapi.search import AttributeQuery

q1 = AttributeQuery("rcsb_binding_affinity.type", "exact_match", "EC50")
q2 = AttributeQuery("rcsb_binding_affinity.value", "equals", 2.0)
q3 = AttributeQuery("rcsb_entry_info.selected_polymer_entity_types", "exists")
q4 = AttributeQuery("rcsb_nonpolymer_entity_container_identifiers.nonpolymer_comp_id", "exists")

q = group(q1 & q2) & q3 & q4
results = list(q())
print("count", len(results))

Please note that before you can do this, you will of course need to upgrade your version of rcsb-api to the latest version first (V1.1.0):

pip install rcsb-api --upgrade

We hope this offers a helpful mechanism to address your issue. Of course, we do hope to work on automating that grouping based on the specific types of attributes requested by the user (so that the user doesn't have to actively think about it and manually group them separately), but that will be a longer standing objective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants