Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search API: Retain nesting of queries, Data API: add querying for all structures #50

Merged
merged 14 commits into from
Mar 12, 2025

Conversation

ivana-truong
Copy link
Collaborator

@ivana-truong ivana-truong commented Mar 1, 2025

Search API

  • Changed __and__/__or__ method of SearchQuery and added group method in response to Issue in grouped attribute search #49. Now building queries with group function will cause the group to be preserved while constructing the rest of the query
from rcsbapi.search import group

query = group(q1 & q2) & q3 & q4

Data API

  • Added ALL_STRUCTURES to const.py/data.__init__.py. This can be passed into the input_ids parameter to make a query for all structures. Currently supports entries and chem_comps input_types
    • Will batch into lists of 5,000 and merge results into one dictionary object
  • Added progress_bar parameter to .exec. If set to True, there will be a progress bar for the executing query.
  • Added batch_size parameter to .exec. Defaults to 5,000.

Misc

  • Updated citation information

Copy link
Collaborator

@piehld piehld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ivana-truong! This is looking excellent.

While you're still implementing the workaround for nested attribute issue based on our conversations, I thought I'd go ahead and leave a couple comments and suggestions on the querying all structures functionality.

Copy link
Collaborator

@piehld piehld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ivana-truong! These are wonderful enhancements, and very nice work on all of them. I just had a couple of suggestions and questions, but overall things look excellent.

Co-authored-by: Dennis Piehl <[email protected]>
@ivana-truong ivana-truong force-pushed the dev-it-tests branch 2 times, most recently from e195430 to 0aa606b Compare March 7, 2025 21:56
@dmyersturnbull
Copy link
Contributor

dmyersturnbull commented Mar 11, 2025

@ivana-truong @piehld

This is a really nice solution!

I don't know this package well, but would it be possible to flatten immediately before the query is run rather than greedily per &/|? Apologies if I'm not understanding correctly or proposing something you already tried.

If I understand correctly, the problem is that __and__ doesn't know about any additional terms that might come on the right (since & is left-associative). But you need to know all the terms to flatten correctly, which precludes any greedy solution.

As an alternative to group(), you could have & and | build a binary or n-ary parse tree, and only optimize the tree right before the API call. I'm basically proposing moving some of the logic in __and__ and __or__.

Sketch of binary tree solution

# So a Query method can return Node:
from __future__ import annotations
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class Node(SearchQuery):
    """Implementation as a tree."
    def exec(self, ...) -> Union["Session", int]:
        prepped: Node = self._optimize()
        ...  # TODO
    def _optimize(self) -> Node:
        ...  # TODO: Traverse the tree, flatten compatible terms, etc.

@dataclass(frozen=True, slots=True)
class Group(Node):
    operator: TAndOr
    left: Node
    right: Node
    # Omit `keep_nested`.

@dataclass(frozen=True, slots=True)
class Conjunction(Branch):
    def _optimize(self) -> Node: ...  # TODO

@dataclass(frozen=True, slots=True)
class Disjunction(Branch):
    def _optimize(self) -> Node: ...  # TODO

#  Other subclasses

By the way: I really like the use of dataclass. It looks like the subclasses of Terminal plus a few other classes could also be made dataclasses without issue.

@piehld
Copy link
Collaborator

piehld commented Mar 11, 2025

Thanks for the input and idea @dmyersturnbull! A binary tree would provide a nice data structure for managing this query construction. Will definitely keep this in mind as an option if we encounter other query-building issues in the future.

The specific issue we were addressing here though wasn't so much with the package's ability to "flatten correctly"—in fact, it was actually flattening entirely as expected given the set of nodes and operators provided (as you phrased it, indeed "greedily"). Rather, the issue here was trying to prevent it from flattening when you want a specific set of nodes to be in the same group, even if it's logically the same as being flattened out. This is largely a consequence of relying on () parentheses for grouping instead of a structured type, since parentheses only controls the order of operations rather than establishing a group.

So basically this PR provides a way to avoid flattening out a subgroup, so while (q1 & q2) & q3 will still become q1 & q2 & q3, the following will create two groups: group(q1 & q2) & q3.

Copy link
Collaborator

@piehld piehld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wonderful @ivana-truong! Thank you again for the excellent work! I just had two minor comments, but once you address those feel free to merge!

@ivana-truong ivana-truong merged commit 9b0d7f2 into staging Mar 12, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants