Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search API: Retain nesting of queries, Data API: add querying for all structures #50

Merged
merged 14 commits into from
Mar 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
# Changelog
## v1.1.0 (2025-03-12)

- Add `ALL_STRUCTURES` object, allowing Data API queries for all PDB structures and chemical components
- Add `progress_bar` and `batch_size` parameters to Data API package's `.exec`
- Add `group` function to Search API package to enforce nested grouping
- Update README with new citation information
- Update search schemas: 1.48.0 -> 1.49.0
- Update data schemas:
- entry schema 9.0.3 -> 9.0.4
- polymer_entity_instance schema 10.0.2 -> 10.0.3
- nonpolymer_entity_instance schema 10.0.0 -> 10.0.1

## v1.0.1 (2025-01-17)

- Add import to `const.py` for compatibility with Python 3.8
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,13 @@ Please cite the ``rcsb-api`` package by URL:

You should also cite the RCSB.org API services this package utilizes:

> Dennis W Piehl, Brinda Vallat, Ivana Truong, Habiba Morsy, Rusham Bhatt,
> Santiago Blaumann, Pratyoy Biswas, Yana Rose, Sebastian Bittrich, Jose M. Duarte,
> Joan Segura, Chunxiao Bi, Douglas Myers-Turnbull, Brian P. Hudson, Christine Zardecki,
> Stephen K. Burley, Rcsb-Api: Python Toolkit for Streamlining Access to RCSB Protein
> Data Bank APIs, Journal of Molecular Biology, 2025.
> DOI: [10.1016/j.jmb.2025.168970](https://doi.org/10.1016/j.jmb.2025.168970)

> Yana Rose, Jose M. Duarte, Robert Lowe, Joan Segura, Chunxiao Bi, Charmi
> Bhikadiya, Li Chen, Alexander S. Rose, Sebastian Bittrich, Stephen K. Burley,
> John D. Westbrook. RCSB Protein Data Bank: Architectural Advances Towards
Expand Down
43 changes: 43 additions & 0 deletions docs/data_api/query_construction.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,26 @@ input_ids=["4HHB.A", "4HHB.B"]
input_ids={"instance_ids": ["4HHB.A", "4HHB.B"]}
```

While it is generally more efficient and easier to interpret results if you use a refined list of IDs, if you would like to request a set of data for all IDs within an `input_type`, you can use the `ALL_STRUCTURES` variable. This will set `input_ids` to all IDs for the given `input_type` if supported.

```python
from rcsbapi.data import DataQuery as Query
from rcsbapi.data import ALL_STRUCTURES

# Using `ALL_STRUCTURES` with `input_type` "entries"
# will use all experimentally-determined entry IDs
query = Query(
input_type="entries",
input_ids=ALL_STRUCTURES,
return_data_list=["exptl.method"]
)

# Executing the query with a progress bar
query.exec(progress_bar=True)

print(query.get_response())
```

### return_data_list
These are the data that you are requesting (or "fields").

Expand Down Expand Up @@ -154,6 +174,29 @@ print(result_dict)
}
```

### Executing Large Queries
When executing large queries, the package will batch the `input_ids` before requesting and merge the responses into one JSON object. The default batch size is 5,000, but this value can be adjusted in the `exec` method. To see a progress bar that tracks which batches have been completed, you can set `progress_bar` to `True`.

```python
from rcsbapi.data import DataQuery as Query
from rcsbapi.data import ALL_STRUCTURES

query = Query(
input_type="entries",
input_ids=ALL_STRUCTURES,
return_data_list=["exptl.method"]
)

# Executing query with larger batch size
# and progress bar
query.exec(
batch_size=7000,
progress_bar=True
)

print(query.get_response())
```

## Helpful Methods
There are several methods included to make working with query objects easier. These methods can help you refine your queries to request exactly and only what you want, as well as further understand the GraphQL syntax.

Expand Down
32 changes: 30 additions & 2 deletions docs/search_api/query_construction.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,35 @@ query = q1 & q2
list(query())
```

Some sets of attributes can be separately grouped for a more specific search. For example, the attribute `rcsb_chem_comp_related.resource_name` could be set to "DrugBank" or another database and grouped with the attribute `rcsb_chem_comp_related.resource_accession_code`, which can be used to search for an accession code. When grouped, these attributes will be searched for together (i.e. the accession code must be associated with the specified database). To identify attributes that can be grouped, check the [schema](http://search.rcsb.org/rcsbsearch/v2/metadata/schema) for attributes with `rcsb_nested_indexing` set to `true`. To specify that two attributes should be searched together, use the `group` function.

```python
from rcsbapi.search import AttributeQuery
from rcsbapi.search import group

q1 = AttributeQuery(
attribute="rcsb_chem_comp_related.resource_name",
operator="exact_match",
value="DrugBank"
)

q2 = AttributeQuery(
attribute="rcsb_chem_comp_related.resource_accession_code",
operator="exact_match",
value="DB01050"
)

q3 = AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
)

# Using `group` ensures that `resource_name` and `accession_code` attributes are searched together
query = group(q1 & q2) & q3
list(query())
```

### Sessions
The result of executing a query (either by calling it as a function or using `exec()`) is a
`Session` object. It implements `__iter__`, so it is usually treated as an
Expand Down Expand Up @@ -184,8 +213,7 @@ session.get_query_builder_link()

#### Progress Bar
The `iquery()` `Session` method provides a progress bar indicating the number of API
requests being made. It requires the `tqdm` package be installed to track the
progress of the query interactively.
requests being made.
```python
results = query().iquery()
```
Expand Down
2 changes: 1 addition & 1 deletion rcsbapi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
__author__ = "Dennis Piehl"
__email__ = "[email protected]"
__license__ = "MIT"
__version__ = "1.0.1"
__version__ = "1.1.0"

__path__ = __import__("pkgutil").extend_path(__path__, __name__)

Expand Down
3 changes: 2 additions & 1 deletion rcsbapi/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@


class Config:
DATA_API_TIMEOUT: int = 60
API_TIMEOUT: int = 60
SEARCH_API_REQUESTS_PER_SECOND: int = 10
SUPPRESS_AUTOCOMPLETE_WARNING: bool = False
INPUT_ID_LIMIT: int = 5000

def __setattr__(self, name, value):
"""Verify attribute exists when a user tries to set a configuration parameter, and ensure proper typing.
Expand Down
5 changes: 5 additions & 0 deletions rcsbapi/const.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,5 +97,10 @@ class Const:
"uniprot": [r"[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"]
})

INPUT_TYPE_TO_ALL_STRUCTURES_ENDPOINT: MappingProxyType[str, List[str]] = MappingProxyType({
"entries": ["https://data.rcsb.org/rest/v1/holdings/current/entry_ids"],
"chem_comps": ["https://data.rcsb.org/rest/v1/holdings/current/ccd_ids", "https://data.rcsb.org/rest/v1/holdings/current/prd_ids"]
})


const = Const()
25 changes: 24 additions & 1 deletion rcsbapi/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,32 @@
"""RCSB PDB Data API"""

from .data_schema import DataSchema

DATA_SCHEMA = DataSchema()

# This is needed because __getattr__ will be called twice on import,
# so ALL_STRUCTURES should be cached to avoid initializing twice
_import_cache: dict = {}


def __getattr__(name: str):
"""Overloading __getattr__ so that when ALL_STRUCTURES is accessed for the first time,
ALL_STRUCTURES object will be built.

Args:
name (str): attribute name
"""
if name == "ALL_STRUCTURES":
if name not in _import_cache:
from .data_query import AllStructures
ALL_STRUCTURES = AllStructures()
_import_cache[name] = ALL_STRUCTURES

return _import_cache[name] # Return cached instance

# keep functionality of original __getattr__
raise AttributeError(f"Module {repr(__name__)} has no attribute {repr(name)}")


from .data_query import DataQuery # noqa:E402

__all__ = ["DataQuery", "DataSchema"]
93 changes: 62 additions & 31 deletions rcsbapi/data/data_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
import re
import time
from typing import Any, Union, List, Dict, Optional, Tuple
import json
import requests
from tqdm import tqdm
from rcsbapi.data import DATA_SCHEMA
from ..config import config
from ..const import const
Expand Down Expand Up @@ -36,14 +38,15 @@ def __init__(
add_rcsb_id (bool, optional): whether to automatically add <input_type>.rcsb_id to queries. Defaults to True.
"""
suppress_autocomplete_warning = config.SUPPRESS_AUTOCOMPLETE_WARNING if config.SUPPRESS_AUTOCOMPLETE_WARNING else suppress_autocomplete_warning
input_id_limit = 200
if isinstance(input_ids, list):
if len(input_ids) > input_id_limit:
logger.warning("More than %d input_ids. For a more readable response, reduce number of ids.", input_id_limit)
if isinstance(input_ids, dict):
for value in input_ids.values():
if len(value) > input_id_limit:
logger.warning("More than %d input_ids. For a more readable response, reduce number of ids.", input_id_limit)

if not isinstance(input_ids, AllStructures):
if isinstance(input_ids, list):
if len(input_ids) > config.INPUT_ID_LIMIT:
logger.warning("More than %d input_ids. Query will be slower to complete.", config.INPUT_ID_LIMIT)
if isinstance(input_ids, dict):
for value in input_ids.values():
if len(value) > config.INPUT_ID_LIMIT:
logger.warning("More than %d input_ids. Query will be slower to complete.", config.INPUT_ID_LIMIT)

self._input_type, self._input_ids = self._process_input_ids(input_type, input_ids)
self._return_data_list = return_data_list
Expand All @@ -61,6 +64,7 @@ def __init__(
def _process_input_ids(self, input_type: str, input_ids: Union[List[str], Dict[str, str], Dict[str, List[str]]]) -> Tuple[str, List[str]]:
"""Convert input_type to plural if possible.
Set input_ids to be a list of ids.
If using ALL_STRUCTURES, return the id list corresponding to the input type.

Args:
input_type (str): query input type
Expand All @@ -70,6 +74,11 @@ def _process_input_ids(self, input_type: str, input_ids: Union[List[str], Dict[s
Returns:
Tuple[str, List[str]]: returns a tuple of converted input_type and list of input_ids
"""
# If input_ids is ALL_STRUCTURES, return appropriate list of ids
if isinstance(input_ids, AllStructures):
new_input_ids = input_ids.get_all_ids(input_type)
return (input_type, new_input_ids)

# Convert _input_type to plural if applicable
converted = False
if DATA_SCHEMA._root_dict[input_type][0]["kind"] != "LIST":
Expand Down Expand Up @@ -154,39 +163,36 @@ def get_editor_link(self) -> str:
editor_base_link = str(const.DATA_API_ENDPOINT) + "/index.html?query="
return editor_base_link + urllib.parse.quote(self._query)

def exec(self) -> Dict[str, Any]:
def exec(self, batch_size: int = 5000, progress_bar: bool = False) -> Dict[str, Any]:
"""POST a GraphQL query and get response

Returns:
Dict[str, Any]: JSON object
"""
batch_size = 50
if len(self._input_ids) > batch_size:
batched_ids = self._batch_ids(batch_size)
response_json: Dict[str, Any] = {}
# count = 0
for id_batch in batched_ids:
query = re.sub(r"\[([^]]+)\]", f"{id_batch}".replace("'", '"'), self._query)
part_response = requests.post(
headers={"Content-Type": "application/graphql"},
data=query,
url=const.DATA_API_ENDPOINT,
timeout=config.DATA_API_TIMEOUT
).json()
self._parse_gql_error(part_response)
time.sleep(0.2)
if not response_json:
response_json = part_response
else:
response_json = self._merge_response(response_json, part_response)
batched_ids: Union[List[List[str]], tqdm] = self._batch_ids(batch_size)
else:
response_json = requests.post(
batched_ids = [self._input_ids]
response_json: Dict[str, Any] = {}

if progress_bar is True:
batched_ids = tqdm(batched_ids)

for id_batch in batched_ids:
query = re.sub(r"\[([^]]+)\]", f"{id_batch}".replace("'", '"'), self._query)
part_response = requests.post(
headers={"Content-Type": "application/graphql"},
data=self._query,
data=query,
url=const.DATA_API_ENDPOINT,
timeout=config.DATA_API_TIMEOUT
timeout=config.API_TIMEOUT
).json()
self._parse_gql_error(response_json)
self._parse_gql_error(part_response)
time.sleep(0.2)
if not response_json:
response_json = part_response
else:
response_json = self._merge_response(response_json, part_response)

if "data" in response_json.keys():
query_response = response_json["data"][self._input_type]
if query_response is None:
Expand Down Expand Up @@ -242,3 +248,28 @@ def _merge_response(self, merge_into_response: Dict[str, Any], to_merge_response
combined_response = merge_into_response
combined_response["data"][self._input_type] += to_merge_response["data"][self._input_type]
return combined_response


class AllStructures:
def __init__(self):
self.ALL_STRUCTURES = self.reload()

def reload(self) -> dict[str, List[str]]:
ALL_STRUCTURES = {}
for input_type, endpoints in const.INPUT_TYPE_TO_ALL_STRUCTURES_ENDPOINT.items():
all_ids: List[str] = []
for endpoint in endpoints:
response = requests.get(endpoint, timeout=60)
if response.status_code == 200:
all_ids.extend(json.loads(response.text))
else:
response.raise_for_status()
ALL_STRUCTURES[input_type] = all_ids

return ALL_STRUCTURES

def get_all_ids(self, input_type) -> List[str]:
if input_type in self.ALL_STRUCTURES:
return self.ALL_STRUCTURES[input_type]
else:
raise ValueError(f"ALL_STRUCTURES is not yet available for input_type {input_type}")
2 changes: 1 addition & 1 deletion rcsbapi/data/data_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ def __init__(self) -> None:
GraphQL schema defining available fields, types, and how they are connected.
"""
self.pdb_url: str = const.DATA_API_ENDPOINT
self.timeout: int = config.DATA_API_TIMEOUT
self.timeout: int = config.API_TIMEOUT
self.schema: Dict = self._fetch_schema()
"""JSON resulting from full introspection of GraphQL query"""

Expand Down
2 changes: 1 addition & 1 deletion rcsbapi/data/resources/assembly.json
Original file line number Diff line number Diff line change
Expand Up @@ -582,7 +582,7 @@
"rcsb_search_group": [
{
"group_name": "ID(s) and Keywords",
"priority_order": 15
"priority_order": 25
}
]
},
Expand Down
Loading