rcsb · ivana-truong · Mar 12, 2025 · Feb 27, 2025 · Feb 28, 2025 · Mar 1, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,16 @@
 # Changelog
+## v1.1.0 (2025-03-12)
+
+- Add `ALL_STRUCTURES` object, allowing Data API queries for all PDB structures and chemical components
+- Add `progress_bar` and `batch_size` parameters to Data API package's `.exec`
+- Add `group` function to Search API package to enforce nested grouping
+- Update README with new citation information
+- Update search schemas: 1.48.0 -> 1.49.0
+- Update data schemas: 
+  - entry schema 9.0.3 -> 9.0.4
+  - polymer_entity_instance schema 10.0.2 -> 10.0.3
+  - nonpolymer_entity_instance schema 10.0.0 -> 10.0.1
+
 ## v1.0.1 (2025-01-17)
 
 - Add import to `const.py` for compatibility with Python 3.8

diff --git a/README.md b/README.md
@@ -140,6 +140,13 @@ Please cite the ``rcsb-api`` package by URL:
 
 You should also cite the RCSB.org API services this package utilizes:
 
+> Dennis W Piehl, Brinda Vallat, Ivana Truong, Habiba Morsy, Rusham Bhatt, 
+> Santiago Blaumann, Pratyoy Biswas, Yana Rose, Sebastian Bittrich, Jose M. Duarte,
+> Joan Segura, Chunxiao Bi, Douglas Myers-Turnbull, Brian P. Hudson, Christine Zardecki,
+> Stephen K. Burley, Rcsb-Api: Python Toolkit for Streamlining Access to RCSB Protein 
+> Data Bank APIs, Journal of Molecular Biology, 2025.
+> DOI: [10.1016/j.jmb.2025.168970](https://doi.org/10.1016/j.jmb.2025.168970)
+
 > Yana Rose, Jose M. Duarte, Robert Lowe, Joan Segura, Chunxiao Bi, Charmi
 > Bhikadiya, Li Chen, Alexander S. Rose, Sebastian Bittrich, Stephen K. Burley,
 > John D. Westbrook. RCSB Protein Data Bank: Architectural Advances Towards

diff --git a/docs/data_api/query_construction.md b/docs/data_api/query_construction.md
@@ -88,6 +88,26 @@ input_ids=["4HHB.A", "4HHB.B"]
 input_ids={"instance_ids": ["4HHB.A", "4HHB.B"]}
 ```
 
+While it is generally more efficient and easier to interpret results if you use a refined list of IDs, if you would like to request a set of data for all IDs within an `input_type`, you can use the `ALL_STRUCTURES` variable. This will set `input_ids` to all IDs for the given `input_type` if supported.
+
+```python
+from rcsbapi.data import DataQuery as Query
+from rcsbapi.data import ALL_STRUCTURES
+
+# Using `ALL_STRUCTURES` with `input_type` "entries"
+# will use all experimentally-determined entry IDs
+query = Query(
+    input_type="entries",
+    input_ids=ALL_STRUCTURES,
+    return_data_list=["exptl.method"]
+)
+
+# Executing the query with a progress bar
+query.exec(progress_bar=True)
+
+print(query.get_response())
+```
+
 ### return_data_list
 These are the data that you are requesting (or "fields").
 
@@ -154,6 +174,29 @@ print(result_dict)
 }
 ```
 
+### Executing Large Queries
+When executing large queries, the package will batch the `input_ids` before requesting and merge the responses into one JSON object. The default batch size is 5,000, but this value can be adjusted in the `exec` method. To see a progress bar that tracks which batches have been completed, you can set `progress_bar` to `True`.
+
+```python
+from rcsbapi.data import DataQuery as Query
+from rcsbapi.data import ALL_STRUCTURES
+
+query = Query(
+    input_type="entries",
+    input_ids=ALL_STRUCTURES,
+    return_data_list=["exptl.method"]
+)
+
+# Executing query with larger batch size
+# and progress bar
+query.exec(
+  batch_size=7000,
+  progress_bar=True
+)
+
+print(query.get_response())
+```
+
 ## Helpful Methods
 There are several methods included to make working with query objects easier. These methods can help you refine your queries to request exactly and only what you want, as well as further understand the GraphQL syntax.
 

diff --git a/docs/search_api/query_construction.md b/docs/search_api/query_construction.md
@@ -150,6 +150,35 @@ query = q1 & q2
 list(query())
 ```
 
+Some sets of attributes can be separately grouped for a more specific search. For example, the attribute `rcsb_chem_comp_related.resource_name` could be set to "DrugBank" or another database and grouped with the attribute `rcsb_chem_comp_related.resource_accession_code`, which can be used to search for an accession code. When grouped, these attributes will be searched for together (i.e. the accession code must be associated with the specified database). To identify attributes that can be grouped, check the [schema](http://search.rcsb.org/rcsbsearch/v2/metadata/schema) for attributes with `rcsb_nested_indexing` set to `true`. To specify that two attributes should be searched together, use the `group` function.
+
+```python
+from rcsbapi.search import AttributeQuery
+from rcsbapi.search import group
+
+q1 = AttributeQuery(
+    attribute="rcsb_chem_comp_related.resource_name",
+    operator="exact_match",
+    value="DrugBank"
+)
+
+q2 = AttributeQuery(
+    attribute="rcsb_chem_comp_related.resource_accession_code",
+    operator="exact_match",
+    value="DB01050"
+)
+
+q3 = AttributeQuery(
+    attribute="rcsb_entity_source_organism.scientific_name",
+    operator="exact_match",
+    value="Homo sapiens"
+)
+
+# Using `group` ensures that `resource_name` and `accession_code` attributes are searched together
+query = group(q1 & q2) & q3
+list(query())
+```
+
 ### Sessions
 The result of executing a query (either by calling it as a function or using `exec()`) is a
 `Session` object. It implements `__iter__`, so it is usually treated as an
@@ -184,8 +213,7 @@ session.get_query_builder_link()
 
 #### Progress Bar
 The `iquery()` `Session` method provides a progress bar indicating the number of API
-requests being made. It requires the `tqdm` package be installed to track the
-progress of the query interactively.
+requests being made.
 ```python
 results = query().iquery()
 ```

diff --git a/rcsbapi/__init__.py b/rcsbapi/__init__.py
@@ -2,7 +2,7 @@
 __author__ = "Dennis Piehl"
 __email__ = "[email protected]"
 __license__ = "MIT"
-__version__ = "1.0.1"
+__version__ = "1.1.0"
 
 __path__ = __import__("pkgutil").extend_path(__path__, __name__)
 

diff --git a/rcsbapi/config.py b/rcsbapi/config.py
@@ -18,9 +18,10 @@
 
 
 class Config:
-    DATA_API_TIMEOUT: int = 60
+    API_TIMEOUT: int = 60
     SEARCH_API_REQUESTS_PER_SECOND: int = 10
     SUPPRESS_AUTOCOMPLETE_WARNING: bool = False
+    INPUT_ID_LIMIT: int = 5000
 
     def __setattr__(self, name, value):
         """Verify attribute exists when a user tries to set a configuration parameter, and ensure proper typing.

diff --git a/rcsbapi/const.py b/rcsbapi/const.py
@@ -97,5 +97,10 @@ class Const:
         "uniprot": [r"[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}"]
     })
 
+    INPUT_TYPE_TO_ALL_STRUCTURES_ENDPOINT: MappingProxyType[str, List[str]] = MappingProxyType({
+        "entries": ["https://data.rcsb.org/rest/v1/holdings/current/entry_ids"],
+        "chem_comps": ["https://data.rcsb.org/rest/v1/holdings/current/ccd_ids", "https://data.rcsb.org/rest/v1/holdings/current/prd_ids"]
+    })
+
 
 const = Const()
diff --git a/rcsbapi/data/__init__.py b/rcsbapi/data/__init__.py
@@ -1,9 +1,32 @@
 """RCSB PDB Data API"""
-
 from .data_schema import DataSchema
 
 DATA_SCHEMA = DataSchema()
 
+# This is needed because __getattr__ will be called twice on import,
+# so ALL_STRUCTURES should be cached to avoid initializing twice
+_import_cache: dict = {}
+
+
+def __getattr__(name: str):
+    """Overloading __getattr__ so that when ALL_STRUCTURES is accessed for the first time,
+    ALL_STRUCTURES object will be built.
+
+    Args:
+        name (str): attribute name
+    """
+    if name == "ALL_STRUCTURES":
+        if name not in _import_cache:
+            from .data_query import AllStructures
+            ALL_STRUCTURES = AllStructures()
+            _import_cache[name] = ALL_STRUCTURES
+
+        return _import_cache[name]  # Return cached instance
+
+    # keep functionality of original __getattr__
+    raise AttributeError(f"Module {repr(__name__)} has no attribute {repr(name)}")
+
+
 from .data_query import DataQuery  # noqa:E402
 
 __all__ = ["DataQuery", "DataSchema"]
diff --git a/rcsbapi/data/data_query.py b/rcsbapi/data/data_query.py
@@ -3,7 +3,9 @@
 import re
 import time
 from typing import Any, Union, List, Dict, Optional, Tuple
+import json
 import requests
+from tqdm import tqdm
 from rcsbapi.data import DATA_SCHEMA
 from ..config import config
 from ..const import const
@@ -36,14 +38,15 @@ def __init__(
             add_rcsb_id (bool, optional): whether to automatically add <input_type>.rcsb_id to queries. Defaults to True.
         """
         suppress_autocomplete_warning = config.SUPPRESS_AUTOCOMPLETE_WARNING if config.SUPPRESS_AUTOCOMPLETE_WARNING else suppress_autocomplete_warning
-        input_id_limit = 200
-        if isinstance(input_ids, list):
-            if len(input_ids) > input_id_limit:
-                logger.warning("More than %d input_ids. For a more readable response, reduce number of ids.", input_id_limit)
-        if isinstance(input_ids, dict):
-            for value in input_ids.values():
-                if len(value) > input_id_limit:
-                    logger.warning("More than %d input_ids. For a more readable response, reduce number of ids.", input_id_limit)
+
+        if not isinstance(input_ids, AllStructures):
+            if isinstance(input_ids, list):
+                if len(input_ids) > config.INPUT_ID_LIMIT:
+                    logger.warning("More than %d input_ids. Query will be slower to complete.", config.INPUT_ID_LIMIT)
+            if isinstance(input_ids, dict):
+                for value in input_ids.values():
+                    if len(value) > config.INPUT_ID_LIMIT:
+                        logger.warning("More than %d input_ids. Query will be slower to complete.", config.INPUT_ID_LIMIT)
 
         self._input_type, self._input_ids = self._process_input_ids(input_type, input_ids)
         self._return_data_list = return_data_list
@@ -61,6 +64,7 @@ def __init__(
     def _process_input_ids(self, input_type: str, input_ids: Union[List[str], Dict[str, str], Dict[str, List[str]]]) -> Tuple[str, List[str]]:
         """Convert input_type to plural if possible.
         Set input_ids to be a list of ids.
+        If using ALL_STRUCTURES, return the id list corresponding to the input type.
 
         Args:
             input_type (str): query input type
@@ -70,6 +74,11 @@ def _process_input_ids(self, input_type: str, input_ids: Union[List[str], Dict[s
         Returns:
             Tuple[str, List[str]]: returns a tuple of converted input_type and list of input_ids
         """
+        # If input_ids is ALL_STRUCTURES, return appropriate list of ids
+        if isinstance(input_ids, AllStructures):
+            new_input_ids = input_ids.get_all_ids(input_type)
+            return (input_type, new_input_ids)
+
         # Convert _input_type to plural if applicable
         converted = False
         if DATA_SCHEMA._root_dict[input_type][0]["kind"] != "LIST":
@@ -154,39 +163,36 @@ def get_editor_link(self) -> str:
         editor_base_link = str(const.DATA_API_ENDPOINT) + "/index.html?query="
         return editor_base_link + urllib.parse.quote(self._query)
 
-    def exec(self) -> Dict[str, Any]:
+    def exec(self, batch_size: int = 5000, progress_bar: bool = False) -> Dict[str, Any]:
         """POST a GraphQL query and get response
 
         Returns:
             Dict[str, Any]: JSON object
         """
-        batch_size = 50
         if len(self._input_ids) > batch_size:
-            batched_ids = self._batch_ids(batch_size)
-            response_json: Dict[str, Any] = {}
-            # count = 0
-            for id_batch in batched_ids:
-                query = re.sub(r"\[([^]]+)\]", f"{id_batch}".replace("'", '"'), self._query)
-                part_response = requests.post(
-                    headers={"Content-Type": "application/graphql"},
-                    data=query,
-                    url=const.DATA_API_ENDPOINT,
-                    timeout=config.DATA_API_TIMEOUT
-                ).json()
-                self._parse_gql_error(part_response)
-                time.sleep(0.2)
-                if not response_json:
-                    response_json = part_response
-                else:
-                    response_json = self._merge_response(response_json, part_response)
+            batched_ids: Union[List[List[str]], tqdm] = self._batch_ids(batch_size)
         else:
-            response_json = requests.post(
+            batched_ids = [self._input_ids]
+        response_json: Dict[str, Any] = {}
+
+        if progress_bar is True:
+            batched_ids = tqdm(batched_ids)
+
+        for id_batch in batched_ids:
+            query = re.sub(r"\[([^]]+)\]", f"{id_batch}".replace("'", '"'), self._query)
+            part_response = requests.post(
                 headers={"Content-Type": "application/graphql"},
-                data=self._query,
+                data=query,
                 url=const.DATA_API_ENDPOINT,
-                timeout=config.DATA_API_TIMEOUT
+                timeout=config.API_TIMEOUT
             ).json()
-            self._parse_gql_error(response_json)
+            self._parse_gql_error(part_response)
+            time.sleep(0.2)
+            if not response_json:
+                response_json = part_response
+            else:
+                response_json = self._merge_response(response_json, part_response)
+
         if "data" in response_json.keys():
             query_response = response_json["data"][self._input_type]
             if query_response is None:
@@ -242,3 +248,28 @@ def _merge_response(self, merge_into_response: Dict[str, Any], to_merge_response
         combined_response = merge_into_response
         combined_response["data"][self._input_type] += to_merge_response["data"][self._input_type]
         return combined_response
+
+
+class AllStructures:
+    def __init__(self):
+        self.ALL_STRUCTURES = self.reload()
+
+    def reload(self) -> dict[str, List[str]]:
+        ALL_STRUCTURES = {}
+        for input_type, endpoints in const.INPUT_TYPE_TO_ALL_STRUCTURES_ENDPOINT.items():
+            all_ids: List[str] = []
+            for endpoint in endpoints:
+                response = requests.get(endpoint, timeout=60)
+                if response.status_code == 200:
+                    all_ids.extend(json.loads(response.text))
+                else:
+                    response.raise_for_status()
+                ALL_STRUCTURES[input_type] = all_ids
+
+        return ALL_STRUCTURES
+
+    def get_all_ids(self, input_type) -> List[str]:
+        if input_type in self.ALL_STRUCTURES:
+            return self.ALL_STRUCTURES[input_type]
+        else:
+            raise ValueError(f"ALL_STRUCTURES is not yet available for input_type {input_type}")
diff --git a/rcsbapi/data/data_schema.py b/rcsbapi/data/data_schema.py
@@ -114,7 +114,7 @@ def __init__(self) -> None:
         GraphQL schema defining available fields, types, and how they are connected.
         """
         self.pdb_url: str = const.DATA_API_ENDPOINT
-        self.timeout: int = config.DATA_API_TIMEOUT
+        self.timeout: int = config.API_TIMEOUT
         self.schema: Dict = self._fetch_schema()
         """JSON resulting from full introspection of GraphQL query"""
 

diff --git a/rcsbapi/data/resources/assembly.json b/rcsbapi/data/resources/assembly.json
@@ -582,7 +582,7 @@
                     "rcsb_search_group": [
                         {
                             "group_name": "ID(s) and Keywords",
-                            "priority_order": 15
+                            "priority_order": 25
                         }
                     ]
                 },