-
Notifications
You must be signed in to change notification settings - Fork 38
Implement Validation for Non-Queryable Attributes in Filtering #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lds in QueryablesCache
| | `USE_DATETIME_NANOS` | Enables nanosecond precision handling for `datetime` field searches as per the `date_nanos` type. When `False`, it uses 3 millisecond precision as per the type `date`. | `true` | Optional | | ||
| | `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. | None | Optional | | ||
| | `EXCLUDED_FROM_ITEMS` | Specifies fields to exclude from STAC item responses. Supports comma-separated field names and dot notation for nested fields (e.g., `private_data,properties.confidential,assets.internal`). | `None` | Optional | | ||
| | `VALIDATE_QUERYABLES` | Enable validation of query parameters against the collection's queryables. If set to `true`, the API will reject queries containing fields that are not defined in the collection's queryables. | `false` | Optional | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bountx How does this interact with the EXCLUDED_FROM_QUERYABLES environment variable? Should both these env vars be combined into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did miss this in my implementation so I will fix it, but these two variables serve a different purpose: VALIDATE_QUERYABLES should validate on set of all queryables and EXCLUDED_FROM_QUERYABLES can be used to make this set of all queryables smaller.
Their interaction when both used should look something along:
Cache of all queryables -> Removes from this set all excluded queryables -> Validates on that set
If you had something other in mind let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonhealy1 Fixed this interaction to work the way I mentioned beforehand.
README.md
Outdated
| - If a request contains a query parameter or filter field that is not in the list of allowed queryables, the API returns a `400 Bad Request` error with a message indicating the invalid field(s). | ||
| - The cache is automatically refreshed based on the `QUERYABLES_CACHE_TTL` setting. | ||
|
|
||
| This feature helps prevent queries on unindexed fields which could lead to poor performance or unexpected results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong but I think Opensearch automatically indexes everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poor wording on my side. Unindexed fields = nonexistant fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed wording
README.md
Outdated
| | `STAC_INDEX_ASSETS` | Controls if Assets are indexed when added to Elasticsearch/Opensearch. This allows asset fields to be included in search queries. | `false` | Optional | | ||
| | `USE_DATETIME` | Configures the datetime search behavior in SFEOS. When enabled, searches both datetime field and falls back to start_datetime/end_datetime range for items with null datetime. When disabled, searches only by start_datetime/end_datetime range. | `true` | Optional | | ||
| | `USE_DATETIME_NANOS` | Enables nanosecond precision handling for `datetime` field searches as per the `date_nanos` type. When `False`, it uses 3 millisecond precision as per the type `date`. | `true` | Optional | | ||
| | `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. | None | Optional | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a line here explaining the expected behavior if VALIDATE_QUERYABLES is also set.
| get_properties_from_cql2_filter, | ||
| initialize_queryables_cache, | ||
| validate_queryables, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are importing stac_fastapi.sfeos_helpers into stac_fastapi.core. The Core library is a generic abstraction and must not depend on specific implementation helpers. This breaks the modularity of the library.
Query validation is a generic feature that belongs in Core. However, we cannot import it from sfeos_helpers.
- Move to Core Please move queryables.py (the Cache and Validator logic) out of stac_fastapi/sfeos_helpers/ and into stac_fastapi/core/.
Once the code lives in Core, CoreClient can import it safely without creating a dependency on the Elasticsearch/OpenSearch specific package. The Validator should rely strictly on BaseDatabaseLogic interfaces. (Note: You will likely need to add get_queryables_mapping as an abstract method to BaseDatabaseLogic so the validator can call it generically).
- Remove Global State Please remove the global variable (_queryables_cache_instance). Instead, instantiate QueryablesCache directly inside CoreClient.init as self.queryables_cache. This resolves the threading and testing fragility issues inherent in global state.
| ```bash | ||
| VALIDATE_QUERYABLES=true | ||
| QUERYABLES_CACHE_TTL=3600 # Optional, defaults to 3600 seconds (1 hour) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this seem like a long default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what my team suggested at first as the cache query for queryable parameters doesn't seem that costly. I'll rediscuss that with them though in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to 6 hours default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it should be shorter like 30 minutes maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see - for our case the queryables changes are pretty rare so 1h-6h updates wouldn't change that much, but I think 30m default would work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if someone is adding new data to the db, they may get confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the default value to 30 minutes
|
|
||
|
|
||
| _queryables_cache_instance: Optional[QueryablesCache] = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. Please remove _queryables_cache_instance
class CoreClient(AsyncBaseCoreClient):
def __init__(self, database: BaseDatabaseLogic, ...):
self.database = database
# Initialize the cache as an instance attribute
# This solves the threading and testing fragility issues
self.queryables_cache = QueryablesCache(database)```
|
|
||
| class TestGlobalFunctions: | ||
| @pytest.fixture(autouse=True) | ||
| def reset_global_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove this fixture after the other changes have been made.
|
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Combines the following fixes: - Normalize property names by stripping 'properties.' prefix in CQL2 filter - Enhance field mapping to handle 'properties.' prefix in Elasticsearch queries - Fix nested property traversal and improve error messages - Implement exclusion of fields from queryables based on environment variable - Enhance exclusion logic for queryables to support top-level fields and properties prefix
|
@jonhealy1 After some dev tests it turned out the EXCLUDE_FROM_QUERYABLES didn't interact well with our validation - fixed it. Added CQL2-text support and fixed nested properties traversal for validation. Let me know if you have any questions! |
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just make sure old changelog entries won't be deleted.
|
|
||
| - Added optional `/catalogs` route support to enable federated hierarchical catalog browsing and navigation. [#547](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/547) | ||
| - Added DELETE `/catalogs/{catalog_id}/collections/{collection_id}` endpoint to support removing collections from catalogs. When a collection belongs to multiple catalogs, it removes only the specified catalog from the collection's parent_ids. When a collection belongs to only one catalog, the collection is deleted entirely. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554) | ||
| - Added `parent_ids` internal field to collections to support multi-catalog hierarchies. Collections can now belong to multiple catalogs, with parent catalog IDs stored in this field for efficient querying and management. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bountx PR looks good, however, it looks like these changelog entries are being deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missclicked when merging, I'll add them back
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
Related Issue(s):
Description:
Created query validation to return bad request 400 on requests with parameters not existent in any queryables without unnecessary searching (optional: turned off by default - turned on by environment variable
VALIDATE_QUERYABLES).To do that PR also introduces local caching system (global variable
_queryables_cache_instance: QueryablesCache) that collects a set of queryables parameters from all the collections and updates it hourly by default (set with environment variableQUERYABLES_CACHE_TTLin seconds - default to3600, 1 hour).PR Checklist:
pre-commit run --all-files)make test)