-
Notifications
You must be signed in to change notification settings - Fork 42
improvement(tools): many new tools, redesigned existing tools, better token management #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
5132aba
improvement(tools): many new tools, redesigned existing tools, better…
alexsku 4c91662
unit test fixes
alexsku c5cec97
formatting
alexsku 7338fca
tests for all tools, sorting fields sorted out
alexsku 003d37e
feedback
alexsku 7f3fb34
mentioned default views support
alexsku 7a2d618
using the ci
alexsku 83cee6b
more tolerant to environment inconsistincies
alexsku 0b08a20
more erorr handling
alexsku 7a84ff4
fixes
alexsku 54e70bc
removed oss tests
alexsku 9daa7bf
fixes
alexsku File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Changelog | ||
|
|
||
| All notable changes to mcp-server-datahub will be documented in this file. | ||
|
|
||
| The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | ||
| and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | ||
|
|
||
| ## [0.4.0] - 2025-11-17 | ||
|
|
||
| ### Added | ||
|
|
||
| #### Response Token Budget Management | ||
| - **New `TokenCountEstimator` class** for fast token counting using character-based heuristics | ||
| - **Automatic result truncation** via `_select_results_within_budget()` to prevent context window issues | ||
| - **Configurable token limits**: | ||
| - `TOOL_RESPONSE_TOKEN_LIMIT` environment variable (default: 80,000 tokens) | ||
| - `ENTITY_SCHEMA_TOKEN_BUDGET` environment variable (default: 16,000 tokens per entity) | ||
| - **90% safety buffer** to account for token estimation inaccuracies | ||
| - Ensures at least one result is always returned | ||
|
|
||
| #### Enhanced Search Capabilities | ||
| - **Enhanced Keyword Search**: | ||
| - Supports pagination with `start` parameter | ||
| - Added `viewUrn` for view-based filtering | ||
| - Added `sortInput` for custom sorting | ||
|
|
||
| #### Query Entity Support | ||
| - **Native QueryEntity type support** (SQL queries as first-class entities) | ||
| - New `query_entity.gql` GraphQL query | ||
| - Optimized entity retrieval with specialized query for QueryEntity types | ||
| - Includes query statement, subjects (datasets/fields), and platform information | ||
|
|
||
| #### GraphQL Compatibility | ||
| - **Adaptive field detection** for newer GMS versions | ||
| - Caching mechanism for GMS version detection | ||
| - Graceful fallback when newer fields aren't available | ||
| - Support for `#[CLOUD]` and `#[NEWER_GMS]` conditional field markers | ||
| - `DISABLE_NEWER_GMS_FIELD_DETECTION` environment variable override | ||
|
|
||
| #### Schema Field Optimization | ||
| - **Smart field prioritization** to stay within token budgets: | ||
| 1. Primary key fields (`isPartOfKey=true`) | ||
| 2. Partitioning key fields (`isPartitioningKey=true`) | ||
| 3. Fields with descriptions | ||
| 4. Fields with tags or glossary terms | ||
| 5. Alphabetically by field path | ||
| - Generator-based approach for memory efficiency | ||
|
|
||
| #### Error Handling & Security | ||
| - **Enhanced error logging** with full stack traces in `async_background` wrapper | ||
| - Logs function name, args, and kwargs on failures | ||
| - **ReDoS protection** in HTML sanitization with bounded regex patterns | ||
| - **Query truncation** function (configurable via `QUERY_LENGTH_HARD_LIMIT`, default: 5,000 chars) | ||
|
|
||
| #### Default Views Support | ||
| - **Automatic default view application** for all search operations | ||
| - Fetches organization's default global view from DataHub | ||
| - **5-minute caching** (configurable via `VIEW_CACHE_TTL_SECONDS`) | ||
| - Can be disabled via `DATAHUB_MCP_DISABLE_DEFAULT_VIEW` environment variable | ||
| - Ensures search results respect organization's data governance policies | ||
|
|
||
| ### Dependencies | ||
|
|
||
| - **Added** `cachetools>=5.0.0`: For GMS field detection caching | ||
| - **Added** `types-cachetools` (dev): Type stubs for mypy | ||
|
|
||
| ### Performance | ||
|
|
||
| - **Memory efficiency**: Generator-based result selection avoids loading all results into memory | ||
| - **Caching**: GMS version detection cached per graph instance | ||
| - **Fast token estimation**: Character-based heuristic (no tokenizer overhead) | ||
| - **Smart truncation**: Truncates less important schema fields first | ||
|
|
||
| --- | ||
|
|
||
| ## [0.3.11] and earlier | ||
|
|
||
| See git history for changes in earlier versions. | ||
|
|
||
| --- | ||
|
|
||
| ## Migration Guide | ||
|
|
||
| ### Environment Variables (New in 0.4.0) | ||
|
|
||
| ```bash | ||
| # Configure token limits (optional) | ||
| export TOOL_RESPONSE_TOKEN_LIMIT=80000 | ||
| export ENTITY_SCHEMA_TOKEN_BUDGET=16000 | ||
|
|
||
| # Disable newer GMS field detection if needed | ||
| export DISABLE_NEWER_GMS_FIELD_DETECTION=true | ||
|
|
||
| # Disable default view application (optional) | ||
| export DATAHUB_MCP_DISABLE_DEFAULT_VIEW=true | ||
| ``` | ||
|
|
||
| ### Search Examples (New in 0.4.0) | ||
|
|
||
| ```python | ||
| # Keyword search with filters | ||
| result = search( | ||
| query="/q revenue_*", | ||
| filters={"entity_type": ["DATASET"]}, | ||
| num_results=10 | ||
| ) | ||
|
|
||
| # Search with view filtering and sorting | ||
| result = search( | ||
| query="customer data", | ||
| viewUrn="urn:li:dataHubView:...", | ||
| sortInput={"sortBy": "RELEVANCE", "sortOrder": "DESCENDING"}, | ||
| num_results=10 | ||
| ) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Questions or Issues? | ||
|
|
||
| - Open an issue: https://github.com/acryldata/mcp-server-datahub/issues | ||
| - Documentation: https://docs.datahub.com/docs/features/feature-guides/mcp | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| """Token count estimation utilities for MCP responses. | ||
|
|
||
| IMPORTANT: This file is kept in sync between two repositories: | ||
| - datahub-integrations-service: src/datahub_integrations/mcp/_token_estimator.py | ||
| - mcp-server-datahub: src/mcp_server_datahub/_token_estimator.py | ||
|
|
||
| When making changes, ensure both versions remain identical. | ||
| """ | ||
|
|
||
| from functools import lru_cache | ||
| from typing import Union | ||
|
|
||
| from loguru import logger | ||
|
|
||
|
|
||
| class TokenCountEstimator: | ||
| """Fast token estimation for MCP response budget management. | ||
|
|
||
| Uses character-based heuristics instead of actual tokenization for performance. | ||
| Accuracy is sufficient given the 90% budget buffer used in practice. | ||
| """ | ||
|
|
||
| def __init__(self, model: str): | ||
| """Initialize the token estimator. | ||
|
|
||
| Args: | ||
| model: The target model name (for future model-specific optimizations) | ||
| """ | ||
| self.model = model | ||
|
|
||
| @staticmethod | ||
| @lru_cache(maxsize=100) | ||
| def estimate_tokens(text: str) -> int: | ||
| """ | ||
| Fast token estimation using character-based heuristic. | ||
|
|
||
| Uses 1.3 * len(text) / 4 which empirically approximates token counts | ||
| for entity metadata. This approach is preferred over tiktoken because: | ||
| - Faster (no tokenizer overhead) | ||
| - More robust for structured/repetitive content | ||
| - No dependency on tokenizer libraries | ||
| - Accuracy is sufficient with 90% budget buffer | ||
|
|
||
| Returns: | ||
| Approximate token count | ||
| """ | ||
|
|
||
| return int(1.3 * len(text) / 4) | ||
|
|
||
| @staticmethod | ||
| def estimate_dict_tokens( | ||
| obj: Union[dict, list, str, int, float, bool, None], | ||
| ) -> int: | ||
| """ | ||
| Fast approximation of token count for dict/list structures without JSON serialization. | ||
|
|
||
| Recursively walks structure counting characters. Much faster than json.dumps + estimate_tokens. | ||
|
|
||
| IMPORTANT: Assumes no circular references in the structure. | ||
| Protected against infinite recursion with MAX_DEPTH=100. | ||
|
|
||
| Args: | ||
| obj: Dict, list, or primitive value (must not contain circular references) | ||
|
|
||
| Returns: | ||
| Approximate token count | ||
| """ | ||
| MAX_DEPTH = 100 | ||
|
|
||
| def _count_chars(item, depth: int = 0) -> int: | ||
| if depth > MAX_DEPTH: | ||
| logger.error( | ||
| f"Max depth {MAX_DEPTH} exceeded in structure, stopping recursion" | ||
| ) | ||
| return 0 | ||
|
|
||
| if item is None: | ||
| return 4 # "null" | ||
| elif isinstance(item, bool): | ||
| return 5 # "true" or "false" | ||
| elif isinstance(item, str): | ||
| # Account for: | ||
| # - Quotes around string values: "value" → +6 | ||
| # - Escape characters (\n, \", \\, etc.) → +10% of length | ||
| # Structural chars weighted heavier as they often tokenize separately | ||
| base_length = len(item) | ||
| escape_overhead = int(base_length * 0.1) | ||
| return base_length + 6 + escape_overhead | ||
| elif isinstance(item, (int, float)): | ||
| return 6 # Average number length | ||
| elif isinstance(item, list): | ||
| return sum(_count_chars(elem, depth + 1) for elem in item) + len(item) | ||
| elif isinstance(item, dict): | ||
| total = 0 | ||
| for key, value in item.items(): | ||
| # Account for: "key": value, → 2 quotes + colon + space + comma | ||
| # Structural chars weighted heavier (often separate tokens) | ||
| total += len(str(key)) + 9 | ||
| total += _count_chars(value, depth + 1) | ||
| return total + len(item) # Additional padding for structure | ||
| else: | ||
| return 10 # Fallback for other types | ||
|
|
||
| chars = _count_chars(obj, depth=0) | ||
| # Use same formula as estimate_tokens for consistency | ||
| return int(1.3 * chars / 4) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to commit this file?
What is the expectation going forward, that this file will be constantly updated on each change?
Why not just keep release notes as the place to communicate what we've shipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the release notes? where is that?