acryldata · alexsku · Nov 19, 2025 · Nov 18, 2025 · Nov 18, 2025 · Nov 18, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -34,12 +34,6 @@ jobs:
           # Test against longtail (Cloud instance)
           - test_env: "longtail"
             datahub_version: "cloud"
-          # Test against OSS v1.3.0 (latest stable)
-          - test_env: "oss"
-            datahub_version: "v1.3.0"
-          # Test against latest OSS (head)
-          - test_env: "oss"
-            datahub_version: "head"
       fail-fast: false
 
     steps:
@@ -82,15 +76,15 @@ jobs:
           DATAHUB_GMS_URL: ${{ secrets.LONGTAIL_GMS_URL }}
           DATAHUB_GMS_TOKEN: ${{ secrets.LONGTAIL_GMS_TOKEN }}
         run: |
-          uv run pytest tests/test_mcp_server.py -v
+          uv run pytest tests/test_mcp_integration.py -v
 
       - name: Run integration tests (OSS)
         if: matrix.test_env == 'oss'
         env:
           DATAHUB_GMS_URL: "http://localhost:8080"
           DATAHUB_GMS_TOKEN: ""
         run: |
-          uv run pytest tests/test_mcp_server.py -v
+          uv run pytest tests/test_mcp_integration.py -v
 
       # Cleanup
       - name: Cleanup DataHub (OSS only)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,122 @@
+# Changelog
+
+All notable changes to mcp-server-datahub will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [0.4.0] - 2025-11-17
+
+### Added
+
+#### Response Token Budget Management
+- **New `TokenCountEstimator` class** for fast token counting using character-based heuristics
+- **Automatic result truncation** via `_select_results_within_budget()` to prevent context window issues
+- **Configurable token limits**:
+  - `TOOL_RESPONSE_TOKEN_LIMIT` environment variable (default: 80,000 tokens)
+  - `ENTITY_SCHEMA_TOKEN_BUDGET` environment variable (default: 16,000 tokens per entity)
+- **90% safety buffer** to account for token estimation inaccuracies
+- Ensures at least one result is always returned
+
+#### Enhanced Search Capabilities
+- **Enhanced Keyword Search**:
+  - Supports pagination with `start` parameter
+  - Added `viewUrn` for view-based filtering
+  - Added `sortInput` for custom sorting
+
+#### Query Entity Support
+- **Native QueryEntity type support** (SQL queries as first-class entities)
+- New `query_entity.gql` GraphQL query
+- Optimized entity retrieval with specialized query for QueryEntity types
+- Includes query statement, subjects (datasets/fields), and platform information
+
+#### GraphQL Compatibility
+- **Adaptive field detection** for newer GMS versions
+- Caching mechanism for GMS version detection
+- Graceful fallback when newer fields aren't available
+- Support for `#[CLOUD]` and `#[NEWER_GMS]` conditional field markers
+- `DISABLE_NEWER_GMS_FIELD_DETECTION` environment variable override
+
+#### Schema Field Optimization
+- **Smart field prioritization** to stay within token budgets:
+  1. Primary key fields (`isPartOfKey=true`)
+  2. Partitioning key fields (`isPartitioningKey=true`)
+  3. Fields with descriptions
+  4. Fields with tags or glossary terms
+  5. Alphabetically by field path
+- Generator-based approach for memory efficiency
+
+#### Error Handling & Security
+- **Enhanced error logging** with full stack traces in `async_background` wrapper
+- Logs function name, args, and kwargs on failures
+- **ReDoS protection** in HTML sanitization with bounded regex patterns
+- **Query truncation** function (configurable via `QUERY_LENGTH_HARD_LIMIT`, default: 5,000 chars)
+
+#### Default Views Support
+- **Automatic default view application** for all search operations
+- Fetches organization's default global view from DataHub
+- **5-minute caching** (configurable via `VIEW_CACHE_TTL_SECONDS`)
+- Can be disabled via `DATAHUB_MCP_DISABLE_DEFAULT_VIEW` environment variable
+- Ensures search results respect organization's data governance policies
+
+### Dependencies
+
+- **Added** `cachetools>=5.0.0`: For GMS field detection caching
+- **Added** `types-cachetools` (dev): Type stubs for mypy
+
+### Performance
+
+- **Memory efficiency**: Generator-based result selection avoids loading all results into memory
+- **Caching**: GMS version detection cached per graph instance
+- **Fast token estimation**: Character-based heuristic (no tokenizer overhead)
+- **Smart truncation**: Truncates less important schema fields first
+
+---
+
+## [0.3.11] and earlier
+
+See git history for changes in earlier versions.
+
+---
+
+## Migration Guide
+
+### Environment Variables (New in 0.4.0)
+
+```bash
+# Configure token limits (optional)
+export TOOL_RESPONSE_TOKEN_LIMIT=80000
+export ENTITY_SCHEMA_TOKEN_BUDGET=16000
+
+# Disable newer GMS field detection if needed
+export DISABLE_NEWER_GMS_FIELD_DETECTION=true
+
+# Disable default view application (optional)
+export DATAHUB_MCP_DISABLE_DEFAULT_VIEW=true
+```
+
+### Search Examples (New in 0.4.0)
+
+```python
+# Keyword search with filters
+result = search(
+    query="/q revenue_*",
+    filters={"entity_type": ["DATASET"]},
+    num_results=10
+)
+
+# Search with view filtering and sorting
+result = search(
+    query="customer data",
+    viewUrn="urn:li:dataHubView:...",
+    sortInput={"sortBy": "RELEVANCE", "sortOrder": "DESCENDING"},
+    num_results=10
+)
+```
+
+---
+
+## Questions or Issues?
+
+- Open an issue: https://github.com/acryldata/mcp-server-datahub/issues
+- Documentation: https://docs.datahub.com/docs/features/feature-guides/mcp
diff --git a/pyproject.toml b/pyproject.toml
@@ -7,6 +7,7 @@ requires-python = ">=3.10"
 dependencies = [
     "acryl-datahub==1.2.0.2",
     "asyncer>=0.0.8",
+    "cachetools>=5.0.0",
     "fastmcp==2.10.5",
     "jmespath~=1.0.1",
     "loguru",
@@ -20,6 +21,7 @@ dev = [
     "mypy>=1.15.0",
     "pytest>=8.3.5",
     "ruff>=0.11.6",
+    "types-cachetools",
     "types-jmespath~=1.0.1",
 ]
 
@@ -47,6 +49,13 @@ extend-exclude = [
     "src/mcp_server_datahub/_version.py",  # Generated by setuptools-scm
 ]
 
+[tool.mypy]
+# Exclude shared tests that use datahub_integrations imports
+# These work via conftest.py compatibility shim but mypy can't see it
+exclude = [
+    "^tests/test_mcp/",
+]
+
 [tool.uv]
 cache-keys = [{ file = "pyproject.toml" }, { git = true }]
 

diff --git a/src/mcp_server_datahub/__main__.py b/src/mcp_server_datahub/__main__.py
@@ -9,10 +9,13 @@
 
 from mcp_server_datahub._telemetry import TelemetryMiddleware
 from mcp_server_datahub._version import __version__
-from mcp_server_datahub.mcp_server import mcp, with_datahub_client
+from mcp_server_datahub.mcp_server import mcp, register_all_tools, with_datahub_client
 
 logging.basicConfig(level=logging.INFO)
 
+# Register tools with OSS-compatible descriptions
+register_all_tools(is_oss=True)
+
 
 @click.command()
 @click.version_option(version=__version__)

diff --git a/src/mcp_server_datahub/_token_estimator.py b/src/mcp_server_datahub/_token_estimator.py
@@ -0,0 +1,106 @@
+"""Token count estimation utilities for MCP responses.
+
+IMPORTANT: This file is kept in sync between two repositories:
+- datahub-integrations-service: src/datahub_integrations/mcp/_token_estimator.py
+- mcp-server-datahub: src/mcp_server_datahub/_token_estimator.py
+
+When making changes, ensure both versions remain identical.
+"""
+
+from functools import lru_cache
+from typing import Union
+
+from loguru import logger
+
+
+class TokenCountEstimator:
+    """Fast token estimation for MCP response budget management.
+
+    Uses character-based heuristics instead of actual tokenization for performance.
+    Accuracy is sufficient given the 90% budget buffer used in practice.
+    """
+
+    def __init__(self, model: str):
+        """Initialize the token estimator.
+
+        Args:
+            model: The target model name (for future model-specific optimizations)
+        """
+        self.model = model
+
+    @staticmethod
+    @lru_cache(maxsize=100)
+    def estimate_tokens(text: str) -> int:
+        """
+        Fast token estimation using character-based heuristic.
+
+        Uses 1.3 * len(text) / 4 which empirically approximates token counts
+        for entity metadata. This approach is preferred over tiktoken because:
+        - Faster (no tokenizer overhead)
+        - More robust for structured/repetitive content
+        - No dependency on tokenizer libraries
+        - Accuracy is sufficient with 90% budget buffer
+
+        Returns:
+            Approximate token count
+        """
+
+        return int(1.3 * len(text) / 4)
+
+    @staticmethod
+    def estimate_dict_tokens(
+        obj: Union[dict, list, str, int, float, bool, None],
+    ) -> int:
+        """
+        Fast approximation of token count for dict/list structures without JSON serialization.
+
+        Recursively walks structure counting characters. Much faster than json.dumps + estimate_tokens.
+
+        IMPORTANT: Assumes no circular references in the structure.
+        Protected against infinite recursion with MAX_DEPTH=100.
+
+        Args:
+            obj: Dict, list, or primitive value (must not contain circular references)
+
+        Returns:
+            Approximate token count
+        """
+        MAX_DEPTH = 100
+
+        def _count_chars(item, depth: int = 0) -> int:
+            if depth > MAX_DEPTH:
+                logger.error(
+                    f"Max depth {MAX_DEPTH} exceeded in structure, stopping recursion"
+                )
+                return 0
+
+            if item is None:
+                return 4  # "null"
+            elif isinstance(item, bool):
+                return 5  # "true" or "false"
+            elif isinstance(item, str):
+                # Account for:
+                # - Quotes around string values: "value" → +6
+                # - Escape characters (\n, \", \\, etc.) → +10% of length
+                # Structural chars weighted heavier as they often tokenize separately
+                base_length = len(item)
+                escape_overhead = int(base_length * 0.1)
+                return base_length + 6 + escape_overhead
+            elif isinstance(item, (int, float)):
+                return 6  # Average number length
+            elif isinstance(item, list):
+                return sum(_count_chars(elem, depth + 1) for elem in item) + len(item)
+            elif isinstance(item, dict):
+                total = 0
+                for key, value in item.items():
+                    # Account for: "key": value, → 2 quotes + colon + space + comma
+                    # Structural chars weighted heavier (often separate tokens)
+                    total += len(str(key)) + 9
+                    total += _count_chars(value, depth + 1)
+                return total + len(item)  # Additional padding for structure
+            else:
+                return 10  # Fallback for other types
+
+        chars = _count_chars(obj, depth=0)
+        # Use same formula as estimate_tokens for consistency
+        return int(1.3 * chars / 4)