Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,6 @@ jobs:
# Test against longtail (Cloud instance)
- test_env: "longtail"
datahub_version: "cloud"
# Test against OSS v1.3.0 (latest stable)
- test_env: "oss"
datahub_version: "v1.3.0"
# Test against latest OSS (head)
- test_env: "oss"
datahub_version: "head"
fail-fast: false

steps:
Expand Down Expand Up @@ -82,15 +76,15 @@ jobs:
DATAHUB_GMS_URL: ${{ secrets.LONGTAIL_GMS_URL }}
DATAHUB_GMS_TOKEN: ${{ secrets.LONGTAIL_GMS_TOKEN }}
run: |
uv run pytest tests/test_mcp_server.py -v
uv run pytest tests/test_mcp_integration.py -v

- name: Run integration tests (OSS)
if: matrix.test_env == 'oss'
env:
DATAHUB_GMS_URL: "http://localhost:8080"
DATAHUB_GMS_TOKEN: ""
run: |
uv run pytest tests/test_mcp_server.py -v
uv run pytest tests/test_mcp_integration.py -v

# Cleanup
- name: Cleanup DataHub (OSS only)
Expand Down
122 changes: 122 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Changelog

All notable changes to mcp-server-datahub will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.0] - 2025-11-17

### Added

#### Response Token Budget Management
- **New `TokenCountEstimator` class** for fast token counting using character-based heuristics
- **Automatic result truncation** via `_select_results_within_budget()` to prevent context window issues
- **Configurable token limits**:
- `TOOL_RESPONSE_TOKEN_LIMIT` environment variable (default: 80,000 tokens)
- `ENTITY_SCHEMA_TOKEN_BUDGET` environment variable (default: 16,000 tokens per entity)
- **90% safety buffer** to account for token estimation inaccuracies
- Ensures at least one result is always returned

#### Enhanced Search Capabilities
- **Enhanced Keyword Search**:
- Supports pagination with `start` parameter
- Added `viewUrn` for view-based filtering
- Added `sortInput` for custom sorting

#### Query Entity Support
- **Native QueryEntity type support** (SQL queries as first-class entities)
- New `query_entity.gql` GraphQL query
- Optimized entity retrieval with specialized query for QueryEntity types
- Includes query statement, subjects (datasets/fields), and platform information

#### GraphQL Compatibility
- **Adaptive field detection** for newer GMS versions
- Caching mechanism for GMS version detection
- Graceful fallback when newer fields aren't available
- Support for `#[CLOUD]` and `#[NEWER_GMS]` conditional field markers
- `DISABLE_NEWER_GMS_FIELD_DETECTION` environment variable override

#### Schema Field Optimization
- **Smart field prioritization** to stay within token budgets:
1. Primary key fields (`isPartOfKey=true`)
2. Partitioning key fields (`isPartitioningKey=true`)
3. Fields with descriptions
4. Fields with tags or glossary terms
5. Alphabetically by field path
- Generator-based approach for memory efficiency

#### Error Handling & Security
- **Enhanced error logging** with full stack traces in `async_background` wrapper
- Logs function name, args, and kwargs on failures
- **ReDoS protection** in HTML sanitization with bounded regex patterns
- **Query truncation** function (configurable via `QUERY_LENGTH_HARD_LIMIT`, default: 5,000 chars)

#### Default Views Support
- **Automatic default view application** for all search operations
- Fetches organization's default global view from DataHub
- **5-minute caching** (configurable via `VIEW_CACHE_TTL_SECONDS`)
- Can be disabled via `DATAHUB_MCP_DISABLE_DEFAULT_VIEW` environment variable
- Ensures search results respect organization's data governance policies

### Dependencies

- **Added** `cachetools>=5.0.0`: For GMS field detection caching
- **Added** `types-cachetools` (dev): Type stubs for mypy

### Performance

- **Memory efficiency**: Generator-based result selection avoids loading all results into memory
- **Caching**: GMS version detection cached per graph instance
- **Fast token estimation**: Character-based heuristic (no tokenizer overhead)
- **Smart truncation**: Truncates less important schema fields first

---

## [0.3.11] and earlier

See git history for changes in earlier versions.

---

## Migration Guide

### Environment Variables (New in 0.4.0)

```bash
# Configure token limits (optional)
export TOOL_RESPONSE_TOKEN_LIMIT=80000
export ENTITY_SCHEMA_TOKEN_BUDGET=16000

# Disable newer GMS field detection if needed
export DISABLE_NEWER_GMS_FIELD_DETECTION=true

# Disable default view application (optional)
export DATAHUB_MCP_DISABLE_DEFAULT_VIEW=true
```

### Search Examples (New in 0.4.0)

```python
# Keyword search with filters
result = search(
query="/q revenue_*",
filters={"entity_type": ["DATASET"]},
num_results=10
)

# Search with view filtering and sorting
result = search(
query="customer data",
viewUrn="urn:li:dataHubView:...",
sortInput={"sortBy": "RELEVANCE", "sortOrder": "DESCENDING"},
num_results=10
)
```

---

## Questions or Issues?

- Open an issue: https://github.com/acryldata/mcp-server-datahub/issues
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to commit this file?

What is the expectation going forward, that this file will be constantly updated on each change?

Why not just keep release notes as the place to communicate what we've shipped?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the release notes? where is that?

- Documentation: https://docs.datahub.com/docs/features/feature-guides/mcp
9 changes: 9 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ requires-python = ">=3.10"
dependencies = [
"acryl-datahub==1.2.0.2",
"asyncer>=0.0.8",
"cachetools>=5.0.0",
"fastmcp==2.10.5",
"jmespath~=1.0.1",
"loguru",
Expand All @@ -20,6 +21,7 @@ dev = [
"mypy>=1.15.0",
"pytest>=8.3.5",
"ruff>=0.11.6",
"types-cachetools",
"types-jmespath~=1.0.1",
]

Expand Down Expand Up @@ -47,6 +49,13 @@ extend-exclude = [
"src/mcp_server_datahub/_version.py", # Generated by setuptools-scm
]

[tool.mypy]
# Exclude shared tests that use datahub_integrations imports
# These work via conftest.py compatibility shim but mypy can't see it
exclude = [
"^tests/test_mcp/",
]

[tool.uv]
cache-keys = [{ file = "pyproject.toml" }, { git = true }]

Expand Down
5 changes: 4 additions & 1 deletion src/mcp_server_datahub/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@

from mcp_server_datahub._telemetry import TelemetryMiddleware
from mcp_server_datahub._version import __version__
from mcp_server_datahub.mcp_server import mcp, with_datahub_client
from mcp_server_datahub.mcp_server import mcp, register_all_tools, with_datahub_client

logging.basicConfig(level=logging.INFO)

# Register tools with OSS-compatible descriptions
register_all_tools(is_oss=True)


@click.command()
@click.version_option(version=__version__)
Expand Down
106 changes: 106 additions & 0 deletions src/mcp_server_datahub/_token_estimator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""Token count estimation utilities for MCP responses.

IMPORTANT: This file is kept in sync between two repositories:
- datahub-integrations-service: src/datahub_integrations/mcp/_token_estimator.py
- mcp-server-datahub: src/mcp_server_datahub/_token_estimator.py

When making changes, ensure both versions remain identical.
"""

from functools import lru_cache
from typing import Union

from loguru import logger


class TokenCountEstimator:
"""Fast token estimation for MCP response budget management.

Uses character-based heuristics instead of actual tokenization for performance.
Accuracy is sufficient given the 90% budget buffer used in practice.
"""

def __init__(self, model: str):
"""Initialize the token estimator.

Args:
model: The target model name (for future model-specific optimizations)
"""
self.model = model

@staticmethod
@lru_cache(maxsize=100)
def estimate_tokens(text: str) -> int:
"""
Fast token estimation using character-based heuristic.

Uses 1.3 * len(text) / 4 which empirically approximates token counts
for entity metadata. This approach is preferred over tiktoken because:
- Faster (no tokenizer overhead)
- More robust for structured/repetitive content
- No dependency on tokenizer libraries
- Accuracy is sufficient with 90% budget buffer

Returns:
Approximate token count
"""

return int(1.3 * len(text) / 4)

@staticmethod
def estimate_dict_tokens(
obj: Union[dict, list, str, int, float, bool, None],
) -> int:
"""
Fast approximation of token count for dict/list structures without JSON serialization.

Recursively walks structure counting characters. Much faster than json.dumps + estimate_tokens.

IMPORTANT: Assumes no circular references in the structure.
Protected against infinite recursion with MAX_DEPTH=100.

Args:
obj: Dict, list, or primitive value (must not contain circular references)

Returns:
Approximate token count
"""
MAX_DEPTH = 100

def _count_chars(item, depth: int = 0) -> int:
if depth > MAX_DEPTH:
logger.error(
f"Max depth {MAX_DEPTH} exceeded in structure, stopping recursion"
)
return 0

if item is None:
return 4 # "null"
elif isinstance(item, bool):
return 5 # "true" or "false"
elif isinstance(item, str):
# Account for:
# - Quotes around string values: "value" → +6
# - Escape characters (\n, \", \\, etc.) → +10% of length
# Structural chars weighted heavier as they often tokenize separately
base_length = len(item)
escape_overhead = int(base_length * 0.1)
return base_length + 6 + escape_overhead
elif isinstance(item, (int, float)):
return 6 # Average number length
elif isinstance(item, list):
return sum(_count_chars(elem, depth + 1) for elem in item) + len(item)
elif isinstance(item, dict):
total = 0
for key, value in item.items():
# Account for: "key": value, → 2 quotes + colon + space + comma
# Structural chars weighted heavier (often separate tokens)
total += len(str(key)) + 9
total += _count_chars(value, depth + 1)
return total + len(item) # Additional padding for structure
else:
return 10 # Fallback for other types

chars = _count_chars(obj, depth=0)
# Use same formula as estimate_tokens for consistency
return int(1.3 * chars / 4)
Loading