Skip to content

improvement(tools): many new tools, redesigned existing tools, better token management#55

Merged
alexsku merged 12 commits intomainfrom
sa-new-mcp-tools-AI-149
Nov 19, 2025
Merged

improvement(tools): many new tools, redesigned existing tools, better token management#55
alexsku merged 12 commits intomainfrom
sa-new-mcp-tools-AI-149

Conversation

@alexsku
Copy link
Copy Markdown
Contributor

@alexsku alexsku commented Nov 18, 2025

sync from the hosted version - I added CHANGELOG.md to keep track of changes

Copy link
Copy Markdown
Contributor

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good.
Minor suggestions for sanity of OSS repo.

CHANGELOG.md Outdated
Comment on lines +55 to +66
#### New Files
- `_token_estimator.py`: Token counting utilities
- `gql/query_entity.gql`: Specialized query for QueryEntity type

### Changed

- **Complete rewrite of `mcp_server.py`** (2,513 lines vs 662 in previous version)
- **GraphQL API migration**: `scrollAcrossEntities` → `searchAcrossEntities`
- Replaced `scrollId` parameter with `start` for pagination
- Added `viewUrn` and `sortInput` parameters
- **Updated `gql/search.gql`**: Modern search API with pagination
- **Updated `gql/entity_details.gql`**: Reformatted `#[CLOUD]` markers for better maintainability
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should omit New and Changed files section. I don't think it adds much value, plus this is the most basic thing git gives right out of the box.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id agree

CHANGELOG.md Outdated
Comment on lines +82 to +121
#### Tests Synced from Integrations Service
**Added**: Comprehensive test suite synced from internal integrations service (12 test files in `tests/mcp/`):
- Entity retrieval and queries
- Column lineage extraction
- Filter conversion logic
- Schema field operations
- Tag processing
- Lineage path calculations
- Helper function tests

**Not Synced**: `test_mcp_telemetry.py` and `test_mcp_server.py` (service-specific)

**Note**: Tests use `datahub_integrations` imports to remain identical to source. These are reference implementations - adaptation needed to run in OSS.

#### GraphQL Query Changes
- `scrollAcrossEntities` → `searchAcrossEntities`
- `scrollId` parameter removed (use `start` instead)
- New required parameter: `start` (integer offset)
- New optional parameters: `viewUrn`, `sortInput`

**Migration**: See examples in README or documentation.

### Documentation

- Added module docstrings explaining repository sync requirements
- Inline comments documenting the importance of relative imports for cross-repo compatibility

### Compatibility

- **DataHub OSS**: Full compatibility (cloud-specific fields automatically disabled)
- **DataHub Cloud**: Enhanced features when cloud fields available
- **GMS Versions**: Adaptive compatibility with version detection
- **MCP Protocol**: Compatible with MCP 2.0+ (uses FastMCP 2.10.5)

### Internal

- Synced from internal DataHub integrations service (commit: 7077a9ce72)
- Service-specific files (`router.py`, `mcp_telemetry.py`) intentionally not included in open source version

---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove everything from section #### Tests Synced from Integrations Service onwards till old release section ## [0.3.11] and earlier. Its mostly repetition and needless.


# Log configuration on startup
if not DISABLE_DEFAULT_VIEW:
logger.info("Default view application ENABLED (cache TTL: 5 minutes)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not captured in changelog

- queryCountLast30DaysFeature: Number of queries in last 30 days
- rowCountFeature: Table row count
- sizeInBytesFeature: Table size in bytes
- writeCountLast30DaysFeature: Number of writes/updates in last 30 days
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect these are cloud-only fields. What is the behavior if non-existing fields are passed as sort input ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question..

break

# If in OSS repo, create datahub_integrations compatibility shim
if using_oss:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest creating OSS compatible shim in cloud-only repo than adding cloud-compatible shim in OSS repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a package rename copy script would also do. This will also help make lint happy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie copy this file to the cloud, right? i will do this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, n do a reverse shim.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Lets keep any cloud logic into the cloud-only repo to avoid future problems.

@mayurinehate mayurinehate self-requested a review November 18, 2025 15:32
Copy link
Copy Markdown
Contributor

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix failing lints

@mayurinehate mayurinehate self-requested a review November 18, 2025 17:30
@mayurinehate mayurinehate dismissed their stale review November 18, 2025 17:30

unblocking merge


## Questions or Issues?

- Open an issue: https://github.com/acryldata/mcp-server-datahub/issues
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to commit this file?

What is the expectation going forward, that this file will be constantly updated on each change?

Why not just keep release notes as the place to communicate what we've shipped?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the release notes? where is that?

#[CLOUD] }
#[CLOUD] }
#[CLOUD] }
statsSummary { #[CLOUD]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all valid fields for OSS servers? Just double checking..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does seem to fail

}
}
}
sqlAssertion {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, all of these data models are available in OSS as well right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see failures

Truncate a SQL query if it exceeds the maximum length.
"""
return truncate_with_ellipsis(
query, QUERY_LENGTH_HARD_LIMIT, suffix="... [truncated]"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to be careful about this for SQL query generation flow. As Anna pointed out, we were seeing a good bit of truncation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should discuss the truncation issue - i think the issue Anna was seeing was related to incorrectly passed (or preserved) parameter for the chat type - the code was truncated due to the slack limitation even though it was run from the webui



def _enable_newer_gms_fields(query: str) -> str:
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I would definitely not expect to see these methods in the mcp_Server.py file directly. Doesn't this feel like a good candidate to extract into a unit-tested utility file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the past (until this merge) we were coping only mcp_server between oss and fork, hence everything was in mcp_server, now, since we are copying the entire folder we can refactor and split mcp_server


def _enable_cloud_fields(query: str) -> str:
return query.replace("#[CLOUD]", "")
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interetsing approach. I had no idea that those gql comments actually were used at runtime to filter out specific fields..


try:
# Only DataHub Cloud has a frontend base url.
# Cloud instances typically run newer GMS versions with additional fields.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really NO BETTER WAY to detect whether we have a Cloud instance than to do this?

This feels like a fragile check. What if tomorrow we add frontend_base_url to Open Source?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any env vars that we can use that have acryl in the name to detect cloud vs open source?

query = _enable_newer_gms_fields(query)
newer_gms_enabled_for_this_query = True
else:
query = _disable_newer_gms_fields(query)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "newer" field vs an "older" field.

Aren't new fields eventually "old" at some point? When does that happen?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing feels a bit .. odd to me.

The fact that we have all this query manipulation directly lumped inside a method called execute_graphql just feels overwhelming.

Perhaps you should break this out into a separate function for easier reading, testing, and maintainability.

execute_graphql: ...

# 1. Resolve the final graphql query to execute
final_query = _resolve_query(query) # This is where the complex server side multiplexing occurs.


# Retry with newer GMS fields disabled - process both tags again
try:
fallback_query = original_query
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike how all of this complexity is just directly lumped into the execute_graphql_method.

Why not extract this method that is named to convey exactly what is happening? E.g. retry_simplified_graphql_query or something.



def clean_get_entity_response(raw_response: dict) -> dict:
def _sort_fields_by_priority(fields: List[dict]) -> Iterator[dict]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another obvious utility method that likely should not live directly inside of mcp_server.py.

This file is already getting quite long and difficult to read at 650+ lines.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction - this file is already getting to 1400+ lines!!!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction - this file is already getting to 2000+ lines!!!!

@mcp.tool(description="Get an entity by its DataHub URN.")
@async_background
def get_entity(urn: str) -> dict:
def get_entities(urns: List[str] | str) -> List[dict] | dict:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice glad to see we are adding this.

f"This can happen if the entity has no aspects ingested yet, or if there's a permissions issue."
)

inject_urls_for_urns(client._graph, result, [""])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to be careful about this method, especially if it depends on frontend_base_url from the client to build the URL.

def list_schema_fields(
urn: str,
keywords: Optional[List[str] | str] = None,
limit: int = 100,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: If we are using start, count for search pagination, why not also call these variables "start" and "count" instead of "limit" and "offset"?

"description": "User's email", # matches both
"tags": ["PII"] # matches neither
}
Score = 4 (email in fieldPath + email in desc + user in fieldPath + user in desc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, it looks like we have multiple functions in here that compute scores to help sort schema fields. Is there any simple way to unify them?

I know that this method takes the keywords as well, but just thinking aloud..

}


def _convert_custom_filter_format(filters_obj: Any) -> Any:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another obvious utility function that does't belong in the MCP server file

return filters_obj


def _search_implementation(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy that the method that implements search -- our most popular MCP tool -- can only be found by scrolling to line 1293 of this file.

register_search_tools(mcp)

# Register get_lineage tool
mcp.tool(name="get_lineage", description=get_lineage.__doc__)(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 2501 we actually see what tools are exposed :) This to me is just not an acceptable thing from a code quality point of view

Copy link
Copy Markdown
Contributor

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the MCP server file very difficult to review.

I would expect that file to contain basic entry points into the tools that are callable for the MCP server, with delegation to logic defined elsewhere (managers, service classes, helper / utility classes). Currently, there is poor separation of concerns and everything is lumped into 1 2k+ lines file - I don't really see where this can go from here.

Breaking this apart into more modular components would allow us to more systematically unit test each of the logic-heavy functions for doing things like truncating query responses, remapping queries to remove fields, etc.

As a reader, I want to see progressive disclosure of logical complexity, not everything thrown at me in a vertical pile at once. This is not necessarily immediately blocking the PR, but I think is in our interest to invest in cleaning up the structure of this codebase

async def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _R:
return await asyncer.asyncify(fn)(*args, **kwargs)
try:
return await asyncer.asyncify(fn)(*args, **kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should eventually move to async functions for tool calls so we don't need to wrap the methods in an asyncer. Will help us scale better in the future.

logger.info(f"Registering MCP tools (is_oss={is_oss})")

# Choose sorting documentation based on deployment type
if not is_oss:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose this in the open source repo?

Copy link
Copy Markdown

@nwadams nwadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from me. I think long term we'll want to refactor the mcp_server.py and tools into separate classes.

@alexsku alexsku force-pushed the sa-new-mcp-tools-AI-149 branch from b778b8d to 7f3fb34 Compare November 19, 2025 01:47
@alexsku alexsku merged commit 8761474 into main Nov 19, 2025
2 checks passed
@alexsku alexsku deleted the sa-new-mcp-tools-AI-149 branch November 19, 2025 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants