improvement(tools): many new tools, redesigned existing tools, better token management by alexsku · Pull Request #55 · acryldata/mcp-server-datahub

alexsku · 2025-11-18T01:57:02Z

sync from the hosted version - I added CHANGELOG.md to keep track of changes

mayurinehate

Mostly looks good.
Minor suggestions for sanity of OSS repo.

mayurinehate · 2025-11-18T10:57:03Z

CHANGELOG.md

+#### New Files
+- `_token_estimator.py`: Token counting utilities
+- `gql/query_entity.gql`: Specialized query for QueryEntity type
+
+### Changed
+
+- **Complete rewrite of `mcp_server.py`** (2,513 lines vs 662 in previous version)
+- **GraphQL API migration**: `scrollAcrossEntities` → `searchAcrossEntities`
+  - Replaced `scrollId` parameter with `start` for pagination
+  - Added `viewUrn` and `sortInput` parameters
+- **Updated `gql/search.gql`**: Modern search API with pagination
+- **Updated `gql/entity_details.gql`**: Reformatted `#[CLOUD]` markers for better maintainability


I think, we should omit New and Changed files section. I don't think it adds much value, plus this is the most basic thing git gives right out of the box.

mayurinehate · 2025-11-18T11:02:04Z

CHANGELOG.md

+#### Tests Synced from Integrations Service
+**Added**: Comprehensive test suite synced from internal integrations service (12 test files in `tests/mcp/`):
+- Entity retrieval and queries
+- Column lineage extraction
+- Filter conversion logic
+- Schema field operations
+- Tag processing
+- Lineage path calculations
+- Helper function tests
+
+**Not Synced**: `test_mcp_telemetry.py` and `test_mcp_server.py` (service-specific)
+
+**Note**: Tests use `datahub_integrations` imports to remain identical to source. These are reference implementations - adaptation needed to run in OSS.
+
+#### GraphQL Query Changes
+- `scrollAcrossEntities` → `searchAcrossEntities`
+- `scrollId` parameter removed (use `start` instead)
+- New required parameter: `start` (integer offset)
+- New optional parameters: `viewUrn`, `sortInput`
+
+**Migration**: See examples in README or documentation.
+
+### Documentation
+
+- Added module docstrings explaining repository sync requirements
+- Inline comments documenting the importance of relative imports for cross-repo compatibility
+
+### Compatibility
+
+- **DataHub OSS**: Full compatibility (cloud-specific fields automatically disabled)
+- **DataHub Cloud**: Enhanced features when cloud fields available
+- **GMS Versions**: Adaptive compatibility with version detection
+- **MCP Protocol**: Compatible with MCP 2.0+ (uses FastMCP 2.10.5)
+
+### Internal
+
+- Synced from internal DataHub integrations service (commit: 7077a9ce72)
+- Service-specific files (`router.py`, `mcp_telemetry.py`) intentionally not included in open source version
+
+---


Let's remove everything from section #### Tests Synced from Integrations Service onwards till old release section ## [0.3.11] and earlier. Its mostly repetition and needless.

mayurinehate · 2025-11-18T12:57:59Z

src/mcp_server_datahub/mcp_server.py

+
+# Log configuration on startup
+if not DISABLE_DEFAULT_VIEW:
+    logger.info("Default view application ENABLED (cache TTL: 5 minutes)")


This change is not captured in changelog

mayurinehate · 2025-11-18T13:08:46Z

src/mcp_server_datahub/mcp_server.py

+    - queryCountLast30DaysFeature: Number of queries in last 30 days
+    - rowCountFeature: Table row count
+    - sizeInBytesFeature: Table size in bytes
+    - writeCountLast30DaysFeature: Number of writes/updates in last 30 days


I suspect these are cloud-only fields. What is the behavior if non-existing fields are passed as sort input ?

Good question..

mayurinehate · 2025-11-18T13:29:38Z

tests/conftest.py

+        break
+
+# If in OSS repo, create datahub_integrations compatibility shim
+if using_oss:


I'd suggest creating OSS compatible shim in cloud-only repo than adding cloud-compatible shim in OSS repo.

Or a package rename copy script would also do. This will also help make lint happy.

ie copy this file to the cloud, right? i will do this

yes, n do a reverse shim.

I agree with this. Lets keep any cloud logic into the cloud-only repo to avoid future problems.

mayurinehate

Please fix failing lints

unblocking merge

jjoyce0510 · 2025-11-18T18:32:32Z

CHANGELOG.md

+
+## Questions or Issues?
+
+- Open an issue: https://github.com/acryldata/mcp-server-datahub/issues


Do we really want to commit this file?

What is the expectation going forward, that this file will be constantly updated on each change?

Why not just keep release notes as the place to communicate what we've shipped?

what are the release notes? where is that?

jjoyce0510 · 2025-11-18T18:33:16Z

src/mcp_server_datahub/gql/entity_details.gql

-    #[CLOUD]     }
-    #[CLOUD]   }
-    #[CLOUD] }
+    statsSummary {                                               #[CLOUD]


These are all valid fields for OSS servers? Just double checking..

does seem to fail

jjoyce0510 · 2025-11-18T18:33:56Z

src/mcp_server_datahub/gql/entity_details.gql

+          }
+        }
+      }
+      sqlAssertion {


Just to confirm, all of these data models are available in OSS as well right?

i dont see failures

jjoyce0510 · 2025-11-18T18:51:09Z

src/mcp_server_datahub/mcp_server.py

+    Truncate a SQL query if it exceeds the maximum length.
+    """
+    return truncate_with_ellipsis(
+        query, QUERY_LENGTH_HARD_LIMIT, suffix="... [truncated]"


Just need to be careful about this for SQL query generation flow. As Anna pointed out, we were seeing a good bit of truncation

we should discuss the truncation issue - i think the issue Anna was seeing was related to incorrectly passed (or preserved) parameter for the chat type - the code was truncated due to the slack limitation even though it was run from the webui

jjoyce0510 · 2025-11-18T18:52:35Z

src/mcp_server_datahub/mcp_server.py



+def _enable_newer_gms_fields(query: str) -> str:
+    """


This is interesting. I would definitely not expect to see these methods in the mcp_Server.py file directly. Doesn't this feel like a good candidate to extract into a unit-tested utility file?

in the past (until this merge) we were coping only mcp_server between oss and fork, hence everything was in mcp_server, now, since we are copying the entire folder we can refactor and split mcp_server

jjoyce0510 · 2025-11-18T18:53:19Z

src/mcp_server_datahub/mcp_server.py

+
 def _enable_cloud_fields(query: str) -> str:
-    return query.replace("#[CLOUD]", "")
+    """


Interetsing approach. I had no idea that those gql comments actually were used at runtime to filter out specific fields..

jjoyce0510 · 2025-11-18T18:54:14Z

src/mcp_server_datahub/mcp_server.py

+
    try:
        # Only DataHub Cloud has a frontend base url.
+        # Cloud instances typically run newer GMS versions with additional fields.


Is there really NO BETTER WAY to detect whether we have a Cloud instance than to do this?

This feels like a fragile check. What if tomorrow we add frontend_base_url to Open Source?

Are there any env vars that we can use that have acryl in the name to detect cloud vs open source?

jjoyce0510 · 2025-11-18T18:55:37Z

src/mcp_server_datahub/mcp_server.py

+            query = _enable_newer_gms_fields(query)
+            newer_gms_enabled_for_this_query = True
+        else:
+            query = _disable_newer_gms_fields(query)


What is a "newer" field vs an "older" field.

Aren't new fields eventually "old" at some point? When does that happen?

This whole thing feels a bit .. odd to me.

The fact that we have all this query manipulation directly lumped inside a method called execute_graphql just feels overwhelming.

Perhaps you should break this out into a separate function for easier reading, testing, and maintainability.

execute_graphql: ...

# 1. Resolve the final graphql query to execute final_query = _resolve_query(query) # This is where the complex server side multiplexing occurs.

jjoyce0510 · 2025-11-18T18:59:17Z

src/mcp_server_datahub/mcp_server.py

+
+            # Retry with newer GMS fields disabled - process both tags again
+            try:
+                fallback_query = original_query


I really dislike how all of this complexity is just directly lumped into the execute_graphql_method.

Why not extract this method that is named to convey exactly what is happening? E.g. retry_simplified_graphql_query or something.

jjoyce0510 · 2025-11-18T19:00:42Z

src/mcp_server_datahub/mcp_server.py



-def clean_get_entity_response(raw_response: dict) -> dict:
+def _sort_fields_by_priority(fields: List[dict]) -> Iterator[dict]:


Another obvious utility method that likely should not live directly inside of mcp_server.py.

This file is already getting quite long and difficult to read at 650+ lines.

Correction - this file is already getting to 1400+ lines!!!

Correction - this file is already getting to 2000+ lines!!!!

jjoyce0510 · 2025-11-18T19:03:20Z

src/mcp_server_datahub/mcp_server.py

-@mcp.tool(description="Get an entity by its DataHub URN.")
-@async_background
-def get_entity(urn: str) -> dict:
+def get_entities(urns: List[str] | str) -> List[dict] | dict:


Nice glad to see we are adding this.

jjoyce0510 · 2025-11-18T19:04:20Z

src/mcp_server_datahub/mcp_server.py

+                    f"This can happen if the entity has no aspects ingested yet, or if there's a permissions issue."
+                )
+
+            inject_urls_for_urns(client._graph, result, [""])


We might need to be careful about this method, especially if it depends on frontend_base_url from the client to build the URL.

jjoyce0510 · 2025-11-18T19:05:01Z

src/mcp_server_datahub/mcp_server.py

+def list_schema_fields(
+    urn: str,
+    keywords: Optional[List[str] | str] = None,
+    limit: int = 100,


Nitpick: If we are using start, count for search pagination, why not also call these variables "start" and "count" instead of "limit" and "offset"?

jjoyce0510 · 2025-11-18T19:06:25Z

src/mcp_server_datahub/mcp_server.py

+                    "description": "User's email",    # matches both
+                    "tags": ["PII"]                   # matches neither
+                }
+                Score = 4 (email in fieldPath + email in desc + user in fieldPath + user in desc)


Just curious, it looks like we have multiple functions in here that compute scores to help sort schema fields. Is there any simple way to unify them?

I know that this method takes the keywords as well, but just thinking aloud..

jjoyce0510 · 2025-11-18T19:06:56Z

src/mcp_server_datahub/mcp_server.py

+    }
+
+
+def _convert_custom_filter_format(filters_obj: Any) -> Any:


Another obvious utility function that does't belong in the MCP server file

jjoyce0510 · 2025-11-18T19:07:34Z

src/mcp_server_datahub/mcp_server.py

+        return filters_obj


 def _search_implementation(


Crazy that the method that implements search -- our most popular MCP tool -- can only be found by scrolling to line 1293 of this file.

jjoyce0510 · 2025-11-18T19:11:48Z

src/mcp_server_datahub/mcp_server.py

 register_search_tools(mcp)
+
+# Register get_lineage tool
+mcp.tool(name="get_lineage", description=get_lineage.__doc__)(


On line 2501 we actually see what tools are exposed :) This to me is just not an acceptable thing from a code quality point of view

jjoyce0510

I find the MCP server file very difficult to review.

I would expect that file to contain basic entry points into the tools that are callable for the MCP server, with delegation to logic defined elsewhere (managers, service classes, helper / utility classes). Currently, there is poor separation of concerns and everything is lumped into 1 2k+ lines file - I don't really see where this can go from here.

Breaking this apart into more modular components would allow us to more systematically unit test each of the logic-heavy functions for doing things like truncating query responses, remapping queries to remove fields, etc.

As a reader, I want to see progressive disclosure of logical complexity, not everything thrown at me in a vertical pile at once. This is not necessarily immediately blocking the PR, but I think is in our interest to invest in cleaning up the structure of this codebase

nwadams · 2025-11-18T20:06:17Z

src/mcp_server_datahub/mcp_server.py

    async def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _R:
-        return await asyncer.asyncify(fn)(*args, **kwargs)
+        try:
+            return await asyncer.asyncify(fn)(*args, **kwargs)


We should eventually move to async functions for tool calls so we don't need to wrap the methods in an asyncer. Will help us scale better in the future.

nwadams · 2025-11-19T01:42:17Z

src/mcp_server_datahub/mcp_server.py

+    logger.info(f"Registering MCP tools (is_oss={is_oss})")
+
+    # Choose sorting documentation based on deployment type
+    if not is_oss:


Do we want to expose this in the open source repo?

nwadams

LGTM from me. I think long term we'll want to refactor the mcp_server.py and tools into separate classes.

… token management

alexsku requested review from mayurinehate, nwadams and shirshanka November 18, 2025 01:57

mayurinehate approved these changes Nov 18, 2025

View reviewed changes

mayurinehate self-requested a review November 18, 2025 15:32

mayurinehate previously requested changes Nov 18, 2025

View reviewed changes

mayurinehate self-requested a review November 18, 2025 17:30

jjoyce0510 reviewed Nov 18, 2025

View reviewed changes

jjoyce0510 approved these changes Nov 18, 2025

View reviewed changes

jjoyce0510 mentioned this pull request Nov 18, 2025

docs: add DataHub MCP server extension documentation block/goose#5769

Merged

2 tasks

nwadams reviewed Nov 18, 2025

View reviewed changes

nwadams reviewed Nov 19, 2025

View reviewed changes

nwadams approved these changes Nov 19, 2025

View reviewed changes

alexsku added 6 commits November 18, 2025 17:46

improvement(tools): many new tools, redesigned existing tools, better…

5132aba

… token management

unit test fixes

4c91662

formatting

c5cec97

tests for all tools, sorting fields sorted out

7338fca

feedback

003d37e

mentioned default views support

7f3fb34

alexsku force-pushed the sa-new-mcp-tools-AI-149 branch from b778b8d to 7f3fb34 Compare November 19, 2025 01:47

alexsku added 6 commits November 18, 2025 17:51

using the ci

7a2d618

more tolerant to environment inconsistincies

83cee6b

more erorr handling

0b08a20

fixes

7a84ff4

removed oss tests

54e70bc

fixes

9daa7bf

alexsku merged commit 8761474 into main Nov 19, 2025
2 checks passed

alexsku deleted the sa-new-mcp-tools-AI-149 branch November 19, 2025 02:24


		## Questions or Issues?

		- Open an issue: https://github.com/acryldata/mcp-server-datahub/issues



		def clean_get_entity_response(raw_response: dict) -> dict:
		def _sort_fields_by_priority(fields: List[dict]) -> Iterator[dict]:

		}


		def _convert_custom_filter_format(filters_obj: Any) -> Any:

Conversation

alexsku commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayurinehate left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayurinehate left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexsku commented Nov 18, 2025 •

edited

Loading