Skip to content

feat(backend): add DingTalk document to knowledge base MCP tool#971

Closed
parabala wants to merge 20 commits intowecode-ai:mainfrom
parabala:wegent/dingtalk-docs-to-knowledge
Closed

feat(backend): add DingTalk document to knowledge base MCP tool#971
parabala wants to merge 20 commits intowecode-ai:mainfrom
parabala:wegent/dingtalk-docs-to-knowledge

Conversation

@parabala
Copy link
Copy Markdown
Collaborator

@parabala parabala commented Apr 14, 2026

Summary

Add support for importing DingTalk documents into Wegent knowledge bases.

Features

MCP Tools

  • get_dingtalk_document_info: Get document metadata from URL
  • add_dingtalk_doc_to_knowledge: Add document with text content
  • add_dingtalk_doc_with_attachment: Add document using attachment

DingTalk Docs Service

  • Document URL parsing and ID extraction
  • Document metadata retrieval
  • Filename generation with naming convention: {title}_{timestamp}.md

DingTalk Docs Skill

  • dingtalk_doc_to_kb tool for complete workflow:
    1. Start sandbox environment
    2. Fetch document info from DingTalk
    3. Download document content
    4. Save as {title}_{timestamp}.md
    5. Upload as attachment
    6. Create knowledge base document

Naming Convention

Documents are saved with the format: {title}_{timestamp}.md

  • Example: 产品需求文档_20260413170933.md
  • Timestamp format: YYYYMMDDHHMMSS

Files Added

  • app/services/dingtalk/__init__.py
  • app/services/dingtalk/docs_service.py
  • app/mcp_server/tools/dingtalk_docs.py
  • init_data/skills/dingtalk-docs/SKILL.md
  • init_data/skills/dingtalk-docs/__init__.py
  • init_data/skills/dingtalk-docs/provider.py

Files Modified

  • app/mcp_server/server.py (import dingtalk_docs module)

Summary by CodeRabbit

  • New Features
    • DingTalk document integration: fetch metadata, download content, and add DingTalk documents into knowledge bases.
    • New "dingtalk-connector" skill: end-to-end import workflow (fetch metadata, download, upload attachment, create knowledge document) with indexing/summary triggers.
    • Knowledge uploads now accept optional source metadata when creating documents.

Add support for importing DingTalk documents into Wegent knowledge bases.

Features:
- New MCP tools for DingTalk document operations:
  - get_dingtalk_document_info: Get document metadata from URL
  - add_dingtalk_doc_to_knowledge: Add document with text content
  - add_dingtalk_doc_with_attachment: Add document using attachment

- DingTalk Docs Service (app/services/dingtalk/):
  - Document URL parsing and ID extraction
  - Document metadata retrieval
  - Filename generation with naming convention: {title}_{timestamp}.md

- DingTalk Docs Skill (init_data/skills/dingtalk-docs/):
  - dingtalk_doc_to_kb tool for complete workflow:
    1. Start sandbox environment
    2. Fetch document info from DingTalk
    3. Download document content
    4. Save as {title}_{timestamp}.md
    5. Upload as attachment
    6. Create knowledge base document

The timestamp format follows YYYYMMDDHHMMSS convention.

Files added:
- app/services/dingtalk/__init__.py
- app/services/dingtalk/docs_service.py
- app/mcp_server/tools/dingtalk_docs.py
- init_data/skills/dingtalk-docs/SKILL.md
- init_data/skills/dingtalk-docs/__init__.py
- init_data/skills/dingtalk-docs/provider.py

Files modified:
- app/mcp_server/server.py (import dingtalk_docs module)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds DingTalk document support: registers new MCP tools, implements a DingTalk docs service that calls MCP tools and external endpoints, adds a sandboxed skill/provider to download/upload/import documents into knowledge bases, and extends the knowledge orchestrator to accept and persist source metadata.

Changes

Cohort / File(s) Summary
MCP server & tools
backend/app/mcp_server/server.py, backend/app/mcp_server/tools/dingtalk_docs.py
Registered dingtalk_docs in the MCP server and added four knowledge tools: get_dingtalk_document_info, download_dingtalk_document, add_dingtalk_doc_to_knowledge, add_dingtalk_doc_with_attachment. Tools handle user resolution, DB session lifecycle, call DingTalk service/orchestrator, and return structured success/error payloads.
DingTalk service
backend/app/services/dingtalk/__init__.py, backend/app/services/dingtalk/docs_service.py
New DingTalkDocsService (+ singleton dingtalk_docs_service) that parses DingTalk URLs, resolves per-user MCP config, invokes MCP JSON-RPC tools/call, and implements get_document_info and download_document_content with format/extension handling and mapped ValueError/auth errors.
Skill / Provider (sandbox)
backend/init_data/skills/dingtalk-connector/SKILL.md, backend/init_data/skills/dingtalk-connector/__init__.py, backend/init_data/skills/dingtalk-connector/provider.py
Added dingtalk-connector skill metadata and provider exposing DingTalkDocsToolProvider and DingTalkDocToKBTool. Tool runs sandboxed flow: call MCP tools for metadata/content, save file in sandbox, upload attachment to backend, then call MCP to create KB document. Includes Pydantic input model and error/status emissions.
Knowledge orchestrator
backend/app/services/knowledge/orchestrator.py
Extended create_document_with_content(...) signature to accept source_config and pass it into created KnowledgeDocumentCreate records (applied across text/file and attachment flows).
Init / packaging
backend/app/services/dingtalk/__init__.py, backend/init_data/skills/dingtalk-connector/__init__.py
Added package initializers that re-export public symbols (dingtalk_docs_service, DingTalkDocsToolProvider).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Tool as DingTalkDocToKBTool
    participant Sandbox as Sandbox Env
    participant MCP as MCP Server (knowledge)
    participant Service as DingTalkDocsService
    participant Backend as Backend (attachments/orchestrator)

    User->>Tool: run(dingtalk_doc_url, knowledge_base_id)
    Tool->>Sandbox: create sandbox / emit running
    Sandbox->>MCP: tools/call get_dingtalk_document_info(doc_url)
    MCP->>Service: dispatch get_document_info
    Service-->>MCP: metadata (doc_id,title,updateTime)
    MCP-->>Sandbox: metadata response
    Sandbox->>MCP: tools/call download_dingtalk_document(doc_url, format=markdown)
    MCP->>Service: dispatch download_document_content
    Service-->>MCP: content + metadata
    MCP-->>Sandbox: document content
    Sandbox->>Backend: upload attachment (multipart)
    Backend-->>Sandbox: attachment_id
    Sandbox->>MCP: tools/call add_dingtalk_doc_with_attachment(kb_id, attachment_id,...)
    MCP->>Backend: create_document_with_content(..., source_config)
    Backend-->>MCP: document created (document_id)
    MCP-->>Sandbox: success response
    Sandbox-->>User: completed result (document_id, attachment_id, filename)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 I hop through URLs and curl delight,
I fetch the docs and tuck them tight.
From sandbox burrow to knowledge tree,
Attachments, titles — all hop with me.
A rabbit’s cheer for DingTalk harmony.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main feature: adding DingTalk document import capability to the knowledge base via MCP tools. It is specific, clear, and reflects the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 83.02% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-docs/provider.py (1)

672-685: Centralize filename generation instead of duplicating it here.

This sanitizer now exists in both backend/init_data/skills/dingtalk-docs/provider.py and backend/app/services/dingtalk/docs_service.py. Keeping two copies will drift the next time the naming rules change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 672 - 685,
Duplicate filename-sanitization logic exists in _build_filename (class in
provider.py) and in docs_service; extract the logic into a single shared
function (e.g., sanitize_doc_filename(title: str, modified_time: str) or
build_doc_filename(title, modified_time)) in a common module and have
_build_filename call that function instead of reimplementing the regex/untitled
fallback; update docs_service to import and use the same shared function so both
places reference one implementation and remove the duplicated
regex/strip/untitled code from _build_filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 143-153: The tool currently treats doc_content as required and
returns an error if missing; update add_dingtalk_doc_to_knowledge (or the
function that contains this snippet) to honor the advertised fallback by
attempting to fetch the document from DingTalk when doc_content is falsy: call
the existing DingTalk fetch helper (e.g., fetch_dingtalk_doc,
fetch_doc_content_from_dingtalk, or dingtalk_client.get_document) using the
provided doc_id/metadata, assign the result to doc_content, and only return the
error if that fetch fails or returns empty; ensure the final returned dict
structure remains the same and includes the fetched content on success.
- Around line 66-70: The current get_dingtalk_document_info handler calls
asyncio.run(...) (invoking dingtalk_docs_service.get_document_info) which fails
inside FastMCP's running event loop; change the handler to be async (declare
async def get_dingtalk_document_info(...)) and await
dingtalk_docs_service.get_document_info(doc_url) directly, or alternatively
replace asyncio.run(...) with anyio.to_thread.run_sync(lambda: asyncio.run(...))
/ anyio.to_thread.run_sync(lambda:
dingtalk_docs_service.get_document_info(doc_url)) to run the blocking call off
the loop; update references to get_dingtalk_document_info and ensure its callers
support async invocation.

In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 143-162: get_document_info currently fabricates metadata; replace
the placeholder body in get_document_info(doc_id, doc_url) with a real DingTalk
API lookup that uses the service's credentials/token (e.g. the client or auth
field on this service) to fetch the document metadata (title, last modified
timestamp, content type and canonical URL) and map those into the return shape
(doc_id, title, modified_time ISO, modified_time_formatted as "%Y%m%d%H%M%S",
content_type, url). Ensure you handle API errors (raise or log and propagate
appropriately) and normalize the timestamp from the API into modified_time and
modified_time_formatted; preserve the same keys so downstream code expecting
title/modified_time/etc. continues to work.
- Around line 191-209: download_document_content currently returns placeholder
markdown instead of fetching real document data; replace the stub in
download_document_content to call DingTalk's export/download APIs using the
existing HTTP client or SDK, pass doc_info["doc_id"] and export_format to
request the exported file, download and decode the returned content, set
file_extension based on export_format, populate "content", "title",
"modified_time", "modified_time_formatted", and "doc_id" with the real values
from the response, and add error handling to raise or log failures instead of
returning fake content (refer to download_document_content, doc_info and
export_format to locate the implementation).

In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 465-499: The _download_document_content() helper currently returns
a hardcoded markdown placeholder instead of the real DingTalk document body;
replace the placeholder logic in _download_document_content (search for function
name _download_document_content and the doc_id extraction block) with a call to
the DingTalk document API using the extracted doc_id (use existing auth/token
helpers in the module or skill to obtain credentials), fetch the actual document
body (convert/format to markdown if needed), handle HTTP errors and empty
responses, and return {"success": True, "content": actual_content} (or
{"success": False, "error": ...} on failures) so imports persist real content
rather than canned text.
- Around line 77-82: The code currently falls back to inserting a
machine-specific path into sys.path to import BaseSandboxTool (the
sys.path.insert(...) branch before importing _base and BaseSandboxTool); remove
that fallback entirely so the module only tries the legitimate import locations
and then raises ImportError if not found. Update the try/except around importing
BaseSandboxTool to eliminate the hard-coded "/workspace/1760905/Wegent/..."
sys.path insertion and ensure the except ImportError simply raises a clear
ImportError mentioning BaseSandboxTool, referencing the import attempt of
BaseSandboxTool from _base.
- Around line 533-546: The current code builds a shell string in curl_cmd and
calls sandbox.commands.run(cmd=curl_cmd), which is vulnerable because
auth_token, file_path and upload_url are interpolated raw; fix it by avoiding a
raw shell string—either construct argv-style input (pass a list/sequence of args
to sandbox.commands.run instead of a single shell string) or reliably escape
each interpolated value using shlex.quote before composing the command; update
the code paths around curl_cmd and the sandbox.commands.run call (and any
helpers like _build_filename if used here) so the values for auth_token,
file_path and upload_url are not interpreted by the shell.
- Around line 605-633: The current approach embeds user-controlled doc_title
into a single-quoted curl command string (curl_cmd) and passes it to
sandbox.commands.run, which allows shell injection if doc_title contains quotes;
instead, avoid constructing a shell command with interpolated JSON—either (a)
perform the HTTP call directly in Python (e.g., use aiohttp/requests) and send
the JSON payload built from payload variable, or (b) write json.dumps(payload)
to a temporary file and invoke sandbox.commands.run with curl using -d
`@/tmp/file` (or pass the JSON via stdin) so no user input is injected into a
shell string; update the code paths that build payload, curl_cmd and the
sandbox.commands.run invocation to use one of these safe approaches and ensure
auth_token and mcp_url are passed as headers/args rather than concatenated into
a shell line.

---

Nitpick comments:
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 672-685: Duplicate filename-sanitization logic exists in
_build_filename (class in provider.py) and in docs_service; extract the logic
into a single shared function (e.g., sanitize_doc_filename(title: str,
modified_time: str) or build_doc_filename(title, modified_time)) in a common
module and have _build_filename call that function instead of reimplementing the
regex/untitled fallback; update docs_service to import and use the same shared
function so both places reference one implementation and remove the duplicated
regex/strip/untitled code from _build_filename.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7eb8315-74f6-465f-bbe1-0c282dba752f

📥 Commits

Reviewing files that changed from the base of the PR and between b3fbc93 and b6986a2.

📒 Files selected for processing (7)
  • backend/app/mcp_server/server.py
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/dingtalk/__init__.py
  • backend/app/services/dingtalk/docs_service.py
  • backend/init_data/skills/dingtalk-docs/SKILL.md
  • backend/init_data/skills/dingtalk-docs/__init__.py
  • backend/init_data/skills/dingtalk-docs/provider.py

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py
Comment thread backend/app/mcp_server/tools/dingtalk_docs.py Outdated
Comment thread backend/app/services/dingtalk/docs_service.py Outdated
Comment thread backend/app/services/dingtalk/docs_service.py Outdated
Comment thread backend/init_data/skills/dingtalk-docs/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-connector/provider.py
Comment thread backend/init_data/skills/dingtalk-connector/provider.py
Comment thread backend/init_data/skills/dingtalk-docs/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-docs/provider.py Outdated
Fixes applied:
1. Fix asyncio.run issue in get_dingtalk_document_info
   - Changed to async def and await directly

2. Fix add_dingtalk_doc_to_knowledge to fetch content when missing
   - Added logic to fetch document from DingTalk when doc_content is not provided
   - Falls back to downloading content via dingtalk_docs_service

3. Fix hardcoded path in provider.py import
   - Removed hardcoded /workspace path from sys.path.insert
   - Cleaned up import error handling

4. Fix shell injection vulnerabilities in curl commands
   - Use shlex.quote for command arguments
   - Pass commands as arrays instead of shell strings
   - Write JSON payload to temp file instead of embedding in command

5. Fix duplicate filename sanitization logic
   - Extracted sanitize_filename and build_dingtalk_doc_filename as shared functions
   - Updated docs_service.py to use shared functions
   - Updated provider.py to import from shared module

All Python files pass syntax check and are formatted with black.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
backend/app/services/dingtalk/docs_service.py (2)

181-200: ⚠️ Potential issue | 🟠 Major

Placeholder implementation returns synthetic metadata.

get_document_info() fabricates title and modified_time from the URL and local clock instead of calling the DingTalk API. This means downstream filenames and KB records will contain synthetic data rather than actual document metadata.

This has been flagged in a previous review. Ensure this is addressed before the feature is production-ready.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 181 - 200, The
current get_document_info function returns synthetic metadata (title,
modified_time, modified_time_formatted) built from doc_id and local clock;
replace this placeholder by calling the DingTalk document metadata API to fetch
real values (title, lastModified, content type, url) and map them into the
returned dict (fields title, modified_time, modified_time_formatted,
content_type, url, doc_id); preserve a safe fallback path that logs the API
error and only then uses the synthetic values so downstream code (file naming
and KB records) receives real metadata when available and predictable fallback
data when the API fails.

229-247: ⚠️ Potential issue | 🟠 Major

Placeholder implementation returns dummy content.

download_document_content() returns hardcoded placeholder markdown instead of fetching actual document content from DingTalk. Any KB entry created through this path will contain fake content.

This has been flagged in a previous review. Ensure this is addressed before the feature is production-ready.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 229 - 247,
download_document_content currently returns hardcoded markdown; replace the
placeholder logic in download_document_content to call DingTalk's document
export/download APIs using the doc_info and export_format, download the exported
content (handling possible export conversion for "markdown" vs other formats),
parse/validate the response and populate "content", "title", "modified_time",
"modified_time_formatted", "file_extension" and "doc_id" from the real API
response, and add robust error handling and retries (log and raise on
unrecoverable failures) so KB entries contain the actual document content rather
than dummy text.
backend/init_data/skills/dingtalk-docs/provider.py (2)

408-452: ⚠️ Potential issue | 🟠 Major

Placeholder: _get_document_info_from_mcp does not call MCP or DingTalk.

This method extracts the document ID from the URL and fabricates a synthetic title/timestamp rather than fetching real metadata. The workflow cannot retrieve actual document information.

This has been flagged in a previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 408 - 452,
The _get_document_info_from_mcp function currently only parses doc_id from
doc_url and synthesizes metadata; replace the placeholder logic with a real
MCP/DingTalk metadata fetch using the provided sandbox (use the sandbox/tool
client or the MCP helper available in this environment) to call the appropriate
MCP endpoint with the extracted doc_id (or the full doc_url) and return the real
fields: success (bool), doc_id, title, modified_time (ISO),
modified_time_formatted; preserve the existing error handling pattern (return
{"success": False, "error": ...}) and ensure you fall back to an informative
error when the MCP call fails or returns no data so callers of
_get_document_info_from_mcp get real metadata instead of synthetic values.

454-497: ⚠️ Potential issue | 🟠 Major

Placeholder: _download_document_content returns template content.

This method generates placeholder markdown instead of downloading the actual DingTalk document body. KB entries created through this path will contain dummy text.

This has been flagged in a previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 454 - 497,
_download_document_content currently returns placeholder markdown instead of
fetching the real document body; replace the template generation with an actual
fetch from DingTalk via the MCP/client used elsewhere (use the same client or
helper that performs authenticated requests), extract the document ID from
doc_url (doc_id) using the existing patterns, call the appropriate MCP/DingTalk
API endpoint to retrieve the document content (handle JSON/html responses),
convert/clean the response into the expected markdown/plain string, and return
{"success": True, "content": content}; ensure robust error handling around the
network call (catch and log exceptions and return {"success": False, "error":
str(e)}), and reuse existing symbols/functions for HTTP auth/client if present
to avoid duplicating credentials.
🧹 Nitpick comments (8)
backend/app/mcp_server/tools/dingtalk_docs.py (2)

239-246: Consider making this handler async for consistency.

add_dingtalk_doc_with_attachment is defined as a synchronous function (def) while the other two tools are async def. If FastMCP handles both patterns correctly, this is fine, but async consistency would be cleaner and would allow using asyncio.to_thread() for the blocking orchestrator call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 239 - 246, The
function add_dingtalk_doc_with_attachment is synchronous while similar handlers
are async; change its signature to async def
add_dingtalk_doc_with_attachment(...) and update its internals to run the
blocking orchestrator call via asyncio.to_thread (or await an async wrapper) so
the handler is non-blocking and consistent with the other async tools; ensure
any callers or tests that expect the old sync signature are updated to await
add_dingtalk_doc_with_attachment and preserve existing behavior for
trigger_indexing/trigger_summary and return type Dict[str, Any].

180-186: Move datetime import to module level.

The datetime import is performed twice inside the function at runtime. Since it's already used elsewhere in the codebase, move the import to the top of the file for consistency and minor performance improvement.

♻️ Proposed fix

Add to module-level imports:

 import logging
+from datetime import datetime
 from typing import Any, Dict, Optional

Then remove the inline imports at lines 180 and 184:

             logger.warning(
                 f"Invalid modified_time format: {modified_time}, using current time"
             )
-            from datetime import datetime
-
             modified_time = datetime.now().strftime("%Y%m%d%H%M%S")
     else:
-        from datetime import datetime
-
         modified_time = datetime.now().strftime("%Y%m%d%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 180 - 186, Move
the repeated inline imports of datetime to the module level: add a single
module-level line "from datetime import datetime" to the top of the file, then
remove the two inline "from datetime import datetime" statements found near the
assignments to modified_time (the blocks that set modified_time =
datetime.now().strftime("%Y%m%d%H%M%S")). Ensure the existing uses of
modified_time still refer to the now-imported datetime and run tests/lint to
confirm no unused-import warnings.
backend/app/services/dingtalk/docs_service.py (2)

17-17: Unused import: httpx

The httpx module is imported but never used in the current implementation. This is likely intended for future DingTalk API calls.

♻️ Suggested fix

Either remove the unused import until needed:

-import httpx
-
 from app.core.config import settings

Or add a comment indicating it's reserved for future use:

import httpx  # Reserved for DingTalk API calls
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` at line 17, The import httpx
at module-level in docs_service.py is unused; either remove the import statement
entirely or retain it with an explicit comment (e.g., "Reserved for DingTalk API
calls") to indicate intentional future use so linters don't flag it — update the
module-level import for httpx accordingly.

141-147: Log the parse failure before falling through.

The outer except Exception: pass swallows all errors silently. While the fallback to current time is acceptable, logging the exception would aid debugging parse failures.

♻️ Proposed fix
             except ValueError:
                 continue
-    except Exception:
-        pass
+    except Exception as e:
+        logger.warning(f"Unexpected error parsing modified_time '{modified_time}': {e}")
 
     # Return current time as fallback
     logger.warning(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 141 - 147, The
outer "except Exception: pass" in the block that tries to parse modified_time
silently swallows failures; update that outer handler to log the exception
(including exception details) and context (the modified_time value and attempted
fmt) before falling back to current time—modify the outer except around the
datetime.strptime/strftime loop to call the module/class logger (or existing
logger instance) with a descriptive message and the caught exception information
so parse failures are visible for debugging.
backend/app/services/dingtalk/__init__.py (1)

14-19: Sort __all__ alphabetically for consistency.

Static analysis flags that __all__ is not sorted. Per isort standards, sorting helps maintainability.

♻️ Proposed fix
 __all__ = [
+    "build_dingtalk_doc_filename",
     "DingTalkDocsService",
     "dingtalk_docs_service",
     "sanitize_filename",
-    "build_dingtalk_doc_filename",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/__init__.py` around lines 14 - 19, The __all__
list in this module is not alphabetically sorted; reorder the entries in __all__
so they are lexicographically sorted (e.g., "DingTalkDocsService" before
"build_dingtalk_doc_filename", "dingtalk_docs_service", and "sanitize_filename")
to satisfy static analysis/isort expectations and maintain consistency.
backend/init_data/skills/dingtalk-docs/provider.py (3)

75-77: Use raise ... from None to clarify exception origin.

Per B904, when re-raising in an except block, use raise ... from None to indicate the exception is intentional and not from mishandled exception processing.

♻️ Proposed fix
     else:
-        raise ImportError(
+        raise ImportError(
             "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 75 - 77, The
ImportError being raised for failing to import BaseSandboxTool from
chat_shell.tools.sandbox._base should be re-raised using exception chaining
suppression; modify the raise in the provider module where BaseSandboxTool is
checked so it uses "raise ImportError(... ) from None" to indicate the error is
intentional and not caused by a prior exception.

461-462: Remove redundant import re inside method.

Same issue - re is already imported at module level.

♻️ Proposed fix
         try:
             # Extract doc ID for content generation
-            import re
-
             doc_id = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 461 - 462,
Remove the redundant local import of the regex module inside the method (the
`import re` line shown near the "Extract doc ID for content generation" block)
since `re` is already imported at module level; simply delete that inner import
so the method uses the module-level `re` symbol (locate the local import inside
the provider.py function that extracts the doc ID and remove it).

415-416: Remove redundant import re inside method.

re is already imported at the module level (line 20). The local import is unnecessary.

♻️ Proposed fix
         try:
             # Extract doc ID from URL
-            import re
-
             doc_id = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 415 - 416,
Remove the redundant local "import re" inside the method that extracts the doc
ID from a URL and rely on the module-level "re" import instead; delete the
inline import statement where the comment "Extract doc ID from URL" appears (the
duplicate import inside that function) so the function uses the already-imported
regex module, then run linters/tests to confirm no unused-import warnings
remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 198-207: The call to
knowledge_orchestrator.create_document_with_content is a synchronous, blocking
operation called inside an async handler; wrap this call using asyncio.to_thread
and await it so the blocking DB/file/URL work runs off the event loop (i.e.,
replace the direct call to
knowledge_orchestrator.create_document_with_content(...) with an awaited
asyncio.to_thread invocation of that function, passing the same named arguments:
db, user, knowledge_base_id, name/title, source_type, content/doc_content,
trigger_indexing, trigger_summary).

In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 624-627: The hardcoded payload_file ("/tmp/mcp_payload.json") can
cause race conditions; change the code that builds/writes payload_file (where
payload_json is created and sandbox.files.write is called) to create a secure
unique temp path using Python's tempfile (e.g., tempfile.mkstemp() or
tempfile.NamedTemporaryFile) or append a uuid, write payload_json to that unique
filename, and ensure you close/remove the temp file after use; update references
to payload_file accordingly so sandbox.files.write uses the generated unique
path.

---

Duplicate comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 181-200: The current get_document_info function returns synthetic
metadata (title, modified_time, modified_time_formatted) built from doc_id and
local clock; replace this placeholder by calling the DingTalk document metadata
API to fetch real values (title, lastModified, content type, url) and map them
into the returned dict (fields title, modified_time, modified_time_formatted,
content_type, url, doc_id); preserve a safe fallback path that logs the API
error and only then uses the synthetic values so downstream code (file naming
and KB records) receives real metadata when available and predictable fallback
data when the API fails.
- Around line 229-247: download_document_content currently returns hardcoded
markdown; replace the placeholder logic in download_document_content to call
DingTalk's document export/download APIs using the doc_info and export_format,
download the exported content (handling possible export conversion for
"markdown" vs other formats), parse/validate the response and populate
"content", "title", "modified_time", "modified_time_formatted", "file_extension"
and "doc_id" from the real API response, and add robust error handling and
retries (log and raise on unrecoverable failures) so KB entries contain the
actual document content rather than dummy text.

In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 408-452: The _get_document_info_from_mcp function currently only
parses doc_id from doc_url and synthesizes metadata; replace the placeholder
logic with a real MCP/DingTalk metadata fetch using the provided sandbox (use
the sandbox/tool client or the MCP helper available in this environment) to call
the appropriate MCP endpoint with the extracted doc_id (or the full doc_url) and
return the real fields: success (bool), doc_id, title, modified_time (ISO),
modified_time_formatted; preserve the existing error handling pattern (return
{"success": False, "error": ...}) and ensure you fall back to an informative
error when the MCP call fails or returns no data so callers of
_get_document_info_from_mcp get real metadata instead of synthetic values.
- Around line 454-497: _download_document_content currently returns placeholder
markdown instead of fetching the real document body; replace the template
generation with an actual fetch from DingTalk via the MCP/client used elsewhere
(use the same client or helper that performs authenticated requests), extract
the document ID from doc_url (doc_id) using the existing patterns, call the
appropriate MCP/DingTalk API endpoint to retrieve the document content (handle
JSON/html responses), convert/clean the response into the expected
markdown/plain string, and return {"success": True, "content": content}; ensure
robust error handling around the network call (catch and log exceptions and
return {"success": False, "error": str(e)}), and reuse existing
symbols/functions for HTTP auth/client if present to avoid duplicating
credentials.

---

Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 239-246: The function add_dingtalk_doc_with_attachment is
synchronous while similar handlers are async; change its signature to async def
add_dingtalk_doc_with_attachment(...) and update its internals to run the
blocking orchestrator call via asyncio.to_thread (or await an async wrapper) so
the handler is non-blocking and consistent with the other async tools; ensure
any callers or tests that expect the old sync signature are updated to await
add_dingtalk_doc_with_attachment and preserve existing behavior for
trigger_indexing/trigger_summary and return type Dict[str, Any].
- Around line 180-186: Move the repeated inline imports of datetime to the
module level: add a single module-level line "from datetime import datetime" to
the top of the file, then remove the two inline "from datetime import datetime"
statements found near the assignments to modified_time (the blocks that set
modified_time = datetime.now().strftime("%Y%m%d%H%M%S")). Ensure the existing
uses of modified_time still refer to the now-imported datetime and run
tests/lint to confirm no unused-import warnings.

In `@backend/app/services/dingtalk/__init__.py`:
- Around line 14-19: The __all__ list in this module is not alphabetically
sorted; reorder the entries in __all__ so they are lexicographically sorted
(e.g., "DingTalkDocsService" before "build_dingtalk_doc_filename",
"dingtalk_docs_service", and "sanitize_filename") to satisfy static
analysis/isort expectations and maintain consistency.

In `@backend/app/services/dingtalk/docs_service.py`:
- Line 17: The import httpx at module-level in docs_service.py is unused; either
remove the import statement entirely or retain it with an explicit comment
(e.g., "Reserved for DingTalk API calls") to indicate intentional future use so
linters don't flag it — update the module-level import for httpx accordingly.
- Around line 141-147: The outer "except Exception: pass" in the block that
tries to parse modified_time silently swallows failures; update that outer
handler to log the exception (including exception details) and context (the
modified_time value and attempted fmt) before falling back to current
time—modify the outer except around the datetime.strptime/strftime loop to call
the module/class logger (or existing logger instance) with a descriptive message
and the caught exception information so parse failures are visible for
debugging.

In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 75-77: The ImportError being raised for failing to import
BaseSandboxTool from chat_shell.tools.sandbox._base should be re-raised using
exception chaining suppression; modify the raise in the provider module where
BaseSandboxTool is checked so it uses "raise ImportError(... ) from None" to
indicate the error is intentional and not caused by a prior exception.
- Around line 461-462: Remove the redundant local import of the regex module
inside the method (the `import re` line shown near the "Extract doc ID for
content generation" block) since `re` is already imported at module level;
simply delete that inner import so the method uses the module-level `re` symbol
(locate the local import inside the provider.py function that extracts the doc
ID and remove it).
- Around line 415-416: Remove the redundant local "import re" inside the method
that extracts the doc ID from a URL and rely on the module-level "re" import
instead; delete the inline import statement where the comment "Extract doc ID
from URL" appears (the duplicate import inside that function) so the function
uses the already-imported regex module, then run linters/tests to confirm no
unused-import warnings remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 815d6cfc-0722-43c9-a13e-c10817cb7e42

📥 Commits

Reviewing files that changed from the base of the PR and between b6986a2 and e567768.

📒 Files selected for processing (4)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/dingtalk/__init__.py
  • backend/app/services/dingtalk/docs_service.py
  • backend/init_data/skills/dingtalk-docs/provider.py

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py
Comment thread backend/init_data/skills/dingtalk-docs/provider.py Outdated
Critical fixes:

1. Remove duplicate _get_user_from_token
   - Import from knowledge.py instead of reimplementing

2. Implement real DingTalk API calls via MCP
   - get_document_info now calls dingtalk_docs MCP 'get_document_info' tool
   - download_document_content now calls dingtalk_docs MCP 'download_document' tool
   - Added _call_dingtalk_mcp_tool helper for MCP communication
   - Added _get_dingtalk_docs_mcp_config to get user MCP settings

3. User preferences integration
   - MCP tools now pass user.preferences to service layer
   - Service reads dingtalk_docs MCP config from user preferences
   - Falls back to placeholder if MCP not configured

The implementation now requires users to:
1. Configure dingtalk_docs MCP in user settings
2. Provide the MCP server URL from DingTalk

Files modified:
- app/mcp_server/tools/dingtalk_docs.py
- app/services/dingtalk/docs_service.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
backend/app/mcp_server/tools/dingtalk_docs.py (1)

140-143: ⚠️ Potential issue | 🟠 Major

Keep the synchronous document-creation phase off the async handler.

add_dingtalk_doc_to_knowledge() is async, but it creates a synchronous SessionLocal() and then calls knowledge_orchestrator.create_document_with_content() directly. That blocks the FastMCP event loop, and the obvious asyncio.to_thread(..., db=db) fix would also be unsafe because the current SQLAlchemy session is not thread-safe. Move the whole DB/orchestrator phase behind a worker-thread helper that opens and closes its own SessionLocal().

Also applies to: 205-214

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 140 - 143, The
async handler add_dingtalk_doc_to_knowledge is creating a synchronous
SessionLocal() and calling knowledge_orchestrator.create_document_with_content
directly which blocks the event loop; refactor by extracting the DB/orchestrator
work (including opening/closing a fresh SessionLocal and calling
_get_user_from_token and knowledge_orchestrator.create_document_with_content)
into a synchronous helper function (e.g.,
sync_create_dingtalk_doc(db_work_helper) that creates its own SessionLocal,
performs the user lookup and document creation, closes the session, and returns
results) and then call that helper from the async function via
asyncio.to_thread(sync_create_dingtalk_doc, ...), ensuring no existing
SQLAlchemy SessionLocal instance is shared across threads. Ensure the same
change is applied to the other occurrence around lines 205-214.
backend/app/services/dingtalk/docs_service.py (2)

261-309: ⚠️ Potential issue | 🔴 Critical

Do not fabricate metadata on MCP miss or failure.

The missing-config path, the catch-all error path, and even the Line 286/Line 287 defaults all manufacture a title/timestamp and return them as if DingTalk had confirmed them. That makes get_document_info() report success with fake metadata and reintroduces the earlier bad-filename/bad-record failure mode.

🧭 Patch sketch
         mcp_config = self._get_dingtalk_docs_mcp_config(user_preferences)
         if not mcp_config:
-            logger.warning(
-                "dingtalk_docs MCP not configured, returning placeholder info"
-            )
-            # Fallback to placeholder if MCP not configured
-            now = datetime.now()
-            return {
-                "doc_id": doc_id,
-                "title": f"DingTalkDoc_{doc_id[:8]}",
-                "modified_time": now.isoformat(),
-                "modified_time_formatted": now.strftime("%Y%m%d%H%M%S"),
-                "content_type": "markdown",
-                "url": doc_url,
-            }
+            raise ValueError(
+                "dingtalk_docs MCP not configured. "
+                "Please configure DingTalk Docs MCP in user settings."
+            )
@@
-            title = result.get("title", f"DingTalkDoc_{doc_id[:8]}")
-            modified_time = result.get("modified_time", datetime.now().isoformat())
+            title = result.get("title")
+            modified_time = result.get("modified_time")
+            if not title or not modified_time:
+                raise ValueError(
+                    "MCP response missing required document metadata"
+                )
@@
-        except Exception as e:
-            logger.error(f"Failed to get document info from MCP: {e}")
-            # Fallback to placeholder on error
-            now = datetime.now()
-            return {
-                "doc_id": doc_id,
-                "title": f"DingTalkDoc_{doc_id[:8]}",
-                "modified_time": now.isoformat(),
-                "modified_time_formatted": now.strftime("%Y%m%d%H%M%S"),
-                "content_type": "markdown",
-                "url": doc_url,
-            }
+        except Exception as e:
+            logger.error(f"Failed to get document info from MCP: {e}")
+            raise ValueError(f"Failed to get document info: {e}") from e

95-113: ⚠️ Potential issue | 🟠 Major

Validate the hostname before extracting a document ID.

re.search() scans the full input string, so an unrelated URL or arbitrary text that merely embeds alidocs.dingtalk.com/i/... in a query or fragment will pass validation. Because the original doc_url is then forwarded unchanged to the downstream MCP call, this boundary currently accepts non-DingTalk URLs instead of rejecting them up front.

🔒 Patch sketch
+from urllib.parse import urlparse
+
 def _extract_doc_id_from_url(self, url: str) -> Optional[str]:
     if not url:
         return None
+    parsed = urlparse(url)
+    if parsed.scheme not in {"http", "https"}:
+        return None
+    if parsed.hostname != "alidocs.dingtalk.com":
+        return None
+    path = parsed.path.rstrip("/")

-    node_pattern = r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)"
-    match = re.search(node_pattern, url)
+    match = re.fullmatch(r"/i/nodes/([a-zA-Z0-9_-]+)", path)
     if match:
         return match.group(1)

-    docs_pattern = r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)"
-    match = re.search(docs_pattern, url)
+    match = re.fullmatch(r"/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", path)
     if match:
         return match.group(1)

-    wiki_pattern = r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)"
-    match = re.search(wiki_pattern, url)
+    match = re.fullmatch(r"/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)", path)
     if match:
         return match.group(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 95 - 113, The
extraction currently uses re.search over the whole url string (node_pattern,
docs_pattern, wiki_pattern) which can match embedded text in unrelated URLs;
instead parse the url with urllib.parse.urlparse, validate that the parsed
hostname (netloc) is exactly or endswith "alidocs.dingtalk.com" (depending on
allowed subdomains), and then run your regexes against only the parsed path (or
use path-based pattern matching) to extract and return the document id; update
the code that references url, node_pattern, docs_pattern, and wiki_pattern to
perform the hostname check before returning any match and return None if the
hostname is not the expected DingTalk host.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-196: The code builds a timestamped filename via
dingtalk_docs_service.build_filename(title, modified_time) and logs/returns it
but the document creation still calls the create function with name=title and
omits file_extension, so the timestamped "{title}_{timestamp}.md" is never
applied; update the document creation call that currently uses name=title to use
name=filename and set file_extension to ".md" (or the expected extension) so the
built filename is actually used. Locate the variable filename, the
build_filename call, and the create(...) invocation that currently passes
name=title and modify that invocation to pass name=filename and include
file_extension=".md".

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 140-143: The async handler add_dingtalk_doc_to_knowledge is
creating a synchronous SessionLocal() and calling
knowledge_orchestrator.create_document_with_content directly which blocks the
event loop; refactor by extracting the DB/orchestrator work (including
opening/closing a fresh SessionLocal and calling _get_user_from_token and
knowledge_orchestrator.create_document_with_content) into a synchronous helper
function (e.g., sync_create_dingtalk_doc(db_work_helper) that creates its own
SessionLocal, performs the user lookup and document creation, closes the
session, and returns results) and then call that helper from the async function
via asyncio.to_thread(sync_create_dingtalk_doc, ...), ensuring no existing
SQLAlchemy SessionLocal instance is shared across threads. Ensure the same
change is applied to the other occurrence around lines 205-214.

In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 95-113: The extraction currently uses re.search over the whole url
string (node_pattern, docs_pattern, wiki_pattern) which can match embedded text
in unrelated URLs; instead parse the url with urllib.parse.urlparse, validate
that the parsed hostname (netloc) is exactly or endswith "alidocs.dingtalk.com"
(depending on allowed subdomains), and then run your regexes against only the
parsed path (or use path-based pattern matching) to extract and return the
document id; update the code that references url, node_pattern, docs_pattern,
and wiki_pattern to perform the hostname check before returning any match and
return None if the hostname is not the expected DingTalk host.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 950c5d32-7fca-4851-9f74-8d16eb86ad0b

📥 Commits

Reviewing files that changed from the base of the PR and between e567768 and e34e393.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/dingtalk/docs_service.py

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py Outdated
Critical fixes:

1. Fix _get_document_info_from_mcp() to call real MCP
   - Now calls backend MCP endpoint 'get_dingtalk_document_info'
   - Returns real document metadata from DingTalk
   - Added proper error handling and temp file cleanup

2. Fix _download_document_content_real() to fetch real content
   - New method that calls backend MCP endpoint 'download_document'
   - Returns actual document content from DingTalk
   - Uses unique temp file names with uuid
   - Added proper cleanup

3. Fix blocking call in dingtalk_docs.py
   - Wrapped knowledge_orchestrator.create_document_with_content
   - Now uses asyncio.to_thread() to run off event loop

4. Fix hardcoded temp file path
   - Changed from '/tmp/mcp_payload.json' to '/tmp/mcp_payload_{uuid}.json'
   - Added try/finally blocks to ensure cleanup
   - Applied to all three places using temp files

Files modified:
- app/mcp_server/tools/dingtalk_docs.py
- init_data/skills/dingtalk-docs/provider.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/init_data/skills/dingtalk-docs/provider.py (3)

861-905: Consider simplifying the redundant cleanup logic.

The cleanup code at lines 886-893 runs inside the main try block (after sandbox.commands.run), but the same cleanup also exists in the except block (lines 897-904). This works but is redundant since a single finally block would handle both paths.

♻️ Simplified structure
             try:
                 # Build curl command using array style to prevent shell injection
                 import shlex

                 curl_cmd = [
                     "curl",
                     "-s",
                     "-X",
                     "POST",
                     "-H",
                     f"Authorization: Bearer {auth_token}",
                     "-H",
                     "Content-Type: application/json",
                     "-d",
                     f"@{payload_file}",
                     mcp_url,
                 ]

                 result_obj = await sandbox.commands.run(
                     cmd=shlex.join(curl_cmd),
                     cwd="/home/user",
                     timeout=60,
                 )
-
-                # Clean up temp file after use
-                try:
-                    await sandbox.commands.run(
-                        cmd=f"rm -f {payload_file}",
-                        cwd="/home/user",
-                        timeout=10,
-                    )
-                except Exception:
-                    pass
-
-            except Exception as e:
-                # Clean up temp file on error
+            finally:
                 try:
                     await sandbox.commands.run(
                         cmd=f"rm -f {payload_file}",
                         cwd="/home/user",
                         timeout=10,
                     )
                 except Exception:
                     pass
-                raise e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 861 - 905,
The temp-file cleanup for payload_file is duplicated in both the success path
and exception path; refactor the try/except to a try/finally around the
sandbox.commands.run invocation that executes the rm -f {payload_file} cleanup
(called via sandbox.commands.run) in the finally block, remove the duplicate
cleanup in the except branch, and keep error re-raising behavior for exceptions
from sandbox.commands.run; use the existing curl_cmd/shlex.join, auth_token and
mcp_url variables as before to locate the code to change.

74-77: Use raise ... from None to distinguish from exception handling errors.

When re-raising an ImportError within an except clause, use from None to suppress the exception chain and clarify intent.

♻️ Proposed fix
     else:
-        raise ImportError(
+        raise ImportError(
             "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 74 - 77, The
ImportError raised in the else branch that references "Cannot import
BaseSandboxTool from chat_shell.tools.sandbox._base" should be suppressed from
chaining by using "raise ... from None"; locate the else block in provider.py
where the ImportError is raised (the message about BaseSandboxTool and
chat_shell.tools.sandbox._base) and modify the raise to use "from None" so the
exception chain is not preserved.

548-592: Remove dead placeholder code.

This _download_document_content method returns placeholder content and is never called—the workflow at line 303 uses _download_document_content_real instead. Remove this dead code to avoid confusion.

🗑️ Remove dead code
-    async def _download_document_content(self, sandbox: Any, doc_url: str) -> dict:
-        """Download document content from DingTalk.
-
-        For now, this returns placeholder content.
-        In production, this would call the DingTalk API via MCP.
-        """
-        try:
-            # Extract doc ID for content generation
-            import re
-
-            doc_id = None
-            patterns = [
-                r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)",
-                r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)",
-            ]
-
-            for pattern in patterns:
-                match = re.search(pattern, doc_url)
-                if match:
-                    doc_id = match.group(1)
-                    break
-
-            # Generate placeholder content
-            content = f"""# DingTalk Document
-
-This document was imported from DingTalk.
-
-**Source URL:** {doc_url}
-**Document ID:** {doc_id or "unknown"}
-**Imported at:** {datetime.now().isoformat()}
-
-## Content
-
-The actual content would be fetched from DingTalk API in production.
-This is a placeholder for the document content.
-
----
-*Imported by Wegent DingTalk Docs Skill*
-"""
-
-            return {"success": True, "content": content}
-
-        except Exception as e:
-            return {"success": False, "error": str(e)}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 548 - 592,
Remove the dead placeholder method _download_document_content (its docstring,
body and exception handling) from the DingTalk docs provider since the workflow
uses _download_document_content_real; also remove any now-unused imports
referenced only by that method (e.g., re, datetime) and ensure there are no
remaining references to _download_document_content elsewhere in the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 861-905: The temp-file cleanup for payload_file is duplicated in
both the success path and exception path; refactor the try/except to a
try/finally around the sandbox.commands.run invocation that executes the rm -f
{payload_file} cleanup (called via sandbox.commands.run) in the finally block,
remove the duplicate cleanup in the except branch, and keep error re-raising
behavior for exceptions from sandbox.commands.run; use the existing
curl_cmd/shlex.join, auth_token and mcp_url variables as before to locate the
code to change.
- Around line 74-77: The ImportError raised in the else branch that references
"Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" should be
suppressed from chaining by using "raise ... from None"; locate the else block
in provider.py where the ImportError is raised (the message about
BaseSandboxTool and chat_shell.tools.sandbox._base) and modify the raise to use
"from None" so the exception chain is not preserved.
- Around line 548-592: Remove the dead placeholder method
_download_document_content (its docstring, body and exception handling) from the
DingTalk docs provider since the workflow uses _download_document_content_real;
also remove any now-unused imports referenced only by that method (e.g., re,
datetime) and ensure there are no remaining references to
_download_document_content elsewhere in the module.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 027c1883-d4aa-4e6e-87e0-6a556953a71b

📥 Commits

Reviewing files that changed from the base of the PR and between e34e393 and 93cb40a.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/init_data/skills/dingtalk-docs/provider.py

Add better error handling and user-friendly messages when DingTalk
document authentication fails:

1. Detect HTTP 401/403 errors from DingTalk MCP
2. Check for authentication-related keywords in error messages
3. Provide clear instructions to users:
   - Ensure they have access to the document in DingTalk
   - Check if document is shared or public
   - Verify MCP configuration is correct

This helps users understand why document download failed and
what steps they need to take to fix it.

File modified:
- app/services/dingtalk/docs_service.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
backend/app/services/dingtalk/docs_service.py (2)

191-225: Extract repeated numeric literals into named constants.

The literals at Line 209 (60.0), Line 219/413 (401, 403), and Line 392 (doc_id[:8]) should be constants for maintainability.

As per coding guidelines, "Extract magic numbers to named constants in Python code."

Also applies to: 392-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 191 - 225,
Replace magic numeric literals with descriptive constants: define
DEFAULT_HTTP_TIMEOUT = 60.0 and use it when creating
httpx.AsyncClient(timeout=DEFAULT_HTTP_TIMEOUT); define AUTH_ERROR_CODES = (401,
403) and use it in the HTTPStatusError handling instead of hard-coded 401/403;
define DOC_ID_PREVIEW_LEN = 8 and use it wherever doc_id[:8] is used. Update the
references in the MCP call function (the payload/call block and the except
httpx.HTTPStatusError handler) and in the code that slices doc_id so all
occurrences use these named constants (e.g., in the function that constructs doc
previews or logs). Ensure constants are placed near the module top with
descriptive names and used throughout.

74-76: Drop the no-op constructor or add explicit return type annotation.

Line 74-76 defines an empty __init__ without -> None; either remove it or annotate it explicitly.

As per coding guidelines, "Python code MUST follow PEP 8, Black formatter (line length: 88), and isort standards with type hints required."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 74 - 76, The
empty constructor def __init__(self): in docs_service.py is a no-op and missing
a return type; either remove this method entirely if it adds no behavior, or
annotate it with an explicit -> None return type (i.e., change to def
__init__(self) -> None:) to satisfy type-hinting rules and PEP8/Black
requirements; update the __init__ declaration accordingly in the class that
defines it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 289-337: The code currently masks MCP/config/auth failures by
returning fabricated metadata in get_document_info flow; instead, when
_get_dingtalk_docs_mcp_config(user_preferences) returns falsy, raise a clear
exception (e.g., RuntimeError or a DingtalkConfigError) with context about
missing MCP config rather than returning a placeholder, and in the except block
around _call_dingtalk_mcp_tool re-raise the original exception (or wrap it with
additional context) instead of returning placeholder metadata; update references
in this function to surface errors from _get_dingtalk_docs_mcp_config and
_call_dingtalk_mcp_tool so callers can handle auth/config failures (include the
original exception message when re-raising).
- Around line 343-405: The export_format parameter is not validated before
calling the MCP and is later used directly as file_extension; update the method
handling doc download to validate export_format against allowed values (e.g.,
"markdown", "html", "txt") before calling _call_dingtalk_mcp_tool, map
"markdown" to "md" for file_extension, and raise a clear ValueError for
unsupported formats; locate this logic in the same function (the one that calls
_call_dingtalk_mcp_tool and uses _format_modified_time) and perform
validation/mapping right after extracting doc_id and before building the MCP
arguments.
- Around line 414-422: The two ValueError raises that handle
httpx.HTTPStatusError should chain the original exception (use "raise
ValueError(...) from e") and the multi-line messages that use f-strings without
interpolation should be plain string literals; update the raises referencing the
caught exception variable "e" so both raise ValueError(...) from e and replace
f"... " with regular "..." for the static messages (the ValueError lines that
mention DingTalk authentication and the one that includes HTTP
{e.response.status_code}).

---

Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 191-225: Replace magic numeric literals with descriptive
constants: define DEFAULT_HTTP_TIMEOUT = 60.0 and use it when creating
httpx.AsyncClient(timeout=DEFAULT_HTTP_TIMEOUT); define AUTH_ERROR_CODES = (401,
403) and use it in the HTTPStatusError handling instead of hard-coded 401/403;
define DOC_ID_PREVIEW_LEN = 8 and use it wherever doc_id[:8] is used. Update the
references in the MCP call function (the payload/call block and the except
httpx.HTTPStatusError handler) and in the code that slices doc_id so all
occurrences use these named constants (e.g., in the function that constructs doc
previews or logs). Ensure constants are placed near the module top with
descriptive names and used throughout.
- Around line 74-76: The empty constructor def __init__(self): in
docs_service.py is a no-op and missing a return type; either remove this method
entirely if it adds no behavior, or annotate it with an explicit -> None return
type (i.e., change to def __init__(self) -> None:) to satisfy type-hinting rules
and PEP8/Black requirements; update the __init__ declaration accordingly in the
class that defines it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dcaed037-be49-4cbc-ba8b-ecee32782f09

📥 Commits

Reviewing files that changed from the base of the PR and between 93cb40a and bcc2bc9.

📒 Files selected for processing (1)
  • backend/app/services/dingtalk/docs_service.py

Comment thread backend/app/services/dingtalk/docs_service.py Outdated
Comment on lines +343 to +405
export_format: str = "markdown",
) -> Dict[str, Any]:
"""Download DingTalk document content via MCP.

Args:
doc_url: DingTalk document URL
user_preferences: User's preferences JSON string (for MCP config)
export_format: Export format (markdown, html, txt)

Returns:
Dict containing:
- content: Document content as string
- title: Document title
- modified_time: Modification time
- file_extension: Suggested file extension

Raises:
ValueError: If download fails
"""
doc_id = self._extract_doc_id_from_url(doc_url)
if not doc_id:
raise ValueError(f"Invalid DingTalk document URL: {doc_url}")

logger.info(f"Downloading document content for doc_id: {doc_id}")

# Get MCP config
mcp_config = self._get_dingtalk_docs_mcp_config(user_preferences)
if not mcp_config:
raise ValueError(
"dingtalk_docs MCP not configured. "
"Please configure DingTalk Docs MCP in user settings."
)

# Call MCP tool to download document content
try:
result = await self._call_dingtalk_mcp_tool(
mcp_config,
tool_name="download_document",
arguments={
"doc_id": doc_id,
"url": doc_url,
"format": export_format,
},
)

content = result.get("content", "")
if not content:
raise ValueError("Empty content returned from MCP")

title = result.get("title", f"DingTalkDoc_{doc_id[:8]}")
modified_time = result.get(
"modified_time",
result.get("last_modified", datetime.now().isoformat()),
)

return {
"content": content,
"title": title,
"modified_time": modified_time,
"modified_time_formatted": self._format_modified_time(modified_time),
"file_extension": (
"md" if export_format == "markdown" else export_format
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate export_format against supported values before MCP call.

Line 343 accepts any string; unsupported values are forwarded downstream and also used as file_extension on Line 403-405.

💡 Suggested fix
+SUPPORTED_EXPORT_FORMATS = {"markdown", "html", "txt"}
@@
     async def download_document_content(
         self,
         doc_url: str,
         user_preferences: Optional[str] = None,
         export_format: str = "markdown",
@@
+        if export_format not in SUPPORTED_EXPORT_FORMATS:
+            raise ValueError(
+                f"Unsupported export_format: {export_format}. "
+                f"Supported formats: {', '.join(sorted(SUPPORTED_EXPORT_FORMATS))}"
+            )
+
         doc_id = self._extract_doc_id_from_url(doc_url)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 343 - 405, The
export_format parameter is not validated before calling the MCP and is later
used directly as file_extension; update the method handling doc download to
validate export_format against allowed values (e.g., "markdown", "html", "txt")
before calling _call_dingtalk_mcp_tool, map "markdown" to "md" for
file_extension, and raise a clear ValueError for unsupported formats; locate
this logic in the same function (the one that calls _call_dingtalk_mcp_tool and
uses _format_modified_time) and perform validation/mapping right after
extracting doc_id and before building the MCP arguments.

Comment on lines +414 to +422
raise ValueError(
f"DingTalk authentication required. Please ensure:\n"
f"1. You have access to this document in DingTalk\n"
f"2. The document is shared with you or is public\n"
f"3. Your DingTalk MCP configuration is correct"
)
raise ValueError(
f"Failed to download document from DingTalk: HTTP {e.response.status_code}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
uv run ruff check backend/app/services/dingtalk/docs_service.py --select F541,B904 -n

Repository: wecode-ai/Wegent

Length of output: 102


🏁 Script executed:

#!/bin/bash
# First, verify the file exists and check line count
wc -l backend/app/services/dingtalk/docs_service.py

# Read lines 410-425 to get full context
sed -n '410,425p' backend/app/services/dingtalk/docs_service.py

Repository: wecode-ai/Wegent

Length of output: 932


🏁 Script executed:

#!/bin/bash
# Read lines 400-430 to see the full try-except context
sed -n '400,430p' backend/app/services/dingtalk/docs_service.py

Repository: wecode-ai/Wegent

Length of output: 1561


Chain exceptions and replace unnecessary f-strings with regular strings.

Lines 415-418 use f-strings without any variable placeholders and should be regular strings. Additionally, both ValueError raises at lines 414 and 420 should chain the httpx.HTTPStatusError exception with from e to preserve the original traceback context:

raise ValueError(...) from e
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 414-419: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[error] 415-415: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 416-416: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 417-417: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 418-418: f-string without any placeholders

Remove extraneous f prefix

(F541)


[warning] 420-422: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 414 - 422, The
two ValueError raises that handle httpx.HTTPStatusError should chain the
original exception (use "raise ValueError(...) from e") and the multi-line
messages that use f-strings without interpolation should be plain string
literals; update the raises referencing the caught exception variable "e" so
both raise ValueError(...) from e and replace f"... " with regular "..." for the
static messages (the ValueError lines that mention DingTalk authentication and
the one that includes HTTP {e.response.status_code}).

- Fix get_document_info to use correct nodeId parameter instead of doc_id/url
- Rewrite download_document_content to follow DingTalk MCP workflow:
  1. Call get_document_info to get contentType and extension
  2. Route to appropriate tool based on document type:
     - adoc: use get_document_content
     - axls: raise error (requires dingtalk_table MCP)
     - able: raise error (requires dingtalk_ai_table MCP)
     - file: use download_file and fetch from downloadUrl
- Add proper error handling for unsupported document types
- Add urlparse validation to _extract_doc_id_from_url for proper hostname checking
- Extract magic numbers to constants (HTTP_TIMEOUT_SECONDS, AUTH_ERROR_CODES, DOC_ID_PREVIEW_LENGTH)
- Add return type annotation to __init__ method
- Remove placeholder fallback in get_document_info, raise proper exceptions instead
- Apply consistent black formatting

Addresses CodeRabbit review feedback for improved security and maintainability.
- Add Accept: application/json header to MCP requests to fix HTTP 406 errors
- Add response body logging for better debugging
- Add specific error handling for 406 Not Acceptable with detailed message

The 406 error typically indicates content negotiation issues between
the client and DingTalk MCP gateway server.
- Fix get_document_info to use camelCase field names (modifiedTime, contentType)
  as returned by DingTalk MCP instead of snake_case
- Add debug logging to show available fields in MCP response
- Add fallback field names for get_document_content (markdown, text)
  in case content field is empty

The DingTalk MCP returns camelCase field names, but the code was
expecting snake_case names causing metadata extraction to fail.
DingTalk MCP get_document_info returns different field names than expected:
- 'name' instead of 'title' for document name
- 'updateTime' instead of 'modifiedTime' for modification time
- 'docUrl' for the document URL

Updated the field mapping to match actual DingTalk MCP response format.
DingTalk MCP returns updateTime as Unix timestamp in milliseconds (13 digits).
Updated _format_modified_time to:
- Detect millisecond timestamps (13 digit numbers)
- Use the original timestamp directly in filename (e.g., 产品需求文档_1775705225000)
- Fall back to YYYYMMDDHHMMSS format for ISO datetime strings

This matches the user's requested filename format.
- Fix doc_info.get to use 'name' instead of 'title' for document title
- Fix doc_info.get to use 'updateTime' instead of 'modifiedTime' for timestamp
- Update build_dingtalk_doc_filename docstring to match actual behavior
  - Example now shows Unix timestamp format: 产品需求文档_1775705225000.md
  - Args description updated to reflect timestamp format

This ensures correct filename generation when downloading documents.
Update modified_time format validation to accept both:
- Unix timestamp in milliseconds (13 digits)
- YYYYMMDDHHMMSS format (14 digits)

Previously only accepted 14-digit format, causing warnings when
DingTalk MCP returns 13-digit Unix timestamps.
@parabala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/app/services/dingtalk/docs_service.py (1)

507-529: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefixes and chain exceptions.

Lines 513-516 use f-strings without any placeholders. Additionally, the ValueError raises should chain the original exception with from e to preserve traceback context.

🔧 Proposed fix
         except httpx.HTTPStatusError as e:
             logger.error(
                 f"HTTP error from DingTalk MCP: {e.response.status_code} - {e.response.text}"
             )
             if e.response.status_code in AUTH_ERROR_CODES:
                 raise ValueError(
-                    f"DingTalk authentication required. Please ensure:\n"
-                    f"1. You have access to this document in DingTalk\n"
-                    f"2. The document is shared with you or is public\n"
-                    f"3. Your DingTalk MCP configuration is correct"
-                )
+                    "DingTalk authentication required. Please ensure:\n"
+                    "1. You have access to this document in DingTalk\n"
+                    "2. The document is shared with you or is public\n"
+                    "3. Your DingTalk MCP configuration is correct"
+                ) from e
             raise ValueError(
                 f"Failed to download document from DingTalk: HTTP {e.response.status_code}"
-            )
+            ) from e
         except Exception as e:
             logger.error(f"Failed to download document from MCP: {e}")
             error_msg = str(e)
             if "authentication" in error_msg.lower() or "auth" in error_msg.lower():
                 raise ValueError(
                     f"DingTalk authentication required: {error_msg}\n"
-                    f"Please ensure you have access to this document in DingTalk."
-                )
-            raise ValueError(f"Failed to download document: {e}")
+                    "Please ensure you have access to this document in DingTalk."
+                ) from e
+            raise ValueError(f"Failed to download document: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 507 - 529, In the
DingTalk docs service exception handlers (the except httpx.HTTPStatusError and
generic except Exception blocks in docs_service.py), remove the unnecessary
f-string prefixes from string literals (e.g., logger.error and ValueError
messages that have no placeholders) and change each bare "raise ValueError(...)"
to chain the original exception using "raise ValueError(... ) from e" so the
traceback/context is preserved; reference the logger, AUTH_ERROR_CODES, and the
caught exception variable e when making these edits.
backend/app/mcp_server/tools/dingtalk_docs.py (1)

195-217: ⚠️ Potential issue | 🟠 Major

The timestamped filename is not being used in document creation.

The filename is built at line 196 following the {title}_{timestamp}.md convention, but the document is created with name=title (line 212) instead of name=filename. The filename only appears in the response payload (line 223) but the actual document doesn't use it.

🔧 Proposed fix
         result = await asyncio.to_thread(
             knowledge_orchestrator.create_document_with_content,
             db=db,
             user=user,
             knowledge_base_id=knowledge_base_id,
-            name=title,
+            name=filename,
             source_type="text",
             content=doc_content,
+            file_extension=".md",
             trigger_indexing=trigger_indexing,
             trigger_summary=trigger_summary,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 195 - 217, The
created document currently uses name=title instead of the timestamped filename
produced by dingtalk_docs_service.build_filename; update the call to
knowledge_orchestrator.create_document_with_content to pass the generated
filename (variable filename) as the name argument (instead of title) so the
stored document uses the `{title}_{timestamp}.md` name returned in the response;
ensure you reference the existing variables filename and title and the function
knowledge_orchestrator.create_document_with_content when making this change.
🧹 Nitpick comments (1)
backend/app/services/dingtalk/docs_service.py (1)

364-369: The export_format parameter is unused.

The export_format parameter is documented and accepted but never used in the implementation. The actual download format is determined by the document's contentType and extension from the MCP response. Consider either removing the unused parameter or implementing format-based export logic if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 364 - 369, The
download_document_content function currently accepts an export_format parameter
but never uses it; either remove the unused parameter or implement logic to
respect it: update download_document_content to inspect export_format and map it
to the appropriate request/conversion path (e.g., force markdown/pdf/html export
when export_format == "markdown"/"pdf"/"html"), or if conversion isn't
supported, remove export_format from the signature and all callers; reference
the download_document_content function and its export_format parameter when
making the change and ensure any tests/call sites are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 248-262: The except block catching httpx.HTTPStatusError (the
handler using "except httpx.HTTPStatusError as e") currently raises several
ValueError instances without chaining; update each raise (the ones for
AUTH_ERROR_CODES, status 406, and the generic failure in the DingTalk MCP call)
to include exception chaining by appending "from e" so the original
httpx.HTTPStatusError traceback is preserved for debugging while keeping the
existing log calls (logger.error) intact.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-217: The created document currently uses name=title instead of
the timestamped filename produced by dingtalk_docs_service.build_filename;
update the call to knowledge_orchestrator.create_document_with_content to pass
the generated filename (variable filename) as the name argument (instead of
title) so the stored document uses the `{title}_{timestamp}.md` name returned in
the response; ensure you reference the existing variables filename and title and
the function knowledge_orchestrator.create_document_with_content when making
this change.

In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 507-529: In the DingTalk docs service exception handlers (the
except httpx.HTTPStatusError and generic except Exception blocks in
docs_service.py), remove the unnecessary f-string prefixes from string literals
(e.g., logger.error and ValueError messages that have no placeholders) and
change each bare "raise ValueError(...)" to chain the original exception using
"raise ValueError(... ) from e" so the traceback/context is preserved; reference
the logger, AUTH_ERROR_CODES, and the caught exception variable e when making
these edits.

---

Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 364-369: The download_document_content function currently accepts
an export_format parameter but never uses it; either remove the unused parameter
or implement logic to respect it: update download_document_content to inspect
export_format and map it to the appropriate request/conversion path (e.g., force
markdown/pdf/html export when export_format == "markdown"/"pdf"/"html"), or if
conversion isn't supported, remove export_format from the signature and all
callers; reference the download_document_content function and its export_format
parameter when making the change and ensure any tests/call sites are updated
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f41b824-1638-4c3d-8079-5c55d802c848

📥 Commits

Reviewing files that changed from the base of the PR and between 93cb40a and bd58b05.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/dingtalk/docs_service.py

Comment on lines +248 to +262
except httpx.HTTPStatusError as e:
logger.error(f"HTTP error from DingTalk MCP: {e.response.status_code}")
logger.error(f"Response body: {e.response.text}")
if e.response.status_code in AUTH_ERROR_CODES:
raise ValueError(
"DingTalk authentication failed. Please check your MCP configuration."
)
if e.response.status_code == 406:
raise ValueError(
f"DingTalk MCP request format not acceptable (HTTP 406). "
f"Response: {e.response.text}"
)
raise ValueError(
f"Failed to call DingTalk MCP: HTTP {e.response.status_code}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Chain exceptions to preserve traceback context.

The ValueError raises in the exception handler should chain the original exception using from e to preserve the traceback for debugging.

🔧 Proposed fix
         except httpx.HTTPStatusError as e:
             logger.error(f"HTTP error from DingTalk MCP: {e.response.status_code}")
             logger.error(f"Response body: {e.response.text}")
             if e.response.status_code in AUTH_ERROR_CODES:
                 raise ValueError(
                     "DingTalk authentication failed. Please check your MCP configuration."
-                )
+                ) from e
             if e.response.status_code == 406:
                 raise ValueError(
                     f"DingTalk MCP request format not acceptable (HTTP 406). "
                     f"Response: {e.response.text}"
-                )
+                ) from e
             raise ValueError(
                 f"Failed to call DingTalk MCP: HTTP {e.response.status_code}"
-            )
+            ) from e
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 252-254: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 256-259: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 260-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 248 - 262, The
except block catching httpx.HTTPStatusError (the handler using "except
httpx.HTTPStatusError as e") currently raises several ValueError instances
without chaining; update each raise (the ones for AUTH_ERROR_CODES, status 406,
and the generic failure in the DingTalk MCP call) to include exception chaining
by appending "from e" so the original httpx.HTTPStatusError traceback is
preserved for debugging while keeping the existing log calls (logger.error)
intact.

…ctor

Rename the skill to avoid conflicts with other definitions:
- Folder: dingtalk-docs → dingtalk-connector
- Skill name: dingtalk-docs → dingtalk-connector
- Provider name: dingtalk-docs → dingtalk-connector
- Task type: dingtalk-docs → dingtalk-connector

Updated all references in:
- SKILL.md
- provider.py
- mcp_provider_registry.py
- dingtalk_docs.py
- Test files
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)

195-217: ⚠️ Potential issue | 🟠 Major

The generated {title}_{timestamp}.md name is still never applied.

filename is built and returned, but create_document_with_content() still receives name=title. The text-import path therefore does not actually persist the naming convention this tool advertises.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 195 - 217, The
code builds a filename via dingtalk_docs_service.build_filename (assigned to
filename) but still passes the original title into
knowledge_orchestrator.create_document_with_content (name=title), so the
generated `{title}_{timestamp}.md` never gets persisted; update the call to
create_document_with_content to pass name=filename (and any other
filename-related parameter the function supports) instead of title so the stored
document uses the built filename from filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 33-41: Add OpenTelemetry tracing decorators to the MCP handlers so
their import path appears in traces: decorate the async handler
get_dingtalk_document_info with `@trace_async` using a descriptive span name
(e.g., "mcp.get_dingtalk_document_info"), the appropriate tracer name (e.g.,
"mcp") and the existing extract_attributes callable; likewise apply `@trace_async`
or `@trace_sync` (matching each function's async/sync nature) to the other
handlers defined in the file around the blocks at 90-104 and 237-249, and ensure
the trace decorator is imported where these functions are defined.

In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 603-618: The download step in _download_document_content_real()
fails to extract IDs for valid DingTalk wiki links even though
_get_document_info_from_mcp() accepts them; update the URL extraction logic used
in _download_document_content_real() (the local variable doc_url and the
patterns list) to include the wiki form accepted upstream (e.g., add a regex for
/wiki/... that captures the document ID) or normalize wiki URLs into the
nodes/docs form before running the existing patterns so that the same doc_id is
produced for wiki links and the "Could not extract document ID" error no longer
occurs.
- Around line 224-232: Add OpenTelemetry tracing to the async entrypoint by
decorating the _arun method with `@trace_async`; e.g., import trace_async and add
a decorator like `@trace_async`("dingtalk.import", "dingtalk-connector", lambda
ctx, args, kwargs: {"dingtalk_doc_url": kwargs.get("dingtalk_doc_url", args[0]
if args else None), "knowledge_base_id": kwargs.get("knowledge_base_id", args[1]
if len(args) > 1 else None)}). Apply this to the async def _arun(...) function
so the whole async flow emits a top-level span and include the import for
trace_async at the top of the module.
- Around line 635-645: The code builds an MCP payload named "download_document"
(payload variable) which is not a registered tool and will fail at runtime;
replace the "name": "download_document" entry with
"add_dingtalk_doc_to_knowledge" and adjust the arguments to match that tool
(pass doc_id/doc_url and any required metadata/format), or alternatively
register/implement a new MCP tool named "download_document" that performs the
standalone download; refer to the existing DingTalk tools
get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment to align argument names and behavior.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-217: The code builds a filename via
dingtalk_docs_service.build_filename (assigned to filename) but still passes the
original title into knowledge_orchestrator.create_document_with_content
(name=title), so the generated `{title}_{timestamp}.md` never gets persisted;
update the call to create_document_with_content to pass name=filename (and any
other filename-related parameter the function supports) instead of title so the
stored document uses the built filename from filename.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da173f5d-960b-4536-ad5d-3436cb712069

📥 Commits

Reviewing files that changed from the base of the PR and between bd58b05 and 1128176.

📒 Files selected for processing (7)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/mcp_provider_registry.py
  • backend/init_data/skills/dingtalk-connector/SKILL.md
  • backend/init_data/skills/dingtalk-connector/__init__.py
  • backend/init_data/skills/dingtalk-connector/provider.py
  • backend/tests/services/execution/test_request_builder_mcp.py
  • backend/tests/services/execution/test_request_builder_user_mcp.py
✅ Files skipped from review due to trivial changes (2)
  • backend/init_data/skills/dingtalk-connector/init.py
  • backend/init_data/skills/dingtalk-connector/SKILL.md

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py
Comment thread backend/init_data/skills/dingtalk-connector/provider.py
Comment thread backend/init_data/skills/dingtalk-connector/provider.py
Comment on lines +635 to +645
# Build MCP tool call payload for download_document
payload = {
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "download_document",
"arguments": {
"doc_id": doc_id,
"url": doc_url,
"format": "markdown",
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "Searching for MCP tools named download_document:"
rg -n --type=py 'name\s*=\s*"download_document"' backend/app/mcp_server/tools -C2 || true

echo
echo "Searching for knowledge-server DingTalk tools introduced in this PR:"
rg -n --type=py 'name\s*=\s*"get_dingtalk_document_info"|name\s*=\s*"add_dingtalk_doc_to_knowledge"|name\s*=\s*"add_dingtalk_doc_with_attachment"|build_mcp_tools_dict\(server="knowledge"\)' backend/app/mcp_server/tools backend/app/mcp_server/server.py -C1

Repository: wecode-ai/Wegent

Length of output: 1654


🏁 Script executed:

#!/bin/bash
echo "=== Search for download_document across entire codebase ==="
rg -n 'download_document' --type=py . 2>/dev/null | head -20

echo
echo "=== Examine context in provider.py around lines 635-645 ==="
cat -n backend/init_data/skills/dingtalk-connector/provider.py | sed -n '625,650p'

Repository: wecode-ai/Wegent

Length of output: 2095


🏁 Script executed:

#!/bin/bash
echo "=== Check for dynamic tool registration or aliases ==="
rg -n 'download_document|dynamic.*tool|register.*tool' backend/app/mcp_server/ --type=py | grep -i 'dynamic\|alias\|register' | head -10

echo
echo "=== Check add_dingtalk_doc_to_knowledge implementation ==="
sed -n '90,120p' backend/app/mcp_server/tools/dingtalk_docs.py

Repository: wecode-ai/Wegent

Length of output: 2668


download_document MCP tool does not exist and will cause runtime failure.

The code posts name: "download_document" to /mcp/knowledge/sse, but this tool is not registered. The only DingTalk-related knowledge tools available are get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment. The workflow will fail when this code path executes.

Consider using add_dingtalk_doc_to_knowledge (which handles document downloads) or create the missing download_document MCP tool if a standalone download operation is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 635 -
645, The code builds an MCP payload named "download_document" (payload variable)
which is not a registered tool and will fail at runtime; replace the "name":
"download_document" entry with "add_dingtalk_doc_to_knowledge" and adjust the
arguments to match that tool (pass doc_id/doc_url and any required
metadata/format), or alternatively register/implement a new MCP tool named
"download_document" that performs the standalone download; refer to the existing
DingTalk tools get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment to align argument names and behavior.

…uments

1. Add source_config parameter to create_document_with_content:
   - Accepts optional source_config dict
   - Passes to KnowledgeDocumentCreate

2. Update add_dingtalk_doc_to_knowledge to:
   - Extract file_extension from DingTalk response (default: md)
   - Build source_config with url, source, and updated_at
   - Pass file_extension and source_config to document creation

source_config format:
{
    "url": "https://alidocs.dingtalk.com/i/nodes/...",
    "source": "dingtalk-connector",
    "updated_at": 1775705225000
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/knowledge/orchestrator.py (1)

1093-1109: ⚠️ Potential issue | 🟠 Major

Add tracing decorator to the modified sync orchestrator entrypoint.

create_document_with_content(...) was modified but still has no @trace_sync(...), so this new DingTalk ingestion path remains untraced in OTel.

As per coding guidelines, "Use @trace_sync(span_name, tracer_name, extract_attributes) decorator to trace entire sync functions in OpenTelemetry".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/knowledge/orchestrator.py` around lines 1093 - 1109, The
create_document_with_content sync entrypoint is missing the OpenTelemetry
tracing decorator; add `@trace_sync`(...) above the create_document_with_content
method, importing trace_sync if not already present, and configure span_name
(e.g., "knowledge.create_document_with_content"), tracer_name (matching existing
service tracer), and an extract_attributes callable that captures key attributes
such as knowledge_base_id, user.id, source_type, and trigger_indexing so the
entire ingestion path is traced; ensure the decorator is applied to the method
signature in orchestrator.py (create_document_with_content).
♻️ Duplicate comments (2)
backend/app/mcp_server/tools/dingtalk_docs.py (2)

203-204: ⚠️ Potential issue | 🟠 Major

Timestamped filename is computed but not used in persistence path.

Line 204 builds {title}_{timestamp}.md, but Line 227 still passes name=title, so the actual stored/uploaded name does not follow the advertised convention.

Suggested fix
         # Build filename according to naming convention
         filename = dingtalk_docs_service.build_filename(title, modified_time)
+        name_without_ext, ext = filename.rsplit(".", 1)

@@
         result = await asyncio.to_thread(
             knowledge_orchestrator.create_document_with_content,
             db=db,
             user=user,
             knowledge_base_id=knowledge_base_id,
-            name=title,
+            name=name_without_ext,
             source_type="text",
             content=doc_content,
-            file_extension=file_extension,
+            file_extension=ext,
             trigger_indexing=trigger_indexing,
             trigger_summary=trigger_summary,
             source_config=source_config,
         )

Also applies to: 222-234

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 203 - 204, The
timestamped filename created by dingtalk_docs_service.build_filename(title,
modified_time) is computed into the variable filename but never used; update the
persistence/upload call that currently passes name=title to instead pass
name=filename (i.e., use the filename variable), ensuring the stored/uploaded
document follows the {title}_{timestamp}.md convention; look for usages around
the call that sets filename and the subsequent persist/upload function where
name=title and replace that argument.

33-41: ⚠️ Potential issue | 🟠 Major

Trace decorators are still missing on all newly added MCP handlers.

get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment should be wrapped with @trace_async / @trace_sync to make this flow observable.

As per coding guidelines, "Use @trace_async(span_name, tracer_name, extract_attributes) decorator to trace entire async functions in OpenTelemetry" and "Use @trace_sync(span_name, tracer_name, extract_attributes) decorator to trace entire sync functions in OpenTelemetry".

Also applies to: 90-104, 254-266

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 33 - 41, The
three newly added MCP handler functions get_dingtalk_document_info,
add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment need
OpenTelemetry decorators; wrap each async handler with `@trace_async`(span_name,
tracer_name, extract_attributes) and any sync handler with `@trace_sync`(...) so
their execution is observable: add `@trace_async` for get_dingtalk_document_info
and for any other async functions (e.g., add_dingtalk_doc_with_attachment if
async), and use `@trace_sync` for synchronous handlers (e.g.,
add_dingtalk_doc_to_knowledge if sync), providing a clear span_name (e.g.,
"mcp.get_dingtalk_document_info"), the appropriate tracer_name, and an
extract_attributes lambda that pulls relevant params (like doc_url, node_id) to
ensure consistent tracing across these handlers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 317-323: The response currently returns the input attachment_id
which can be stale because the orchestrator
(backend/app/services/knowledge/orchestrator.py) copies the attachment and links
the document to a new attachment; update the return payload in the function that
constructs the response (the block using result, doc_title and attachment_id in
backend/app/mcp_server/tools/dingtalk_docs.py) to return the actual linked
attachment id from the created document (use result.attachment_id) instead of
the original input attachment_id; also ensure result has an attachment_id
attribute before returning (or fall back appropriately).
- Around line 222-234: You’re passing the SQLAlchemy Session object `db` and ORM
`user` into `asyncio.to_thread` which is unsafe; instead either (A) open/close a
new DB session inside the worker thread and pass only primitives, or (B) avoid
moving DB/ORM work into a thread by calling `await
knowledge_orchestrator.create_document_with_content(...)` directly. Concretely:
change the call site that uses
`asyncio.to_thread(knowledge_orchestrator.create_document_with_content, db=db,
user=user, ...)` to pass only primitive IDs/strings (e.g. `user_id=user.id`) and
have `knowledge_orchestrator.create_document_with_content` accept `user_id` and
create its own session inside the thread, or refactor
`create_document_with_content` to be async and call it without `to_thread`; do
not pass `db` or ORM `user` across the thread boundary.

In `@backend/app/services/knowledge/orchestrator.py`:
- Line 1108: The attachment import branch in orchestrator.py drops the
source_config parameter when constructing the KnowledgeDocumentCreate for
source_type="attachment", so metadata is lost; update the attachment flow where
KnowledgeDocumentCreate is built (the block that creates KnowledgeDocumentCreate
in the attachment branch) to accept and pass through the source_config
Optional[Dict[str, Any]] exactly like the non-attachment path (the later
persistence path that already includes source_config), ensuring the constructor
call includes source_config and any downstream calls persist it unchanged.

---

Outside diff comments:
In `@backend/app/services/knowledge/orchestrator.py`:
- Around line 1093-1109: The create_document_with_content sync entrypoint is
missing the OpenTelemetry tracing decorator; add `@trace_sync`(...) above the
create_document_with_content method, importing trace_sync if not already
present, and configure span_name (e.g.,
"knowledge.create_document_with_content"), tracer_name (matching existing
service tracer), and an extract_attributes callable that captures key attributes
such as knowledge_base_id, user.id, source_type, and trigger_indexing so the
entire ingestion path is traced; ensure the decorator is applied to the method
signature in orchestrator.py (create_document_with_content).

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 203-204: The timestamped filename created by
dingtalk_docs_service.build_filename(title, modified_time) is computed into the
variable filename but never used; update the persistence/upload call that
currently passes name=title to instead pass name=filename (i.e., use the
filename variable), ensuring the stored/uploaded document follows the
{title}_{timestamp}.md convention; look for usages around the call that sets
filename and the subsequent persist/upload function where name=title and replace
that argument.
- Around line 33-41: The three newly added MCP handler functions
get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment need OpenTelemetry decorators; wrap each async
handler with `@trace_async`(span_name, tracer_name, extract_attributes) and any
sync handler with `@trace_sync`(...) so their execution is observable: add
`@trace_async` for get_dingtalk_document_info and for any other async functions
(e.g., add_dingtalk_doc_with_attachment if async), and use `@trace_sync` for
synchronous handlers (e.g., add_dingtalk_doc_to_knowledge if sync), providing a
clear span_name (e.g., "mcp.get_dingtalk_document_info"), the appropriate
tracer_name, and an extract_attributes lambda that pulls relevant params (like
doc_url, node_id) to ensure consistent tracing across these handlers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f9854a9-9fbf-4ad0-916b-52645591f987

📥 Commits

Reviewing files that changed from the base of the PR and between 1128176 and 78fbb84.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/knowledge/orchestrator.py

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py Outdated
Comment on lines +317 to +323
return {
"success": True,
"document_id": result.id,
"document_name": result.name,
"attachment_id": attachment_id,
"message": f"Document '{doc_title}' added to knowledge base successfully",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Returned attachment_id may be inaccurate for the created document.

This response returns the input attachment_id, but the attachment source path creates a copied attachment in orchestrator (backend/app/services/knowledge/orchestrator.py, Line 1216-1231). Return result.attachment_id to reflect the actual linked attachment.

Suggested fix
         return {
             "success": True,
             "document_id": result.id,
             "document_name": result.name,
-            "attachment_id": attachment_id,
+            "attachment_id": result.attachment_id,
             "message": f"Document '{doc_title}' added to knowledge base successfully",
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return {
"success": True,
"document_id": result.id,
"document_name": result.name,
"attachment_id": attachment_id,
"message": f"Document '{doc_title}' added to knowledge base successfully",
}
return {
"success": True,
"document_id": result.id,
"document_name": result.name,
"attachment_id": result.attachment_id,
"message": f"Document '{doc_title}' added to knowledge base successfully",
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 317 - 323, The
response currently returns the input attachment_id which can be stale because
the orchestrator (backend/app/services/knowledge/orchestrator.py) copies the
attachment and links the document to a new attachment; update the return payload
in the function that constructs the response (the block using result, doc_title
and attachment_id in backend/app/mcp_server/tools/dingtalk_docs.py) to return
the actual linked attachment id from the created document (use
result.attachment_id) instead of the original input attachment_id; also ensure
result has an attachment_id attribute before returning (or fall back
appropriately).

Comment thread backend/app/services/knowledge/orchestrator.py
Change filename format from {title}_{timestamp}.md to {title}.{file_extension}
to better align with DingTalk document info response structure.

The file_extension is now obtained directly from get_document_info response,
providing a cleaner and more intuitive filename format.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
backend/app/mcp_server/tools/dingtalk_docs.py (3)

140-142: ⚠️ Potential issue | 🔴 Critical

Do not pass db/user across asyncio.to_thread.

db (SQLAlchemy session) and user (ORM object) are created on the main thread and then used in a worker thread. This is thread-unsafe and can cause runtime/data consistency failures.

Proposed fix (keep blocking work off-loop, but create thread-local DB/ORM objects)
@@
-        result = await asyncio.to_thread(
-            knowledge_orchestrator.create_document_with_content,
-            db=db,
-            user=user,
-            knowledge_base_id=knowledge_base_id,
-            name=title,
-            source_type="text",
-            content=doc_content,
-            file_extension=file_extension,
-            trigger_indexing=trigger_indexing,
-            trigger_summary=trigger_summary,
-            source_config=source_config,
-        )
+        def _create_document_in_thread():
+            thread_db = SessionLocal()
+            try:
+                thread_user = _get_user_from_token(thread_db, token_info)
+                if not thread_user:
+                    raise ValueError("User not found")
+                return knowledge_orchestrator.create_document_with_content(
+                    db=thread_db,
+                    user=thread_user,
+                    knowledge_base_id=knowledge_base_id,
+                    name=title,
+                    source_type="text",
+                    content=doc_content,
+                    file_extension=file_extension,
+                    trigger_indexing=trigger_indexing,
+                    trigger_summary=trigger_summary,
+                    source_config=source_config,
+                )
+            finally:
+                thread_db.close()
+
+        result = await asyncio.to_thread(_create_document_in_thread)
#!/bin/bash
set -euo pipefail

echo "Check to_thread callsite and passed args:"
rg -n -C4 'asyncio\.to_thread\(' backend/app/mcp_server/tools/dingtalk_docs.py

echo
echo "Check orchestrator signature and DB/ORM expectations:"
rg -n -C10 'def create_document_with_content\(' backend/app/services/knowledge/orchestrator.py

Also applies to: 207-219

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 140 - 142, You
are passing a SessionLocal() instance and an ORM user object from the main
thread into asyncio.to_thread, which is thread-unsafe; instead, change the
to_thread callsite so it does not receive db or ORM objects—pass only primitives
(e.g., user id or token_info) and move SessionLocal() creation and the ORM query
(the equivalent of _get_user_from_token) into the worker thread (or open a new
session inside create_document_with_content / the threaded helper). Locate the
asyncio.to_thread usage in dingtalk_docs.py and ensure the worker function
creates its own SessionLocal(), re-queries the User by id (or re-validates
token), and closes the session; apply the same fix pattern to the similar
callsite around the create_document_with_content orchestration mentioned in the
comment.

302-307: ⚠️ Potential issue | 🟡 Minor

Return the created document’s attachment id, not the input id.

The response currently echoes the input attachment_id; this can become inaccurate if the create flow remaps/copies attachments.

Proposed fix
         return {
             "success": True,
             "document_id": result.id,
             "document_name": result.name,
-            "attachment_id": attachment_id,
+            "attachment_id": getattr(result, "attachment_id", attachment_id),
             "message": f"Document '{doc_title}' added to knowledge base successfully",
         }
#!/bin/bash
set -euo pipefail

echo "Inspect attachment handling in orchestrator:"
rg -n -C8 'attachment_id|source_type.*attachment|copy' backend/app/services/knowledge/orchestrator.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 302 - 307, The
response is echoing the input attachment_id instead of the attachment id
returned by the create flow; update the return dict in
backend/app/mcp_server/tools/dingtalk_docs.py to use the created attachment id
from the API result (e.g., use result.attachment_id or result.attachment.id or
result.attachments[0].id as provided by the create call) instead of the input
variable attachment_id, and add a safe fallback (None) if that field is missing
so the returned "attachment_id" reflects the actual stored attachment; adjust
tests or callers if they expect the old input value.

33-41: ⚠️ Potential issue | 🟠 Major

Add tracing decorators to all three MCP handlers.

These entrypoints are untraced, so the DingTalk import path is not visible in OpenTelemetry.

As per coding guidelines, "Use @trace_async(span_name, tracer_name, extract_attributes) decorator to trace entire async functions in OpenTelemetry" and "Use @trace_sync(span_name, tracer_name, extract_attributes) decorator to trace entire sync functions in OpenTelemetry".

Also applies to: 90-104, 239-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 33 - 41, Add
OpenTelemetry tracing decorators to the three MCP handler entrypoints in this
file: attach `@trace_async` to get_dingtalk_document_info and the two other
handler functions defined around the ranges referenced (lines ~90-104 and
~239-251); for each async handler use `@trace_async` with span_name set to a
descriptive value (e.g., "mcp.get_dingtalk_document_info" for
get_dingtalk_document_info), tracer_name "mcp", and an extract_attributes
callable that extracts the doc_url (or other relevant params) into attributes so
the DingTalk import path and request context are visible in traces; if any
handler is sync use `@trace_sync` with the same naming/attributes convention.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 11-12: The docs in backend/app/mcp_server/tools/dingtalk_docs.py
mention a stale naming convention `{title}_{timestamp}.md` but the code actually
builds filenames as `{title}.{file_extension}`; update the textual
descriptions/comments in all locations (including the occurrences referenced
around lines 11-12, 92-93, 120-122, 264-265) to reflect the current
implementation, i.e., change `{title}_{timestamp}.md` to
`{title}.{file_extension}` (or a wording like "saved as
{title}.{file_extension}") and ensure any examples or prose around functions
that construct filenames (e.g., the filename builder, save/upload helpers) are
consistent with the actual filename format.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 140-142: You are passing a SessionLocal() instance and an ORM user
object from the main thread into asyncio.to_thread, which is thread-unsafe;
instead, change the to_thread callsite so it does not receive db or ORM
objects—pass only primitives (e.g., user id or token_info) and move
SessionLocal() creation and the ORM query (the equivalent of
_get_user_from_token) into the worker thread (or open a new session inside
create_document_with_content / the threaded helper). Locate the
asyncio.to_thread usage in dingtalk_docs.py and ensure the worker function
creates its own SessionLocal(), re-queries the User by id (or re-validates
token), and closes the session; apply the same fix pattern to the similar
callsite around the create_document_with_content orchestration mentioned in the
comment.
- Around line 302-307: The response is echoing the input attachment_id instead
of the attachment id returned by the create flow; update the return dict in
backend/app/mcp_server/tools/dingtalk_docs.py to use the created attachment id
from the API result (e.g., use result.attachment_id or result.attachment.id or
result.attachments[0].id as provided by the create call) instead of the input
variable attachment_id, and add a safe fallback (None) if that field is missing
so the returned "attachment_id" reflects the actual stored attachment; adjust
tests or callers if they expect the old input value.
- Around line 33-41: Add OpenTelemetry tracing decorators to the three MCP
handler entrypoints in this file: attach `@trace_async` to
get_dingtalk_document_info and the two other handler functions defined around
the ranges referenced (lines ~90-104 and ~239-251); for each async handler use
`@trace_async` with span_name set to a descriptive value (e.g.,
"mcp.get_dingtalk_document_info" for get_dingtalk_document_info), tracer_name
"mcp", and an extract_attributes callable that extracts the doc_url (or other
relevant params) into attributes so the DingTalk import path and request context
are visible in traces; if any handler is sync use `@trace_sync` with the same
naming/attributes convention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a41dae1-1bac-41e6-9f88-456d717f0ae1

📥 Commits

Reviewing files that changed from the base of the PR and between 78fbb84 and 6ea239c.

📒 Files selected for processing (1)
  • backend/app/mcp_server/tools/dingtalk_docs.py

Comment thread backend/app/mcp_server/tools/dingtalk_docs.py Outdated
…up code

- Remove unused download_document method from DingTalkDocsService
- Remove unused imports and constants from docs_service.py
- Remove unused BaseModel import from dingtalk __init__.py
- Clean up duplicate import in dingtalk_docs.py MCP tools
- Update SKILL.md description to reflect actual workflow
- Remove unused _download_document_content method from provider.py

All changes are code cleanup with no functional impact.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
backend/init_data/skills/dingtalk-connector/provider.py (2)

598-614: ⚠️ Potential issue | 🟠 Major

Handle /wiki/ URLs in the download step too.

_get_document_info_from_mcp() already accepts DingTalk wiki links, but _download_document_content_real() only extracts IDs from /nodes/ and /docs/. A valid wiki URL will therefore pass step 2 and fail immediately at step 3 with “Could not extract document ID”.

Minimal fix
             patterns = [
                 r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)",
                 r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)",
+                r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)",
             ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 598 -
614, _download_document_content_real() fails to extract IDs for DingTalk wiki
links even though _get_document_info_from_mcp() accepts them; update the
ID-extraction logic in _download_document_content_real() to include the wiki URL
pattern used elsewhere (e.g., add a regex covering "/wiki/" URLs similar to the
patterns for "/i/nodes/" and "/i/team/.../docs/") so that doc_id is correctly
set for wiki links and the function no longer returns the "Could not extract
document ID" error.

631-644: ⚠️ Potential issue | 🔴 Critical

download_document is not a registered MCP tool.

The knowledge server registers the decorated tools from backend/app/mcp_server/tools/dingtalk_docs.py, and that file exposes get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment — not download_document. This request will fail as soon as the content-download step runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 631 -
644, The payload is calling an unregistered MCP tool "download_document";
replace it with a registered tool (e.g., "add_dingtalk_doc_with_attachment" or
"get_dingtalk_document_info") or register a new tool. Update the "name" field in
the payload from "download_document" to one of the existing exported tools
(suggest "add_dingtalk_doc_with_attachment" if you intend to upload the
downloaded content into knowledge) and ensure the arguments match that tool's
signature (use doc_id, url and any expected keys), or alternatively
implement/register a new tool named "download_document" in
backend/app/mcp_server/tools/dingtalk_docs.py if that behavior is required.
backend/app/services/dingtalk/docs_service.py (2)

158-172: ⚠️ Potential issue | 🟡 Minor

Chain the original HTTPStatusError when re-raising.

Both except httpx.HTTPStatusError as e blocks discard the original traceback by raising a fresh ValueError without from e, which makes MCP/download failures much harder to debug.

Also applies to: 414-427

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 158 - 172, The
except blocks catching httpx.HTTPStatusError (e) in docs_service.py currently
re-raise new ValueError instances and discard the original exception context;
update each re-raise in the HTTPStatusError handlers (the block shown and the
similar block around lines 414-427) to chain the original exception by using
"raise ValueError(... ) from e" so the original HTTPStatusError traceback is
preserved; ensure you apply this to all branches (AUTH_ERROR_CODES branch, 406
branch, and the generic failure branch) that construct and raise ValueError.

272-277: ⚠️ Potential issue | 🟡 Minor

Reject or honor unsupported export_format values.

export_format is part of the public API here, but the method never validates it or passes it downstream. A caller asking for html or txt currently gets whatever the MCP returns, while file_extension is still inferred from the DingTalk metadata instead of the requested format.

Also applies to: 339-412

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 272 - 277,
download_document_content currently accepts an export_format but never validates
or enforces it; update it (and the comparable methods in the 339-412 range) to
validate export_format against an allowed set (e.g.,
{"markdown","html","txt","pdf"} or the formats supported by the MCP), raise a
clear exception for unsupported values, and ensure the validated format is
propagated to downstream requests and used to derive file_extension (instead of
relying on DingTalk metadata); reference the download_document_content function
name and the downstream call(s) that send export_format/file_extension so you
can locate where to inject validation, mapping, and the error raise.
backend/app/mcp_server/tools/dingtalk_docs.py (2)

300-305: ⚠️ Potential issue | 🟡 Minor

Return the attachment actually linked to the new document.

Echoing the input attachment_id here can be stale if the orchestrator copies or rebinds the attachment during document creation. Return result.attachment_id so callers see the attachment that the new knowledge document actually references.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 300 - 305, The
returned payload currently echoes the input attachment_id which can be stale;
update the return dict in the function that builds the response (the block using
variables result, doc_title, attachment_id) to return the attachment actually
linked to the created document by using result.attachment_id for the
"attachment_id" field instead of the input attachment_id so callers receive the
real attachment bound to the new document.

205-217: ⚠️ Potential issue | 🔴 Critical

Keep the SQLAlchemy session and ORM object on the same thread.

db is created at Line 136 and user is loaded from it before the asyncio.to_thread() call, but both are then used inside the worker thread. SQLAlchemy sessions and ORM instances are not thread-safe, so this can break once create_document_with_content() touches the session in that background thread.

Safer pattern
-        result = await asyncio.to_thread(
-            knowledge_orchestrator.create_document_with_content,
-            db=db,
-            user=user,
-            knowledge_base_id=knowledge_base_id,
-            name=title,
-            source_type="text",
-            content=doc_content,
-            file_extension=file_extension,
-            trigger_indexing=trigger_indexing,
-            trigger_summary=trigger_summary,
-            source_config=source_config,
-        )
+        def _create_document() -> Any:
+            thread_db = SessionLocal()
+            try:
+                thread_user = _get_user_from_token(thread_db, token_info)
+                if not thread_user:
+                    raise ValueError("User not found")
+                return knowledge_orchestrator.create_document_with_content(
+                    db=thread_db,
+                    user=thread_user,
+                    knowledge_base_id=knowledge_base_id,
+                    name=title,
+                    source_type="text",
+                    content=doc_content,
+                    file_extension=file_extension,
+                    trigger_indexing=trigger_indexing,
+                    trigger_summary=trigger_summary,
+                    source_config=source_config,
+                )
+            finally:
+                thread_db.close()
+
+        result = await asyncio.to_thread(_create_document)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 205 - 217, The
code passes a SQLAlchemy Session object `db` and an ORM `user` into
`asyncio.to_thread()` where
`knowledge_orchestrator.create_document_with_content` runs; SQLAlchemy
sessions/ORM instances are not thread-safe—either perform all DB/ORM work on the
current async thread (call `create_document_with_content` directly without
`asyncio.to_thread`) or pass only primitive data into the worker and open a new
Session there. Concretely, extract any IDs/primitive fields needed from `user`
(e.g., `user.id`) and other DB-derived values before calling
`asyncio.to_thread`, or move the call out of `asyncio.to_thread` and `await
knowledge_orchestrator.create_document_with_content(db=..., user=...)` on the
original thread; alternatively, modify `create_document_with_content` to accept
primitives and create its own session inside the worker thread.
🧹 Nitpick comments (1)
backend/app/services/dingtalk/docs_service.py (1)

207-211: Trace the new async service entrypoints.

get_document_info() and download_document_content() are the backend entrypoints for this feature, but neither emits a span today, so DingTalk fetch/download failures will be missing from the trace tree.

As per coding guidelines, "Use @trace_async(span_name, tracer_name, extract_attributes) decorator to trace entire async functions in OpenTelemetry".

Also applies to: 272-277

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/dingtalk/docs_service.py` around lines 207 - 211, Add
OpenTelemetry tracing to the async entrypoints by importing and applying the
trace_async decorator: add `@trace_async`("dingtalk.get_document_info",
"dingtalk", extract_attributes=...) above the async def get_document_info(...)
and add `@trace_async`("dingtalk.download_document_content", "dingtalk",
extract_attributes=...) above async def download_document_content(...); ensure
trace_async is imported from the tracing/observability helper used elsewhere in
the repo and have the extract_attributes include relevant params (e.g., doc_url
and user_preferences for get_document_info, doc_url for
download_document_content) so failures appear in the trace tree.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 317-320: The code constructs file_path from a user-supplied title
via _build_filename(title, file_extension) which allows path traversal; sanitize
the title first (e.g., strip/replace path separators and disallow .., or use a
safe helper like os.path.basename or a slug/safe-filename function) before
calling _build_filename, then build the path with os.path.join(base_dir,
filename) and validate with os.path.realpath or pathlib.Path.resolve() that the
resulting path is inside the intended base directory; if validation fails,
reject the title or raise an error.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 300-305: The returned payload currently echoes the input
attachment_id which can be stale; update the return dict in the function that
builds the response (the block using variables result, doc_title, attachment_id)
to return the attachment actually linked to the created document by using
result.attachment_id for the "attachment_id" field instead of the input
attachment_id so callers receive the real attachment bound to the new document.
- Around line 205-217: The code passes a SQLAlchemy Session object `db` and an
ORM `user` into `asyncio.to_thread()` where
`knowledge_orchestrator.create_document_with_content` runs; SQLAlchemy
sessions/ORM instances are not thread-safe—either perform all DB/ORM work on the
current async thread (call `create_document_with_content` directly without
`asyncio.to_thread`) or pass only primitive data into the worker and open a new
Session there. Concretely, extract any IDs/primitive fields needed from `user`
(e.g., `user.id`) and other DB-derived values before calling
`asyncio.to_thread`, or move the call out of `asyncio.to_thread` and `await
knowledge_orchestrator.create_document_with_content(db=..., user=...)` on the
original thread; alternatively, modify `create_document_with_content` to accept
primitives and create its own session inside the worker thread.

In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 158-172: The except blocks catching httpx.HTTPStatusError (e) in
docs_service.py currently re-raise new ValueError instances and discard the
original exception context; update each re-raise in the HTTPStatusError handlers
(the block shown and the similar block around lines 414-427) to chain the
original exception by using "raise ValueError(... ) from e" so the original
HTTPStatusError traceback is preserved; ensure you apply this to all branches
(AUTH_ERROR_CODES branch, 406 branch, and the generic failure branch) that
construct and raise ValueError.
- Around line 272-277: download_document_content currently accepts an
export_format but never validates or enforces it; update it (and the comparable
methods in the 339-412 range) to validate export_format against an allowed set
(e.g., {"markdown","html","txt","pdf"} or the formats supported by the MCP),
raise a clear exception for unsupported values, and ensure the validated format
is propagated to downstream requests and used to derive file_extension (instead
of relying on DingTalk metadata); reference the download_document_content
function name and the downstream call(s) that send export_format/file_extension
so you can locate where to inject validation, mapping, and the error raise.

In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 598-614: _download_document_content_real() fails to extract IDs
for DingTalk wiki links even though _get_document_info_from_mcp() accepts them;
update the ID-extraction logic in _download_document_content_real() to include
the wiki URL pattern used elsewhere (e.g., add a regex covering "/wiki/" URLs
similar to the patterns for "/i/nodes/" and "/i/team/.../docs/") so that doc_id
is correctly set for wiki links and the function no longer returns the "Could
not extract document ID" error.
- Around line 631-644: The payload is calling an unregistered MCP tool
"download_document"; replace it with a registered tool (e.g.,
"add_dingtalk_doc_with_attachment" or "get_dingtalk_document_info") or register
a new tool. Update the "name" field in the payload from "download_document" to
one of the existing exported tools (suggest "add_dingtalk_doc_with_attachment"
if you intend to upload the downloaded content into knowledge) and ensure the
arguments match that tool's signature (use doc_id, url and any expected keys),
or alternatively implement/register a new tool named "download_document" in
backend/app/mcp_server/tools/dingtalk_docs.py if that behavior is required.

---

Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 207-211: Add OpenTelemetry tracing to the async entrypoints by
importing and applying the trace_async decorator: add
`@trace_async`("dingtalk.get_document_info", "dingtalk", extract_attributes=...)
above the async def get_document_info(...) and add
`@trace_async`("dingtalk.download_document_content", "dingtalk",
extract_attributes=...) above async def download_document_content(...); ensure
trace_async is imported from the tracing/observability helper used elsewhere in
the repo and have the extract_attributes include relevant params (e.g., doc_url
and user_preferences for get_document_info, doc_url for
download_document_content) so failures appear in the trace tree.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 587d57d3-0dbb-45ad-b12c-c8dae808a360

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea239c and 5b25f86.

📒 Files selected for processing (5)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/dingtalk/__init__.py
  • backend/app/services/dingtalk/docs_service.py
  • backend/init_data/skills/dingtalk-connector/SKILL.md
  • backend/init_data/skills/dingtalk-connector/provider.py
✅ Files skipped from review due to trivial changes (2)
  • backend/app/services/dingtalk/init.py
  • backend/init_data/skills/dingtalk-connector/SKILL.md

Comment on lines +317 to +320
from app.services.knowledge.orchestrator import _build_filename

filename = _build_filename(title, file_extension)
file_path = f"/home/user/{filename}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize the title before turning it into a sandbox path.

_build_filename() in backend/app/services/knowledge/orchestrator.py:84-91 only appends the extension, so a title like ../../tmp/payload produces /home/user/../../tmp/payload.md here. That lets a user-supplied or remote DingTalk title escape the intended workspace path.

Suggested hardening
+import re
@@
-            filename = _build_filename(title, file_extension)
+            safe_title = re.sub(r"[\\/]+", "_", title).strip().strip(".")
+            if not safe_title:
+                error_msg = "Document title produced an invalid filename"
+                logger.error(f"[DingTalkDocToKBTool] {error_msg}")
+                result = self._format_error(error_message=error_msg)
+                await self._emit_tool_status("failed", error_msg)
+                return result
+
+            filename = _build_filename(safe_title, file_extension)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 317 -
320, The code constructs file_path from a user-supplied title via
_build_filename(title, file_extension) which allows path traversal; sanitize the
title first (e.g., strip/replace path separators and disallow .., or use a safe
helper like os.path.basename or a slug/safe-filename function) before calling
_build_filename, then build the path with os.path.join(base_dir, filename) and
validate with os.path.realpath or pathlib.Path.resolve() that the resulting path
is inside the intended base directory; if validation fails, reject the title or
raise an error.

- Add trace_async/trace_sync decorators to MCP handlers in dingtalk_docs.py
- Add trace_async decorator to dingtalk-connector provider _arun method
- Fix URL extraction in _download_document_content_real to support wiki links
- Add new download_dingtalk_document MCP tool for skill content download
- Fix filename usage in add_dingtalk_doc_to_knowledge to use built filename
- Fix MCP payload tool name from non-existent download_document to download_dingtalk_document
@parabala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai resume

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
backend/init_data/skills/dingtalk-connector/provider.py (3)

62-79: Use raise ... from err for better exception chaining.

When re-raising an exception within an except block, use raise ... from err (or raise ... from None) to preserve the exception chain and distinguish the new exception from errors in exception handling.

♻️ Suggested fix
 except ImportError:
     # Try absolute import (for dynamic loading)
     import sys
 
     # Get the package name dynamically
     package_name = __name__.rsplit(".", 1)[0]
     _base_module = sys.modules.get(f"{package_name}._base")
     if _base_module:
         BaseSandboxTool = _base_module.BaseSandboxTool
     else:
-        raise ImportError(
+        raise ImportError(
             "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 62 -
79, The ImportError re-raise in the fallback import path should chain the
original exception; in the except ImportError block that constructs package_name
and looks up _base_module (and ultimately raises ImportError("Cannot import
BaseSandboxTool from chat_shell.tools.sandbox._base")), change the re-raise to
preserve the original error by using "raise ImportError(... ) from err" (or from
the caught exception variable) so the original ImportError is chained to the new
one when BaseSandboxTool cannot be resolved; locate the try/except around the
import of BaseSandboxTool and update the final raise to use exception chaining.

554-598: Remove or deprecate dead placeholder code.

_download_document_content() is never called — the workflow uses _download_document_content_real() instead. Keeping this placeholder creates confusion about which method is the actual implementation.

♻️ Suggested action

Either remove _download_document_content entirely, or rename _download_document_content_real to _download_document_content and delete the placeholder.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 554 -
598, The placeholder method _download_document_content is dead code and
conflicts with the real implementation _download_document_content_real; either
remove _download_document_content entirely, or move/rename the real
implementation by renaming _download_document_content_real to
_download_document_content and delete the placeholder, then update any internal
references to use the canonical name so only the real implementation remains;
ensure imports/uses refer to the chosen name and run tests to confirm no
references remain to the removed function.

540-550: Add logging to silent exception handlers.

Multiple try-except-pass blocks silently swallow exceptions during temp file cleanup (lines 548-549, 728-729, 897-898, 908-909). While cleanup failures may be non-critical, logging at DEBUG level helps diagnose sandbox issues.

♻️ Example fix for one instance
             finally:
                 # Clean up temp file
                 try:
                     await sandbox.commands.run(
                         cmd=f"rm -f {payload_file}",
                         cwd="/home/user",
                         timeout=10,
                     )
-                except Exception:
-                    pass
+                except Exception as cleanup_err:
+                    logger.debug(f"[DingTalkDocToKBTool] Temp file cleanup failed: {cleanup_err}")

Also applies to: 720-730, 890-910

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 540 -
550, The cleanup code currently swallows exceptions in multiple
finally/except-pass blocks around sandbox.commands.run (e.g., the rm -f
{payload_file} call and similar temp-file cleanup calls), so add logging at
DEBUG level in those except handlers: create/get a module logger (e.g., logger =
logging.getLogger(__name__)) and replace the bare "except Exception: pass" with
a handler that logs a clear message including the resource name (payload_file or
other temp filename) and the caught exception (use exc_info=True or include
traceback). Apply the same change to all similar blocks that silently pass (the
cleanup calls around sandbox.commands.run and temp file removals referenced in
the review).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 704-711: The current return only propagates "content" from the MCP
tool result, causing callers to default the file extension to "md"; update the
handler (the block that processes the result of download_dingtalk_document / the
tool_result handling in provider.py) to also extract file_extension from
tool_result (e.g., tool_result.get("file_extension")) and include it in the
returned dict (e.g., return {"success": True, "content": content,
"file_extension": file_extension}), and when missing, supply a sensible default
or None so the caller no longer always falls back to ".md".

---

Nitpick comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 62-79: The ImportError re-raise in the fallback import path should
chain the original exception; in the except ImportError block that constructs
package_name and looks up _base_module (and ultimately raises
ImportError("Cannot import BaseSandboxTool from
chat_shell.tools.sandbox._base")), change the re-raise to preserve the original
error by using "raise ImportError(... ) from err" (or from the caught exception
variable) so the original ImportError is chained to the new one when
BaseSandboxTool cannot be resolved; locate the try/except around the import of
BaseSandboxTool and update the final raise to use exception chaining.
- Around line 554-598: The placeholder method _download_document_content is dead
code and conflicts with the real implementation _download_document_content_real;
either remove _download_document_content entirely, or move/rename the real
implementation by renaming _download_document_content_real to
_download_document_content and delete the placeholder, then update any internal
references to use the canonical name so only the real implementation remains;
ensure imports/uses refer to the chosen name and run tests to confirm no
references remain to the removed function.
- Around line 540-550: The cleanup code currently swallows exceptions in
multiple finally/except-pass blocks around sandbox.commands.run (e.g., the rm -f
{payload_file} call and similar temp-file cleanup calls), so add logging at
DEBUG level in those except handlers: create/get a module logger (e.g., logger =
logging.getLogger(__name__)) and replace the bare "except Exception: pass" with
a handler that logs a clear message including the resource name (payload_file or
other temp filename) and the caught exception (use exc_info=True or include
traceback). Apply the same change to all similar blocks that silently pass (the
cleanup calls around sandbox.commands.run and temp file removals referenced in
the review).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fbc39a5f-86e1-4f12-929e-d4badf7a88e0

📥 Commits

Reviewing files that changed from the base of the PR and between 5b25f86 and c534204.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/init_data/skills/dingtalk-connector/provider.py

Comment thread backend/init_data/skills/dingtalk-connector/provider.py
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
backend/init_data/skills/dingtalk-connector/provider.py (3)

62-79: Use raise ... from None for cleaner exception chaining.

When raising a new ImportError after failing to find the module dynamically, use raise ... from None to suppress the confusing chained exception context.

♻️ Suggested fix
     if _base_module:
         BaseSandboxTool = _base_module.BaseSandboxTool
     else:
-        raise ImportError(
+        raise ImportError(
             "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base"
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 62 -
79, The ImportError raised when _base_module is not found should suppress
chained exception context; update the else branch where you raise
ImportError("Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base")
to use "raise ImportError(... ) from None" so the new ImportError is raised
without showing the previous exception context—make this change in the block
that resolves BaseSandboxTool (variables: package_name, _base_module,
BaseSandboxTool).

554-598: Remove unused placeholder method.

_download_document_content returns hardcoded placeholder content and is not called anywhere—the actual implementation is _download_document_content_real (used at line 309). Remove this dead code to avoid confusion.

🗑️ Remove dead code
-    async def _download_document_content(self, sandbox: Any, doc_url: str) -> dict:
-        """Download document content from DingTalk.
-
-        For now, this returns placeholder content.
-        In production, this would call the DingTalk API via MCP.
-        """
-        try:
-            # Extract doc ID for content generation
-            import re
-
-            doc_id = None
-            patterns = [
-                r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)",
-                r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)",
-            ]
-
-            for pattern in patterns:
-                match = re.search(pattern, doc_url)
-                if match:
-                    doc_id = match.group(1)
-                    break
-
-            # Generate placeholder content
-            content = f"""# DingTalk Document
-
-This document was imported from DingTalk.
-
-**Source URL:** {doc_url}
-**Document ID:** {doc_id or "unknown"}
-**Imported at:** {datetime.now().isoformat()}
-
-## Content
-
-The actual content would be fetched from DingTalk API in production.
-This is a placeholder for the document content.
-
----
-*Imported by Wegent DingTalk Docs Skill*
-"""
-
-            return {"success": True, "content": content}
-
-        except Exception as e:
-            return {"success": False, "error": str(e)}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 554 -
598, Remove the unused placeholder function _download_document_content from the
DingTalk provider (it returns hardcoded content and is not referenced); locate
and delete the async def _download_document_content(self, sandbox: Any, doc_url:
str) block, ensuring only the real implementation
_download_document_content_real remains and that no other code references the
removed symbol (search for _download_document_content to confirm).

418-553: Extract duplicated URL parsing logic into a helper.

The URL pattern matching to extract doc_id is duplicated in _get_document_info_from_mcp (lines 428-438), _download_document_content (lines 565-574), and _download_document_content_real (lines 609-619). Extract this into a private helper method to reduce duplication and ensure consistent pattern support.

♻️ Example helper extraction
def _extract_doc_id_from_url(self, doc_url: str) -> Optional[str]:
    """Extract document ID from DingTalk URL.
    
    Supports nodes, docs, and wiki URL formats.
    """
    import re
    
    patterns = [
        r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)",
        r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)",
        r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)",
    ]
    
    for pattern in patterns:
        match = re.search(pattern, doc_url)
        if match:
            return match.group(1)
    return None

Then replace the duplicated pattern matching in each method with:

doc_id = self._extract_doc_id_from_url(doc_url)
if not doc_id:
    return {"success": False, "error": f"Could not extract document ID from URL: {doc_url}"}

Also applies to: 599-733

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 418 -
553, The URL extraction regex is duplicated across _get_document_info_from_mcp,
_download_document_content, and _download_document_content_real; add a private
helper method (e.g., _extract_doc_id_from_url(self, doc_url: str) ->
Optional[str]) that contains the three regex patterns and returns the matched
group or None, then replace each local regex block with a call to this helper
and preserve the existing error handling (returning {"success": False, "error":
...} when None); update the three methods to use the helper to ensure consistent
behavior and remove the duplicated imports/usages of re and repeated pattern
lists.
backend/app/mcp_server/tools/dingtalk_docs.py (1)

200-212: Move imports to module level.

_build_filename (line 201) and asyncio (line 212) are imported inside the function. Move these to the top of the file for consistency and to avoid repeated import overhead.

Additionally, _build_filename is a private function (underscore prefix). Consider whether this should be exposed as a public API or if the filename building logic should be duplicated here to avoid depending on private internals that may change.

♻️ Suggested refactor
 import logging
+import asyncio
 from typing import Any, Dict, Optional
 
 from app.db.session import SessionLocal
 from app.mcp_server.auth import TaskTokenInfo
 from app.mcp_server.tools.decorator import build_mcp_tools_dict, mcp_tool
 from app.mcp_server.tools.knowledge import _get_user_from_token
 from app.services.dingtalk.docs_service import dingtalk_docs_service
-from app.services.knowledge.orchestrator import knowledge_orchestrator
+from app.services.knowledge.orchestrator import knowledge_orchestrator, _build_filename
 from shared.telemetry.decorators import trace_async, trace_sync

Then remove the inline imports at lines 201 and 212.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 200 - 212, Move
the inline imports to module level: add "from
app.services.knowledge.orchestrator import _build_filename" and "import asyncio"
at the top of the module and remove the in-function imports around the calls to
_build_filename and asyncio; also evaluate whether to replace usage of the
private helper _build_filename (underscore-prefixed) by promoting it to a public
function (rename/export) or by duplicating the filename-building logic locally
so the module does not depend on a private internal API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 200-212: Move the inline imports to module level: add "from
app.services.knowledge.orchestrator import _build_filename" and "import asyncio"
at the top of the module and remove the in-function imports around the calls to
_build_filename and asyncio; also evaluate whether to replace usage of the
private helper _build_filename (underscore-prefixed) by promoting it to a public
function (rename/export) or by duplicating the filename-building logic locally
so the module does not depend on a private internal API.

In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 62-79: The ImportError raised when _base_module is not found
should suppress chained exception context; update the else branch where you
raise ImportError("Cannot import BaseSandboxTool from
chat_shell.tools.sandbox._base") to use "raise ImportError(... ) from None" so
the new ImportError is raised without showing the previous exception
context—make this change in the block that resolves BaseSandboxTool (variables:
package_name, _base_module, BaseSandboxTool).
- Around line 554-598: Remove the unused placeholder function
_download_document_content from the DingTalk provider (it returns hardcoded
content and is not referenced); locate and delete the async def
_download_document_content(self, sandbox: Any, doc_url: str) block, ensuring
only the real implementation _download_document_content_real remains and that no
other code references the removed symbol (search for _download_document_content
to confirm).
- Around line 418-553: The URL extraction regex is duplicated across
_get_document_info_from_mcp, _download_document_content, and
_download_document_content_real; add a private helper method (e.g.,
_extract_doc_id_from_url(self, doc_url: str) -> Optional[str]) that contains the
three regex patterns and returns the matched group or None, then replace each
local regex block with a call to this helper and preserve the existing error
handling (returning {"success": False, "error": ...} when None); update the
three methods to use the helper to ensure consistent behavior and remove the
duplicated imports/usages of re and repeated pattern lists.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ff6ace5-a948-4e1e-891d-a0f8b4dcba91

📥 Commits

Reviewing files that changed from the base of the PR and between 5b25f86 and c534204.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/init_data/skills/dingtalk-connector/provider.py

Remove asyncio.to_thread() call that was passing SQLAlchemy Session and ORM
User objects across thread boundaries. SQLAlchemy objects are not thread-safe
and should not be passed between threads.

Instead, call create_document_with_content() directly since it's a synchronous
function that executes database operations within the same thread context where
the Session was created.

This aligns with the pattern used in other MCP tools (knowledge.py) which also
call orchestrator methods directly without using to_thread.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)

395-400: ⚠️ Potential issue | 🟡 Minor

Return the actual linked attachment id from the created document.

This response echoes the input attachment_id, which may differ from the attachment actually linked by orchestrator.

Suggested fix
         return {
             "success": True,
             "document_id": result.id,
             "document_name": result.name,
-            "attachment_id": attachment_id,
+            "attachment_id": (
+                result.attachment_id
+                if getattr(result, "attachment_id", None) is not None
+                else attachment_id
+            ),
             "message": f"Document '{doc_title}' added to knowledge base successfully",
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 395 - 400, The
return currently echoes the input attachment_id rather than the actual
attachment linked by the orchestrator; update the return to read the linked
attachment id from the created document object (the result variable) — e.g. use
result.attachment_id or result.linked_attachment.id or
result.linked_attachment_id (whichever exists on the returned object) and fall
back to the original attachment_id if none is present, then return that value in
the "attachment_id" field along with document_id/result.id and
document_name/result.name.
🧹 Nitpick comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)

167-189: Narrow the inner exception handling around DingTalk fetch.

Catching Exception here masks unexpected programming/runtime errors and duplicates outer error handling. Prefer catching expected fetch failures only (e.g., ValueError) and let unknown exceptions bubble to the outer except with stack trace.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 167 - 189, The
current broad except in the block calling
dingtalk_docs_service.download_document_content masks unexpected errors; narrow
the handler to only expected fetch failures (e.g., ValueError or a specific
service fetch exception) around the call to
dingtalk_docs_service.download_document_content and related parsing
(doc_download, doc_content, file_extension, modified_time, update_time), log and
return the graceful error only for those expected exceptions, and let other
exceptions propagate (or re-raise) so the outer exception handler can capture
stack traces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 43-45: The current extract_attributes lambda stores the full
doc_url (which may include sensitive query params) in traces/logs; change it to
redact query strings or emit only a parsed doc_id instead: parse the provided
doc_url inside extract_attributes (and the other extract_attributes occurrences)
to remove the query component (or extract the document identifier) and set the
attribute to either "doc_id": <extracted_id> or "doc_url": <url_without_query>
before returning, so no raw query parameters are logged.
- Around line 178-180: The code branch in download_document_content (or the
caller handling doc_download) incorrectly reads modified_time_formatted, leaving
modified_time unset; change the assignment to use the actual DingTalk key
"modified_time" (e.g., set modified_time = doc_download.get("modified_time")) so
modified_time is populated when download_document_content() returns
modified_time rather than modified_time_formatted.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 395-400: The return currently echoes the input attachment_id
rather than the actual attachment linked by the orchestrator; update the return
to read the linked attachment id from the created document object (the result
variable) — e.g. use result.attachment_id or result.linked_attachment.id or
result.linked_attachment_id (whichever exists on the returned object) and fall
back to the original attachment_id if none is present, then return that value in
the "attachment_id" field along with document_id/result.id and
document_name/result.name.

---

Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 167-189: The current broad except in the block calling
dingtalk_docs_service.download_document_content masks unexpected errors; narrow
the handler to only expected fetch failures (e.g., ValueError or a specific
service fetch exception) around the call to
dingtalk_docs_service.download_document_content and related parsing
(doc_download, doc_content, file_extension, modified_time, update_time), log and
return the graceful error only for those expected exceptions, and let other
exceptions propagate (or re-raise) so the outer exception handler can capture
stack traces.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68acdb67-cbb6-462a-9a2b-291a89899ebf

📥 Commits

Reviewing files that changed from the base of the PR and between c534204 and 73dc2c9.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_docs.py
  • backend/app/services/knowledge/orchestrator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/services/knowledge/orchestrator.py

Comment on lines +43 to +45
extract_attributes=lambda token_info, doc_url: {
"doc_url": doc_url,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid storing full doc_url in traces/logs.

Raw DingTalk URLs can carry sensitive query parameters. Redact query strings (or emit only doc_id) before logging/telemetry attributes.

Suggested fix
+from urllib.parse import urlsplit, urlunsplit
+
+def _safe_url_for_observability(url: str) -> str:
+    parts = urlsplit(url)
+    return urlunsplit((parts.scheme, parts.netloc, parts.path, "", ""))

 `@trace_async`(
@@
-    extract_attributes=lambda token_info, doc_url: {
-        "doc_url": doc_url,
+    extract_attributes=lambda token_info, doc_url: {
+        "doc_url": _safe_url_for_observability(doc_url),
     },
 )
@@
-            logger.info(
-                f"[MCP] Content not provided, fetching from DingTalk: {doc_url}"
-            )
+            logger.info(
+                "[MCP] Content not provided, fetching from DingTalk: %s",
+                _safe_url_for_observability(doc_url),
+            )

Also applies to: 111-114, 164-166, 264-266

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 43 - 45, The
current extract_attributes lambda stores the full doc_url (which may include
sensitive query params) in traces/logs; change it to redact query strings or
emit only a parsed doc_id instead: parse the provided doc_url inside
extract_attributes (and the other extract_attributes occurrences) to remove the
query component (or extract the document identifier) and set the attribute to
either "doc_id": <extracted_id> or "doc_url": <url_without_query> before
returning, so no raw query parameters are logged.

Comment on lines +178 to +180
if not modified_time:
modified_time = doc_download.get("modified_time_formatted")
# Get file_extension from DingTalk response
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the actual DingTalk download key for modified time.

download_document_content() returns modified_time, not modified_time_formatted. This branch leaves modified_time unset unexpectedly.

Suggested fix
-                if not modified_time:
-                    modified_time = doc_download.get("modified_time_formatted")
+                if not modified_time:
+                    modified_time = doc_download.get("modified_time")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 178 - 180, The
code branch in download_document_content (or the caller handling doc_download)
incorrectly reads modified_time_formatted, leaving modified_time unset; change
the assignment to use the actual DingTalk key "modified_time" (e.g., set
modified_time = doc_download.get("modified_time")) so modified_time is populated
when download_document_content() returns modified_time rather than
modified_time_formatted.

@parabala
Copy link
Copy Markdown
Collaborator Author

close

@parabala parabala closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant