-
Notifications
You must be signed in to change notification settings - Fork 180
[DOC] PPL docs website exporter script #4950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] PPL docs website exporter script #4950
Conversation
📝 WalkthroughSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughMinor markdown formatting edits across several PPL docs and a substantial rewrite of the docs exporter script to add multi-file export orchestration, link normalization, front-matter generation, and directory roll-up behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Source as Source Markdown Files
participant Exporter as Exporter (orchestrator)
participant Processor as Link Fixer & Content Processor
participant Target as Target Output Directory
Source->>Exporter: Collect and sort markdown files
Exporter->>Exporter: Group files by directory levels\nand compute roll-up targets
Exporter->>Exporter: Compute front-matter (title, parent,\ngrand_parent, nav_order, redirect_from)
Exporter->>Exporter: Read each markdown file content
Exporter->>Processor: process_content(content, current_file_path)
Processor->>Processor: fix_link() and other transforms\n(PPL→SQL fences, copy buttons, anchors)
Processor-->>Exporter: Return processed content
Exporter->>Target: Write processed file with front-matter
Exporter->>Target: Generate index.md for second-level dirs\nand write redirect_from entries if needed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
a47f61d to
e101610
Compare
ad0cdad to
98c15f4
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/docs_exporter/export_to_docs_website.py (1)
165-268: Exporter has a real bug (undefined name) + a likely docs loss bug (overwriting sourceindex.md).
grand_parent_nameis undefined. Even if “shouldn’t happen”, it’s still dead/broken code to remove or fix.- Directory
index.mdgeneration at the end can overwrite anindex.mdthat was exported from the source directory.@@ - else: - # This shouldn't happen after roll-up, but keeping for safety - parent = get_heading_for_dir(rel_path.parent.name) - grand_parent = get_heading_for_dir(rel_path.parts[-3]) - grand_parent = DIR_NAMES_TO_HEADINGS_MAP.get( - grand_parent_name, grand_parent_name.replace("-", " ").title() - ) + else: + # Depth > 2 should be rolled up earlier; fail fast to avoid silent bad nav. + raise ValueError(f"Unexpected rel_path depth after roll-up: {rel_path}") @@ - has_children = ( - is_root_index - or (md_file.parent / md_file.stem).is_dir() - and any((md_file.parent / md_file.stem).glob("*/*.md")) - ) + has_children = is_root_index or ( + (md_file.parent / md_file.stem).is_dir() + and any((md_file.parent / md_file.stem).glob("*/*.md")) + ) @@ - # Generate index.md for each directory (only up to second level) + # Generate index.md for each directory (only up to second level), but don't overwrite a source index.md @@ - target_index = target_dir / dir_path / "index.md" + target_index = target_dir / dir_path / "index.md" + source_index = source_dir / dir_path / "index.md" + if source_index.exists(): + continue
🧹 Nitpick comments (1)
scripts/docs_exporter/export_to_docs_website.py (1)
294-298: Hard-coded relative paths are OK for a local tool; consider CLI args for reuse.
Not blocking, but a--source-dir/--target-dirmakes this script easier to run in CI/local and reduces path assumptions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/user/ppl/cmd/trendline.md(1 hunks)docs/user/ppl/functions/cryptographic.md(3 hunks)docs/user/ppl/functions/datetime.md(1 hunks)docs/user/ppl/functions/json.md(1 hunks)docs/user/ppl/general/datatypes.md(0 hunks)docs/user/ppl/index.md(1 hunks)scripts/docs_exporter/export_to_docs_website.py(2 hunks)
💤 Files with no reviewable changes (1)
- docs/user/ppl/general/datatypes.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: For PPL command PRs, refer docs/dev/ppl-commands.md and verify the PR satisfies the checklist
🪛 LanguageTool
docs/user/ppl/functions/cryptographic.md
[grammar] ~8-~8: Use a hyphen to join words.
Context: ...MD5 digest and returns the value as a 32 character hex string. Argument type: S...
(QB_NEW_EN_HYPHEN)
🪛 Ruff (0.14.8)
scripts/docs_exporter/export_to_docs_website.py
200-200: Loop control variable dir_name not used within loop body
(B007)
241-241: Undefined name grand_parent_name
(F821)
241-241: Undefined name grand_parent_name
(F821)
248-249: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: enforce-label
- GitHub Check: build-linux (21, unit)
- GitHub Check: build-linux (25, integration)
- GitHub Check: build-linux (21, integration)
- GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, unit)
- GitHub Check: build-windows-macos (macos-14, 25, integration)
- GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, unit)
- GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
- GitHub Check: build-windows-macos (macos-14, 21, integration)
- GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
🔇 Additional comments (6)
docs/user/ppl/functions/json.md (1)
55-60: Formatting-only change inJSON_VALIDblock looks safe.
Trailing spaces for hard line breaks are fine as long as your exporter/Jekyll pipeline preserves them.docs/user/ppl/index.md (1)
14-18: Confirm```ppl ignoreis a supported fence info string in your toolchain.
Your exporter will rewrite it, but GitHub/Jekyll renderers can treat nonstandard info strings differently.docs/user/ppl/cmd/trendline.md (1)
8-8: Escaping in Syntax line is appropriate to render literal brackets/angle brackets.docs/user/ppl/functions/cryptographic.md (1)
34-38: SHA1/SHA2 formatting updates look fine.Also applies to: 61-66
docs/user/ppl/functions/datetime.md (1)
3-6: Intro text is currently a Markdown code block due to indentation—verify that’s intended.
If not intended, remove the leading spaces so it renders as normal prose.- All PPL date and time functions use the UTC time zone. Both input and output values are interpreted as UTC. - For instance, an input timestamp literal like '2020-08-26 01:01:01' is assumed to be in UTC, and the now() - function also returns the current date and time in UTC. +All PPL date and time functions use the UTC time zone. Both input and output values are interpreted as UTC. +For instance, an input timestamp literal like '2020-08-26 01:01:01' is assumed to be in UTC, and the now() +function also returns the current date and time in UTC.scripts/docs_exporter/export_to_docs_website.py (1)
144-163: Copy-button regex is plausible; ensure it won’t touch non-target fences and thatcopy.htmlexists.
Also consider excluding image/link rewrites inside code blocks (today it will rewrite inside code fences too).
Signed-off-by: Kyle Hounslow <[email protected]>
98c15f4 to
3354c91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/user/ppl/cmd/trendline.md(1 hunks)docs/user/ppl/functions/cryptographic.md(3 hunks)docs/user/ppl/functions/datetime.md(1 hunks)docs/user/ppl/functions/json.md(1 hunks)docs/user/ppl/general/datatypes.md(0 hunks)docs/user/ppl/index.md(1 hunks)scripts/docs_exporter/export_to_docs_website.py(2 hunks)
💤 Files with no reviewable changes (1)
- docs/user/ppl/general/datatypes.md
✅ Files skipped from review due to trivial changes (1)
- docs/user/ppl/functions/json.md
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/user/ppl/cmd/trendline.md
- docs/user/ppl/functions/datetime.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: For PPL command PRs, refer docs/dev/ppl-commands.md and verify the PR satisfies the checklist
🪛 Ruff (0.14.8)
scripts/docs_exporter/export_to_docs_website.py
245-245: Undefined name grand_parent_name
(F821)
245-245: Undefined name grand_parent_name
(F821)
252-253: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
- GitHub Check: build-windows-macos (macos-14, 21, integration)
- GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
🔇 Additional comments (4)
docs/user/ppl/index.md (1)
14-18: Code fence language tag is compatible with the exporter’s PPL→SQL rewrite.
```ppl ignorestill matches^```ppl\binprocess_content, so this should convert cleanly to```sqlin the exported site.docs/user/ppl/functions/cryptographic.md (2)
7-10: Doc polish applied (incl. “32-character”).
34-38: Formatting updates look consistent across SHA1/SHA2 blocks.Also applies to: 61-66
scripts/docs_exporter/export_to_docs_website.py (1)
59-85: Front-matter YAML escaping +redirect_fromlist now looks safe.
This addresses the YAML-breakage risk from titles/parents containing quotes/backslashes, and emitsredirect_fromas a list.
| def fix_link(match, current_file_path=None): | ||
| """Fix relative links to use site.url prefix, preserving anchor-only (#heading) links. | ||
| Handles: | ||
| - Relative paths with ../ and ./ | ||
| - Same-directory links (file.md) | ||
| - Same-directory links with ./ prefix (./file.md) | ||
| - Subdirectory relative paths (subdir/file.md) | ||
| - Anchor normalization for Jekyll | ||
| - Rollup redirects for deep directory structures | ||
| - Malformed syntax escaping | ||
| """ | ||
| link = match.group(1) | ||
| anchor = match.group(3) or "" | ||
|
|
||
| # Keep anchor-only links unchanged | ||
| if (not link and anchor) or link.startswith("#"): | ||
| return match.group(0) | ||
|
|
||
| # Skip external links | ||
| if link.startswith("http"): | ||
| return match.group(0) | ||
|
|
||
| # Remove .md extension | ||
| link = link.replace(".md", "") | ||
|
|
||
| # Resolve path based on link type | ||
| if ( | ||
| ("/" not in link and "\\" not in link) or link.startswith("./") | ||
| ) and current_file_path: | ||
| # Same directory link | ||
| current_dir = str(current_file_path.parent).replace("\\", "/") | ||
| clean_link = link.lstrip("./") | ||
| resolved_path = ( | ||
| f"{current_dir}/{clean_link}" if current_dir != "." else clean_link | ||
| ) | ||
| elif not link.startswith("../") and current_file_path: | ||
| # Relative path within current directory context | ||
| current_dir = str(current_file_path.parent).replace("\\", "/") | ||
| resolved_path = f"{current_dir}/{link}" if current_dir != "." else link | ||
| else: | ||
| # Normalize relative path and remove leading ../ | ||
| resolved_path = os.path.normpath(link).replace("\\", "/") | ||
| while resolved_path.startswith("../"): | ||
| resolved_path = resolved_path[3:] | ||
|
|
||
| # Clean up malformed paths | ||
| resolved_path = re.sub(r"[,\s]+", "-", resolved_path.strip()) | ||
|
|
||
| # Normalize anchor for Jekyll (remove dots and dashes) | ||
| if anchor: | ||
| anchor = re.sub(r"[.-]", "", anchor.lower()) | ||
|
|
||
| # Add trailing slash for directories (but not with anchors) | ||
| if resolved_path and not resolved_path.endswith((".html", ".htm")) and not anchor: | ||
| resolved_path = resolved_path.rstrip("/") + "/" | ||
|
|
||
| return f"]({{{{site.url}}}}{{{{site.baseurl}}}}/{DOCS_BASE_PATH}/{resolved_path}{anchor})" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid rewriting mailto:/tel: links and image links; don’t global-replace .md.
Today fix_link will rewrite non-http schemes and images (because the regex matches ]( inside image syntax), and link.replace(".md","") can remove .md mid-string.
-def fix_link(match, current_file_path=None):
+def fix_link(match, current_file_path=None):
@@
- link = match.group(1)
+ # Skip image links (the regex matches the `](` part of `` too)
+ if match.start() > 0 and match.string[match.start() - 1] == "!":
+ return match.group(0)
+
+ link = match.group(1)
anchor = match.group(3) or ""
@@
- # Skip external links
- if link.startswith("http"):
+ # Skip absolute URI schemes (mailto:, tel:, ftp:, etc.). Allow only http(s) to pass.
+ if re.match(r"^[a-zA-Z][a-zA-Z0-9+.-]*:", link) and not link.startswith(("http://", "https://")):
return match.group(0)
- # Remove .md extension
- link = link.replace(".md", "")
+ # Remove .md extension (suffix only). Note: process_content regex already captures optional `(.md)?`.
+ if link.endswith(".md"):
+ link = link[:-3]If you’d rather skip images at the regex level, you can also change the pattern to use a negative lookbehind: (?<!!)\]\(.
Also applies to: 162-165
🤖 Prompt for AI Agents
In scripts/docs_exporter/export_to_docs_website.py around lines 88 to 146,
fix_link is incorrectly rewriting mailto:/tel: and image links and is naively
doing link.replace(".md","") which can mangle strings; update the function to
(1) skip image links by detecting the image token before the match (e.g. check
whether the character before the match is '!') or change the regex to use a
negative lookbehind so images are not matched, (2) return unchanged for non-http
schemes like mailto:, tel:, ftp: by checking for ":" in the scheme portion (e.g.
if re.match(r"^[a-zA-Z][a-zA-Z0-9+.-]*:", link): return original), and (3) only
strip a trailing ".md" suffix (not global replace) by removing the suffix when
link.lower().endswith(".md") so mid-string ".md" is preserved; apply the same
fixes to the logic around lines 162-165.
| # Check if this is the root index.md and has children | ||
| is_root_index = rel_path.name == "index.md" and rel_path.parent == Path(".") | ||
| has_children = ( | ||
| is_root_index | ||
| or (md_file.parent / md_file.stem).is_dir() | ||
| and any((md_file.parent / md_file.stem).glob("*/*.md")) | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parenthesize has_children to make precedence explicit.
Current logic likely behaves as intended, but it’s easy to misread.
- has_children = (
- is_root_index
- or (md_file.parent / md_file.stem).is_dir()
- and any((md_file.parent / md_file.stem).glob("*/*.md"))
- )
+ has_children = (
+ is_root_index
+ or (
+ (md_file.parent / md_file.stem).is_dir()
+ and any((md_file.parent / md_file.stem).glob("*/*.md"))
+ )
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check if this is the root index.md and has children | |
| is_root_index = rel_path.name == "index.md" and rel_path.parent == Path(".") | |
| has_children = ( | |
| is_root_index | |
| or (md_file.parent / md_file.stem).is_dir() | |
| and any((md_file.parent / md_file.stem).glob("*/*.md")) | |
| ) | |
| # Check if this is the root index.md and has children | |
| is_root_index = rel_path.name == "index.md" and rel_path.parent == Path(".") | |
| has_children = ( | |
| is_root_index | |
| or ( | |
| (md_file.parent / md_file.stem).is_dir() | |
| and any((md_file.parent / md_file.stem).glob("*/*.md")) | |
| ) | |
| ) |
🧰 Tools
🪛 Ruff (0.14.8)
252-253: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
🤖 Prompt for AI Agents
In scripts/docs_exporter/export_to_docs_website.py around lines 248 to 255, the
boolean expression assigned to has_children relies on mixed or/and operator
precedence and is hard to read; change it to make precedence explicit by
parenthesizing the non-root check so it reads: has_children = is_root_index or
((md_file.parent / md_file.stem).is_dir() and any((md_file.parent /
md_file.stem).glob("*/*.md"))). This keeps the same logic but makes the grouping
clear for readers and future maintenance.
Swiddis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, could probably find maintainability nits if I tried but since this is a one-off script we probably don't need to worry about it
| source_dir_ppl = script_dir / "../../docs/user/ppl" | ||
| target_dir_ppl = script_dir / f"../../../documentation-website/_{DOCS_BASE_PATH}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-portable paths, but if i understand right this only runs in CI servers or in a box where we control the platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses pathlib.Path object with / operator (src). Should be portable to all OS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize the slash operator carrried to the tailing strings. I thought we'd end up with {script_dir}\../.... TIL
| for example, the following query retrieve firstname and lastname from accounts if age large than 18. | ||
|
|
||
| ``` | ||
| ```ppl ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when should add ppl ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ignore keyword will omit the code block from doctests. For this case, we want to exclude from doctests but still want ppl syntax highlighting applied.
Note: All ppl blocks are converted to sql formatting on export to docs website (sql is the closest format supported for syntax highlighting).
Description
Add an automated exporter script that transforms PPL markdown documentation tree (
docs/user/ppl/**/*.md) into Jekyll-compatible format for the OpenSearch documentation website.Summary of changes
Demo
Screen recording of rendered Jekyll site after exporting PPL docs. No manual changes applied. See the files exported to documentation-website for below demo here
Screen.Recording.2025-12-12.at.3.16.18.PM.mov
Exporter script features
just-the-docstheme on docs website.Related Issues
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.