Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parse triple backtick strings, discarding the info header #1162

Merged
merged 6 commits into from
Nov 12, 2024

Conversation

sxlijin
Copy link
Collaborator

@sxlijin sxlijin commented Nov 12, 2024

This is a more LLM-reliable way to get codegen outputs

We should add a documentation guide page for this (possibly a great community contribution candidate here), but in the meantime I've drafted up a blog post


Important

Add support for parsing triple backtick strings in JSON parser, including dedenting and discarding info headers.

  • Behavior:
    • Add TripleBacktickString variant to JsonCollection in json_collection.rs to handle triple backtick strings.
    • Update process_token() in json_parse_state.rs to parse triple backtick strings, discarding the info header and dedenting content.
  • Utilities:
    • Move dedent function to dedent.rs in bstd and update its usage in expression.rs and json_collection.rs.
  • Tests:
    • Add tests in test_code.rs to verify parsing of triple backtick strings, including edge cases like nested backticks and dedenting.

This description was created by Ellipsis for 3d5cd23. It will automatically update as commits are pushed.

Copy link

vercel bot commented Nov 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
baml ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 12, 2024 9:59pm

@sxlijin sxlijin requested a review from hellovai November 12, 2024 02:04
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 8748913 in 59 seconds

More details
  • Looked at 820 lines of code in 10 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. engine/baml-lib/schema-ast/src/ast/expression.rs:4
  • Draft comment:
    The dedent function is duplicated here and in engine/bstd/src/dedent.rs. Consider using the dedent function from bstd to avoid code duplication.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment seems to be addressing an issue that has already been resolved in the diff. The dedent function is no longer duplicated because it has been removed from this file and is now being imported from bstd. Therefore, the comment is outdated and not useful.
    I might be missing some context about whether the dedent function in bstd is exactly the same as the one that was removed. However, the fact that the function was removed and replaced with an import strongly suggests that the duplication issue is resolved.
    Even if there are slight differences, the intention to remove duplication is clear from the changes made. The comment is not needed because the action to resolve the issue has already been taken.
    The comment is outdated because the code duplication issue it addresses has already been resolved in the diff. Therefore, the comment should be deleted.

Workflow ID: wflow_iUAafNMCertTBOPr


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

FieldType::class("Test"),
{
"type": "code",
"code": "```\nconst { execSync } = require('child_process');\n```",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we said to trim + dedent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is triple backticks inside a string, that's why the newlines get preserved!

Copy link
Contributor

@hellovai hellovai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedent? everything else looks good!!

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on da7415d in 53 seconds

More details
  • Looked at 26 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. engine/baml-lib/jsonish/src/jsonish/parser/fixing_parser/json_parse_state.rs:502
  • Draft comment:
    The logic for checking triple backticks is flawed due to incorrect use of peek. Consider refactoring to correctly advance the iterator when checking for triple backticks.
  • Reason this comment was not posted:
    Marked as duplicate.

Workflow ID: wflow_rNbUceArEiS7dId6


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

// TODO: this logic is busted. peekable.peek() does not
// advance the iterator (this is easily verified with
// a unit test), but to fix this we need to do a bit of
// refactoring, so for now we'll live with it.
let is_triple_quoted = match next.peek() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for checking triple quotes is flawed due to incorrect use of peek. Consider refactoring to correctly advance the iterator when checking for triple quotes.

@sxlijin sxlijin force-pushed the sam/backticks-parsing branch from da7415d to 289e1d9 Compare November 12, 2024 21:16
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 289e1d9 in 50 seconds

More details
  • Looked at 26 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. engine/baml-lib/jsonish/src/jsonish/parser/fixing_parser/json_parse_state.rs:405
  • Draft comment:
    The logic for checking triple quotes using peek() is incorrect because peek() does not advance the iterator. This can lead to incorrect parsing of triple-quoted strings. Consider refactoring to correctly handle iterator advancement.
  • Reason this comment was not posted:
    Comment was on unchanged code.
2. engine/baml-lib/jsonish/src/jsonish/parser/fixing_parser/json_parse_state.rs:502
  • Draft comment:
    The logic for checking triple backticks using peek() is incorrect because peek() does not advance the iterator. This can lead to incorrect parsing of triple-backtick strings. Consider refactoring to correctly handle iterator advancement.
  • Reason this comment was not posted:
    Marked as duplicate.

Workflow ID: wflow_ghT08XKDv0v7keJ0


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@sxlijin sxlijin enabled auto-merge November 12, 2024 21:30
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 8c56484 in 37 seconds

More details
  • Looked at 107 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. engine/bstd/src/dedent.rs:62
  • Draft comment:
    Consider handling the case where the input string is empty to avoid potential errors or unexpected behavior.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_HIdEVDcFUhmRY8em


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 3d5cd23 in 41 seconds

More details
  • Looked at 162 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_8OAj2ihYvGmbGSUj


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

let (_, tail) = l.split_at(prefix.len());
tail
} else {
""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines that are entirely whitespace should not be replaced with an empty string. This can lead to incorrect results when dedenting, as seen in the test_empty_lines test case. Consider preserving such lines as they are.

Suggested change
""
l

@sxlijin sxlijin added this pull request to the merge queue Nov 12, 2024
Merged via the queue into canary with commit 353b21e Nov 12, 2024
9 of 10 checks passed
@sxlijin sxlijin deleted the sam/backticks-parsing branch November 12, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants