Support arbitrary length code fences in markdown#22937
Conversation
Typing conformance resultsNo changes detected ✅ |
|
|
ntBre
left a comment
There was a problem hiding this comment.
Should we add some new tests? This looks reasonable to me, but I'm curious if @BurntSushi has any thoughts on fancy-regex.
|
|
||
| insta = { workspace = true } | ||
| regex = { workspace = true } | ||
| fancy-regex = "0.17.0" |
There was a problem hiding this comment.
I think we usually add dependencies at the workspace level and then use { workspace = true } in the crates, but I could be wrong.
There was a problem hiding this comment.
I think that's common practice yeah.
|
My only real concern here is using backreferences if we don't really need them, because this does mean that the regex will need to do some kind of backtracking. I grant it's convenient here. The main failure mode is if there are inputs that can cause the regex to catastrophically backtrack. And if so, what those inputs look like. If they're pathological, then maybe we don't care. But it's notoriously difficult to know what specifically will cause catastrophic backtracking. (Hence why the |
|
The backreferences are necessary in order to support arbitrary length code fences of either backticks or tildes, to ensure that we correctly match cases of code fences inside code fences, like something a code block that contains an example of a code block containing a code block (yo dawg): ````markdown
```py
print('hello')
```
````Or a python code block that contains a multiline string with a markdown code block: ````py
print("""
```rust
fn foo() {}
```
""")
````The alternative is replacing the single regex-with-backreference with a line-by-line approach where we check every line looking for a code fence, and then manually build up the code blocks from there. But IMO the performance risk is pretty low given that a python code block that isn't fenced correctly is already an edge case, and IMO better to have less code to maintain. Performance wise, I tested against |
To clarify here, the risk of catastrophic backtracking isn't limited just to what the backreferences are doing. Once backtracking is used, it can generally occur anywhere else in the regex, particularly with repetition operators. It might be fine. Running it on more big test files is probably enough to confirm that it's fine for inputs we care about. |
Format markdown with line-by-line regex parse - Uses basic `regex` crate, so no backtracking or backreferences needed - Supports `~~~` and arbitrary length code fences - Supports `<!-- fmt:off -->` to skip formatting code blocks - Includes test cases from previous PRs, as well as new ones Obviates #22962 and #22937
fancy_regexcrate~~~style code fences as well