-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[eradicate
] ignore # language=
in commented-out-code rule (ERA001)
#14069
Conversation
Do you mind pushing the test cases? I only see the code changes. For reference: https://www.jetbrains.com/help/pycharm/using-language-injections.html |
96f0c10
to
b94339b
Compare
Sorry @charliermarsh , forgot to force-push after I amended the commit with the test cases. |
No prob, thanks for the PR. |
And looks like I need to fix the formatting... Working on it. |
b94339b
to
0d9abfe
Compare
Fixes one common case cited on astral-sh#6019 Should work with the examples from Jetbrains documentation: https://www.jetbrains.com/help/pycharm/using-language-injections.html
0d9abfe
to
58922e4
Compare
I've added some more test cases and fixed the regexp with the additional cases described here. |
|
Thanks and welcome to the project. |
@@ -16,7 +16,7 @@ static CODE_INDICATORS: LazyLock<AhoCorasick> = LazyLock::new(|| { | |||
|
|||
static ALLOWLIST_REGEX: LazyLock<Regex> = LazyLock::new(|| { | |||
Regex::new( | |||
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))", | |||
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|language=[a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+(?:\s+prefix=\S+)?(?:\s+suffix=\S+)?|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested quantifier doesn't look so nice: [a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+
. The optional space at the start of the group leads to multiple ways of matching the same text.
I know Rust's regex engine guarantees linear time, but this is nevertheless a bad pattern. Prefer [a-zA-Z][-_.a-zA-Z0-9]*(?: [-_.a-zA-Z0-9]+)*
instead.
Technically, language IDs can be anything, so it's perhaps better to drop the leading character requirement: [-_.a-zA-Z0-9]+(?: [-_.a-zA-Z0-9]+)*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use language=
? All language identifiers will start with that, and false positives seem somewhat rare / not a huge deal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@InSyncWithFoo maybe I was extra cautious about false positives.
Technically, language IDs can be anything,
Good catch! Maybe we can just require a non-space character after the equals sign. This should take care of most false positives anyway, as it's uncommon for Python users to omit spaces around assignment operators.
The only thing that worries me is about detecting a commented-out named parameter on a multi-line method invocation or declaration. But maybe I'm just being overzealous.
def check_spelling(
text,
# language=DEFAULT_LANGUAGE
) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick search shows that matching on language=
alone would lead to many false negatives.
Language IDs are also displayed to the user as part of the UI, so I doubt they would contain non-ASCII characters. I would say limiting to [-_.a-zA-Z0-9]
and spaces is the most balanced heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] it's uncommon for Python users to omit spaces around assignment operators.
Don't forget keyword arguments, which are recommended to be written with no spaces around the equal sign:
# frobnicate(
# language='en' # Could also be `language=EN` with a predefined constant `EN`
# )
I think both this and the false negative you mention are well within the acceptable error margin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @InSyncWithFoo , will you write the PR with the new regexp and test cases, or should I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out language IDs in comments must be normalized.
This is JSON Lines (one value on each line, comments allowed):
This is pure JSON (one single top-level value, comments disallowed):
jsonlines
must be used even though autocompletion popups use json lines
:
Notably, this rule applies to some languages but not others (though Ruff doesn't need to care about this):
Fixes one common case cited on #6019
Summary
The
commented-out-code
rule (ERA001) fromeradicate
is currently flagging a very common idiom that marks Python strings as another language, to help with syntax highlighting:This PR adds this idiom to the list of allowed exceptions to the rule.
Test Plan
I've added some additional test cases.