Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eradicate] ignore # language= in commented-out-code rule (ERA001) #14069

Merged
merged 1 commit into from
Nov 3, 2024

Conversation

fabiob
Copy link
Contributor

@fabiob fabiob commented Nov 3, 2024

Fixes one common case cited on #6019

Summary

The commented-out-code rule (ERA001) from eradicate is currently flagging a very common idiom that marks Python strings as another language, to help with syntax highlighting:

image

This PR adds this idiom to the list of allowed exceptions to the rule.

Test Plan

I've added some additional test cases.

@charliermarsh
Copy link
Member

Do you mind pushing the test cases? I only see the code changes.

For reference: https://www.jetbrains.com/help/pycharm/using-language-injections.html

@charliermarsh charliermarsh added the bug Something isn't working label Nov 3, 2024
@fabiob
Copy link
Contributor Author

fabiob commented Nov 3, 2024

Sorry @charliermarsh , forgot to force-push after I amended the commit with the test cases.

@charliermarsh
Copy link
Member

No prob, thanks for the PR.

@fabiob
Copy link
Contributor Author

fabiob commented Nov 3, 2024

And looks like I need to fix the formatting... Working on it.

@fabiob
Copy link
Contributor Author

fabiob commented Nov 3, 2024

For reference: https://www.jetbrains.com/help/pycharm/using-language-injections.html

I've added some more test cases and fixed the regexp with the additional cases described here.

@charliermarsh charliermarsh merged commit 2b73a1c into astral-sh:main Nov 3, 2024
20 checks passed
Copy link
Contributor

github-actions bot commented Nov 3, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

@charliermarsh
Copy link
Member

Thanks and welcome to the project.

@fabiob fabiob deleted the adds-language-to-era001 branch November 3, 2024 22:12
@@ -16,7 +16,7 @@ static CODE_INDICATORS: LazyLock<AhoCorasick> = LazyLock::new(|| {

static ALLOWLIST_REGEX: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))",
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|language=[a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+(?:\s+prefix=\S+)?(?:\s+suffix=\S+)?|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested quantifier doesn't look so nice: [a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+. The optional space at the start of the group leads to multiple ways of matching the same text.

I know Rust's regex engine guarantees linear time, but this is nevertheless a bad pattern. Prefer [a-zA-Z][-_.a-zA-Z0-9]*(?: [-_.a-zA-Z0-9]+)* instead.

Technically, language IDs can be anything, so it's perhaps better to drop the leading character requirement: [-_.a-zA-Z0-9]+(?: [-_.a-zA-Z0-9]+)*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use language=? All language identifiers will start with that, and false positives seem somewhat rare / not a huge deal here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InSyncWithFoo maybe I was extra cautious about false positives.

Technically, language IDs can be anything,

Good catch! Maybe we can just require a non-space character after the equals sign. This should take care of most false positives anyway, as it's uncommon for Python users to omit spaces around assignment operators.

The only thing that worries me is about detecting a commented-out named parameter on a multi-line method invocation or declaration. But maybe I'm just being overzealous.

def check_spelling(
  text,
  # language=DEFAULT_LANGUAGE
) ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick search shows that matching on language= alone would lead to many false negatives.

Language IDs are also displayed to the user as part of the UI, so I doubt they would contain non-ASCII characters. I would say limiting to [-_.a-zA-Z0-9] and spaces is the most balanced heuristic.

Copy link
Contributor

@InSyncWithFoo InSyncWithFoo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] it's uncommon for Python users to omit spaces around assignment operators.

Don't forget keyword arguments, which are recommended to be written with no spaces around the equal sign:

# frobnicate(
#     language='en'  # Could also be `language=EN` with a predefined constant `EN`
# )

I think both this and the false negative you mention are well within the acceptable error margin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @InSyncWithFoo , will you write the PR with the new regexp and test cases, or should I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle it.

Copy link
Contributor

@InSyncWithFoo InSyncWithFoo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out language IDs in comments must be normalized.

This is JSON Lines (one value on each line, comments allowed):

This is pure JSON (one single top-level value, comments disallowed):

jsonlines must be used even though autocompletion popups use json lines:

Notably, this rule applies to some languages but not others (though Ruff doesn't need to care about this):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants