Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a subcommand to generate dependency graphs #13402

Merged
merged 11 commits into from
Sep 20, 2024
Merged

Conversation

charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Sep 18, 2024

Summary

This PR adds an experimental Ruff subcommand to generate dependency graphs based on module resolution.

A few highlights:

  • You can generate either dependency or dependent graphs via the --direction command-line argument.
  • Like Pants, we also provide an option to identify imports from string literals (--detect-string-imports).
  • Users can also provide additional dependency data via the include-dependencies key under [tool.ruff.import-map]. This map uses file paths as keys, and lists of strings as values. Those strings can be file paths or globs.

The dependency resolution uses the red-knot module resolver which is intended to be fully spec compliant, so it's also a chance to expose the module resolver in a real-world setting.

The CLI is, e.g., ruff graph build ../autobot, which will output a JSON map from file to files it depends on for the autobot project.

Copy link
Contributor

github-actions bot commented Sep 19, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@charliermarsh charliermarsh force-pushed the charlie/import-map branch 4 times, most recently from 9e912c8 to ccde874 Compare September 19, 2024 00:58
Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments from skimming!

crates/ruff_graph/src/collector.rs Outdated Show resolved Hide resolved
crates/ruff_graph/src/resolver.rs Outdated Show resolved Hide resolved
@charliermarsh
Copy link
Member Author

This needs tests prior to merging but I would like feedback.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

I think we're creating too many Db instances which will hurt performance and I would recommend to use ModuleName and SystemPath over QualifiedName and Path to avoid conversions at the leaf.

crates/ruff/src/args.rs Outdated Show resolved Hide resolved
crates/ruff/src/commands/graph.rs Outdated Show resolved Hide resolved
Comment on lines 43 to 50
// Create a database for each source root.
let databases = package_roots
.values()
.filter_map(|package| package.as_deref())
.filter_map(|package| package.parent())
.map(Path::to_path_buf)
.map(|source_root| Ok((source_root.clone(), ModuleDb::from_src_root(source_root)?)))
.collect::<Result<BTreeMap<_, _>>>()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for creating a new db for each source root?

The caching is local to each database, and creating a new database for each package means that we trade less congestion with reduced cache efficiency. This might be fine if the resolved modules in each package root are distinct but I suspect that the files contain cross-package imports and many stdlib imports that now need to resolved multiple times.

The recommended way to parallelism in Salsa is to implement a snapshot method for your Database that creates a shallow clone:

    #[must_use]
    pub fn snapshot(&self) -> Self {
        Self {
            workspace: self.workspace,
            storage: self.storage.clone(),
            files: self.files.snapshot(),
            system: Arc::clone(&self.system),
        }
    }

#[must_use]
pub fn snapshot(&self) -> Self {
Self {
workspace: self.workspace,
storage: self.storage.clone(),
files: self.files.snapshot(),
system: Arc::clone(&self.system),
}
}

And you can then create a unique database handle for each task:

        let files = WorkspaceFiles::new(db, self);
        let result = Arc::new(std::sync::Mutex::new(Vec::new()));
        let inner_result = Arc::clone(&result);

        let db = db.snapshot();
        let workspace_span = workspace_span.clone();

        rayon::scope(move |scope| {
            for file in &files {
                let result = inner_result.clone();
                let db = db.snapshot();
                let workspace_span = workspace_span.clone();

                scope.spawn(move |_| {
                    let check_file_span = tracing::debug_span!(parent: &workspace_span, "check_file", file=%file.path(&db));
                    let _entered = check_file_span.entered();

                    let file_diagnostics = check_file(&db, file);
                    result.lock().unwrap().extend(file_diagnostics);
                });
            }
        });

        Arc::into_inner(result).unwrap().into_inner().unwrap()

rayon::scope(move |scope| {
for file in &files {
let result = inner_result.clone();
let db = db.snapshot();
let workspace_span = workspace_span.clone();
scope.spawn(move |_| {
let check_file_span = tracing::debug_span!(parent: &workspace_span, "check_file", file=%file.path(&db));
let _entered = check_file_span.entered();
let file_diagnostics = check_file(&db, file);
result.lock().unwrap().extend(file_diagnostics);
});
}
});

Edit: I see below that you already did this. I'm still curious for why we need multiple Dbs. I think what we do here is too granular. At least if I remember how package roots works correctly. I think it doesn't just generate a db for the top-level packages but also for all sub packages. That means that we have two database for a.b.c: one for package a and another for b if c is a module and a and a.b are packages. The result is that we double-resolve modules for a.b, depending on whether they come from the a or a.b. database.

I also worry that relative imports don't work due to that because I think we might baile out at the src_root.

Can't we generate a single database or a database for each common ancestor of the packages (which ideally is the src directory)?

If not, can we add some package paths to SearchPathSettings::extra_paths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, can we add some package paths to SearchPathSettings::extra_paths?

This isn't really "correct" since extra_paths take priority over even the stdlib search path in module resolution. But maybe it'll produce the right answer in most cases.

In the real world, if you have multiple independent first-party packages in a workspace (which may or may not depend on each other), you're likely to have them all editably installed into a venv, and red-knot would statically discover all the first-party packages by reading the pth files in the site-packages directory and add them as search paths that way. But maybe that approach doesn't work here.

Copy link
Member Author

@charliermarsh charliermarsh Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source roots here are actually just the root folders. So, like, in Django, it's just a small set of top-level folders:

["/Users/crmarsh/workspace/django", "/Users/crmarsh/workspace/django/tests", "/Users/crmarsh/workspace/django/tests/admin_scripts/custom_templates", "/Users/crmarsh/workspace/django/tests/admin_scripts/custom_templates/project_template", "/Users/crmarsh/workspace/django/tests/i18n/sampleproject", "/Users/crmarsh/workspace/django/tests/migrations/faulty_migrations/namespace", "/Users/crmarsh/workspace/django/tests/migrations/test_fake_initial_case_insensitive"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a single database with extra paths -- that's actually the way I did it initially, though Alex said it's not the intended use-case. The downside, I think, is that it would allow those source roots to import files across one another. Maybe that's correct though! I can change it.

Copy link
Member

@AlexWaygood AlexWaygood Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source roots here are actually just the root folders.

Possibly I'm still misunderstanding, but if everything's importable from the workspace root, doesn't that imply you can resolve all modules and submodules relative to a single search path, src_root, which would point to the workspace root?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex is obviously the authority when it comes to search paths and I'm probably wrong haha 😆

The intended setup is to use a single database for an application (except VS code where you can have multiple workspaces). That gives the best cache reuse. If that's possible, great. If we aren't hitting any perf issues, that's great too.

Copy link
Contributor

@carljm carljm Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the roots for Django that were listed, I think those are all independent roots (as in, they aren't all importable from a single source root), but they can import each other. So I think it would be more efficient to use a single db, by using multiple you are likely repeating work. It's true that the red-knot resolver (because it was built around the more restrictive assumptions of PEP 561 for Python type checkers, rather than a generic "ordered sys.path" abstraction) doesn't obviously accommodate this use case, but I think extra-paths is a totally fine way to handle it, and better than multiple databases. (In fact I think this is pretty much the sort of use-case extra-paths exists to handle.)

This isn't really "correct" since extra_paths take priority over even the stdlib search path in module resolution.

I don't understand this comment: all first-party paths (both the main source root and the extra paths) should always take priority over the stdlib in module resolution. According to PEP 561 order, the first two items in the resolution order are extra paths and then the first-party root. So if you have multiple "first party roots", given that they have to go in some resolution order, I don't see how any incorrectness is introduced by putting N-1 of them as extra paths, and whichever one you want last as the main first-party root.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really "correct" since extra_paths take priority over even the stdlib search path in module resolution.

I don't understand this comment: all first-party paths (both the main source root and the extra paths) should always take priority over the stdlib in module resolution.

Ugh, you're quite right. Blame it on the holiday haze... yeah, using extra_paths should be fine then, in that case. Sorry for causing unnecessary work @charliermarsh :(

Copy link
Contributor

@carljm carljm Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the real world, if you have multiple independent first-party packages in a workspace (which may or may not depend on each other), you're likely to have them all editably installed into a venv, and red-knot would statically discover all the first-party packages by reading the pth files in the site-packages directory and add them as search paths that way. But maybe that approach doesn't work here.

In many real-world cases (including Django), you'll have little sub-trees of Python code that aren't packaged at all or installed via Python packaging tools, but still end up on the runtime sys.path in some more bespoke way. That's the sort of scenario extra-paths is supposed to be for.

crates/ruff_db/src/system/os.rs Outdated Show resolved Hide resolved
crates/ruff_graph/src/collector.rs Outdated Show resolved Hide resolved
crates/ruff_workspace/src/settings.rs Outdated Show resolved Hide resolved
crates/ruff_workspace/src/settings.rs Outdated Show resolved Hide resolved
@@ -3306,6 +3313,48 @@ pub struct FormatOptions {
pub docstring_code_line_length: Option<DocstringCodeLineWidth>,
}

/// Configures Ruff's import map generation.
#[derive(
Clone, Debug, PartialEq, Eq, Default, Deserialize, Serialize, OptionsMetadata, CombineOptions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the OptionsMetadata trait for now to exclude the settings from the schema?

crates/ruff_graph/src/resolver.rs Outdated Show resolved Hide resolved
crates/ruff_graph/src/resolver.rs Outdated Show resolved Hide resolved
@charliermarsh charliermarsh force-pushed the charlie/import-map branch 3 times, most recently from b4626e1 to 2de42c5 Compare September 19, 2024 22:33
Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!


// Resolve all package roots.
let package_roots = resolver
.package_roots(
Copy link
Contributor

@carljm carljm Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic (determining a package root for an arbitrary file by walking up the tree looking for __init__.py) is part of Ruff so it must be fairly battle-tested, but I'm a little surprised if it doesn't often fail on big internal codebases. In my experience, "accidental namespace packages" (because someone just forgot an __init__.py, and imports worked anyway) are pretty common. Maybe a lot of people use linters (like Ruff, perhaps!) that prevent this?

Because of that, I tend to be more on the side of, if you're going to be doing multi-file analysis, you should explicitly configure the package root(s), rather than trying to discover files and packages automatically, which is easily broken by namespace packages and therefore requires explicitly configuring namespace packages.

crates/ruff/src/commands/analyze_graph.rs Show resolved Hide resolved
crates/ruff_graph/src/collector.rs Outdated Show resolved Hide resolved
crates/ruff_graph/src/collector.rs Show resolved Hide resolved

fn visit_expr(&mut self, expr: &'ast Expr) {
if self.string_imports {
if let Expr::StringLiteral(ast::ExprStringLiteral { value, range: _ }) = expr {
Copy link
Contributor

@carljm carljm Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more aggressive than I realized -- we don't even look for calls to __import__ or importlib.import_module, we just look for any string that might be a module path.

It seems like this could easily be fooled by cases where an import-path constant is defined in one file (like a Django settings file) but the string is imported and then used in a dynamic import in a different module.

But if it's good enough to match existing tools on the codebases we care about, great!

@charliermarsh charliermarsh force-pushed the charlie/import-map branch 2 times, most recently from 6adc038 to a3914bb Compare September 20, 2024 00:22
@charliermarsh charliermarsh merged commit 4e935f7 into main Sep 20, 2024
19 of 20 checks passed
@charliermarsh charliermarsh deleted the charlie/import-map branch September 20, 2024 01:06
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 7, 2024
## 0.6.9

### Preview features

- Fix codeblock dynamic line length calculation for indented docstring examples ([#13523](astral-sh/ruff#13523))
- \[`refurb`\] Mark `FURB118` fix as unsafe ([#13613](astral-sh/ruff#13613))

### Rule changes

- \[`pydocstyle`\] Don't raise `D208` when last line is non-empty ([#13372](astral-sh/ruff#13372))
- \[`pylint`\] Preserve trivia (i.e. comments) in `PLR5501` autofix ([#13573](astral-sh/ruff#13573))

### Configuration

- \[`pyflakes`\] Add `allow-unused-imports` setting for `unused-import` rule (`F401`) ([#13601](astral-sh/ruff#13601))

### Bug fixes

- Support ruff discovery in pip build environments ([#13591](astral-sh/ruff#13591))
- \[`flake8-bugbear`\] Avoid short circuiting `B017` for multiple context managers ([#13609](astral-sh/ruff#13609))
- \[`pylint`\] Do not offer an invalid fix for `PLR1716` when the comparisons contain parenthesis ([#13527](astral-sh/ruff#13527))
- \[`pyupgrade`\] Fix `UP043` to apply to `collections.abc.Generator` and `collections.abc.AsyncGenerator` ([#13611](astral-sh/ruff#13611))
- \[`refurb`\] Fix handling of slices in tuples for `FURB118`, e.g., `x[:, 1]` ([#13518](astral-sh/ruff#13518))

### Documentation

- Update GitHub Action link to `astral-sh/ruff-action` ([#13551](astral-sh/ruff#13551))

## 0.6.8

### Preview features

- Remove unnecessary parentheses around `match case` clauses ([#13510](astral-sh/ruff#13510))
- Parenthesize overlong `if` guards in `match..case` clauses ([#13513](astral-sh/ruff#13513))
- Detect basic wildcard imports in `ruff analyze graph` ([#13486](astral-sh/ruff#13486))
- \[`pylint`\] Implement `boolean-chained-comparison` (`R1716`) ([#13435](astral-sh/ruff#13435))

### Rule changes

- \[`lake8-simplify`\] Detect `SIM910` when using variadic keyword arguments, i.e., `**kwargs` ([#13503](astral-sh/ruff#13503))
- \[`pyupgrade`\] Avoid false negatives with non-reference shadowed bindings of loop variables (`UP028`) ([#13504](astral-sh/ruff#13504))

### Bug fixes

- Detect tuples bound to variadic positional arguments i.e. `*args` ([#13512](astral-sh/ruff#13512))
- Exit gracefully on broken pipe errors ([#13485](astral-sh/ruff#13485))
- Avoid panic when analyze graph hits broken pipe ([#13484](astral-sh/ruff#13484))

### Performance

- Reuse `BTreeSets` in module resolver ([#13440](astral-sh/ruff#13440))
- Skip traversal for non-compound statements ([#13441](astral-sh/ruff#13441))

## 0.6.7

### Preview features

- Add Python version support to ruff analyze CLI ([#13426](astral-sh/ruff#13426))
- Add `exclude` support to `ruff analyze` ([#13425](astral-sh/ruff#13425))
- Fix parentheses around return type annotations ([#13381](astral-sh/ruff#13381))

### Rule changes

- \[`pycodestyle`\] Fix: Don't autofix if the first line ends in a question mark? (D400) ([#13399](astral-sh/ruff#13399))

### Bug fixes

- Respect `lint.exclude` in ruff check `--add-noqa` ([#13427](astral-sh/ruff#13427))

### Performance

- Avoid tracking module resolver files in Salsa ([#13437](astral-sh/ruff#13437))
- Use `forget` for module resolver database ([#13438](astral-sh/ruff#13438))

## 0.6.6

### Preview features

- \[`refurb`\] Skip `slice-to-remove-prefix-or-suffix` (`FURB188`) when non-trivial slice steps are present ([#13405](astral-sh/ruff#13405))
- Add a subcommand to generate dependency graphs ([#13402](astral-sh/ruff#13402))

### Formatter

- Fix placement of inline parameter comments ([#13379](astral-sh/ruff#13379))

### Server

- Fix off-by one error in the `LineIndex::offset` calculation ([#13407](astral-sh/ruff#13407))

### Bug fixes

- \[`fastapi`\] Respect FastAPI aliases in route definitions ([#13394](astral-sh/ruff#13394))
- \[`pydocstyle`\] Respect word boundaries when detecting function signature in docs ([#13388](astral-sh/ruff#13388))

### Documentation

- Add backlinks to rule overview linter ([#13368](astral-sh/ruff#13368))
- Fix documentation for editor vim plugin ALE ([#13348](astral-sh/ruff#13348))
- Fix rendering of `FURB188` docs ([#13406](astral-sh/ruff#13406))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants