-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a subcommand to generate dependency graphs #13402
Conversation
7cc8f9f
to
7b94f03
Compare
|
9e912c8
to
ccde874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments from skimming!
This needs tests prior to merging but I would like feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
I think we're creating too many Db instances which will hurt performance and I would recommend to use ModuleName
and SystemPath
over QualifiedName
and Path
to avoid conversions at the leaf.
crates/ruff/src/commands/graph.rs
Outdated
// Create a database for each source root. | ||
let databases = package_roots | ||
.values() | ||
.filter_map(|package| package.as_deref()) | ||
.filter_map(|package| package.parent()) | ||
.map(Path::to_path_buf) | ||
.map(|source_root| Ok((source_root.clone(), ModuleDb::from_src_root(source_root)?))) | ||
.collect::<Result<BTreeMap<_, _>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for creating a new db for each source root?
The caching is local to each database, and creating a new database for each package means that we trade less congestion with reduced cache efficiency. This might be fine if the resolved modules in each package root are distinct but I suspect that the files contain cross-package imports and many stdlib imports that now need to resolved multiple times.
The recommended way to parallelism in Salsa is to implement a snapshot
method for your Database
that creates a shallow clone:
#[must_use]
pub fn snapshot(&self) -> Self {
Self {
workspace: self.workspace,
storage: self.storage.clone(),
files: self.files.snapshot(),
system: Arc::clone(&self.system),
}
}
ruff/crates/red_knot_workspace/src/db.rs
Lines 82 to 90 in ecab04e
#[must_use] | |
pub fn snapshot(&self) -> Self { | |
Self { | |
workspace: self.workspace, | |
storage: self.storage.clone(), | |
files: self.files.snapshot(), | |
system: Arc::clone(&self.system), | |
} | |
} |
And you can then create a unique database handle for each task:
let files = WorkspaceFiles::new(db, self);
let result = Arc::new(std::sync::Mutex::new(Vec::new()));
let inner_result = Arc::clone(&result);
let db = db.snapshot();
let workspace_span = workspace_span.clone();
rayon::scope(move |scope| {
for file in &files {
let result = inner_result.clone();
let db = db.snapshot();
let workspace_span = workspace_span.clone();
scope.spawn(move |_| {
let check_file_span = tracing::debug_span!(parent: &workspace_span, "check_file", file=%file.path(&db));
let _entered = check_file_span.entered();
let file_diagnostics = check_file(&db, file);
result.lock().unwrap().extend(file_diagnostics);
});
}
});
Arc::into_inner(result).unwrap().into_inner().unwrap()
ruff/crates/red_knot_workspace/src/workspace.rs
Lines 206 to 220 in 599103c
rayon::scope(move |scope| { | |
for file in &files { | |
let result = inner_result.clone(); | |
let db = db.snapshot(); | |
let workspace_span = workspace_span.clone(); | |
scope.spawn(move |_| { | |
let check_file_span = tracing::debug_span!(parent: &workspace_span, "check_file", file=%file.path(&db)); | |
let _entered = check_file_span.entered(); | |
let file_diagnostics = check_file(&db, file); | |
result.lock().unwrap().extend(file_diagnostics); | |
}); | |
} | |
}); |
Edit: I see below that you already did this. I'm still curious for why we need multiple Dbs. I think what we do here is too granular. At least if I remember how package roots works correctly. I think it doesn't just generate a db for the top-level packages but also for all sub packages. That means that we have two database for a.b.c
: one for package a
and another for b
if c
is a module and a
and a.b
are packages. The result is that we double-resolve modules for a.b
, depending on whether they come from the a
or a.b.
database.
I also worry that relative imports don't work due to that because I think we might baile out at the src_root
.
Can't we generate a single database or a database for each common ancestor of the packages (which ideally is the src directory)?
If not, can we add some package paths to SearchPathSettings::extra_paths
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, can we add some package paths to
SearchPathSettings::extra_paths
?
This isn't really "correct" since extra_paths
take priority over even the stdlib search path in module resolution. But maybe it'll produce the right answer in most cases.
In the real world, if you have multiple independent first-party packages in a workspace (which may or may not depend on each other), you're likely to have them all editably installed into a venv, and red-knot would statically discover all the first-party packages by reading the pth
files in the site-packages
directory and add them as search paths that way. But maybe that approach doesn't work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source roots here are actually just the root folders. So, like, in Django, it's just a small set of top-level folders:
["/Users/crmarsh/workspace/django", "/Users/crmarsh/workspace/django/tests", "/Users/crmarsh/workspace/django/tests/admin_scripts/custom_templates", "/Users/crmarsh/workspace/django/tests/admin_scripts/custom_templates/project_template", "/Users/crmarsh/workspace/django/tests/i18n/sampleproject", "/Users/crmarsh/workspace/django/tests/migrations/faulty_migrations/namespace", "/Users/crmarsh/workspace/django/tests/migrations/test_fake_initial_case_insensitive"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a single database with extra paths -- that's actually the way I did it initially, though Alex said it's not the intended use-case. The downside, I think, is that it would allow those source roots to import files across one another. Maybe that's correct though! I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source roots here are actually just the root folders.
Possibly I'm still misunderstanding, but if everything's importable from the workspace root, doesn't that imply you can resolve all modules and submodules relative to a single search path, src_root
, which would point to the workspace root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alex is obviously the authority when it comes to search paths and I'm probably wrong haha 😆
The intended setup is to use a single database for an application (except VS code where you can have multiple workspaces). That gives the best cache reuse. If that's possible, great. If we aren't hitting any perf issues, that's great too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the roots for Django that were listed, I think those are all independent roots (as in, they aren't all importable from a single source root), but they can import each other. So I think it would be more efficient to use a single db, by using multiple you are likely repeating work. It's true that the red-knot resolver (because it was built around the more restrictive assumptions of PEP 561 for Python type checkers, rather than a generic "ordered sys.path
" abstraction) doesn't obviously accommodate this use case, but I think extra-paths is a totally fine way to handle it, and better than multiple databases. (In fact I think this is pretty much the sort of use-case extra-paths exists to handle.)
This isn't really "correct" since
extra_paths
take priority over even the stdlib search path in module resolution.
I don't understand this comment: all first-party paths (both the main source root and the extra paths) should always take priority over the stdlib in module resolution. According to PEP 561 order, the first two items in the resolution order are extra paths and then the first-party root. So if you have multiple "first party roots", given that they have to go in some resolution order, I don't see how any incorrectness is introduced by putting N-1 of them as extra paths, and whichever one you want last as the main first-party root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really "correct" since
extra_paths
take priority over even the stdlib search path in module resolution.I don't understand this comment: all first-party paths (both the main source root and the extra paths) should always take priority over the stdlib in module resolution.
Ugh, you're quite right. Blame it on the holiday haze... yeah, using extra_paths
should be fine then, in that case. Sorry for causing unnecessary work @charliermarsh :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the real world, if you have multiple independent first-party packages in a workspace (which may or may not depend on each other), you're likely to have them all editably installed into a venv, and red-knot would statically discover all the first-party packages by reading the
pth
files in thesite-packages
directory and add them as search paths that way. But maybe that approach doesn't work here.
In many real-world cases (including Django), you'll have little sub-trees of Python code that aren't packaged at all or installed via Python packaging tools, but still end up on the runtime sys.path
in some more bespoke way. That's the sort of scenario extra-paths
is supposed to be for.
@@ -3306,6 +3313,48 @@ pub struct FormatOptions { | |||
pub docstring_code_line_length: Option<DocstringCodeLineWidth>, | |||
} | |||
|
|||
/// Configures Ruff's import map generation. | |||
#[derive( | |||
Clone, Debug, PartialEq, Eq, Default, Deserialize, Serialize, OptionsMetadata, CombineOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the OptionsMetadata
trait for now to exclude the settings from the schema?
b4626e1
to
2de42c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
|
||
// Resolve all package roots. | ||
let package_roots = resolver | ||
.package_roots( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic (determining a package root for an arbitrary file by walking up the tree looking for __init__.py
) is part of Ruff so it must be fairly battle-tested, but I'm a little surprised if it doesn't often fail on big internal codebases. In my experience, "accidental namespace packages" (because someone just forgot an __init__.py
, and imports worked anyway) are pretty common. Maybe a lot of people use linters (like Ruff, perhaps!) that prevent this?
Because of that, I tend to be more on the side of, if you're going to be doing multi-file analysis, you should explicitly configure the package root(s), rather than trying to discover files and packages automatically, which is easily broken by namespace packages and therefore requires explicitly configuring namespace packages.
|
||
fn visit_expr(&mut self, expr: &'ast Expr) { | ||
if self.string_imports { | ||
if let Expr::StringLiteral(ast::ExprStringLiteral { value, range: _ }) = expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more aggressive than I realized -- we don't even look for calls to __import__
or importlib.import_module
, we just look for any string that might be a module path.
It seems like this could easily be fooled by cases where an import-path constant is defined in one file (like a Django settings file) but the string is imported and then used in a dynamic import in a different module.
But if it's good enough to match existing tools on the codebases we care about, great!
6adc038
to
a3914bb
Compare
a3914bb
to
c7ba18b
Compare
e00731e
to
1c7e262
Compare
## 0.6.9 ### Preview features - Fix codeblock dynamic line length calculation for indented docstring examples ([#13523](astral-sh/ruff#13523)) - \[`refurb`\] Mark `FURB118` fix as unsafe ([#13613](astral-sh/ruff#13613)) ### Rule changes - \[`pydocstyle`\] Don't raise `D208` when last line is non-empty ([#13372](astral-sh/ruff#13372)) - \[`pylint`\] Preserve trivia (i.e. comments) in `PLR5501` autofix ([#13573](astral-sh/ruff#13573)) ### Configuration - \[`pyflakes`\] Add `allow-unused-imports` setting for `unused-import` rule (`F401`) ([#13601](astral-sh/ruff#13601)) ### Bug fixes - Support ruff discovery in pip build environments ([#13591](astral-sh/ruff#13591)) - \[`flake8-bugbear`\] Avoid short circuiting `B017` for multiple context managers ([#13609](astral-sh/ruff#13609)) - \[`pylint`\] Do not offer an invalid fix for `PLR1716` when the comparisons contain parenthesis ([#13527](astral-sh/ruff#13527)) - \[`pyupgrade`\] Fix `UP043` to apply to `collections.abc.Generator` and `collections.abc.AsyncGenerator` ([#13611](astral-sh/ruff#13611)) - \[`refurb`\] Fix handling of slices in tuples for `FURB118`, e.g., `x[:, 1]` ([#13518](astral-sh/ruff#13518)) ### Documentation - Update GitHub Action link to `astral-sh/ruff-action` ([#13551](astral-sh/ruff#13551)) ## 0.6.8 ### Preview features - Remove unnecessary parentheses around `match case` clauses ([#13510](astral-sh/ruff#13510)) - Parenthesize overlong `if` guards in `match..case` clauses ([#13513](astral-sh/ruff#13513)) - Detect basic wildcard imports in `ruff analyze graph` ([#13486](astral-sh/ruff#13486)) - \[`pylint`\] Implement `boolean-chained-comparison` (`R1716`) ([#13435](astral-sh/ruff#13435)) ### Rule changes - \[`lake8-simplify`\] Detect `SIM910` when using variadic keyword arguments, i.e., `**kwargs` ([#13503](astral-sh/ruff#13503)) - \[`pyupgrade`\] Avoid false negatives with non-reference shadowed bindings of loop variables (`UP028`) ([#13504](astral-sh/ruff#13504)) ### Bug fixes - Detect tuples bound to variadic positional arguments i.e. `*args` ([#13512](astral-sh/ruff#13512)) - Exit gracefully on broken pipe errors ([#13485](astral-sh/ruff#13485)) - Avoid panic when analyze graph hits broken pipe ([#13484](astral-sh/ruff#13484)) ### Performance - Reuse `BTreeSets` in module resolver ([#13440](astral-sh/ruff#13440)) - Skip traversal for non-compound statements ([#13441](astral-sh/ruff#13441)) ## 0.6.7 ### Preview features - Add Python version support to ruff analyze CLI ([#13426](astral-sh/ruff#13426)) - Add `exclude` support to `ruff analyze` ([#13425](astral-sh/ruff#13425)) - Fix parentheses around return type annotations ([#13381](astral-sh/ruff#13381)) ### Rule changes - \[`pycodestyle`\] Fix: Don't autofix if the first line ends in a question mark? (D400) ([#13399](astral-sh/ruff#13399)) ### Bug fixes - Respect `lint.exclude` in ruff check `--add-noqa` ([#13427](astral-sh/ruff#13427)) ### Performance - Avoid tracking module resolver files in Salsa ([#13437](astral-sh/ruff#13437)) - Use `forget` for module resolver database ([#13438](astral-sh/ruff#13438)) ## 0.6.6 ### Preview features - \[`refurb`\] Skip `slice-to-remove-prefix-or-suffix` (`FURB188`) when non-trivial slice steps are present ([#13405](astral-sh/ruff#13405)) - Add a subcommand to generate dependency graphs ([#13402](astral-sh/ruff#13402)) ### Formatter - Fix placement of inline parameter comments ([#13379](astral-sh/ruff#13379)) ### Server - Fix off-by one error in the `LineIndex::offset` calculation ([#13407](astral-sh/ruff#13407)) ### Bug fixes - \[`fastapi`\] Respect FastAPI aliases in route definitions ([#13394](astral-sh/ruff#13394)) - \[`pydocstyle`\] Respect word boundaries when detecting function signature in docs ([#13388](astral-sh/ruff#13388)) ### Documentation - Add backlinks to rule overview linter ([#13368](astral-sh/ruff#13368)) - Fix documentation for editor vim plugin ALE ([#13348](astral-sh/ruff#13348)) - Fix rendering of `FURB188` docs ([#13406](astral-sh/ruff#13406))
Summary
This PR adds an experimental Ruff subcommand to generate dependency graphs based on module resolution.
A few highlights:
--direction
command-line argument.--detect-string-imports
).include-dependencies
key under[tool.ruff.import-map]
. This map uses file paths as keys, and lists of strings as values. Those strings can be file paths or globs.The dependency resolution uses the red-knot module resolver which is intended to be fully spec compliant, so it's also a chance to expose the module resolver in a real-world setting.
The CLI is, e.g.,
ruff graph build ../autobot
, which will output a JSON map from file to files it depends on for theautobot
project.