-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unreachable code rule #5384
Conversation
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinux
Windows
|
I found a couple of false positive with the Bokeh repo, looking into them now. |
936f428
to
576d03f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! I've two questions that may as well fall under future work
-
Have you thought about how to support e.g. conditional expressions where we also have a conditional data flow?
-
I commented on the
BasicBlock
layout. Could you expand on the reason why you chose this specificBasicBlock
layout? Or in general, could you document how the basic blocks should be structured?
/// i = 0 # block 0 | ||
/// while True: # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for creating a connection to block 0
vs creating four blocks:
- Everything before the while
i = 0
- While header (
while True:
) - While body
- After the body
And connecting
- 1 and 2 with an unconditional jump
- 2 and 3 with a conditional jump
- 2 and 4 with a conditional jump
- 3 and 2 with an unconditional jump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of detecting unreachable code, your block 1 doesn't contain any control flow, notwithstanding function calls that always raise an exception, thus block 2 is always reached when block 1 is reached. Creating one block instead of two is then simply less work for us later on. Rustc does the same thing, see https://rustc-dev-guide.rust-lang.org/appendix/background.html#what-is-a-control-flow-graph.
Other then the addition block, it's what we roughly create at the moment, see https://github.com/astral-sh/ruff/pull/5384/files#diff-a98f67bee97e7c459fad838467a76c7cbcd7c66beca1bae4826d1fa4054c4e5d (crates/ruff/src/rules/ruff/rules/snapshots/ruff__rules__ruff__rules__unreachable__tests__while.py_12.snap) for a graph of code that is quite similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! I've two questions that may as well fall under future work
* Have you thought about how to support e.g. conditional expressions where we also have a conditional data flow?
You mean for example an if
statement inside of the condition (test) of an if
statement (Expr::IfExp
)? No, I didn't have time to look at this yet.
* I commented on the `BasicBlock` layout. Could you expand on the reason why you chose this specific `BasicBlock` layout? Or in general, could you document how the basic blocks should be structured?
I'm going to assume you mean BasicBlocks
(multiple) as you left a comment on that.
Basically the BasicBlocks.blocks
is a tree laid out as a vector/array. The fist and last blocks are defined as the last and first blocks in the function (i.e. they are switched) because we start processing the last statement. Why start with the last statement? Because the statement always jumps to the next statement assuming no control flow diversion, which means they have to be part of the existing tree (vector) to reference them.
In between these two block however things get a little fuzzy... It's not ideal, but I couldn't really find a reasonable way to make it return a fixed order (in the time I had). Basically each statement can add an arbitrary number of blocks as it can have an arbitrary number of statements within it (think the body of a loop or if statement). For these "sub-statements" we use the same approach as the top-level statements, but we reuse the blocks
vector (for performance reasons). This all works fairly straight forward up to the point where you have recursion, e.g. for loops.
For while loops the body jumps to the while block itself, unless we see a break
or return
. But by default create_blocks
points to the last block in blocks
as the next block (unless after
is set). But the after
argument isn't sufficient in all cases, so we need change_next_block
to fix up the control flow in some cases. (I added the after
argument after I already implemented change_next_block
, but I couldn't remove it as it's still required in some cases last time I checked)
/// i = 0 # block 0 | ||
/// while True: # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of detecting unreachable code, your block 1 doesn't contain any control flow, notwithstanding function calls that always raise an exception, thus block 2 is always reached when block 1 is reached. Creating one block instead of two is then simply less work for us later on. Rustc does the same thing, see https://rustc-dev-guide.rust-lang.org/appendix/background.html#what-is-a-control-flow-graph.
Other then the addition block, it's what we roughly create at the moment, see https://github.com/astral-sh/ruff/pull/5384/files#diff-a98f67bee97e7c459fad838467a76c7cbcd7c66beca1bae4826d1fa4054c4e5d (crates/ruff/src/rules/ruff/rules/snapshots/ruff__rules__ruff__rules__unreachable__tests__while.py_12.snap) for a graph of code that is quite similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider special-casing yield
>>> def foo():
... if False: yield
... return 42
...
>>> def bar():
... return 42
...
...
>>> foo()
<generator object foo at 0x7ffa840a4e00>
>>> bar()
42
IIRC there's one more corner case with nonlocals/locals in the presence of a closure, but I can't type if off the top of my head.
I've moved |
@MichaReiser @charliermarsh I can't push the branch any more, but here are two more patches: Move Stmt::Expr to a different match branch: From d2042fbd090cd59e77363395a48dc94dd6da0db0 Mon Sep 17 00:00:00 2001
From: Thomas de Zeeuw <[email protected]>
Date: Sun, 2 Jul 2023 17:28:09 +0200
Subject: [PATCH 1/2] Move Stmt::Expr to a different match branch
This still needs to be handled.
---
.../ruff/src/rules/ruff/rules/unreachable.rs | 34 ++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/crates/ruff/src/rules/ruff/rules/unreachable.rs b/crates/ruff/src/rules/ruff/rules/unreachable.rs
index b8959749b..5bc085cf8 100644
--- a/crates/ruff/src/rules/ruff/rules/unreachable.rs
+++ b/crates/ruff/src/rules/ruff/rules/unreachable.rs
@@ -357,7 +357,6 @@ fn create_blocks<'stmt>(
| Stmt::Assign(_)
| Stmt::AugAssign(_)
| Stmt::AnnAssign(_)
- | Stmt::Expr(_)
| Stmt::Break(_)
| Stmt::Continue(_) // NOTE: the next branch gets fixed up in `change_next_block`.
| Stmt::Pass(_) => unconditional_next_block(blocks, after),
@@ -437,6 +436,7 @@ fn create_blocks<'stmt>(
// TODO: currently we don't include the lines before the match
// statement in the block, unlike what we do for other
// statements.
+ after = Some(blocks.len() - 1);
continue;
}
Stmt::Raise(_) => {
@@ -461,6 +461,38 @@ fn create_blocks<'stmt>(
orelse,
}
}
+ Stmt::Expr(stmt) => {
+ match &*stmt.value {
+ Expr::BoolOp(_) |
+ Expr::BinOp(_) |
+ Expr::UnaryOp(_) |
+ Expr::Dict(_) |
+ Expr::Set(_) |
+ Expr::Compare(_) |
+ Expr::Call(_) |
+ Expr::FormattedValue(_) |
+ Expr::JoinedStr(_) |
+ Expr::Constant(_) |
+ Expr::Attribute(_) |
+ Expr::Subscript(_) |
+ Expr::Starred(_) |
+ Expr::Name(_) |
+ Expr::List(_) |
+ Expr::Tuple(_) |
+ Expr::Slice(_) => unconditional_next_block(blocks, after),
+ // TODO: handle these expressions.
+ Expr::NamedExpr(_) |
+ Expr::Lambda(_) |
+ Expr::IfExp(_) |
+ Expr::ListComp(_) |
+ Expr::SetComp(_) |
+ Expr::DictComp(_) |
+ Expr::GeneratorExp(_) |
+ Expr::Await(_) |
+ Expr::Yield(_) |
+ Expr::YieldFrom(_) => unconditional_next_block(blocks, after),
+ }
+ },
// The tough branches are done, here is an easy one.
Stmt::Return(_) => NextBlock::Terminate,
};
--
2.41.0
```
Improve BasicBlocks docs:
````patch
From 41e7813b4d83771689905020c5cd1e0779788445 Mon Sep 17 00:00:00 2001
From: Thomas de Zeeuw <[email protected]>
Date: Sun, 2 Jul 2023 17:29:42 +0200
Subject: [PATCH 2/2] Improve BasicBlocks docs
---
crates/ruff/src/rules/ruff/rules/unreachable.rs | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/crates/ruff/src/rules/ruff/rules/unreachable.rs b/crates/ruff/src/rules/ruff/rules/unreachable.rs
index 5bc085cf8..dcbe33186 100644
--- a/crates/ruff/src/rules/ruff/rules/unreachable.rs
+++ b/crates/ruff/src/rules/ruff/rules/unreachable.rs
@@ -194,9 +194,10 @@ struct BasicBlocks<'stmt> {
/// # Notes
///
/// The order of these block is unspecified. However it's guaranteed that
- /// the last block is the statement in the function and the first block is
- /// the last statement. The block are more or less in reverse order, but it
- /// gets fussy around control flow statements (e.g. `if` statements).
+ /// the last block is the first statement in the function and the first
+ /// block is the last statement. The block are more or less in reverse
+ /// order, but it gets fussy around control flow statements (e.g. `while`
+ /// statements).
///
/// For loop blocks, and similar recurring control flows, the end of the
/// body will point to the loop block again (to create the loop). However an
--
2.41.0
``` |
This adds a new rule that detect unreachable code, currently limited to function bodies. How it Works ============ The rule works as follows. First we create "basic blocks" from the statements. These basic block are zero or more lines of code (statements) for which the code flow is easy to follow. Specifically all statements in a single block follow each other, no diversion of the control flow. At the end of the block the control can do one of three things: 1) continue to another code block, 2) based on a condition jump to one of two code blocks, or 3) terminate (return or end of the function). Second, based on these basic blocks, and the simplified control flow they represent, we can determine what blocks are and aren't reached. We do this by starting with the first block of the function and following the jumps it makes to the next code block, marking them all as reached. Third, we create a diagnostic for each code block that is not reached by step 2. Currently these diagnostics are quite limited, see below. Future Work and Limitations =========================== This commit is only the beginning of this rule, there is much work still left to do. The diagnostics created currently is quite limited. It only mentions the function name and points to the fist statement in the basic block. In the future this should be expanded to point to all statements in the basic block. Furthermore it would be helpful to the users to explain *why* a code block is not reached as currently, except for the most basic example, this might not be clear. We're currently quite limited on how we determine if a branch is always taken or not, specifically we only detect the constants true and false and only in the most basic condition (mainly the `if` and `while` statements). This can be expanded to detect more cases where we can statically determine whether or not a branch is taken. This currently doesn't have `try` or (`async`) `with` statements. For match statements we currently set `BasicBlock::stmts` to the entire match statement for each basic block, even though we're only interested in one of its cases. Within match statements binding to named patterns is currently not handled. Similarly to wildcard they should be considered to be always taken (assuming no guard is present).
In some cases, for example while constructing a while loop, the block indices don't always exist. Deal with that possibility by simpling ignoring that block.
This is often the `after` variable, which wasn't correctly used everywhere. This now fixed and a regression tests based on a function found in Bokeh is added to test this.
But this commit doesn't actually fix the problem. The problem is that the try statements aren't handled yet and simple continue with the next block, which in the case of a `while True` loop creates an infinite loop. Thus the rule triggers on any statements after the while loop, but this is incorrect.
5983995
to
0869f5a
Compare
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
de8fd71
to
9de3e5a
Compare
@@ -0,0 +1,241 @@ | |||
--- | |||
source: crates/ruff/src/rules/ruff/rules/unreachable.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only remaining question is how to get the rule for now. |
a0b40e7
to
d04ba39
Compare
Summary
This adds a new rule that detect unreachable code, currently limited to function bodies.
How it Works
The rule works as follows.
First we create "basic blocks" from the statements. These basic block are zero or more lines of code (statements) for which the code flow is easy to follow. Specifically all statements in a single block follow each other, no diversion of the control flow. At the end of the block the control can do one of three things: 1) continue to another code block, 2) based on a condition jump to one of two code blocks, or 3) terminate (return or end of the function).
Second, based on these basic blocks, and the simplified control flow they represent, we can determine what blocks are and aren't reached. We do this by starting with the first block of the function and following the jumps it makes to the next code block, marking them all as reached.
Third, we create a diagnostic for each code block that is not reached by step 2. Currently these diagnostics are quite limited, see below.
Future Work and Limitations
This commit is only the beginning of this rule, there is much work still left to do.
The diagnostics created currently is quite limited. It only mentions the function name and points to the fist statement in the basic block. In the future this should be expanded to point to all statements in the basic block. Furthermore it would be helpful to the users to explain why a code block is not reached as currently, except for the most basic example, this might not be clear.
We're currently quite limited on how we determine if a branch is always taken or not, specifically we only detect the constants true and false and only in the most basic condition (mainly the
if
andwhile
statements). This can be expanded to detect more cases where we can statically determine whether or not a branch is taken.This currently doesn't handle the
try
or (async
)with
statements.For match statements we currently set
BasicBlock::stmts
to the entire match statement for each basic block, even though we're only interested in one of its cases.Within match statements binding to named patterns is currently not handled. Similarly to wildcard they should be considered to be always taken (assuming no guard is present).
False Positive
This has the possibility for false positive mostly around the non-implementation of
try
andwith
statements. Currently I found one in the Bokeh repo and added it as a (commented-out) test case. I believe time is best spend implementingtry
andwith
instead of working around this issue.Test Plan
Added new tests, simply run
cargo test
.I also manually ran this on airflow the Bokeh, Cpython, FastAPI and Jupyter server repos.