Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unreachable code rule #5384

Merged
merged 12 commits into from
Jul 4, 2023
Merged

Add unreachable code rule #5384

merged 12 commits into from
Jul 4, 2023

Conversation

Thomasdezeeuw
Copy link
Contributor

@Thomasdezeeuw Thomasdezeeuw commented Jun 27, 2023

Summary

This adds a new rule that detect unreachable code, currently limited to function bodies.

How it Works

The rule works as follows.

First we create "basic blocks" from the statements. These basic block are zero or more lines of code (statements) for which the code flow is easy to follow. Specifically all statements in a single block follow each other, no diversion of the control flow. At the end of the block the control can do one of three things: 1) continue to another code block, 2) based on a condition jump to one of two code blocks, or 3) terminate (return or end of the function).

Second, based on these basic blocks, and the simplified control flow they represent, we can determine what blocks are and aren't reached. We do this by starting with the first block of the function and following the jumps it makes to the next code block, marking them all as reached.

Third, we create a diagnostic for each code block that is not reached by step 2. Currently these diagnostics are quite limited, see below.

Future Work and Limitations

This commit is only the beginning of this rule, there is much work still left to do.

The diagnostics created currently is quite limited. It only mentions the function name and points to the fist statement in the basic block. In the future this should be expanded to point to all statements in the basic block. Furthermore it would be helpful to the users to explain why a code block is not reached as currently, except for the most basic example, this might not be clear.

We're currently quite limited on how we determine if a branch is always taken or not, specifically we only detect the constants true and false and only in the most basic condition (mainly the if and while statements). This can be expanded to detect more cases where we can statically determine whether or not a branch is taken.

This currently doesn't handle the try or (async) with statements.

For match statements we currently set BasicBlock::stmts to the entire match statement for each basic block, even though we're only interested in one of its cases.

Within match statements binding to named patterns is currently not handled. Similarly to wildcard they should be considered to be always taken (assuming no guard is present).

False Positive

This has the possibility for false positive mostly around the non-implementation of try and with statements. Currently I found one in the Bokeh repo and added it as a (commented-out) test case. I believe time is best spend implementing try and with instead of working around this issue.

Test Plan

Added new tests, simply run cargo test.

I also manually ran this on airflow the Bokeh, Cpython, FastAPI and Jupyter server repos.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 27, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.01      8.1±0.05ms     5.0 MB/sec    1.00      8.0±0.02ms     5.1 MB/sec
formatter/numpy/ctypeslib.py               1.01   1768.6±5.14µs     9.4 MB/sec    1.00   1745.5±2.57µs     9.5 MB/sec
formatter/numpy/globals.py                 1.02    198.3±0.43µs    14.9 MB/sec    1.00    195.1±0.34µs    15.1 MB/sec
formatter/pydantic/types.py                1.02      3.9±0.02ms     6.6 MB/sec    1.00      3.8±0.01ms     6.7 MB/sec
linter/all-rules/large/dataset.py          1.02     14.1±0.08ms     2.9 MB/sec    1.00     13.8±0.06ms     2.9 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.03      3.5±0.02ms     4.7 MB/sec    1.00      3.4±0.02ms     4.9 MB/sec
linter/all-rules/numpy/globals.py          1.00    436.4±1.49µs     6.8 MB/sec    1.00    435.6±0.39µs     6.8 MB/sec
linter/all-rules/pydantic/types.py         1.02      6.3±0.02ms     4.1 MB/sec    1.00      6.1±0.03ms     4.2 MB/sec
linter/default-rules/large/dataset.py      1.01      7.0±0.02ms     5.8 MB/sec    1.00      6.9±0.02ms     5.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1495.1±2.75µs    11.1 MB/sec    1.00   1499.9±3.07µs    11.1 MB/sec
linter/default-rules/numpy/globals.py      1.00    170.8±3.71µs    17.3 MB/sec    1.00    170.7±0.26µs    17.3 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.1±0.02ms     8.2 MB/sec    1.00      3.1±0.01ms     8.2 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.02      9.6±0.12ms     4.2 MB/sec    1.00      9.4±0.07ms     4.3 MB/sec
formatter/numpy/ctypeslib.py               1.00      2.0±0.02ms     8.3 MB/sec    1.00      2.0±0.02ms     8.3 MB/sec
formatter/numpy/globals.py                 1.00    218.9±2.29µs    13.5 MB/sec    1.02   222.4±14.54µs    13.3 MB/sec
formatter/pydantic/types.py                1.01      4.5±0.04ms     5.7 MB/sec    1.00      4.5±0.04ms     5.7 MB/sec
linter/all-rules/large/dataset.py          1.00     15.4±0.16ms     2.6 MB/sec    1.00     15.5±0.07ms     2.6 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.1±0.04ms     4.0 MB/sec    1.00      4.1±0.03ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.00   429.0±10.25µs     6.9 MB/sec    1.01    433.8±7.69µs     6.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      7.1±0.12ms     3.6 MB/sec    1.01      7.1±0.16ms     3.6 MB/sec
linter/default-rules/large/dataset.py      1.00      8.1±0.03ms     5.0 MB/sec    1.02      8.2±0.04ms     4.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1658.9±9.63µs    10.0 MB/sec    1.01   1676.2±8.92µs     9.9 MB/sec
linter/default-rules/numpy/globals.py      1.00    180.5±1.34µs    16.4 MB/sec    1.00    181.1±1.32µs    16.3 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.6±0.01ms     7.0 MB/sec    1.01      3.7±0.02ms     7.0 MB/sec

@Thomasdezeeuw
Copy link
Contributor Author

I found a couple of false positive with the Bokeh repo, looking into them now.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I've two questions that may as well fall under future work

  • Have you thought about how to support e.g. conditional expressions where we also have a conditional data flow?

  • I commented on the BasicBlock layout. Could you expand on the reason why you chose this specific BasicBlock layout? Or in general, could you document how the basic blocks should be structured?

crates/ruff/src/rules/ruff/rules/unreachable.rs Outdated Show resolved Hide resolved
Comment on lines +208 to +210
/// i = 0 # block 0
/// while True: #
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for creating a connection to block 0 vs creating four blocks:

  1. Everything before the while i = 0
  2. While header (while True:)
  3. While body
  4. After the body

And connecting

  • 1 and 2 with an unconditional jump
  • 2 and 3 with a conditional jump
  • 2 and 4 with a conditional jump
  • 3 and 2 with an unconditional jump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of detecting unreachable code, your block 1 doesn't contain any control flow, notwithstanding function calls that always raise an exception, thus block 2 is always reached when block 1 is reached. Creating one block instead of two is then simply less work for us later on. Rustc does the same thing, see https://rustc-dev-guide.rust-lang.org/appendix/background.html#what-is-a-control-flow-graph.

Other then the addition block, it's what we roughly create at the moment, see https://github.com/astral-sh/ruff/pull/5384/files#diff-a98f67bee97e7c459fad838467a76c7cbcd7c66beca1bae4826d1fa4054c4e5d (crates/ruff/src/rules/ruff/rules/snapshots/ruff__rules__ruff__rules__unreachable__tests__while.py_12.snap) for a graph of code that is quite similar.

crates/ruff/src/rules/ruff/rules/unreachable.rs Outdated Show resolved Hide resolved
crates/ruff/src/rules/ruff/rules/unreachable.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@Thomasdezeeuw Thomasdezeeuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I've two questions that may as well fall under future work

* Have you thought about how to support e.g. conditional expressions where we also have a conditional data flow?

You mean for example an if statement inside of the condition (test) of an if statement (Expr::IfExp)? No, I didn't have time to look at this yet.

* I commented on the `BasicBlock` layout. Could you expand on the reason why you chose this specific `BasicBlock` layout? Or in general, could you document how the basic blocks should be structured?

I'm going to assume you mean BasicBlocks (multiple) as you left a comment on that.

Basically the BasicBlocks.blocks is a tree laid out as a vector/array. The fist and last blocks are defined as the last and first blocks in the function (i.e. they are switched) because we start processing the last statement. Why start with the last statement? Because the statement always jumps to the next statement assuming no control flow diversion, which means they have to be part of the existing tree (vector) to reference them.

In between these two block however things get a little fuzzy... It's not ideal, but I couldn't really find a reasonable way to make it return a fixed order (in the time I had). Basically each statement can add an arbitrary number of blocks as it can have an arbitrary number of statements within it (think the body of a loop or if statement). For these "sub-statements" we use the same approach as the top-level statements, but we reuse the blocks vector (for performance reasons). This all works fairly straight forward up to the point where you have recursion, e.g. for loops.

For while loops the body jumps to the while block itself, unless we see a break or return. But by default create_blocks points to the last block in blocks as the next block (unless after is set). But the after argument isn't sufficient in all cases, so we need change_next_block to fix up the control flow in some cases. (I added the after argument after I already implemented change_next_block, but I couldn't remove it as it's still required in some cases last time I checked)

crates/ruff/src/rules/ruff/rules/unreachable.rs Outdated Show resolved Hide resolved
Comment on lines +208 to +210
/// i = 0 # block 0
/// while True: #
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of detecting unreachable code, your block 1 doesn't contain any control flow, notwithstanding function calls that always raise an exception, thus block 2 is always reached when block 1 is reached. Creating one block instead of two is then simply less work for us later on. Rustc does the same thing, see https://rustc-dev-guide.rust-lang.org/appendix/background.html#what-is-a-control-flow-graph.

Other then the addition block, it's what we roughly create at the moment, see https://github.com/astral-sh/ruff/pull/5384/files#diff-a98f67bee97e7c459fad838467a76c7cbcd7c66beca1bae4826d1fa4054c4e5d (crates/ruff/src/rules/ruff/rules/snapshots/ruff__rules__ruff__rules__unreachable__tests__while.py_12.snap) for a graph of code that is quite similar.

crates/ruff/src/rules/ruff/rules/unreachable.rs Outdated Show resolved Hide resolved
Copy link

@dimaqq dimaqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider special-casing yield

>>> def foo():
...   if False: yield
...   return 42
... 
>>> def bar():
...   return 42
... 
...
>>> foo()
<generator object foo at 0x7ffa840a4e00>
>>> bar()
42

IIRC there's one more corner case with nonlocals/locals in the presence of a closure, but I can't type if off the top of my head.

@Thomasdezeeuw
Copy link
Contributor Author

Consider special-casing yield

>>> def foo():
...   if False: yield
...   return 42
... 
>>> def bar():
...   return 42
... 
...
>>> foo()
<generator object foo at 0x7ffa840a4e00>
>>> bar()
42

IIRC there's one more corner case with nonlocals/locals in the presence of a closure, but I can't type if off the top of my head.

I've moved Stmt::Expr to be handled separately, but it's future work.

@Thomasdezeeuw
Copy link
Contributor Author

@MichaReiser @charliermarsh I can't push the branch any more, but here are two more patches:

Move Stmt::Expr to a different match branch:

From d2042fbd090cd59e77363395a48dc94dd6da0db0 Mon Sep 17 00:00:00 2001
From: Thomas de Zeeuw <[email protected]>
Date: Sun, 2 Jul 2023 17:28:09 +0200
Subject: [PATCH 1/2] Move Stmt::Expr to a different match branch

This still needs to be handled.
---
 .../ruff/src/rules/ruff/rules/unreachable.rs  | 34 ++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/crates/ruff/src/rules/ruff/rules/unreachable.rs b/crates/ruff/src/rules/ruff/rules/unreachable.rs
index b8959749b..5bc085cf8 100644
--- a/crates/ruff/src/rules/ruff/rules/unreachable.rs
+++ b/crates/ruff/src/rules/ruff/rules/unreachable.rs
@@ -357,7 +357,6 @@ fn create_blocks<'stmt>(
             | Stmt::Assign(_)
             | Stmt::AugAssign(_)
             | Stmt::AnnAssign(_)
-            | Stmt::Expr(_)
             | Stmt::Break(_)
             | Stmt::Continue(_) // NOTE: the next branch gets fixed up in `change_next_block`.
             | Stmt::Pass(_) => unconditional_next_block(blocks, after),
@@ -437,6 +436,7 @@ fn create_blocks<'stmt>(
                 // TODO: currently we don't include the lines before the match
                 // statement in the block, unlike what we do for other
                 // statements.
+                after = Some(blocks.len() - 1);
                 continue;
             }
             Stmt::Raise(_) => {
@@ -461,6 +461,38 @@ fn create_blocks<'stmt>(
                     orelse,
                 }
             }
+            Stmt::Expr(stmt) => {
+                match &*stmt.value {
+                    Expr::BoolOp(_) |
+                    Expr::BinOp(_) |
+                    Expr::UnaryOp(_) |
+                    Expr::Dict(_) |
+                    Expr::Set(_) |
+                    Expr::Compare(_) |
+                    Expr::Call(_) |
+                    Expr::FormattedValue(_) |
+                    Expr::JoinedStr(_) |
+                    Expr::Constant(_) |
+                    Expr::Attribute(_) |
+                    Expr::Subscript(_) |
+                    Expr::Starred(_) |
+                    Expr::Name(_) |
+                    Expr::List(_) |
+                    Expr::Tuple(_) |
+                    Expr::Slice(_)  => unconditional_next_block(blocks, after),
+                    // TODO: handle these expressions.
+                    Expr::NamedExpr(_) |
+                    Expr::Lambda(_) |
+                    Expr::IfExp(_) |
+                    Expr::ListComp(_) |
+                    Expr::SetComp(_) |
+                    Expr::DictComp(_) |
+                    Expr::GeneratorExp(_) |
+                    Expr::Await(_) |
+                    Expr::Yield(_) |
+                    Expr::YieldFrom(_) => unconditional_next_block(blocks, after),
+                }
+            },
             // The tough branches are done, here is an easy one.
             Stmt::Return(_) => NextBlock::Terminate,
         };
-- 
2.41.0
``` 

Improve BasicBlocks docs:

````patch
From 41e7813b4d83771689905020c5cd1e0779788445 Mon Sep 17 00:00:00 2001
From: Thomas de Zeeuw <[email protected]>
Date: Sun, 2 Jul 2023 17:29:42 +0200
Subject: [PATCH 2/2] Improve BasicBlocks docs

---
 crates/ruff/src/rules/ruff/rules/unreachable.rs | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/crates/ruff/src/rules/ruff/rules/unreachable.rs b/crates/ruff/src/rules/ruff/rules/unreachable.rs
index 5bc085cf8..dcbe33186 100644
--- a/crates/ruff/src/rules/ruff/rules/unreachable.rs
+++ b/crates/ruff/src/rules/ruff/rules/unreachable.rs
@@ -194,9 +194,10 @@ struct BasicBlocks<'stmt> {
     /// # Notes
     ///
     /// The order of these block is unspecified. However it's guaranteed that
-    /// the last block is the statement in the function and the first block is
-    /// the last statement. The block are more or less in reverse order, but it
-    /// gets fussy around control flow statements (e.g. `if` statements).
+    /// the last block is the first statement in the function and the first
+    /// block is the last statement. The block are more or less in reverse
+    /// order, but it gets fussy around control flow statements (e.g. `while`
+    /// statements).
     ///
     /// For loop blocks, and similar recurring control flows, the end of the
     /// body will point to the loop block again (to create the loop). However an
-- 
2.41.0
``` 

Thomas de Zeeuw and others added 8 commits July 4, 2023 08:38
This adds a new rule that detect unreachable code, currently limited to
function bodies.

How it Works
============

The rule works as follows.

First we create "basic blocks" from the statements. These basic block
are zero or more lines of code (statements) for which the code flow is
easy to follow. Specifically all statements in a single block follow
each other, no diversion of the control flow. At the end of the block
the control can do one of three things: 1) continue to another code
block, 2) based on a condition jump to one of two code blocks, or 3)
terminate (return or end of the function).

Second, based on these basic blocks, and the simplified control flow
they represent, we can determine what blocks are and aren't reached. We
do this by starting with the first block of the function and following
the jumps it makes to the next code block, marking them all as reached.

Third, we create a diagnostic for each code block that is not reached by
step 2. Currently these diagnostics are quite limited, see below.

Future Work and Limitations
===========================

This commit is only the beginning of this rule, there is much work still
left to do.

The diagnostics created currently is quite limited. It only mentions the
function name and points to the fist statement in the basic block. In
the future this should be expanded to point to all statements in the
basic block. Furthermore it would be helpful to the users to explain
*why* a code block is not reached as currently, except for the most
basic example, this might not be clear.

We're currently quite limited on how we determine if a branch is always
taken or not, specifically we only detect the constants true and false
and only in the most basic condition (mainly the `if` and `while`
statements). This can be expanded to detect more cases where we can
statically determine whether or not a branch is taken.

This currently doesn't have `try` or (`async`) `with` statements.

For match statements we currently set `BasicBlock::stmts` to the entire
match statement for each basic block, even though we're only interested
in one of its cases.

Within match statements binding to named patterns is currently not
handled. Similarly to wildcard they should be considered to be always
taken (assuming no guard is present).
In some cases, for example while constructing a while loop, the block
indices don't always exist. Deal with that possibility by simpling
ignoring that block.
This is often the `after` variable, which wasn't correctly used
everywhere. This now fixed and a regression tests based on a function
found in Bokeh is added to test this.
But this commit doesn't actually fix the problem.

The problem is that the try statements aren't handled yet and simple
continue with the next block, which in the case of a `while True` loop
creates an infinite loop. Thus the rule triggers on any statements after
the while loop, but this is incorrect.
@MichaReiser
Copy link
Member

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@@ -0,0 +1,241 @@
---
source: crates/ruff/src/rules/ruff/rules/unreachable.rs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the file extension to .md.snap. This gives us a nice preview if you associate the file extension with markdown in your editor

image

@MichaReiser
Copy link
Member

The only remaining question is how to get the rule for now.

@MichaReiser MichaReiser enabled auto-merge (squash) July 4, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants