Rust Integration: Perform Max Flow in Rust instead of Python #1

ohjuny · 2024-10-18T21:16:09Z

The overarching goal is to improve performance by using Rust instead of Python to perform intense computations, while maintaining the Python API.

We implement Python-Rust interop using pyo3. We use the pathfinding crate for the Rust-side implementation of the Edmonds-Karp max flow algorithm.

Through initial profiling, we have identified the call to nx.maximum_flow to be the overwhelming bottleneck (51-74% of total time spent). This PR replaces the calls to nx.maximum_flow with Rust-side bindings, which initial profiling shows a total speedup of 1.6-2.6x.

Links

Profiling Data
Profiling Infrastructure repo: https://github.com/ohjuny/lowtime-exp

Results

Based on profiling runs of 5 models on A40 and A100 GPUs, we found:

3.83~6.99x speed up on first call to Rust-side max_flow
1.89~3.56x speed up on seconds call to Rust-side max_flow
Total runtime: 1.31~3.46x speedup

Build and Run

Assume you have some project that is using lowtime, and you have a local clone of this lowtime branch.

$ ls
project/    lowtime/

Then run the following to set up a virtual environment with this version of lowtime.

$ cd project/
$ python -m venv .env        # create virtual environment
$ source .env/bin/activate   # activate virtual environment
$ cd ../lowtime/
$ pip install -e .           # (editable) install lowtime
$ cd ../project/
$ #work on project, lowtime is now installed

If you modify lib.rs and want those changes reflected:

$ pwd
lowtime
$ pip uninstall lowtime
$ pip install -e .

Design Decisions

`SparseCapacity` vs `DenseCapacity`

The Rust-side code uses the pathfinding::durected::edmonds_karp::edmonds_karp implementation for max flow. The function takes in an EK type parameter, that determines whether to represent the graph using a SparseCapacity (BTree of Btrees) or DenseCapacity (adjaceny matrix). Due to the nature of computation graphs of parallelized training workloads, a SparseCapacity was more efficient. This was verified through profiling, which showed that SparseCapacity was up to 10x faster than DenseCapacity.

`OrderedFloat<f64>` vs `f32` vs `i64`

The edmonds_karp implementation requires the Zero, Bounded, Signed, Ord, Copy` traits to be implemented by capacity values.

Raw floats in Rust do not implement Ord, so they cannot be used; OrderedFloat does.
OrderedFloat<f32> resulted in incorrect output due to lack of precision.

We considered an implementation where we use integers, and convert to/from floats by multiplying by 1e9.

Using i64 resulted in incorrect output. Furthermore, it resulted in a similar runtime (no significant gain).
u64 cannot be used as it does not implement Signed, which is needed to represent negative residual flows in edmonds_karp.

Therefore OrderedFloat<f64> was chosen to represent capacity.

`pyo3` Data Transfer Reduction with "pimpl" idiom

The _lowtime_rs::PhillipsDessouky constructor returns the object itself, which is very large as it contains all the nodes, edges, and associated capacities. Due to a misconception that pyo3 would have to create a "Python copy" of the Rust object, we experimented with using the pimpl idiom to implement a wrapper PhillipsDessouky class that contained a pointer to an internal _PhillipsDessouky class, which contained the actual data and implementations. However, pyo3 seems to have a clever way of associating Rust objects with Python objects without creating a "Python copy". In fact, profiling the "pimpl version" resulted in slightly slower runtimes, probably due to the added indirection and use of the heap.

Handling Rust-side Errors

Before this PR, the lowtime codebase performs exception handling on the call tonx.maximum_flow, since the nx library is designed to raise well-defined exceptions (eg. nx.NetworkXUnbounded). However, the Rust-side pathfinding library is does not contain well-defined errors; instead, the edmonds_karp function considers the following situations unrecoverable:

panic!s if source and sink nodes are not found in verticies.
unreachable! if there is no flow to cancel in its internal representation of node capacities. One way this is caused is by providing a negative capacity as input (this caused a very hard-to-debug error when some flows were very small negative values due to floating point imprecision).
assert!s many invariants of the internal graph representation (assert! will panic! if the condition is false).

If the Rust-side library use well-defined errors, pyo3 has integration for propagating those errors as Python exceptions. The "recommended way" when using a third-party crate that does not contain well-defined errors is to, if possible, modify the third-party crate itself to do this.

pathfinding chooses to use unrecoverable macros over well-defined errors. Assuming we do not intend on changing the pathfinding codebase itself, we have 2 options:

Use catch_unwindto catch panic!s in Rust. Once caught, we convert the error to a custom Error type that is propagated to Python as an exception, which can be caught and handled in Python.
Do not handle panic!s, letting the program crash gracefully. The panic! and unreachable! macros are intended as graceful exit points for unrecoverable states, so we should not try to treat it like a Python exception.

There is significant debate on this topic; for example, this discussion based on the polars library. In the polars discussion, the debate was closed with:

A Rust panic really is intended to be treated as a (graceful) crash, not as a user-space error. It sounds from the polars thread that the maintainers there are willing to treat panics as bugs and fix them.

one does not need to manually extend error catching conditions when using a cython-wrapped library, as one would have to do when including a pyo3-wrapped library.

The point is that a panic isn't a typical error condition (returning Result is). A Python user isn't expected to interact with panics, the choice to not extend from Exception is deliberate so that panics are not accidentally caught.

Will close this one, if there is significant pressure from the community we can revisit.

We make the same decision here: we do not attempt to catch or handle panic!s and unreachable! crashes. If needed in the future, we can add invariant checks before the call to edmonds_karp.

Profiling Set Up

Through the development process, we use profiling data to determine whether an idea was worth pursuing. Since there were many ideas that depended on each other, there was a need to build a profiling infrastructure fast (even if it is scrappy) before we could start experimenting with lowtime. As a result, the current profiling inftrastructure was built:

Measure times between intervals of interest (eg. the call to find_min_cut).
Log the time.
Run a separate script after lowtime has completed that parses the job.log for profiling-related logs and groups relevant intervals together.

While fast, this approach results in many profiling related logs in the lowtime codebase, and requires a separate script to parse the logs post-lowtime. An ideal solution would be integrated profiling infrastructure integrated into lowtime itself, that can be turned on/off through a command-line argument when running lowtime. However, this PR chooses the "scrappy" option because:

The need for profiling data was blocking progress on the "core" idea (Rust interop).
Building an integrated profiling infrastructure seems like an independent addition that could be its own PR.

We will need to remove the profiling logs at the end of this chain of commits (i.e. when we eventually merge lowtime-rust with main), but we keep them for now as we will need to use the existing profiling infrastructure for future commits in this chain.

…de option to run rust or python versions

lowtime/lowtime_rust/src/lib.rs

lowtime/solver.py

jaywonchung · 2024-10-18T23:16:07Z

Could you also quickly document how to build and run this?

Co-authored-by: Jae-Won Chung <[email protected]>

… pimpl version is actually slower

ohjuny · 2024-11-05T20:44:26Z

Not 100% knowledgeable on Python packaging and distribution, but here's what I tried:

Started with src layout that you linked before. I got confused about how to reconcile the 2 different pyproject.tomls that used different build-backends.
Found an example (deltalake-python) that had mixed Python and Rust. The organization looked clean, and used a single maturin build-backend for both Rust and Python files.
Read through maturin README for more info as well. Main takeaway was that (1) the deltalake-python organization works and (2) it is good convention to prepend _ to the Rust module if the intention is that it is internal.

That's how I ended up with the current structure.

I'm trying to fix the last ci.yaml pyright test, where it seems to say that _lowtime_rust does not exist. When I test locally, just running "pip install lowtime" seems to be sufficient in building both Rust and Python libraries, so I'm not sure why the automated test is failing. Will have to look deeper into this, but let me know if you spot anything.

jaywonchung · 2024-11-06T04:55:47Z

Have a pyi file? lowtime/_lowtime_rs.pyi.

from __future__ import annotations

class PhillipsDessouky:
    node_ids: list[int]
    
    def __init__(
        self,
        node_ids: list[int],
        source_node_id: int,
        sink_node_id: int,
        edges_raw: list[tuple[tuple[int, int], float]]
    ) -> None: ...

    def max_flow(self) -> list[tuple[tuple[int, int], float]]: ...

ohjuny · 2024-11-07T00:59:53Z

Have a pyi file? lowtime/_lowtime_rs.pyi.

from __future__ import annotations

class PhillipsDessouky:
    node_ids: list[int]
    
    def __init__(
        self,
        node_ids: list[int],
        source_node_id: int,
        sink_node_id: int,
        edges_raw: list[tuple[tuple[int, int], float]]
    ) -> None: ...

    def max_flow(self) -> list[tuple[tuple[int, int], float]]: ...

~~DIdn't seem to work :( Will look further into this.~~

Edit: needed to also add an empty py.typed file, documented in the maturin docs. Fun fact: there is an open PR on pyo3 to generate these automatically.

Fixing some other pyright issues now.

jaywonchung

This is absolutely great work, thank you! I've left some suggestions and comments.

.github/workflows/CI.yml

pyproject.toml

src/phillips_dessouky.rs

Cargo.toml

lowtime/.gitignore

lowtime/_lowtime_rs.pyi

Co-authored-by: Jae-Won Chung <[email protected]>

lowtime/_lowtime_rs.pyi

Co-authored-by: Jae-Won Chung <[email protected]>

jaywonchung

LGTM!

One thing that would be nice to have is an automated e2e test for result consistency.

set up PR for naive implementation that copies each iteration

e85d6f8

ohjuny marked this pull request as draft October 18, 2024 21:16

ohjuny self-assigned this Oct 18, 2024

ohjuny added the enhancement New feature or request label Oct 18, 2024

ohjuny added 8 commits October 18, 2024 17:27

set up py03 directories

9bd30b8

add rust dependencies

5af0b31

add networkx dependency for testing

67979ae

rust-side PhillipsDessouky

e4f9b3f

python-side profiling code

743336f

integrate rust-side library with python solver

43f0832

black formatting

4048d9f

remove unnecessary dependency

463dd20

ohjuny requested a review from jaywonchung October 18, 2024 21:56

ohjuny added 4 commits October 18, 2024 18:23

factor out rust interop pre/post-processing

5a7d108

replace second max_flow call with rust

3e07341

black formatting

556940a

readd commented old max flow code for potential future where we provi…

9041daf

…de option to run rust or python versions

jaywonchung reviewed Oct 18, 2024

View reviewed changes

lowtime/lowtime_rust/src/lib.rs Outdated Show resolved Hide resolved

jaywonchung reviewed Oct 18, 2024

View reviewed changes

lowtime/lowtime_rust/src/lib.rs Outdated Show resolved Hide resolved

jaywonchung reviewed Oct 18, 2024

View reviewed changes

lowtime/solver.py Outdated Show resolved Hide resolved

ohjuny and others added 9 commits October 18, 2024 20:51

fix refactoring, need to double check with run

4acad7b

comment second max flow (issue with replacing second, will find out)

af5897c

remove rust-side commented i64 code

cd5c40e

rust-side cleanup

45dabdb

Co-authored-by: Jae-Won Chung <[email protected]>

implement 2nd max flow + attempt pimpl for rust-side. note that using…

93a1a78

… pimpl version is actually slower

remove unused pimpl implementation

1f37443

fix lint errors + rust comments

dbb90d9

rename lowtime_rust to lowtime_rs

2c5c5bc

figuring out how to package rust/pyo3 module within python distribution

7b14211

fix nodeview type annotations

10cc0c7

pyi for pyo3 python module

8ae5b7b

ohjuny added 4 commits November 7, 2024 12:09

py.typed for pyright check

54e97f3

fix pyi for _lowtime_rs

a4bb48d

fix type hint

e4aff75

remove unnecessary comment

b4a814c

ohjuny marked this pull request as ready for review November 7, 2024 18:00

refactor class impl out of lib

f945bff

ohjuny mentioned this pull request Nov 7, 2024

Rust Integration: Full Rust Integration #2

Closed

remove unnecessary pyo3 getters

61c7fcd

jaywonchung reviewed Nov 7, 2024

View reviewed changes

ohjuny and others added 4 commits November 7, 2024 14:20

Update pyproject.toml

18031b7

Co-authored-by: Jae-Won Chung <[email protected]>

remove version in cargo.toml

061f054

fix + reformat pyi

92656ff

add author ohjun

8f421b6

ohjuny commented Nov 8, 2024

View reviewed changes

lowtime/_lowtime_rs.pyi Outdated Show resolved Hide resolved

ohjuny and others added 8 commits November 7, 2024 19:07

black formatting for pyio

7d6fb08

Update src/phillips_dessouky.rs

85ffca7

Co-authored-by: Jae-Won Chung <[email protected]>

Update src/phillips_dessouky.rs

cb7e66f

Co-authored-by: Jae-Won Chung <[email protected]>

remove unused Duration

5a98d92

Update src/phillips_dessouky.rs

376f17f

Co-authored-by: Jae-Won Chung <[email protected]>

remove nested gitignore + hardcode version in configs

e22c3b0

rename conflicting ci

642d4a9

rename generated maturin ci yaml

9c509a6

jaywonchung approved these changes Nov 8, 2024

View reviewed changes

jaywonchung merged commit 3f086b9 into lowtime-rust Nov 8, 2024
17 checks passed

ohjuny mentioned this pull request Nov 9, 2024

Rust Integration: Full Rust Integration #3

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust Integration: Perform Max Flow in Rust instead of Python #1

Rust Integration: Perform Max Flow in Rust instead of Python #1

ohjuny commented Oct 18, 2024 •

edited

Loading

jaywonchung commented Oct 18, 2024

ohjuny commented Nov 5, 2024 •

edited

Loading

jaywonchung commented Nov 6, 2024 •

edited

Loading

ohjuny commented Nov 7, 2024 •

edited

Loading

jaywonchung left a comment

jaywonchung left a comment

Rust Integration: Perform Max Flow in Rust instead of Python #1

Rust Integration: Perform Max Flow in Rust instead of Python #1

Conversation

ohjuny commented Oct 18, 2024 • edited Loading

Links

Results

Build and Run

Design Decisions

SparseCapacity vs DenseCapacity

OrderedFloat<f64> vs f32 vs i64

pyo3 Data Transfer Reduction with "pimpl" idiom

Handling Rust-side Errors

Profiling Set Up

jaywonchung commented Oct 18, 2024

ohjuny commented Nov 5, 2024 • edited Loading

jaywonchung commented Nov 6, 2024 • edited Loading

ohjuny commented Nov 7, 2024 • edited Loading

jaywonchung left a comment

Choose a reason for hiding this comment

jaywonchung left a comment

Choose a reason for hiding this comment

ohjuny commented Oct 18, 2024 •

edited

Loading

`SparseCapacity` vs `DenseCapacity`

`OrderedFloat<f64>` vs `f32` vs `i64`

`pyo3` Data Transfer Reduction with "pimpl" idiom

ohjuny commented Nov 5, 2024 •

edited

Loading

jaywonchung commented Nov 6, 2024 •

edited

Loading

ohjuny commented Nov 7, 2024 •

edited

Loading