Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler-v2][optimization] Create and retain temps for each arg #15514

Merged
merged 2 commits into from
Dec 13, 2024

Conversation

vineethk
Copy link
Contributor

@vineethk vineethk commented Dec 5, 2024

Description

This PR builds on the optimization made in #15445 where we optimized Assign stackless-bytecode instructions. In this PR, when generating stackless-bytecode, we create temporaries for each argument (even if they are simple variables), retain those assignments (through dead-store elimination) if they represent cross-block def-uses. In this way, the file-format bytecode generator knows when to load the cross-block definitions (fixes #15339, by essentially implementing eager loads optimization in a different way). The flush writes optimization needs additional logic to handle this pipeline.

With the optimization in this PR, on the aptos-framework and all its dependencies (std libs):

Note that all changes introduced here should be hidden behind OPTIMIZE_WAITING_FOR_COMPARE_TESTS experiment (which is turned off by default), so landing this PR should not affect code generated.

Safety arguments:

  • assigning a new temp for each arg in a function call should be safe (because we already do this for non-trivial arguments)
  • not removing certain deadcode of the form x = x should be safe
  • flush writes optimization is purely a hinting mechanism which advises whether a definition loaded onto the stack should be flushed to a local write-away: so if it hints wrongly, we would have extra flushes to a local, but would not affect the execution semantics

A couple of examples of this optimization working:

This code:

    fun test1(x: u64) {
        bar(x, one(), one(), one(), one(), one());
    }

produces before this PR:

	0: Call one(): u64
	1: Call one(): u64
	2: Call one(): u64
	3: Call one(): u64
	4: Call one(): u64
	5: StLoc[1](loc0: u64)
	6: StLoc[2](loc1: u64)
	7: StLoc[3](loc2: u64)
	8: StLoc[4](loc3: u64)
	9: StLoc[5](loc4: u64)
	10: MoveLoc[0](Arg0: u64)
	11: MoveLoc[5](loc4: u64)
	12: MoveLoc[4](loc3: u64)
	13: MoveLoc[3](loc2: u64)
	14: MoveLoc[2](loc1: u64)
	15: MoveLoc[1](loc0: u64)
	16: Call bar(u64, u64, u64, u64, u64, u64)
	17: Ret

after this PR:

	0: MoveLoc[0](Arg0: u64)
	1: Call one(): u64
	2: Call one(): u64
	3: Call one(): u64
	4: Call one(): u64
	5: Call one(): u64
	6: Call bar(u64, u64, u64, u64, u64, u64)
	7: Ret

This code:

    fun test2(x: u64, y: u64): u64 {
        x + (y * x)
    }

produces before this PR:

	0: MoveLoc[1](Arg1: u64)
	1: CopyLoc[0](Arg0: u64)
	2: Mul
	3: StLoc[1](Arg1: u64)
	4: MoveLoc[0](Arg0: u64)
	5: MoveLoc[1](Arg1: u64)
	6: Add
	7: Ret

after this PR:

	0: CopyLoc[0](Arg0: u64)
	1: MoveLoc[1](Arg1: u64)
	2: MoveLoc[0](Arg0: u64)
	3: Mul
	4: Add
	5: Ret

How Has This Been Tested?

I've gone through all the test output changes:

  • In all existing tests, we produce fewer or equal instructions with the exception of minor regression in two tests (move-compiler-v2/tests/flush-writes/{def_use_03.on.exp, out_of_order_use_03.on.exp}. Filing an issue to follow up on these regressions.
  • Some error messages have become more precise, some have become more abstract (due to the introduction of the additional temporaries).

Key Areas to Review

  • Safety argument, correctness
  • Code retains existing behavior when experiment is off

Type of Change

  • Performance improvement

Which Components or Systems Does This Change Impact?

  • Move Compiler

Copy link

trunk-io bot commented Dec 5, 2024

⏱️ 1h 39m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
forge-compat-test / forge 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 13m 🟩
rust-move-tests 12m 🟩
rust-move-tests 12m 🟩
rust-cargo-deny 9m 🟩🟩🟩🟩🟩 (+1 more)
check-dynamic-deps 6m 🟩🟩🟩🟩🟩 (+1 more)
check 3m 🟩
general-lints 2m 🟩🟩🟩🟩🟩 (+1 more)
semgrep/ci 2m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
permission-check 18s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 14s 🟩🟩🟩🟩🟩 (+1 more)
check-branch-prefix 1s 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

Copy link
Contributor Author

vineethk commented Dec 5, 2024

@vineethk vineethk force-pushed the vk/temps-for-each-args branch 4 times, most recently from f1475b9 to 55e3631 Compare December 6, 2024 12:27
@vineethk vineethk changed the title [compiler-v2] Create temps for each arg [compiler-v2][optimization] Create and retain temps for each arg Dec 6, 2024
@vineethk vineethk force-pushed the vk/temps-for-each-args branch from 55e3631 to 92aaf1d Compare December 6, 2024 13:42
@vineethk vineethk marked this pull request as ready for review December 6, 2024 14:01
@vineethk vineethk enabled auto-merge (squash) December 13, 2024 21:53

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 311694401230605775aa0dc163e53621f2670cc9

two traffics test: inner traffic : committed: 14616.33 txn/s, latency: 2719.18 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5557440
two traffics test : committed: 100.10 txn/s, latency: 1371.94 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 2500 ms), latency samples: 1740
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.605, avg: 1.561", "ConsensusProposalToOrdered: max: 0.324, avg: 0.296", "ConsensusOrderedToCommit: max: 0.318, avg: 0.309", "ConsensusProposalToCommit: max: 0.612, avg: 0.605"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.62s no progress at version 35432 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.56s no progress at version 2229840 (avg 0.56s) [limit 16].
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite framework_upgrade success on 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9

Compatibility test results for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9 (PR)
Upgrade the nodes to version: 311694401230605775aa0dc163e53621f2670cc9
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1681.59 txn/s, submitted: 1683.33 txn/s, failed submission: 1.73 txn/s, expired: 1.73 txn/s, latency: 1990.15 ms, (p50: 2000 ms, p70: 2100, p90: 2700 ms, p99: 3800 ms), latency samples: 135740
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1423.02 txn/s, submitted: 1426.82 txn/s, failed submission: 3.80 txn/s, expired: 3.80 txn/s, latency: 2080.53 ms, (p50: 2100 ms, p70: 2200, p90: 3000 ms, p99: 4100 ms), latency samples: 127200
5. check swarm health
Compatibility test for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9 passed
Upgrade the remaining nodes to version: 311694401230605775aa0dc163e53621f2670cc9
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1428.33 txn/s, submitted: 1433.21 txn/s, failed submission: 4.66 txn/s, expired: 4.87 txn/s, latency: 2106.93 ms, (p50: 2100 ms, p70: 2300, p90: 3000 ms, p99: 4300 ms), latency samples: 128661
Test Ok

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9

Compatibility test results for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9 (PR)
1. Check liveness of validators at old version: 3c6e693a27339e73520f41030dce8fc9cd504967
compatibility::simple-validator-upgrade::liveness-check : committed: 16414.22 txn/s, latency: 2080.76 ms, (p50: 2100 ms, p70: 2200, p90: 2400 ms, p99: 2800 ms), latency samples: 530400
2. Upgrading first Validator to new version: 311694401230605775aa0dc163e53621f2670cc9
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6391.73 txn/s, latency: 4408.13 ms, (p50: 4900 ms, p70: 5100, p90: 6200 ms, p99: 6300 ms), latency samples: 120780
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6524.84 txn/s, latency: 4960.76 ms, (p50: 5300 ms, p70: 5400, p90: 5500 ms, p99: 5600 ms), latency samples: 222940
3. Upgrading rest of first batch to new version: 311694401230605775aa0dc163e53621f2670cc9
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6634.80 txn/s, latency: 4303.67 ms, (p50: 4800 ms, p70: 5100, p90: 5600 ms, p99: 5700 ms), latency samples: 126600
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6930.88 txn/s, latency: 4800.15 ms, (p50: 5100 ms, p70: 5200, p90: 5400 ms, p99: 5800 ms), latency samples: 233260
4. upgrading second batch to new version: 311694401230605775aa0dc163e53621f2670cc9
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 11082.84 txn/s, latency: 2512.21 ms, (p50: 2800 ms, p70: 2900, p90: 3000 ms, p99: 3100 ms), latency samples: 194660
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 11389.07 txn/s, latency: 2809.47 ms, (p50: 2900 ms, p70: 3000, p90: 3100 ms, p99: 3200 ms), latency samples: 372260
5. check swarm health
Compatibility test for 3c6e693a27339e73520f41030dce8fc9cd504967 ==> 311694401230605775aa0dc163e53621f2670cc9 passed
Test Ok

@vineethk vineethk merged commit da63178 into main Dec 13, 2024
80 of 88 checks passed
@vineethk vineethk deleted the vk/temps-for-each-args branch December 13, 2024 22:50
vineethk added a commit that referenced this pull request Dec 16, 2024
vineethk added a commit that referenced this pull request Dec 16, 2024
* Creating temps for each arg. (#15514)

* [compiler-v2] Enable recent stack-optimizations by default (#15595)
georgemitenkov pushed a commit that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Implement "eager pushes" optimization
4 participants