Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verilator simulations can be made an order of magnitude faster #2000

Open
2 tasks done
mayyxeng opened this issue Aug 13, 2024 · 3 comments
Open
2 tasks done

Verilator simulations can be made an order of magnitude faster #2000

mayyxeng opened this issue Aug 13, 2024 · 3 comments

Comments

@mayyxeng
Copy link

Background Work

Feature Description

For simple tests where we only need to run an assembly program, we can make the Verilator simulations more than order of magnitude faster, by removing some functionality from the test harness.

Motivating Example

On a an AMD EPYC 9554 3.75 GHz processors, single-thread Verilator simulation of a single-core RocketChip runs at about 10 kHz, but by stripping down the harness, we could make it run at 270~kHz (single-thread on EPYC), i.e., 27x faster.

Here is how I achieved a 27x speedup:

  1. Remove TL monitors (WithoutTLMonitors as stated in the documentation.
  2. Directly load the program as a hex file into the simulated RAM ($readmemhex), see here.
  3. Exclusively handle +verbose in Verilog using a simplified all-Verilog harness

The last step has the most significant effect. It seems that Verilator really struggles with how verbose printing is handled through $c(...) PLI calls. Even when simulation is non-verbose there is a huge performance impact.

@jerryz123
Copy link
Contributor

Thanks for looking into this... that 27x speedup is quite tempting.....

For 2), I believe the LOADMEM flag should accomplish the same thing... it offloads loading the program to the C++ DRAM model entirely, and happens instantaneously in simulated time. https://chipyard.readthedocs.io/en/latest/Simulation/Software-RTL-Simulation.html#fast-memory-loading

For 3), I'm struggling to find the significant part of the diff between your verilog TestDriver and the one we use: https://github.com/chipsalliance/rocket-chip/blob/dev/src/main/resources/vsrc/TestDriver.v

Could you clarify which changes you made to your TestDriver were significant?

@mayyxeng
Copy link
Author

I am sorry, I've oversimplified the changes.

About 2), you are right, LOADMEM=1 achieves a similar effect, albeit using DRAMSim.
About 3), well, there is much more that's changed. For instance, take a look at this reduced base MinimalSimulationConfig.

To provide a bit of context, I tried to shave-off as much as unnecessary stuff as possible to run some bare metal tests on an RTL simulator I was working on. For instance, there is now effectively a single clock domain and no SimDTM (hence no DPI). I believe not all the changes in MinimalSimulationConfig are necessary for performance, I made most of them because the other simulator I was working on was less capable than Verilator, I just don't know exactly which ones are critical. Perhaps I should try to create a new config with as few changes as possible (compared to mainline Chipyard) and see if I can still get a big speed boost. Would that be interesting for the main branch?

P.S.
I am also passing slightly different arguments to Verilator, e.g., no --split-cfunc as that negatively impacts performance.

@jerryz123
Copy link
Contributor

I'm more interested in simulation speedup opportunities given the same target configuration. I expect that minimizing the target design, as you have done in MinimalSimulationConfig, would obviously yield some speedups. (Does MinimalSimulationConfig even have a core in it?)

The flags we have chosen for verilator have not been optimized for performance, and I'd be very interested in learning more about how to tune verilator. The current --split-cfunc was done because at one-point the monolithic verilated C++ file would cause GCC to OOM in some of our memory-limited CI machines, and would also compile much more slowly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants