-
Notifications
You must be signed in to change notification settings - Fork 3k
Optimize the performance of circular buffer #4275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…n the hot path, avoiding unnecessary kernel and synchronization
Greptile SummaryOptimized circular buffer performance by eliminating GPU-CPU synchronization in the hot path, addressing the performance issue identified in #4274. Key optimizations:
Performance impact: Code correctness:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Actuator as DelayedPDActuator
participant DelayBuf as DelayBuffer
participant CircBuf as CircularBuffer
Note over Actuator,CircBuf: Hot Path (called every physics step)
Actuator->>DelayBuf: compute(control_action.joint_positions)
DelayBuf->>CircBuf: append(data)
alt First time after reset
CircBuf->>CircBuf: Check _all_initialized flag (false)
CircBuf->>CircBuf: Check is_first_push = (_num_pushes == 0)
CircBuf->>CircBuf: Call .any().item() (GPU sync)
CircBuf->>CircBuf: Initialize buffer if needed
CircBuf->>CircBuf: Set _all_initialized = true
else All batches initialized (optimized path)
CircBuf->>CircBuf: Skip initialization check
Note over CircBuf: No GPU-CPU sync needed!
end
CircBuf->>CircBuf: Increment _num_pushes
DelayBuf->>CircBuf: __getitem__(time_lags)
CircBuf-->>DelayBuf: Return delayed data (view)
DelayBuf-->>Actuator: Return delayed data (no clone)
Actuator->>Actuator: In-place assign with [:]
Note over Actuator,CircBuf: Optimizations Applied:<br/>1. Cached max_length as int (avoid .item())<br/>2. Skip initialization check after warmup<br/>3. Removed unnecessary .clone()<br/>4. In-place assignment in actuator
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
Hi, this is a cool optimization, I'm excited to try it, thank you! Two quick qs:
|
| control_action.joint_positions = self.positions_delay_buffer.compute(control_action.joint_positions) | ||
| control_action.joint_velocities = self.velocities_delay_buffer.compute(control_action.joint_velocities) | ||
| control_action.joint_efforts = self.efforts_delay_buffer.compute(control_action.joint_efforts) | ||
| control_action.joint_positions[:] = self.positions_delay_buffer.compute(control_action.joint_positions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the assignment operation here needed? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses slice assignment to keep the original tensor storage (control_action.joint_positions etc.) and overwrite its contents in place. Without this, compute() may return a new tensor, causing buffer replacement and additional allocation or copy overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delay buffer anyway returns a copied tensor since the time-lags indexed the torch tensor internally. The operation here will do another copy of that tensor which I don't think is needed. Also it will override the initial command set into the environment (after action processing). This may affect the next decimation consequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iansseijelly and @T-K-233 : Let me know your thoughts on the above. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging again. @iansseijelly @T-K-233
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this comment earlier.
The intention of line this was just to replace the .clone called from the original compute here to save the cost of one allocation. Indeed, the time-lag index should create a copied tensor, so if the original clone is unuseful, then so is this one.
Regarding the decimation comment, I do not fully get it. But the decimation semantics should remain unchanged before and after this PR.
|
Awesome, thank you for answering my questions, I really appreciate it! |
…verting all boolean conditions
AntoineRichard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The policies are training well on this branch and they do transfer to main, so it's likely the implementation is correct.
Code-wise it also looks good. I've suggested some modification to further improve the performance, but it's not as substantial as removing the cloning step. If we are in a rush we can get this in as is. These will likely get reworked to support warp anyway.
| if self._need_reset: | ||
| if self._buffer is None or (self._num_pushes == 0).any().item(): | ||
| raise RuntimeError("Attempting to retrieve data on an empty circular buffer. Please append data first.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a NIT but I'm not sure, about these lines.
If you need a reset, do we need to perform that extra check to raise the exception?
Also what happens if the self._need_reset flag is True, but none of the other checks are True. Should we still return the buffer?
It looks to me that if _need_reset is False, then append should have made sure none of the other conditions could be True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, this is never raised. I wanted to match the semantics of the original code, but guard the check with a scalar value to make it cheaper, so hence the change. The assumption here is that when _need_reset is True, the conditions in here better be true, because _need_reset is a shadow boolean state in cpu.
| self._need_reset = True | ||
| if self._buffer is not None: | ||
| # set buffer at batch_id reset indices to 0.0 so that the buffer() getter returns the cleared circular buffer after reset. | ||
| self._buffer[:, batch_ids, :] = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing with:
self._buffer[:, batch_ids].zero_()
Benchmarking with BATCH_SIZE=4096, NUM_JOINTS=12, HISTORY_LENGTH=6
====================================================================================================
Scenario | Method 1 (= 0.0) | Method 2 (.zero_()) | Best
----------------------------------------------------------------------------------------------------
1 reset (single index) | 20.33 ± 1.51 µs | 19.52 ± 1.70 µs | Method 2 (1.04x)
10 resets | 21.42 ± 1.55 µs | 21.20 ± 1.67 µs | Method 2 (1.01x)
100 resets | 22.24 ± 1.60 µs | 21.64 ± 1.20 µs | Method 2 (1.03x)
500 resets | 21.91 ± 1.03 µs | 21.57 ± 1.10 µs | Method 2 (1.02x)
1000 resets | 22.03 ± 1.30 µs | 21.79 ± 1.16 µs | Method 2 (1.01x)
2048 resets (half) | 22.61 ± 2.16 µs | 22.31 ± 1.34 µs | Method 2 (1.01x)
All resets (N) | 23.23 ± 1.37 µs | 23.18 ± 0.51 µs | Method 2 (1.00x)
====================================================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. Thanks for the detailed microbenchmark.
| is_first_push = self._num_pushes == 0 | ||
| if is_first_push.any().item(): | ||
| self._buffer[:, is_first_push] = data[is_first_push] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also consider doing:
is_first_push = self._num_pushes == 0
expanded_mask = is_first_push[None, :, None].expand_as(self._buffer)
expanded_data = data[None].expand_as(self._buffer)
torch.where(expanded_mask, expanded_data, self._buffer, out=self._buffer)This should be about 3 times faster than:
is_first_push = self._num_pushes == 0
self._buffer[:, is_first_push] = data[is_first_push]And then it saves you the:
if is_first_push.any().item():which on its own costs about 0.6x the time of:
is_first_push = self._num_pushes == 0
expanded_mask = is_first_push[None, :, None].expand_as(self._buffer)
expanded_data = data[None].expand_as(circular_buffer)
torch.where(expanded_mask, expanded_data, self._buffer, self._buffer)
TLDR; replace:
is_first_push = self._num_pushes == 0
if is_first_push.any().item():
self._buffer[:, is_first_push] = data[is_first_push]
with:
is_first_push = self._num_pushes == 0
if is_first_push.any().item():
expanded_mask = is_first_push[None, :, None].expand_as(self._buffer)
expanded_data = data[None].expand_as(self._buffer)
torch.where(expanded_mask, expanded_data, self._buffer, out=self._buffer)[INFO] Using python from: /home/antoiner/Documents/IsaacLab-Internal/_isaac_sim/python.sh
Benchmarking with BATCH_SIZE=4096, NUM_JOINTS=12, HISTORY_LENGTH=6
============================================================================================================================================
Scenario | direct_mask | where_mask | Best
--------------------------------------------------------------------------------------------------------------------------------------------
1 reset | 94.4± 3.6µs | 25.0± 1.1µs | where_mask
10 resets | 96.8± 4.2µs | 25.3± 0.8µs | where_mask
100 resets | 99.3± 4.2µs | 25.3± 1.7µs | where_mask
500 resets | 99.7± 4.5µs | 25.5± 1.5µs | where_mask
1000 resets | 98.9± 3.4µs | 25.3± 1.0µs | where_mask
2048 resets (half) | 99.8± 3.4µs | 26.0± 1.4µs | where_mask
All resets (N) | 100.3± 3.7µs | 26.0± 1.7µs | where_mask
============================================================================================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with this. Looks like it saves a lot of time.
Probably need a little more comment/documentation on what's happening here since mask programming is less interpretable.
2ef7fc8 to
f3061a4
Compare

Description
This PR addresses issue #4274.
Type of change
Screenshots
Training throughput before and after the patch running training on task
[Isaac-Velocity-Flat-Spot-v0].pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there