Optimize the performance of circular buffer #4275

iansseijelly · 2025-12-23T23:53:26Z

Description

This PR addresses issue #4274.

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

Training throughput before and after the patch running training on task [Isaac-Velocity-Flat-Spot-v0].

## Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

…n the hot path, avoiding unnecessary kernel and synchronization

greptile-apps · 2025-12-24T00:00:47Z

Greptile Summary

Optimized circular buffer performance by eliminating GPU-CPU synchronization in the hot path, addressing the performance issue identified in #4274.

Key optimizations:

Cached max_length as an integer to avoid repeated .item() calls that trigger GPU sync
Added _all_initialized flag to skip first-push checks after warmup, avoiding torch.any().item() calls
Removed unnecessary .clone() in DelayBuffer.compute() return path
Changed to in-place assignment ([:]) in DelayedPDActuator to avoid tensor reallocation

Performance impact:
According to the PR description, profiling showed excessive time spent in aten::item (89% CPU time) and torch.any checks. These optimizations eliminate both bottlenecks in steady-state operation after warmup.

Code correctness:

The in-place assignment pattern is safe because CircularBuffer.__getitem__ returns indexed views
The _all_initialized flag is correctly reset on buffer reset
Initialization logic still handles first-push correctly before setting the flag

Confidence Score: 4/5

This PR is safe to merge with minimal risk - well-targeted performance optimizations
The changes are focused performance optimizations that eliminate GPU-CPU synchronization overhead without altering logic. The coordinated changes across three files (caching max_length, skipping checks after warmup, removing clone, using in-place assignment) work together correctly. Minor deduction because there are no new tests to verify the optimizations don't break edge cases, though the logic appears sound.
No files require special attention

Important Files Changed

Filename	Overview
source/isaaclab/isaaclab/utils/buffers/circular_buffer.py	Optimized to avoid GPU-CPU synchronization in hot path by caching max_length as integer and tracking initialization state
source/isaaclab/isaaclab/utils/buffers/delay_buffer.py	Removed unnecessary `.clone()` call since consumer now uses in-place assignment
source/isaaclab/isaaclab/actuators/actuator_pd.py	Changed to in-place assignment to avoid unnecessary tensor allocation and copying

Sequence Diagram

sequenceDiagram
    participant Actuator as DelayedPDActuator
    participant DelayBuf as DelayBuffer
    participant CircBuf as CircularBuffer
    
    Note over Actuator,CircBuf: Hot Path (called every physics step)
    
    Actuator->>DelayBuf: compute(control_action.joint_positions)
    DelayBuf->>CircBuf: append(data)
    
    alt First time after reset
        CircBuf->>CircBuf: Check _all_initialized flag (false)
        CircBuf->>CircBuf: Check is_first_push = (_num_pushes == 0)
        CircBuf->>CircBuf: Call .any().item() (GPU sync)
        CircBuf->>CircBuf: Initialize buffer if needed
        CircBuf->>CircBuf: Set _all_initialized = true
    else All batches initialized (optimized path)
        CircBuf->>CircBuf: Skip initialization check
        Note over CircBuf: No GPU-CPU sync needed!
    end
    
    CircBuf->>CircBuf: Increment _num_pushes
    DelayBuf->>CircBuf: __getitem__(time_lags)
    CircBuf-->>DelayBuf: Return delayed data (view)
    DelayBuf-->>Actuator: Return delayed data (no clone)
    Actuator->>Actuator: In-place assign with [:]
    
    Note over Actuator,CircBuf: Optimizations Applied:<br/>1. Cached max_length as int (avoid .item())<br/>2. Skip initialization check after warmup<br/>3. Removed unnecessary .clone()<br/>4. In-place assignment in actuator

greptile-apps · 2025-12-24T00:00:48Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

garylvov · 2025-12-27T07:00:46Z

Hi, this is a cool optimization, I'm excited to try it, thank you!

Two quick qs:

Does this only affect configurations that use the delayed PD actuator, or does it apply to all actuator configurations?
I see in your charts that the FPS is improved. How much of a difference does this make in cumulative training time? Like, does it go from 2 hours to train to 1.5 hours to train? I'd love to see the reward with and without the optimization over time

Mayankm96 · 2025-12-27T17:33:29Z

source/isaaclab/isaaclab/actuators/actuator_pd.py

-        control_action.joint_positions = self.positions_delay_buffer.compute(control_action.joint_positions)
-        control_action.joint_velocities = self.velocities_delay_buffer.compute(control_action.joint_velocities)
-        control_action.joint_efforts = self.efforts_delay_buffer.compute(control_action.joint_efforts)
+        control_action.joint_positions[:] = self.positions_delay_buffer.compute(control_action.joint_positions)


Is the assignment operation here needed? 🤔

This uses slice assignment to keep the original tensor storage (control_action.joint_positions etc.) and overwrite its contents in place. Without this, compute() may return a new tensor, causing buffer replacement and additional allocation or copy overhead.

The delay buffer anyway returns a copied tensor since the time-lags indexed the torch tensor internally. The operation here will do another copy of that tensor which I don't think is needed. Also it will override the initial command set into the environment (after action processing). This may affect the next decimation consequently.

@iansseijelly and @T-K-233 : Let me know your thoughts on the above. Thanks again!

Pinging again. @iansseijelly @T-K-233

Sorry, I missed this comment earlier.
The intention of line this was just to replace the .clone called from the original compute here to save the cost of one allocation. Indeed, the time-lag index should create a copied tensor, so if the original clone is unuseful, then so is this one.
Regarding the decimation comment, I do not fully get it. But the decimation semantics should remain unchanged before and after this PR.

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

T-K-233 · 2025-12-27T22:11:03Z

Hi, this is a cool optimization, I'm excited to try it, thank you!

Two quick qs:

Does this only affect configurations that use the delayed PD actuator, or does it apply to all actuator configurations?

It only affect the delayed PD actuator. Only the DelayedPDActuator utilizes the DelayBuffer object.

I see in your charts that the FPS is improved. How much of a difference does this make in cumulative training time? Like, does it go from 2 hours to train to 1.5 hours to train? I'd love to see the reward with and without the optimization over time

For the Isaac-Velocity-Flat-Spot-v0 task, the cumulative training time reduces from 45.76 minutes to 42.35 minutes. This is with 2000 iters, and the difference will become bigger for longer training runs.

The reward and metric of this task with and without this change is identical.

The full tensorboard log is also attached for reference:

spot_training_logs.zip

garylvov · 2025-12-27T22:23:23Z

Awesome, thank you for answering my questions, I really appreciate it!

…verting all boolean conditions

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

AntoineRichard

The policies are training well on this branch and they do transfer to main, so it's likely the implementation is correct.

Code-wise it also looks good. I've suggested some modification to further improve the performance, but it's not as substantial as removing the cloning step. If we are in a rush we can get this in as is. These will likely get reworked to support warp anyway.

AntoineRichard · 2026-01-16T09:13:50Z

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

+        if self._need_reset:
+            if self._buffer is None or (self._num_pushes == 0).any().item():
+                raise RuntimeError("Attempting to retrieve data on an empty circular buffer. Please append data first.")


Mostly a NIT but I'm not sure, about these lines.
If you need a reset, do we need to perform that extra check to raise the exception?
Also what happens if the self._need_reset flag is True, but none of the other checks are True. Should we still return the buffer?

It looks to me that if _need_reset is False, then append should have made sure none of the other conditions could be True.

In practice, this is never raised. I wanted to match the semantics of the original code, but guard the check with a scalar value to make it cheaper, so hence the change. The assumption here is that when _need_reset is True, the conditions in here better be true, because _need_reset is a shadow boolean state in cpu.

AntoineRichard · 2026-01-16T09:36:19Z

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

+        self._need_reset = True
        if self._buffer is not None:
            # set buffer at batch_id reset indices to 0.0 so that the buffer() getter returns the cleared circular buffer after reset.
            self._buffer[:, batch_ids, :] = 0.0


Consider replacing with:
self._buffer[:, batch_ids].zero_()

Benchmarking with BATCH_SIZE=4096, NUM_JOINTS=12, HISTORY_LENGTH=6 ==================================================================================================== Scenario | Method 1 (= 0.0) | Method 2 (.zero_()) | Best ---------------------------------------------------------------------------------------------------- 1 reset (single index) | 20.33 ± 1.51 µs | 19.52 ± 1.70 µs | Method 2 (1.04x) 10 resets | 21.42 ± 1.55 µs | 21.20 ± 1.67 µs | Method 2 (1.01x) 100 resets | 22.24 ± 1.60 µs | 21.64 ± 1.20 µs | Method 2 (1.03x) 500 resets | 21.91 ± 1.03 µs | 21.57 ± 1.10 µs | Method 2 (1.02x) 1000 resets | 22.03 ± 1.30 µs | 21.79 ± 1.16 µs | Method 2 (1.01x) 2048 resets (half) | 22.61 ± 2.16 µs | 22.31 ± 1.34 µs | Method 2 (1.01x) All resets (N) | 23.23 ± 1.37 µs | 23.18 ± 0.51 µs | Method 2 (1.00x) ====================================================================================================

I agree with this. Thanks for the detailed microbenchmark.

AntoineRichard · 2026-01-16T10:03:17Z

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py

+            is_first_push = self._num_pushes == 0
+            if is_first_push.any().item():
+                self._buffer[:, is_first_push] = data[is_first_push]


I would also consider doing:

is_first_push = self._num_pushes == 0 expanded_mask = is_first_push[None, :, None].expand_as(self._buffer) expanded_data = data[None].expand_as(self._buffer) torch.where(expanded_mask, expanded_data, self._buffer, out=self._buffer)

This should be about 3 times faster than:

is_first_push = self._num_pushes == 0 self._buffer[:, is_first_push] = data[is_first_push]

And then it saves you the:

if is_first_push.any().item():

which on its own costs about 0.6x the time of:

is_first_push = self._num_pushes == 0 expanded_mask = is_first_push[None, :, None].expand_as(self._buffer) expanded_data = data[None].expand_as(circular_buffer) torch.where(expanded_mask, expanded_data, self._buffer, self._buffer)

TLDR; replace:

is_first_push = self._num_pushes == 0 if is_first_push.any().item(): self._buffer[:, is_first_push] = data[is_first_push]

with:

is_first_push = self._num_pushes == 0 if is_first_push.any().item(): expanded_mask = is_first_push[None, :, None].expand_as(self._buffer) expanded_data = data[None].expand_as(self._buffer) torch.where(expanded_mask, expanded_data, self._buffer, out=self._buffer)

[INFO] Using python from: /home/antoiner/Documents/IsaacLab-Internal/_isaac_sim/python.sh Benchmarking with BATCH_SIZE=4096, NUM_JOINTS=12, HISTORY_LENGTH=6 ============================================================================================================================================ Scenario | direct_mask | where_mask | Best -------------------------------------------------------------------------------------------------------------------------------------------- 1 reset | 94.4± 3.6µs | 25.0± 1.1µs | where_mask 10 resets | 96.8± 4.2µs | 25.3± 0.8µs | where_mask 100 resets | 99.3± 4.2µs | 25.3± 1.7µs | where_mask 500 resets | 99.7± 4.5µs | 25.5± 1.5µs | where_mask 1000 resets | 98.9± 3.4µs | 25.3± 1.0µs | where_mask 2048 resets (half) | 99.8± 3.4µs | 26.0± 1.4µs | where_mask All resets (N) | 100.3± 3.7µs | 26.0± 1.7µs | where_mask ============================================================================================================================================

I also agree with this. Looks like it saves a lot of time.
Probably need a little more comment/documentation on what's happening here since mask programming is less interpretable.

perf: optimize the performance of circular buffer to cut its impact i…

6b5c172

…n the hot path, avoiding unnecessary kernel and synchronization

iansseijelly requested review from Mayankm96, jtigue-bdai and ooctipus as code owners December 23, 2025 23:53

github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Dec 23, 2025

fix: run pre commit and change contributors

43cfa4c

Mayankm96 reviewed Dec 27, 2025

View reviewed changes

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated Show resolved Hide resolved

Mayankm96 changed the title ~~perf: optimize the performance of circular buffer~~ Optimize the performance of circular buffer Dec 27, 2025

garylvov approved these changes Dec 27, 2025

View reviewed changes

fix: change the naming from _all_initialized to _need_reset, hence in…

d860772

…verting all boolean conditions

Mayankm96 reviewed Dec 28, 2025

View reviewed changes

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated Show resolved Hide resolved

Mayankm96 reviewed Dec 28, 2025

View reviewed changes

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated Show resolved Hide resolved

Mayankm96 reviewed Dec 28, 2025

View reviewed changes

source/isaaclab/isaaclab/utils/buffers/circular_buffer.py Outdated Show resolved Hide resolved

fix: changing comments from init to reset vacb

a0c440d

Mayankm96 added this to Isaac Lab Jan 12, 2026

Mayankm96 moved this to In review in Isaac Lab Jan 12, 2026

AntoineRichard reviewed Jan 16, 2026

View reviewed changes

Mayankm96 force-pushed the main branch 2 times, most recently from 2ef7fc8 to f3061a4 Compare January 28, 2026 16:16

Optimize the performance of circular buffer #4275

Are you sure you want to change the base?

Optimize the performance of circular buffer #4275

Conversation

iansseijelly commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Screenshots

Uh oh!

greptile-apps bot commented Dec 24, 2025

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Dec 24, 2025

Greptile found no issues!

Uh oh!

garylvov commented Dec 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

T-K-233 commented Dec 27, 2025

Uh oh!

garylvov commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntoineRichard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iansseijelly commented Dec 23, 2025 •

edited

Loading

garylvov commented Dec 27, 2025 •

edited

Loading