Skip to content

New inflation layer with optional OpenMP acceleration#51

Closed
tonynajjar wants to merge 1 commit intomain_dexoryfrom
openmp-inflation
Closed

New inflation layer with optional OpenMP acceleration#51
tonynajjar wants to merge 1 commit intomain_dexoryfrom
openmp-inflation

Conversation

@tonynajjar
Copy link
Copy Markdown

@tonynajjar tonynajjar commented Feb 4, 2026

Benchmark Comparison Summary

Test Environments

Dev Machine (ubuntu@dexory)

  • CPU: 16 cores × 5400 MHz
  • OS: Ubuntu 24.04.2 LTS
  • L1 Data Cache: 48 KiB × 16 = 768 KiB
  • L1 Instruction Cache: 64 KiB × 16 = 1024 KiB
  • L2 Cache: 3072 KiB × 16 = 48 MB (unified)
  • L3 Cache: 24576 KiB = 24 MB (shared)
  • Total Cache: ~74 MB

Robot (arri-74)

  • CPU: 16 cores × 5000 MHz
  • L1 Data Cache: 48 KiB × 8 = 384 KiB
  • L1 Instruction Cache: 32 KiB × 8 = 256 KiB
  • L2 Cache: 1280 KiB × 8 = 10 MB (unified)
  • L3 Cache: 18432 KiB = 18 MB (shared)
  • Total Cache: ~28.6 MB

Performance Comparison (Key Benchmarks)

1000×1000 Grid (1M cells, 50% occupancy, 2m inflation radius)

Configuration Dev Time Dev Throughput Robot Time Robot Throughput vs Old Dev vs Old Robot
Old Implementation 24.1 ms 41.4 M cells/s 28.7 ms 34.9 M cells/s baseline baseline
New (OpenMP disabled) 6.89 ms 145.2 M cells/s 11.0 ms 91.1 M cells/s 3.5× faster 2.6× faster
New (OpenMP enabled) 2.50 ms 707.3 M cells/s 2.35 ms 482.5 M cells/s 9.6× faster 12.2× faster

2000×2000 Grid (4M cells, 50% occupancy, 2m inflation radius)

Configuration Dev Time Dev Throughput Robot Time Robot Throughput vs Old Dev vs Old Robot
Old Implementation 91.5 ms 43.7 M cells/s 105 ms 38.1 M cells/s baseline baseline
New (OpenMP disabled) 30.6 ms 130.6 M cells/s 48.9 ms 81.8 M cells/s 3.0× faster 2.1× faster
New (OpenMP enabled) 6.64 ms 893.5 M cells/s 9.11 ms 468.9 M cells/s 13.8× faster 11.5× faster

3333×3333 Grid (11.1M cells, 50% occupancy, 2m inflation radius)

Configuration Dev Time Dev Throughput Robot Time Robot Throughput vs Old Dev vs Old Robot
Old Implementation 311 ms 35.7 M cells/s 357 ms 31.2 M cells/s baseline baseline
New (OpenMP disabled) 115 ms 96.8 M cells/s 182 ms 61.2 M cells/s 2.7× faster 2.0× faster
New (OpenMP enabled) 20.3 ms 697.6 M cells/s 29.9 ms 395.1 M cells/s 15.3× faster 11.9× faster

4000×4000 Grid (16M cells, 30% occupancy, 1m inflation radius)

Configuration Dev Time Dev Throughput Robot Time Robot Throughput vs Old Dev vs Old Robot
Old Implementation 261 ms 61.2 M cells/s 302 ms 53.1 M cells/s baseline baseline
New (OpenMP disabled) 176 ms 90.9 M cells/s 268 ms 59.7 M cells/s 1.5× faster 1.1× faster
New (OpenMP enabled) 28.4 ms 660.7 M cells/s 46.3 ms 364.1 M cells/s 9.2× faster 6.5× faster

Key Findings

1. New Implementation Impact (OpenMP disabled)

  • Dev machine: 2.7-3.5× faster than old implementation
  • Robot: 1.1-2.6× faster than old implementation
  • Performance scales better on more powerful dev machine

2. OpenMP Parallelization Impact

  • Dev machine: 2.5-4.8× additional speedup over single-threaded new implementation
  • Robot: 1.8-5.6× additional speedup over single-threaded new implementation
  • Combined with new implementation: 6.5-15.3× faster than old code

3. Grid Size Scaling

  • Old implementation shows poor scaling with grid size (35-43 M cells/s)
  • New implementation (OpenMP disabled) maintains 90-145 M cells/s
  • New implementation (OpenMP enabled) maintains 395-893 M cells/s on dev, 364-483 M cells/s on robot

4. Occupancy Impact (1500×1500 tests)

  • All implementations show relatively consistent performance across 10%, 30%, 50%, 80% occupancy
  • New implementation handles varying occupancy much more efficiently

5. Inflation Radius Impact

  • Old implementation: significant slowdown with larger radii (41→35→31 M cells/s)
  • New implementation: minimal impact from radius variation

Detailed Results by Parameter

Varying Occupancy (1500×1500 grid, 2m inflation)

Occupancy Old Dev New OpenMP Off Dev New OpenMP On Dev Old Robot New OpenMP Off Robot New OpenMP On Robot
10% 8.16 ms (275.8 M/s) 14.7 ms (152.8 M/s) 4.57 ms (816.7 M/s) 11.9 ms (189.7 M/s) 24.0 ms (93.9 M/s) 5.50 ms (445.2 M/s)
30% 28.5 ms (79.1 M/s) 15.3 ms (147.3 M/s) 4.65 ms (763.5 M/s) 36.1 ms (62.3 M/s) 24.2 ms (93.0 M/s) 5.05 ms (483.0 M/s)
50% 67.5 ms (33.4 M/s) 16.1 ms (139.5 M/s) 5.07 ms (763.0 M/s) 80.7 ms (27.9 M/s) 24.8 ms (90.9 M/s) 5.76 ms (466.2 M/s)
80% 75.6 ms (29.8 M/s) 15.6 ms (144.7 M/s) 4.96 ms (784.4 M/s) 89.1 ms (25.3 M/s) 23.4 ms (96.1 M/s) 5.48 ms (458.2 M/s)

Key Observation: Old implementation degrades significantly with higher occupancy (8→75 ms on dev), while new implementation remains stable (14-16 ms without OpenMP, 4-5 ms with OpenMP).

Varying Inflation Radius (1000×1000 grid, 50% occupancy)

Radius Old Dev New OpenMP Off Dev New OpenMP On Dev Old Robot New OpenMP Off Robot New OpenMP On Robot
0.5m 11.2 ms (89.5 M/s) 6.78 ms (147.5 M/s) 2.45 ms (716.4 M/s) 14.8 ms (67.6 M/s) 11.1 ms (90.5 M/s) 2.55 ms (484.8 M/s)
1.0m 12.7 ms (78.9 M/s) 6.88 ms (145.5 M/s) 2.44 ms (717.3 M/s) 17.0 ms (58.9 M/s) 10.9 ms (91.6 M/s) 2.23 ms (518.3 M/s)
2.0m 14.6 ms (68.7 M/s) 6.84 ms (146.2 M/s) 2.56 ms (677.9 M/s) 20.1 ms (49.8 M/s) 11.1 ms (90.4 M/s) 2.21 ms (508.3 M/s)
3.0m 15.5 ms (64.6 M/s) 6.92 ms (144.5 M/s) 2.46 ms (710.3 M/s) 21.6 ms (46.4 M/s) 11.6 ms (85.9 M/s) 2.47 ms (474.5 M/s)
5.0m 16.9 ms (60.0 M/s) 6.96 ms (143.7 M/s) 2.67 ms (666.2 M/s) 23.3 ms (43.0 M/s) 11.2 ms (89.2 M/s) 2.49 ms (458.6 M/s)
10.0m 17.6 ms (56.7 M/s) 6.95 ms (143.8 M/s) 2.65 ms (657.8 M/s) 25.5 ms (39.3 M/s) 11.4 ms (88.1 M/s) 2.37 ms (472.6 M/s)

Key Observation: Old implementation shows 36% slowdown from smallest to largest radius (11.2→17.6 ms on dev). New implementation shows minimal variation (<3% difference).

Varying Cost Scale (1000×1000 grid, 50% occupancy, 2m radius)

Cost Scale Old Dev New OpenMP Off Dev New OpenMP On Dev Old Robot New OpenMP Off Robot New OpenMP On Robot
1.0 14.3 ms (70.1 M/s) 6.89 ms (145.1 M/s) 2.77 ms (651.6 M/s) 20.1 ms (49.8 M/s) 11.1 ms (89.9 M/s) 2.19 ms (503.2 M/s)
3.0 14.4 ms (69.5 M/s) 6.90 ms (144.9 M/s) 2.63 ms (686.5 M/s) 20.0 ms (50.1 M/s) 11.1 ms (90.1 M/s) 2.30 ms (492.1 M/s)
5.0 14.3 ms (70.2 M/s) 6.89 ms (145.2 M/s) 2.82 ms (655.3 M/s) 20.0 ms (50.1 M/s) 11.0 ms (90.8 M/s) 2.24 ms (496.9 M/s)
10.0 14.4 ms (69.6 M/s) 6.91 ms (144.7 M/s) 2.65 ms (660.0 M/s) 19.9 ms (50.3 M/s) 11.0 ms (90.6 M/s) 2.23 ms (523.8 M/s)

Key Observation: Cost scale factor has negligible impact on performance across all implementations.

Recommendations

Use new implementation with OpenMP enabled - Provides 6.5-15.3× speedup

✅ Even without OpenMP, new implementation is 1.1-3.5× faster

✅ Performance is more predictable and scales better with grid size

✅ Robot shows excellent speedup despite lower CPU frequency

✅ New implementation handles varying occupancy and inflation radii efficiently

Performance Summary Chart

Speedup Factor (vs Old Implementation)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Dev Machine (1000×1000):
Old:     ████ 1.0×
New-Off: █████████████ 3.5×
New-On:  ██████████████████████████████████████ 9.6×

Dev Machine (3333×3333):
Old:     ████ 1.0×
New-Off: ███████████ 2.7×
New-On:  ███████████████████████████████████████████████████████████ 15.3×

Robot (1000×1000):
Old:     ████ 1.0×
New-Off: ██████████ 2.6×
New-On:  ████████████████████████████████████████████████ 12.2×

Robot (3333×3333):
Old:     ████ 1.0×
New-Off: ████████ 2.0×
New-On:  ███████████████████████████████████████████████ 11.9×

@tonynajjar tonynajjar changed the title Openmp inflation OpenMP inflation layer Feb 4, 2026
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
@tonynajjar tonynajjar changed the title OpenMP inflation layer New inflation layer with optional OpenMP acceleration Feb 5, 2026
@tonynajjar
Copy link
Copy Markdown
Author

tonynajjar commented Feb 6, 2026

ros-navigation#5933

@tonynajjar tonynajjar closed this Feb 6, 2026
@tonynajjar tonynajjar deleted the openmp-inflation branch March 10, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant