Benchmark Results Reference
Results are kept up-to-date with each release. For how to reproduce them, see Benchmark Suite and Profiling Guide.
Test Environment
Item |
Value |
|---|---|
OS |
Ubuntu 20.04 (Linux 5.15.0) |
CPU |
Intel Core i7-1185G7 @ 3.00 GHz (4 cores / 8 logical) |
RAM |
31 GB |
Python |
3.8.10 |
Conditions |
Headless ( |
Simulation Throughput
Script: make bench-full → benchmark/run_benchmark.py --sweep 100 500 1000
Config: benchmark/configs/general.yaml — collision_check_frequency=null (every step), 50% agents moving
Last measured: 2026-04-08
Agents |
Step Time (ms) |
RTF |
Spawn Time |
Memory Delta |
|---|---|---|---|---|
100 |
2.17 ± 0.10 |
46.1× |
26 ms |
−24.8 MB |
250 |
6.45 ± 0.30 |
16× |
65 ms |
−19.7 MB † |
500 |
13.21 ± 0.21 |
7.6× |
137 ms |
−15.3 MB |
1000 |
29.98 ± 0.39 |
3.3× |
285 ms |
−3.2 MB |
2000 |
94.82 ± 5.81 |
1.1× |
731 ms |
+29.6 MB † |
† Row from 2026-03-12; not included in make bench-full (100/500/1000 only).
Source: benchmark/results/benchmark_sweep_10.0s.json
Memory note: Negative delta below ~1000 agents is a Python GC artifact. Actual per-agent overhead is ~15 KB above 1000 agents (linear).
Scalability: O(n^1.3) — near-linear up to ~500 agents, slightly super-linear above.
Step Time Component Breakdown
Script: benchmark/profiling/simulation_profiler.py --test builtin --agents 500 --steps 500
Component |
Share |
Notes |
|---|---|---|
Agent Update |
~81% |
Dominant cost; much higher for moving than stationary |
Collision Check |
~18% |
Periodic (bursty); minimal cost on non-check steps |
Monitor Update |
~1% |
Near-zero if monitor disabled |
Step Simulation |
0% |
Physics off; up to ~40% with physics on |
Note on variance: Profiler timestamps show high variance (stdev > mean) because collision checks fire in bursts and some steps include PyBullet warmup or GC pauses. The percentages above are representative of steady-state operation.
Agent Update Cost
Script: benchmark/profiling/agent_update.py --agents 500
Stationary vs Moving (500 agents)
State |
Total time |
Per agent |
|---|---|---|
Stationary |
0.29 ms |
0.57 μs |
Moving |
61.27 ms |
122.5 μs |
Ratio |
214× |
→ Design decision: Benchmarks use 50% moving agents as a representative workload. Moving agent cost dominates: kinematics, trajectory following, and PyBullet API calls only execute for agents with an active goal.
Motion Mode Comparison (500 agents, all moving)
Mode |
Total time |
Per agent |
Relative |
|---|---|---|---|
OMNIDIRECTIONAL |
9.26 ms |
18.5 μs |
1× (baseline) |
DIFFERENTIAL |
26.66 ms |
53.3 μs |
2.9× slower |
→ Optimisation note (2026-03-12): DIFFERENTIAL was previously ~5× slower (90.6 μs/agent). Replacing scipy Slerp with a precomputed quaternion slerp and removing per-tick numpy allocations reduced the ROTATE phase from 37 μs to 5 μs. See Differential Drive Optimization below.
→ Design decision: Benchmarks default to OMNIDIRECTIONAL. DIFFERENTIAL requires heading alignment computation which adds ~2.9× update cost.
Wrapper Layer Overhead
Script: benchmark/profiling/wrapper_overhead.py --n 500 --reps 3
Conditions: 500 objects, process-isolated per layer
Layer |
Spawn time |
Update time (get+set pose) |
Memory (RSS delta) |
|---|---|---|---|
Direct PyBullet (baseline) |
63 ms |
2.21 ms |
5.2 MB |
SimObject |
106 ms (+68%) |
6.80 ms (+3.1×) |
9.2 MB |
SimObjectManager |
207 ms (+229%) |
10.75 ms (+4.9×) |
9.3 MB |
Agent |
237 ms (+276%) |
10.60 ms (+4.8×) |
11.5 MB |
AgentManager |
255 ms (+305%) |
10.20 ms (+4.6×) |
11.6 MB |
Extrapolated to 10,000 objects (production feasibility, thresholds: spawn < 10 s, update < 150 ms/step, memory < +200 MB):
Layer |
Spawn |
Update/step |
Memory |
Pass? |
|---|---|---|---|---|
SimObject |
2.1 s |
136 ms |
+81 MB |
✅ all pass |
Agent |
4.7 s |
212 ms |
+127 MB |
spawn/mem ✅, update ❌ |
→ Design decision: SimObject layer is production-viable at 10,000 scale. Agent layer exceeds the 150 ms/step update threshold at 10,000 (extrapolated 212 ms), but comfortably passes at 5,000 or below. For ≥10,000 agents, consider using SimObject directly or batching update calls.
Collision Mode Comparison (DISABLED / NORMAL_2D / NORMAL_3D)
Script: benchmark/experiments/collision_mode_comparison.py --agents 500 --iterations 200
Mode |
Step time (mean) |
Collision time |
vs Disabled |
|---|---|---|---|
DISABLED |
1.37 ms |
— |
baseline |
NORMAL_2D (9 neighbors) |
2.43 ms |
0.21 ms |
+77% |
NORMAL_3D (27 neighbors) |
2.26 ms |
0.39 ms |
+64% |
Collision check breakdown (both modes): AABB Filtering accounts for ~97–99% of collision check time. Spatial Hashing and Contact Points together are <3%.
→ Design decision: NORMAL_2D yields only modest speedup (7%) over NORMAL_3D at this scale because AABB filtering dominates in both. Use NORMAL_2D for ground robots where Z-axis neighbors are irrelevant for correctness, not for dramatic performance gains.
Collision Algorithm Comparison (Spatial Hashing vs Alternatives)
Script: benchmark/experiments/collision_method_comparison.py --agents 100,500 --iterations 100
Conditions: spacing=0.08m (dense), kinematic objects (mass=0)
Method |
100 agents |
500 agents |
Collisions detected |
Valid? |
|---|---|---|---|---|
Spatial Hashing (current) |
8.0 ms |
92.1 ms |
✅ correct |
✅ |
Brute Force AABB |
8.7 ms |
231.9 ms (2.5×) |
✅ correct |
✅ |
|
23.0 ms |
370.6 ms (4.0×) |
✅ correct |
✅ |
|
0.06 ms |
0.24 ms |
❌ 0 — invalid |
❌ |
|
12.1 ms |
190.3 ms |
❌ 0 — invalid |
❌ |
Key finding: getContactPoints variants return 0 collisions for kinematic objects (mass=0).
PyBullet’s contact point solver only activates between physics-enabled bodies.
Spatial hashing with getClosestPoints is the only valid fast method for kinematic AGV-type robots.
→ Design decision: Spatial hashing (O(N) average) is used because:
It is the only algorithmically valid method for kinematic robots.
It is 2.5–4× faster than the valid alternatives at 500 agents.
Differential Drive Optimization
Script: benchmark/profiling/differential_breakdown.py
Conditions: 1000 agents, DIFFERENTIAL mode (ROTATE phase)
The differential drive ROTATE phase was the single most expensive per-agent per-tick operation. Profiling identified two root causes:
Operation |
Before |
After |
Speedup |
|---|---|---|---|
Quaternion slerp (scipy Slerp → |
26.8 μs |
2.3 μs |
11.8× |
Scalar clip ( |
6.8 μs |
0.15 μs |
44× |
Dead |
~0.6 μs |
0 μs |
eliminated |
Redundant |
~0.3 μs |
0 μs |
eliminated |
Velocity array reallocation (P2) |
~0.3 μs |
0 μs |
in-place |
FORWARD direction recompute (P3) |
~0.5 μs |
0 μs |
cached |
ROTATE phase total |
36.1 μs |
4.9 μs |
7.3× |
End-to-end impact (1000 agents, differential):
Metric |
Before |
After |
Improvement |
|---|---|---|---|
Per-agent update |
57 μs |
12.4 μs |
4.6× faster |
DIFFERENTIAL / OMNIDIRECTIONAL ratio |
5.0× |
2.9× |
Closer to parity |
500-agent ROTATE frame time |
18.0 ms |
2.5 ms |
−15.6 ms/frame |
Why was scipy Slerp slow?
scipy.spatial.transform.Slerp is a generic array-oriented API designed for batch interpolation
(e.g., slerp(np.linspace(0, 1, 10000))). When called with a single scalar t per agent per tick,
the fixed overhead — input normalisation, dtype promotion, Rotation object wrapping — dominates
the actual trigonometry. The replacement _quat_slerp sidesteps this by precomputing
dot/theta/sin(theta) once per waypoint transition and using math.sin/math.cos (C-level scalar
functions) instead of numpy array dispatch.
Absolute timings vary by environment; the ratios above are the meaningful comparison metric.
Collision Config-Based Comparison (Physics ON vs OFF)
Script: benchmark/experiments/collision_methods_config_based.py
Conditions: 100 objects, 500 steps
Config |
Physics |
Method |
Step time |
Collision time |
|---|---|---|---|---|
|
OFF |
|
1.51 ms |
1.24 ms |
|
ON |
|
2.64 ms |
1.41 ms |
|
ON |
|
2.00 ms |
1.27 ms |
Physics ON adds ~75% step overhead due to the physics engine stepping each simulation step.
→ Design decision: Default is physics=false (kinematics mode) for maximum throughput.
Use physics=true only when rigid-body dynamics are required.
Source: benchmark/results/collision_methods_config_based.txt
Arm Joint Control Performance
Script: benchmark/profiling/arm_joint_update.py --test scaling
Conditions: arm_robot.urdf (4 revolute joints), fixed-base, JointAction cycling, 100 steps per count
Physics vs Kinematic Scaling
Arms |
Joints |
Physics (ms/step) |
Kinematic (ms/step) |
Ratio |
|---|---|---|---|---|
1 |
4 |
0.029 |
0.010 |
2.8× |
5 |
20 |
0.082 |
0.056 |
1.5× |
10 |
40 |
0.152 |
0.090 |
1.7× |
25 |
100 |
0.415 |
0.253 |
1.6× |
50 |
200 |
0.886 |
0.552 |
1.6× |
Kinematic mode is consistently faster than physics mode for joint control.
The gap comes from skipping stepSimulation() — kinematic mode uses resetJointState()
with per-step interpolation at URDF velocity limits.
Exact ratios are environment-dependent; the trend (kinematic faster) is consistent.
Component Breakdown (10 arms)
Script: benchmark/profiling/arm_joint_update.py --test builtin --arms 10
Component |
Physics |
Kinematic |
|---|---|---|
agent_update |
27.0% (0.050 ms) |
96.7% (0.087 ms) |
step_simulation |
69.6% (0.129 ms) |
0.2% (0.000 ms) |
callbacks |
1.0% |
0.4% |
total |
0.186 ms |
0.090 ms |
In physics mode, stepSimulation() dominates (70%). In kinematic mode, the physics engine
is bypassed entirely — agent_update (joint interpolation + resetJointState) is the sole cost.
Kinematic Joint Cache Optimization
Problem: cProfile showed p.getJointState() consuming ~36% of kinematic update time
(called per-joint per-step to read current positions before interpolating).
Solution: _kinematic_joint_positions cache — joint positions stored in a Python dict,
initialized via batch p.getJointStates() at agent creation, updated after each resetJointState().
get_joint_state() returns cached values for kinematic robots (zero PyBullet calls).
Metric |
Before cache |
After cache |
Improvement |
|---|---|---|---|
50 arms step time |
0.826 ms |
0.552 ms |
1.5× faster |
|
200 (50 arms × 4 joints) |
0 |
eliminated |
→ Design decision: Kinematic joint cache is always active for mass=0.0 URDF robots.
The cache is invisible to callers — get_joint_state() API is unchanged.
Data collected 2026-03-15. Absolute timings and ratios are environment-dependent
(CPU, OS, PyBullet version). The qualitative trends — kinematic faster than physics,
cache eliminating per-step PyBullet calls — are expected to hold across environments.
Re-run benchmark/profiling/arm_joint_update.py to obtain numbers for your setup.
See Also
Benchmark Suite — How to run benchmarks and reproduce these results
Experiment Scripts — Collision algorithm comparison scripts
Profiling Guide — Per-component profiling scripts
Optimization Guide — Parameter recommendations based on these results