Benchmark Results Reference

Results are kept up-to-date with each release. For how to reproduce them, see Benchmark Suite and Profiling Guide.

Test Environment

Item	Value
OS	Ubuntu (Linux 6.6 microsoft-standard-WSL2)
CPU	AMD Ryzen AI 7 PRO 350 w/ Radeon 860M (8 cores / 16 logical)
RAM	29 GB
Python	3.12.3
Conditions	Headless (`gui=false`), `physics=false`, `timestep=0.1`, 3 repetitions

Simulation Throughput

Command: make bench-release (runs the core throughput sweep and the mobile control-path comparison) Config: benchmark/configs/general.yaml — simple_cube robots, collision_check_frequency=null (every step), 50% agents moving, batch controller, fleet command interface Last measured: 2026-07-24

Agents	Step Time (ms)	RTF	Spawn Time	Memory Delta
100	0.70 ± 0.06	143.5×	59 ms	+2.6 MB
250	1.76 ± 0.12	56.8×	140 ms	+6.2 MB
500	4.25 ± 0.30	23.5×	294 ms	+12.5 MB
1000	9.88 ± 0.08	10.1×	604 ms	+25.0 MB
2000	23.25 ± 1.07	4.3×	1241 ms	+48.1 MB

Source: benchmark/results/benchmark_sweep_10.0s.json

The 2026-07-24 sweep was collected on a WSL2 host using the lightweight simple_cube model as the large-scale baseline. The default collision_check_frequency=null setting means collision checks run every simulation step. More detailed robot models have higher spawn and update costs and should be benchmarked separately. Timed simulation runs keep tracemalloc disabled because Python allocation tracing materially distorts step-time and RTF measurements; memory deltas are RSS-based.

ROS 2 Bridge Scale Check

Measured 2026-07-25 with Docker, headless simple_cube robots, physics off, timestep=0.1, publish_rate=5 Hz, and target_rtf=0. These are diagnostic single-run values and are not comparable to the core simulation table above.

Mode	Robots	Command result	Max RTF
`fleet`	1000	ack 0.486 s; all moved in 0.104 s	9.37×
`per_robot`	1000	publish 0.217 s; 0/1000 moved in 60 s	0.64×
`per_robot`	100	publish 0.056 s; all moved in 15.033 s	not measured
`hybrid`	1000	fleet ack 7.998 s; per-robot publish 0.220 s	0.30×

The 1000-robot per-robot result shows why topic publication latency must not be treated as end-to-end command latency. See ros2_bridge/PERFORMANCE.md for the bridge-specific methodology and recommendations.

Mobile Control Path Comparison

Script: benchmark/run_benchmark.py --type mobile_control_path --controller per_agent batch --command-interface per_agent fleet --sweep 100 500 1000 --steps 600 --repetitions 3 Source: benchmark/results/mobile_control_path_sweep_600steps.json

This benchmark isolates controller update and command-ingress cost in the Python simulation loop.

Agents	Controller	Command Interface	Setup Time	Step Time	P95 Step
100	per-agent	per-agent	0.0020 s	0.118 ms	0.149 ms
100	batch	fleet	0.0016 s	0.112 ms	0.124 ms
500	per-agent	per-agent	0.0283 s	0.666 ms	0.563 ms
500	batch	fleet	0.0082 s	0.661 ms	0.383 ms
1000	per-agent	per-agent	0.0245 s	1.526 ms	1.592 ms
1000	batch	fleet	0.0162 s	1.290 ms	1.397 ms

Memory note: Memory delta is process RSS and should be treated as environment-dependent, especially on WSL2.

Scalability: ~O(n^1.2) across the measured 100-2000 robot sweep on this WSL2 host.

Step Time Component Breakdown

Script: supplemental profiling with 1000 agents, 50% moving, collision_check_frequency=null (collision checks every simulation step), and the first 100 measured steps after warmup.

Component	Mean Time	Share	Notes
Agent / controller update	4.43 ms	~46%	Dominant cost; much higher for moving than stationary
Pose flush	0.98 ms	~10%	Buffered kinematic pose writes to PyBullet
AABB / spatial-grid flush	2.07 ms	~22%	Refreshes moved-object collision state
Collision check	2.07 ms	~22%	AABB filtering dominates within collision
Monitor update	0.04 ms	<1%	Near-zero if monitor GUI is disabled
Step simulation	0 ms	0%	Physics off; up to ~40% with physics on

Profiler note: Long profiling runs can understate moving-agent cost because robots may reach their goals and become stationary. Use a fixed first-window profile when comparing with the benchmark table.

Agent Update Cost

Script: benchmark/profiling/agent_update.py --agents 500

Stationary vs Moving (500 agents)

State	Total time	Per agent
Stationary	0.29 ms	0.57 μs
Moving	61.27 ms	122.5 μs
Ratio	214×

→ Design decision: Benchmarks use 50% moving agents as a representative workload. Moving agent cost dominates: kinematics, trajectory following, and PyBullet API calls only execute for agents with an active goal.

Motion Mode Comparison (500 agents, all moving)

Mode	Total time	Per agent	Relative
OMNIDIRECTIONAL	9.26 ms	18.5 μs	1× (baseline)
DIFFERENTIAL	26.66 ms	53.3 μs	2.9× slower

→ Optimisation note (2026-03-12): DIFFERENTIAL was previously ~5× slower (90.6 μs/agent). Replacing scipy Slerp with a precomputed quaternion slerp and removing per-tick numpy allocations reduced the ROTATE phase from 37 μs to 5 μs. See Differential Drive Optimization below.

→ Design decision: Benchmarks default to OMNIDIRECTIONAL. DIFFERENTIAL requires heading alignment computation which adds ~2.9× update cost.

Wrapper Layer Overhead

Script: benchmark/profiling/wrapper_overhead.py --n 500 --reps 3 Conditions: 500 objects, process-isolated per layer

Layer	Spawn time	Update time (get+set pose)	Memory (RSS delta)
Direct PyBullet (baseline)	63 ms	2.21 ms	5.2 MB
SimObject	106 ms (+68%)	6.80 ms (+3.1×)	9.2 MB
SimObjectManager	207 ms (+229%)	10.75 ms (+4.9×)	9.3 MB
Agent	237 ms (+276%)	10.60 ms (+4.8×)	11.5 MB
AgentManager	255 ms (+305%)	10.20 ms (+4.6×)	11.6 MB

Extrapolated to 10,000 objects (production feasibility, thresholds: spawn < 10 s, update < 150 ms/step, memory < +200 MB):

Layer	Spawn	Update/step	Memory	Pass?
SimObject	2.1 s	136 ms	+81 MB	✅ all pass
Agent	4.7 s	212 ms	+127 MB	spawn/mem ✅, update ❌

→ Design decision: SimObject layer is production-viable at 10,000 scale. Agent layer exceeds the 150 ms/step update threshold at 10,000 (extrapolated 212 ms), but comfortably passes at 5,000 or below. For ≥10,000 agents, consider using SimObject directly or batching update calls.

Collision Mode Comparison (DISABLED / NORMAL_2D / NORMAL_3D)

Script: benchmark/experiments/collision_mode_comparison.py --agents 500 --iterations 200

Mode	Step time (mean)	Collision time	vs Disabled
DISABLED	1.37 ms	—	baseline
NORMAL_2D (9 neighbors)	2.43 ms	0.21 ms	+77%
NORMAL_3D (27 neighbors)	2.26 ms	0.39 ms	+64%

Collision check breakdown (both modes): AABB Filtering accounts for ~97–99% of collision check time. Spatial Hashing and Contact Points together are <3%.

→ Design decision: NORMAL_2D yields only modest speedup (7%) over NORMAL_3D at this scale because AABB filtering dominates in both. Use NORMAL_2D for ground robots where Z-axis neighbors are irrelevant for correctness, not for dramatic performance gains.

Collision Algorithm Comparison (Spatial Hashing vs Alternatives)

Script: benchmark/experiments/collision_method_comparison.py --agents 100,500 --iterations 100 Conditions: spacing=0.08m (dense), kinematic objects (mass=0)

Method	100 agents	500 agents	Collisions detected	Valid?
Spatial Hashing (current)	8.0 ms	92.1 ms	✅ correct	✅
Brute Force AABB	8.7 ms	231.9 ms (2.5×)	✅ correct	✅
`getClosestPoints` all-pairs	23.0 ms	370.6 ms (4.0×)	✅ correct	✅
`getContactPoints()` no-args	0.06 ms	0.24 ms	❌ 0 — invalid	❌
`getContactPoints(A,B)` pairwise	12.1 ms	190.3 ms	❌ 0 — invalid	❌

Key finding: getContactPoints variants return 0 collisions for kinematic objects (mass=0). PyBullet’s contact point solver only activates between physics-enabled bodies. Spatial hashing with getClosestPoints is the only valid fast method for kinematic AGV-type robots.

→ Design decision: Spatial hashing (O(N) average) is used because:

It is the only algorithmically valid method for kinematic robots.
It is 2.5–4× faster than the valid alternatives at 500 agents.

Differential Drive Optimization

Script: benchmark/profiling/differential_breakdown.py Conditions: 1000 agents, DIFFERENTIAL mode (ROTATE phase)

The differential drive ROTATE phase was the single most expensive per-agent per-tick operation. Profiling identified two root causes:

Operation	Before	After	Speedup
Quaternion slerp (scipy Slerp → `_quat_slerp`)	26.8 μs	2.3 μs	11.8×
Scalar clip (`np.clip` → `max/min`)	6.8 μs	0.15 μs	44×
Dead `np.array()` allocations (P0)	~0.6 μs	0 μs	eliminated
Redundant `np.dot` in slerp (P1)	~0.3 μs	0 μs	eliminated
Velocity array reallocation (P2)	~0.3 μs	0 μs	in-place
FORWARD direction recompute (P3)	~0.5 μs	0 μs	cached
ROTATE phase total	36.1 μs	4.9 μs	7.3×

End-to-end impact (1000 agents, differential):

Metric	Before	After	Improvement
Per-agent update	57 μs	12.4 μs	4.6× faster
DIFFERENTIAL / OMNIDIRECTIONAL ratio	5.0×	2.9×	Closer to parity
500-agent ROTATE frame time	18.0 ms	2.5 ms	−15.6 ms/frame

Why was scipy Slerp slow?

scipy.spatial.transform.Slerp is a generic array-oriented API designed for batch interpolation (e.g., slerp(np.linspace(0, 1, 10000))). When called with a single scalar t per agent per tick, the fixed overhead — input normalisation, dtype promotion, Rotation object wrapping — dominates the actual trigonometry. The replacement _quat_slerp sidesteps this by precomputing dot/theta/sin(theta) once per waypoint transition and using math.sin/math.cos (C-level scalar functions) instead of numpy array dispatch.

Absolute timings vary by environment; the ratios above are the meaningful comparison metric.

Collision Config-Based Comparison (Physics ON vs OFF)

Script: benchmark/experiments/collision_methods_config_based.py Conditions: 100 objects, 500 steps

Config	Physics	Method	Step time	Collision time
`physics_off_closest.yaml`	OFF	`CLOSEST_POINTS`	1.51 ms	1.24 ms
`physics_on_contact.yaml`	ON	`CONTACT_POINTS`	2.64 ms	1.41 ms
`hybrid.yaml`	ON	`HYBRID`	2.00 ms	1.27 ms

Physics ON adds ~75% step overhead due to the physics engine stepping each simulation step.

→ Design decision: Default is physics=false (kinematics mode) for maximum throughput. Use physics=true only when rigid-body dynamics are required.

Source: benchmark/results/collision_methods_config_based.txt

Arm Joint Control Performance

Script: benchmark/profiling/arm_joint_update.py --test scaling Conditions: arm_robot.urdf (4 revolute joints), fixed-base, JointAction cycling, 100 steps per count

Physics vs Kinematic Scaling

Arms	Joints	Physics (ms/step)	Kinematic (ms/step)	Ratio
1	4	0.029	0.010	2.8×
5	20	0.082	0.056	1.5×
10	40	0.152	0.090	1.7×
25	100	0.415	0.253	1.6×
50	200	0.886	0.552	1.6×

Kinematic mode is consistently faster than physics mode for joint control. The gap comes from skipping stepSimulation() — kinematic mode uses resetJointState() with per-step interpolation at URDF velocity limits. Exact ratios are environment-dependent; the trend (kinematic faster) is consistent.

Component Breakdown (10 arms)

Script: benchmark/profiling/arm_joint_update.py --test builtin --arms 10

Component	Physics	Kinematic
agent_update	27.0% (0.050 ms)	96.7% (0.087 ms)
step_simulation	69.6% (0.129 ms)	0.2% (0.000 ms)
callbacks	1.0%	0.4%
total	0.186 ms	0.090 ms

In physics mode, stepSimulation() dominates (70%). In kinematic mode, the physics engine is bypassed entirely — agent_update (joint interpolation + resetJointState) is the sole cost.

Kinematic Joint Cache Optimization

Problem: cProfile showed p.getJointState() consuming ~36% of kinematic update time (called per-joint per-step to read current positions before interpolating).

Solution: _kinematic_joint_positions cache — joint positions stored in a Python dict, initialized via batch p.getJointStates() at agent creation, updated after each resetJointState(). get_joint_state() returns cached values for kinematic robots (zero PyBullet calls).

Metric	Before cache	After cache	Improvement
50 arms step time	0.826 ms	0.552 ms	1.5× faster
`p.getJointState` calls/step	200 (50 arms × 4 joints)	0	eliminated

→ Design decision: Kinematic joint cache is always active for mass=0.0 URDF robots. The cache is invisible to callers — get_joint_state() API is unchanged.

Data collected 2026-03-15. Absolute timings and ratios are environment-dependent (CPU, OS, PyBullet version). The qualitative trends — kinematic faster than physics, cache eliminating per-step PyBullet calls — are expected to hold across environments. Re-run benchmark/profiling/arm_joint_update.py to obtain numbers for your setup.