Roadmap
Planned additions and improvements for PyBulletFleet. Items are grouped by category; ordering within a group does not imply priority.
Assets
New robot and infrastructure models:
Physics Mobile Robot — Wheeled robot driven by PyBullet physics (motor torques, friction, contact forces)
Physics Mobile Manipulator — Physics-mode mobile manipulator with motor-driven base and arm
Conveyor / Elevator / Mobile Rack — Warehouse infrastructure entities for material handling scenarios
Features
Snapshot, Event log & Replay (EventBus steps 1–2 done ✅; steps 3–6 pending) — Three-layer architecture sharing a single EventBus as data source. Enables replay, external synchronization (USO integration), and downstream observability (see Observability below).
Layer overview:
Layer
Purpose
Granularity
Lossless?
Replayable?
Snapshot
State checkpoint for seek / restore
Low frequency (e.g. 1 Hz)
Yes
Yes (state restore)
Event log
Causal record of state transitions (input replay, audit)
Per state-change
Yes (must)
Yes (input replay)
Trace
Operational observability (Grafana/Tempo)
Action-level spans
Sampling allowed
No (view only)
Design principle: same source, separate sinks. All three are derived from the EventBus (
docs/design/eventbus/spec.md) — emit once, fan out to multiple subscribers. Snapshot ≠ Event (different schemas, separate files), but share common header keys (sim_time,step,wall_time,run_id) so they can be merged on the time axis.Why trace alone cannot replay: trace is sampled, coarse-grained (Action-level spans), drops non-deterministic inputs (RNG seed, callback returns), and has attribute-size limits unsuitable for
Pathwaypoints / IK results. Replay needs Event log (input) + Snapshot (checkpoint).Output layout:
run_<timestamp>/ snapshots.jsonl # low-frequency full+delta state events.jsonl # lossless state-change events # OTel spans → Tempo (separate, optional, sampled)
Implementation order:
EventBus core +
SimEventdataclass (seedocs/design/eventbus/spec.md)Publish from
Actionstate transitions,add_object/remove_object, collision Enter/ExitJSONL event writer subscriber (lossless input log)
SimulationSnapshot+MultiRobotSimulationCore.snapshot()/restore()(seedocs/design/snapshot-replay/spec.md)replay.py: snapshot + event log → re-executionOTel exporter subscriber (covered in Observability section, independent of replay)
Steps 3–6 are independent once 1–2 are in place.
Behavior tree integration — Create agent behavior from behavior trees
SimObject.from_sdf()→List[SimObject]✅ — Factory method that loads an SDF file viap.loadSDF()and wraps each returned body_id in aSimObject. Collision detection and lifecycle management viaadd_object()are applied automatically. Required for Open-RMF SDF environment loading and official support for pybullet_data SDF models (kiva_shelf, wsg50_gripper, etc.).
Crowd Simulation (lightweight)
Lightweight crowd simulation for NPC pedestrians in RMF demo environments. Provides more realistic pedestrian behavior than the current RandomWalkController (random point within radius) while staying far simpler than Gazebo’s full Menge/ORCA crowd_simulator plugin.
Motivation: RMF demos (airport_terminal) use Gazebo’s crowd_simulator with Menge BFSM + ORCA + nav mesh for 10 pedestrians. PyBulletFleet currently substitutes RandomWalkController which has no obstacle avoidance and no agent-agent interaction. A middle-ground is useful for visual fidelity in demo environments without the complexity of a full crowd simulation framework.
Scope:
Feature |
RandomWalkController (current) |
Proposed |
Gazebo crowd_simulator |
|---|---|---|---|
Obstacle avoidance |
None |
Simple raycast or AABB check |
ORCA + nav mesh |
Agent-agent avoidance |
None |
Separation force (boids-style) |
ORCA mutual |
Goal selection |
Uniform random in radius |
Weighted goal sets (configurable) |
BFSM state machine |
Nav mesh |
Not required |
Not required |
Required |
Config complexity |
2 params |
~5 params (YAML) |
3 XML files |
Computation |
~0 |
O(n·k) neighbor lookup |
O(n²) ORCA |
Approach: A new CrowdController (or enhancement of RandomWalkController) that adds:
Goal sets — Named groups of positions with transition weights (like Menge BFSM but in YAML)
Simple separation — Repulsive force from nearby agents (boids separation rule, not full ORCA)
Static obstacle avoidance — Optional raycast or AABB query to avoid walls (PyBullet
rayTest)
Priority: Low — current RandomWalkController is functional for demo purposes. This is primarily for RMF demo visual fidelity (再現度).
Crowd-as-dynamic-obstacle throughput benchmark
The higher-value use of a crowd is not visual fidelity but turning crowd density into a benchmark variable: measure how much a fleet’s task throughput degrades when robots must contend with pedestrians. Pure visual pedestrians (a human mesh wandering) have little functional value here, because the fleet is RMF-path-driven and does not avoid arbitrary dynamic obstacles — the NPCs are currently just eye-candy. Making them dynamic obstacles the fleet reacts to is what aligns with PyBulletFleet’s strength (throughput/scale benchmarking at hundreds of robots, a regime too heavy for Gazebo).
Three pieces:
Crowd as dynamic obstacles — the lightweight
CrowdControllerabove, with pedestrian positions known to the sim.Robot reaction —
Minimal: “pedestrian within X m ahead → slow/stop” via PyBullet
rayTest/ collision checks. No RMF changes; measures stop-and-wait delay.Full: publish detected pedestrians as
rmf_obstacle_msgs/Obstaclesso RMF reroutes / closes lanes (heavier; integrates withrmf_visualization_obstaclesand the obstacle pipeline).
Task-throughput harness — dispatch N patrol/delivery tasks and measure completion time / completions-per-hour. This is a new metric distinct from the existing RTF/step-time benchmarks (which measure sim performance, not fleet task throughput).
Experiment: sweep crowd density × avoidance{off, stop, reroute} → task throughput, at 100–1000 robots. Reuses existing collision detection, rayTest, EventBus, and the benchmark harness scaffolding.
Priority: Low / research — schedule after the ROSCon minimum (working demos + packaging). Start with the minimal reaction (stop-on-detect, no RMF changes).
Devices
See ros2_bridge/README.md for device enhancements (elevator doors, double-hinge doors, xacro-parameterised URDFs).
RMF Demo Feature Coverage
Gaps between the standard Gazebo rmf_demos and the PyBulletFleet bridge. Core
flows (navigation, doors, lifts, delivery pick/drop, battery/charging,
cleaning coverage, georeferenced maps) are implemented. Ported demos:
office, clinic, hotel, airport_terminal, battle_royale, campus. Outstanding:
Demos not ported — require an RMF mode the bridge doesn’t implement:
office_mock_traffic_lightandtriple_H— both use theEasyTrafficLightadapter (the robot owns navigation; RMF only gates it with go/stop). This inverts the bridge’s full-control model and is a poor fit for the kinematic, RMF-planned sim — see the [TrafficLight discussion]. Relevant only once a co-simulation/robot-proxy layer exists. Run them in Gazebo for now.
RMF tasks/actions — partial (fleet_adapter.execute_action):
dock — treated as a plain navigate (no dedicated dock-path approach; endpoint is reached, only the precise approach trajectory is skipped — negligible in a kinematic sim).
Other custom performable actions — logged and finished, not executed.
Substituted (works, but differs from Gazebo):
Crowd — Menge/ORCA
crowd_simulator→ lightweightRandomWalkController(no avoidance). See Crowd Simulation.Caddy — Gazebo keyboard teleop (
diff_driveplugin) → autonomousPatrolController.
Not modeled by design (kinematic focus):
Robot sensors (lidar/camera), physics dynamics (wheel torque/contact).
rmf_obstacledetection pipeline (also unused in stock Gazebo demos; see the “Crowd-as-dynamic-obstacle throughput benchmark” section above).
Interfaces
External communication layers:
ROS 2 — Topic / service / action bridge for ROS 2 ecosystem integration (see
ros2_bridge/README.md)gRPC — Language-agnostic RPC interface for orchestrators, WMS, and fleet managers
Distributed co-simulation (Robot Proxy layer) — Per-robot proxy processes that translate between simulated robots and real Robot Apps (Nav2, task assigners, BTs). Enables running 100+ unmodified single-robot software stacks against a centralized batched simulator. Sim Central stays single-process and batched; only a thin fixed-schema boundary (
StateSnapshot+CommandBuffer+ Events) crosses the IPC. Shares schema with Snapshot/Replay — sameStateSnapshotdataclass feeds live IPC, replay log, and observability sinks (define schema once, fan out to multiple consumers). Seedocs/design/co-simulation/spec.mdfor full design including layer separation, transport options (shared memory / gRPC / DDS), lockstep vs async sync modes, and 6-phase implementation plan.
Refactoring
MultiRobotSimulationCoreresponsibility decomposition — The core class has grown to ~3200 lines / 66 methods. Extract self-contained subsystems into dedicated classes composed into the core:Subsystem
Target class
Methods to extract
Est. lines
Collision detection (spatial hash, AABB, broad/narrow phase)
CollisionSystemcheck_collisions,filter_aabb_pairs,_update_object_spatial_grid, … (~20 methods)~1000
Visualizer (camera, transparency, keyboard)
VisualizerControllerconfigure_visualizer,setup_camera,_handle_keyboard_events, … (~6 methods)~250
Profiling & memory tracking
SimProfilerrecord_profiling,_print_profiling_summary,_print_memory_profiling_summary, … (~5 methods)~200
Each subsystem is held by composition (
self._collision = CollisionSystem(self)). The core class delegates; public API does not change. Natural stepping stone toward the SimBackend ABC (Long-Term Phase 1).Move
controller.pyintocontrollers/package — Base controller classes (Controller,KinematicController,OmniController,DifferentialController,create_controller,register_controller) live inpybullet_fleet/controller.pywhile higher-level controllers (PatrolController,RandomWalkController) are already inpybullet_fleet/controllers/. Consolidate by moving the base module into the package ascontrollers/base.py(or splitting omni/differential into separate files) and re-exporting fromcontrollers/__init__.py. Preserve thefrom pybullet_fleet.controller import ...path via a compatibility shim or__init__.pyre-export.Remove scipy dependency — Currently only
scipy.spatial.transform.Rotationis used (9 call sites for quat↔euler, quat↔matrix, relative rotation). Replace with PyBullet utilities + lightweight helpers ingeometry.pyto eliminate the ~150 MB transitive dependency. Low priority: no runtime performance impact, only install size.Manual quaternion helpers in
geometry.py— Extend the pattern established bySlerpPrecomp/quat_slerp(avoiding scipy, hand-written scalar math) by adding the following helpers togeometry.py. The goal is to eliminate scipyRotationobject creation overhead on hot paths.Required helper functions:
quat_rotate_vector(q, v)— Rotate vectorvby quaternionq(body→world transform). ReplacesRotation.from_quat(q).apply(v)quat_multiply(q1, q2)— Quaternion product. Replaces(Rotation.from_quat(q1) * Rotation.from_quat(q2)).as_quat()quat_from_rotvec(rotvec)— Quaternion from rotation vector. ReplacesRotation.from_rotvec(rotvec).as_quat()(used in angular velocity → orientation update)
Primary application sites:
OmniVelocityController._apply_velocity()— body→world velocity transform + quaternion update from angular velocitytools.body_to_world_velocity_3d()— same as abovePath._calculate_orientation_for_plane()— rotation matrix → quaternion conversionPath.visualize_waypoints()— quaternion → rotation matrix conversion
Design policy:
Pure Python scalar math (
math.sin/math.cos) or small numpy array opsSame file and style as
SlerpPrecompApply incrementally after profiling confirms bottleneck (YAGNI)
Full scipy removal in a separate PR after all call sites are replaced
GUI / Interactive Tools
Improve p.GUI interactivity for development and debugging.
Long-term plan is to replace p.GUI with Rerun (see Long-Term Phase 1) where many of these capabilities come built-in. These items are the interim layer until that migration lands, and naturally collect under the planned VisualizerController (see Refactoring).
Current state:
sim.pause()/resume()Python API ✅SPACEkeybinding for pause toggle ✅pause/resumeevents on EventBus ✅CameraControllerwith right-drag pan, zoom, top-down view ✅
Planned:
Pause / Step / Speed control panel — In-window UI for pause/resume, single-step, and real-time multiplier. Short-term:
p.addUserDebugParametersliders/buttons. Mid-term: tkinter side panel composed withDataMonitor.SimObject inspector panel — Live list of all
_agents/_sim_objectswithobject_id,name, currentPose, action queue length, and status. Click-to-select with viewport highlight.Camera focus on selection — Selected entity becomes camera target; auto-follow toggle; distance / yaw / pitch sliders. Replaces the current static
setup_camera.Object manipulation — Drag selected
SimObjectin XY plane, edit pose via numeric input, delete via UI (callsremove_object()properly so collision/EventBus stay consistent). Respectspickableflag.Action queue inspector — Per-Agent visualization of pending and active actions (type, status, target). Right-click to cancel an action.
Keyboard shortcut overlay — Help panel listing all bound keys (
SPACE, future shortcuts), surfaced via a?key.
Implementation notes:
All new GUI affordances must be opt-out via
SimulationParams(gui=False)so headless / CI paths stay zero-cost.UI state changes must publish through the EventBus (
gui.select,gui.manipulate,gui.action_cancel) so plugins / recordings see them.Avoid per-step UI redraws — throttle to ≤30 Hz independent of sim FPS.
Performance
Near-term optimizations within the current PyBullet-backed architecture. Goal: reduce per-step cost at 100–1000 agents without a full backend swap (see Long-Term section for that).
Profiling Baseline (measured 2026-04-03)
Benchmark: 500 agents, omnidirectional MoveAction, collision_check_frequency=0 (disabled), physics=False, simple_cube.urdf.
Per-step breakdown (median):
Component |
Time |
% of total |
What |
|---|---|---|---|
Python overhead (TPI + slerp + action queue + set_pose logic) |
11.5 ms |
88% |
CPython for-loop + object dispatch |
|
0.7 ms |
5% |
PyBullet C API |
|
0.5 ms |
4% |
PyBullet C API |
Movement detection |
0.3 ms |
3% |
Pure Python arithmetic |
Total |
13.0 ms |
100% |
FPS 77 |
Key insight: 88% of step time is Pure Python overhead, not C API calls.
Collision at 10 Hz adds only ~0.2 ms at 500 agents — negligible compared to agent_update.
Vectorization micro-benchmark (500 agents):
Operation |
Python for-loop |
NumPy vectorized |
Speedup |
|---|---|---|---|
Position compute ( |
1,150 μs |
5.6 μs |
205× |
TPI-like trapezoidal profile |
(per-agent) |
33 μs |
— |
Per-agent cost |
23 μs/agent |
~0.01 μs/agent |
— |
Two-Phase Step: Decouple Computation from PyBullet C API ✅ (batch controller path)
Current step_once() iterates agents one-by-one, each calling controller.compute() (Python/NumPy) → set_pose_raw() (PyBullet C API + AABB update + spatial grid) interleaved. This prevents vectorization and adds per-agent Python↔C crossing overhead.
Proposed split:
Phase |
What |
Hot path |
|---|---|---|
Phase 1 — Compute |
All controllers compute new poses; no side effects |
Pure Python / NumPy |
Phase 2 — Apply |
Tight loop of |
PyBullet C calls |
Phase 3 — Bookkeep |
Batch AABB refresh ( |
C calls + Python dict |
This requires removing direct pybullet API calls from sim_object.py, agent.py, and controller.py. Instead, these modules produce pose intents (position + orientation tuples), and core_simulation.py flushes them to PyBullet in bulk.
Key changes:
SimObject.set_pose()/set_pose_raw()writes to an internal buffer (cached pose + dirty flag) without callingp.resetBasePositionAndOrientation()Controller.compute()returns(new_pos, new_orn)or writes to agent’s pending pose buffercore_simulation.step_once()collects dirty poses → batchresetBasePositionAndOrientation→ batch AABB updateMovement detection stays pure Python (already cached-pose-based), unaffected
Attached-object propagation runs after Phase 2 using the buffered parent poses
Vectorized Agent Update (NumPy Batch) ✅
For the common “N agents on straight-line TPI paths” case, Phase 1 can be further vectorized:
Store all active agents’
forward_start_pos,forward_direction, and TPI parameters in contiguous(N, 3)NumPy arraysCompute
new_positions = start_positions + directions * ratios[:, np.newaxis]in one vectorized callSlerp batch: pre-compute all
(start_quat, target_quat, t_fraction)and batchquat_slerpFallback: agents with non-standard controllers (velocity mode, custom callbacks) use the existing per-agent path
This is a “BatchController” or “VectorizedOmniController” that sits alongside the existing Controller ABC.
C++ Extensions for Hot-Path Functions
Profile-guided candidates for C++ (via pybind11) or Cython acceleration:
Function |
Current |
Why C++ helps |
|---|---|---|
TwoPointInterpolation |
Pure Python |
Called N× per step; tight numerical loop ideal for native code |
|
Python scalar math in |
N× per step; SIMD-friendly |
Spatial hash broad-phase |
Python dict + set ops in |
Dict overhead at 1000+ objects; Rust/C++ hash map faster |
AABB overlap test |
Python comparisons in |
Tight inner loop; autovectorizable in C++ |
|
PyBullet C API (already native) |
Already fast; not a candidate |
Priority: TPI and slerp first (highest call frequency), then spatial hash (scales with agent count²).
Deferred AABB Update
Currently set_pose() calls p.getAABB() and updates the spatial grid per object, immediately. For kinematic-only mode:
Defer all AABB updates to a single
p.performCollisionDetection()call after Phase 2Batch
p.getAABB()for all moved objects at onceRebuild spatial grid once per step instead of incrementally per-object
This removes N p.getAABB() C-API round-trips from the set_pose hot path.
Summary: Expected Impact (500 agents, measured baseline)
Optimization |
Estimated step time |
Estimated FPS |
Speedup vs current |
Effort |
|---|---|---|---|---|
Current |
13.0 ms |
77 |
1.0× |
— |
NumPy vectorized controller |
~1.2 ms |
~850 |
~11× |
Medium |
+ C++ TPI/slerp extensions |
~1.0 ms |
~1,000 |
~13× |
Medium |
Theoretical floor (C API only) |
0.7 ms |
~1,400 |
~18× |
— |
Scaling by agent count (current → vectorized estimate):
Agents |
Current FPS |
Est. vectorized FPS |
Speedup |
|---|---|---|---|
100 |
225 |
~2,000+ |
~9× |
500 |
77 |
~850 |
~11× |
1,000 |
27 |
~450 |
~17× |
Note: Collision detection (10 Hz spatial hash) adds <1% overhead at 500 agents. C++ spatial hash becomes relevant at 1,000+ agents with higher collision frequencies.
Relation to Long-Term Backend Abstraction: The two-phase split and the “remove direct pybullet calls” refactoring are natural stepping stones toward the SimBackend ABC (Phase 1 of Long-Term). The buffered-pose pattern becomes the write side of
SimBackend.set_positions_batch().
SDF & DAE Support Improvements
PyBulletFleet currently has two self-implemented workarounds for Gazebo ecosystem interop:
resolve_sdf_to_urdf— Hand-rolled SDF→URDF XML converter (PyBullet cannot load SDF directly)DAE defensive fallbacks — Colour extraction, texture symlinks, collision try/except (PyBullet has poor DAE support)
Both are functional but hacky. This section tracks improvements to make them more robust or replace them entirely.
SDF → URDF Conversion: gz sdf -p Replacement
resolve_sdf_to_urdf() in sdf_loader.py is a minimal self-implemented SDF→URDF converter using xml.etree. It covers only <link>, <joint>, <mesh>, and primitive geometries — enough for simple robots (DeliveryRobot) but fragile for complex SDF models.
Why self-implemented? No usable Python SDF→URDF tool exists:
sdformat_urdf— C++ only, no Python bindingspysdf(PyPI) — fails to build on Python 3.12+libsdformat14— C++ template API, not callable via ctypesgz sdf -p model.sdf— converts SDF→URDF via CLI subprocess (see evaluation below)
gz sdf -p evaluation (Gazebo Harmonic / sdformat 14, as of 2026-04):
Self-implemented parser |
|
|
|---|---|---|
Dependencies |
None ( |
|
Coverage |
Minimal (link/joint/mesh only) |
Broader (frames, nested models partial) |
Edge cases |
Fix ourselves immediately |
Wait for upstream fix |
DAE colour injection |
|
Not supported (no colour in output URDF) |
Nested |
Not supported |
Partial support |
Known issues |
Pose hierarchy may drift for deep nesting |
Nested model bugs, some pose frame resolution errors |
Stability |
Tested for our models |
“Best effort” — known regressions between versions |
Verdict: gz sdf -p is usable but not fully stable. For simple robots it works; for complex SDF it has regressions. The dependency cost (~200 MB) is also significant. Current self-implemented parser is the pragmatic choice.
Future action:
Monitor
gz sdf -pstability across Gazebo releasesIf it stabilises + we already have
gz-toolsin Docker for other reasons → switch to subprocess call and deleteresolve_sdf_to_urdf()If we support complex SDF robots (nested models, sensors) → worth the dependency
DAE → OBJ Automatic Conversion
PyBullet has poor DAE (COLLADA) support: textures are not loaded, diffuse colours are ignored, and some meshes fail createCollisionShape. Gazebo Fuel models ship exclusively as DAE.
Proposal: Use assimp export (from assimp-utils, ~2 MB) to convert DAE meshes to OBJ + MTL at download time. OBJ is PyBullet’s best-supported mesh format — colours, textures via MTL, and collision all work reliably.
Changes needed:
docker/Dockerfile.rmf_demos—apt install assimp-utilsdocker/download_fuel_models.py— Post-downloadassimp export *.dae *.objconversion steppybullet_fleet/sdf_loader.py— Prefer.objover.daewhen both exist (mesh_path.replace(".dae", ".obj"))
Simplification benefit: Once all DAE meshes have OBJ counterparts, three defensive workarounds in sdf_loader.py become unnecessary and can be removed:
Current workaround |
Why it exists |
Removed after conversion |
|---|---|---|
|
Fuel DAE has embedded colours; local OBJ does not → different |
OBJ+MTL always carries material info → uniform |
|
DAE references textures by bare filename; PyBullet resolves relative to |
OBJ+MTL uses relative paths that resolve correctly |
|
Some DAE meshes crash PyBullet’s collision shape loader |
OBJ collision shapes load reliably |
The is_fuel flag would be replaced by a mesh-format check (e.g. “does an .obj file exist alongside the .dae?”), making the loader source-agnostic — any DAE model from any provider would be handled the same way.
Multi-material mesh textures (per-material visuals)
createVisualShape(GEOM_MESH) applies a single texture per mesh, so a
multi-material mesh (one OBJ/DAE packing many materials) renders with just one
texture tiled over everything. The campus env is a single 50-material OBJ and
showed one (a person) texture across the whole scene. Current workaround: the
world.force_color config flat-shades such a mesh grey (strips its mtllib
so no texture loads — see _obj_without_materials). Proper fix: split the mesh
by material into per-material sub-visuals (reusing the trimesh <submesh>
extraction path), one texture each. Moderate effort; helps any multi-material
environment.
Priority
Low — SDF parsing, DAE handling, and multi-material textures are all functional enough for demos (with the flat-shade workaround for multi-material meshes). These improvements primarily affect:
Visual fidelity (DAE→OBJ: textures and materials render correctly)
Maintainability (gz sdf -p: delete ~150 lines of hand-rolled XML conversion)
Robustness (both: fewer edge-case failures with complex models)
Environments
Simulation environment assets (warehouse floors, factory layouts, etc.):
pybullet-fleet-environmentspackage — Manage environment assets in a separate repository, installable viapip install pybullet-fleet-environmentsfor on-demand retrieval. Keep PyBulletFleet core lightweight by not bundling meshes.AWS RoboMaker Small Warehouse (MIT-0): DAE→OBJ converted meshes + URDF wrappers
Open-RMF rmf_demos maps (office, hotel, clinic, airport, campus): OBJ mesh export
pybullet_data bundled environments (kiva_shelf, samurai, stadium) wrappers
Original license clearly noted per environment
resolve_environment()API — Name resolution similar toresolve_model()for loading environments. Shows install hints when not installed.
CI / DevOps
GitHub Actions refactoring — Streamlined CI pipeline
Automated performance tracking — Run time / memory benchmarks in CI, auto-update results in documentation, and alert on significant performance regressions
Observability
Operational visibility for long benchmark runs and production deployments.
Shares the EventBus with Snapshot/Replay — same emission point, different sinks.
See docs/design/observability/spec.md for full design.
Structured logging (JSON) — Replace prefix-string
NamedLazyLoggerwithextra={...}dict fields (agent_id,object_id,sim_time,step). Enables Loki/Grafana label queries.sim_timeinjection filter —logging.Filterregistered byMultiRobotSimulationCoreauto-attachessim_timeandstepto every record. Removes per-call boilerplate on hot paths.Prometheus / OTLP metrics — Promote
DataMonitorinternals (pbf_step_duration_seconds,pbf_active_agents,pbf_actions_total{type,status},pbf_collision_pairs) to scrapable metrics. Opt-in to keep default cost zero.OpenTelemetry trace exporter — EventBus subscriber that converts
action.start/action.completeevent pairs into spans. Sampling allowed.causation_idaligned with OTelspan_idso Tempo “Trace to Logs” jumps to the matching JSONL event in Loki.Hot-path safety — All exporters use batch / async processors. Per-step spans are forbidden; Action-level granularity only.
LazyLogger.isEnabledFor()semantics preserved.
Long-Term: Backend Abstraction & Beyond PyBullet
Current architecture is tightly coupled to PyBullet (body_id, per-entity FFI calls). At 1000 agents PyBullet’s per-call overhead dominates step time (88% of 40.9 ms). The following items explore decoupling from PyBullet to unlock 10–100× performance gains while keeping the Python user API unchanged.
Phase 1: SimBackend ABC + Numpy Pure Kinematic Backend
SimBackend ABC — Abstract interface (
set_positions_batch,detect_collisions,load_model,step_physics) thatAgentandSimObjectprogram against instead of rawpybulletcallsNumpyBackend (default) — Contiguous numpy arrays for positions/orientations,
scipy.spatial.cKDTreefor collision. No physics engine dependency. Expected: 1000 agents in ~2–5 ms (RTF 20–50×), 5000+ agents at RTF > 1.0PyBulletBackend (compat) — Wraps existing PyBullet calls behind SimBackend ABC for backward compatibility and physics-mode users
URDF parsing without PyBullet — Use
yourdfpyor similar to parse URDF into internal model data, removing the last hard dependency on PyBullet for kinematic-only modeVisualization decoupling — Replace
p.GUIwith Rerun, Open3D, or RViz for rendering. Backend-agnostic scene display
Phase 2: Native Backend (Rust/C++ via PyO3/pybind11)
Only justified if Phase 1 numpy performance is insufficient (e.g., 5000+ agents at 240 Hz).
Rust kinematic core — Position update, yaw integration, AABB collision in Rust. Exposed to Python via PyO3. Expected: further 3–10× over numpy (RTF 100–300× for 1000 agents)
Batch collision in Rust — Sweep-and-prune or spatial hash for O(n log n) broad-phase, replacing Python KDTree
Zero-copy interop — Share numpy arrays directly with Rust via buffer protocol (no serialization)
Phase 3: GPU Backend (optional)
For 10,000+ agent scenarios or RL training workloads.
MuJoCo MJX backend — JAX-accelerated batch physics on GPU. URDF via MJCF conversion
JAX pure kinematic —
jax.numpydrop-in for NumpyBackend, JIT-compiled, GPU-parallel
Phase 4: ECS Architecture (v2.0 candidate)
Considered when entity diversity explodes beyond what Agent/SimObject OOP can express cleanly (e.g., drones + ground robots + conveyors + elevators + dynamic obstacles all in one scene with distinct component sets).
Component-based entity model — Replace Agent/SimObject inheritance with composable components (Transform, Collision, Kinematics, JointState, ActionQueue, etc.)
System pipeline — Each system (MovementSystem, CollisionSystem, ActionSystem) operates on component arrays. Maps naturally from current Controller ABC → System
Rust ECS runtime — Leverage
bevy_ecsorhecsfor cache-friendly SoA memory layout. Python API via PyO3 for user-facing logicMigration path — Controller ABC → System, EventBus → ECS Events, Registry → ECS resource/archetype queries. Plugin Architecture phases are designed as stepping stones toward this
Note: At this point the project may evolve beyond “PyBulletFleet” in name, as PyBullet would be just one optional backend among several, and the core would no longer depend on it. A rename (e.g., FleetSim, KinematicFleet) may be appropriate.