# Roadmap Planned additions and improvements for PyBulletFleet. Items are grouped by category; ordering within a group does not imply priority. ## Assets New robot and infrastructure models: - **Physics Mobile Robot** — Wheeled robot driven by PyBullet physics (motor torques, friction, contact forces) - **Physics Mobile Manipulator** — Physics-mode mobile manipulator with motor-driven base and arm - **Conveyor / Elevator / Mobile Rack** — Warehouse infrastructure entities for material handling scenarios ## Features - **Snapshot, Event log & Replay** *(EventBus steps 1–2 done ✅; steps 3–6 pending)* — Three-layer architecture sharing a single EventBus as data source. Enables replay, external synchronization ([USO](https://github.com/yuokamoto/Unified-Simulation-Orchestrator) integration), and downstream observability (see [Observability](#observability) below). **Layer overview:** | Layer | Purpose | Granularity | Lossless? | Replayable? | |---|---|---|---|---| | **Snapshot** | State checkpoint for seek / restore | Low frequency (e.g. 1 Hz) | Yes | Yes (state restore) | | **Event log** | Causal record of state transitions (input replay, audit) | Per state-change | Yes (must) | Yes (input replay) | | **Trace** | Operational observability (Grafana/Tempo) | Action-level spans | Sampling allowed | No (view only) | **Design principle: same source, separate sinks.** All three are derived from the EventBus (`docs/design/eventbus/spec.md`) — emit once, fan out to multiple subscribers. Snapshot ≠ Event (different schemas, separate files), but share common header keys (`sim_time`, `step`, `wall_time`, `run_id`) so they can be merged on the time axis. **Why trace alone cannot replay:** trace is sampled, coarse-grained (Action-level spans), drops non-deterministic inputs (RNG seed, callback returns), and has attribute-size limits unsuitable for `Path` waypoints / IK results. Replay needs Event log (input) + Snapshot (checkpoint). **Output layout:** ``` run_/ snapshots.jsonl # low-frequency full+delta state events.jsonl # lossless state-change events # OTel spans → Tempo (separate, optional, sampled) ``` **Implementation order:** 1. EventBus core + `SimEvent` dataclass (see `docs/design/eventbus/spec.md`) 2. Publish from `Action` state transitions, `add_object` / `remove_object`, collision Enter/Exit 3. JSONL event writer subscriber (lossless input log) 4. `SimulationSnapshot` + `MultiRobotSimulationCore.snapshot()` / `restore()` (see `docs/design/snapshot-replay/spec.md`) 5. `replay.py`: snapshot + event log → re-execution 6. OTel exporter subscriber (covered in Observability section, independent of replay) Steps 3–6 are independent once 1–2 are in place. - **Behavior tree integration** — Create agent behavior from behavior trees - **`SimObject.from_sdf()` → `List[SimObject]`** ✅ — Factory method that loads an SDF file via `p.loadSDF()` and wraps each returned body_id in a `SimObject`. Collision detection and lifecycle management via `add_object()` are applied automatically. Required for Open-RMF SDF environment loading and official support for pybullet_data SDF models (kiva_shelf, wsg50_gripper, etc.). ### Crowd Simulation (lightweight) Lightweight crowd simulation for NPC pedestrians in RMF demo environments. Provides more realistic pedestrian behavior than the current `RandomWalkController` (random point within radius) while staying far simpler than Gazebo's full Menge/ORCA `crowd_simulator` plugin. **Motivation:** RMF demos (airport_terminal) use Gazebo's `crowd_simulator` with Menge BFSM + ORCA + nav mesh for 10 pedestrians. PyBulletFleet currently substitutes `RandomWalkController` which has no obstacle avoidance and no agent-agent interaction. A middle-ground is useful for visual fidelity in demo environments without the complexity of a full crowd simulation framework. **Scope:** | Feature | RandomWalkController (current) | Proposed | Gazebo crowd_simulator | |---------|-------------------------------|----------|----------------------| | Obstacle avoidance | None | Simple raycast or AABB check | ORCA + nav mesh | | Agent-agent avoidance | None | Separation force (boids-style) | ORCA mutual | | Goal selection | Uniform random in radius | Weighted goal sets (configurable) | BFSM state machine | | Nav mesh | Not required | Not required | Required | | Config complexity | 2 params | ~5 params (YAML) | 3 XML files | | Computation | ~0 | O(n·k) neighbor lookup | O(n²) ORCA | **Approach:** A new `CrowdController` (or enhancement of `RandomWalkController`) that adds: 1. **Goal sets** — Named groups of positions with transition weights (like Menge BFSM but in YAML) 2. **Simple separation** — Repulsive force from nearby agents (boids separation rule, not full ORCA) 3. **Static obstacle avoidance** — Optional raycast or AABB query to avoid walls (PyBullet `rayTest`) **Priority:** Low — current `RandomWalkController` is functional for demo purposes. This is primarily for RMF demo visual fidelity (再現度). #### Crowd-as-dynamic-obstacle throughput benchmark The higher-value use of a crowd is not visual fidelity but **turning crowd density into a benchmark variable**: measure how much a fleet's *task* throughput degrades when robots must contend with pedestrians. Pure visual pedestrians (a human mesh wandering) have little functional value here, because the fleet is RMF-path-driven and does not avoid arbitrary dynamic obstacles — the NPCs are currently just eye-candy. Making them *dynamic obstacles the fleet reacts to* is what aligns with PyBulletFleet's strength (throughput/scale benchmarking at hundreds of robots, a regime too heavy for Gazebo). **Three pieces:** 1. **Crowd as dynamic obstacles** — the lightweight `CrowdController` above, with pedestrian positions known to the sim. 2. **Robot reaction** — - *Minimal:* "pedestrian within X m ahead → slow/stop" via PyBullet `rayTest` / collision checks. No RMF changes; measures stop-and-wait delay. - *Full:* publish detected pedestrians as `rmf_obstacle_msgs/Obstacles` so RMF reroutes / closes lanes (heavier; integrates with `rmf_visualization_obstacles` and the obstacle pipeline). 3. **Task-throughput harness** — dispatch N patrol/delivery tasks and measure completion time / completions-per-hour. This is a **new metric** distinct from the existing RTF/step-time benchmarks (which measure sim performance, not fleet task throughput). **Experiment:** sweep crowd density × avoidance{off, stop, reroute} → task throughput, at 100–1000 robots. Reuses existing collision detection, `rayTest`, EventBus, and the benchmark harness scaffolding. **Priority:** Low / research — schedule after the ROSCon minimum (working demos + packaging). Start with the minimal reaction (stop-on-detect, no RMF changes). ### Devices See `ros2_bridge/README.md` for device enhancements (elevator doors, double-hinge doors, xacro-parameterised URDFs). ## RMF Demo Feature Coverage Gaps between the standard Gazebo `rmf_demos` and the PyBulletFleet bridge. Core flows (navigation, doors, **lifts**, delivery pick/drop, battery/charging, **cleaning coverage**, **georeferenced maps**) are implemented. Ported demos: office, clinic, hotel, airport_terminal, battle_royale, **campus**. Outstanding: **Demos not ported — require an RMF mode the bridge doesn't implement:** - `office_mock_traffic_light` and `triple_H` — both use the **`EasyTrafficLight`** adapter (the robot owns navigation; RMF only gates it with go/stop). This inverts the bridge's full-control model and is a poor fit for the kinematic, RMF-planned sim — see the [TrafficLight discussion]. Relevant only once a co-simulation/robot-proxy layer exists. Run them in Gazebo for now. **RMF tasks/actions — partial** (`fleet_adapter.execute_action`): - **dock** — treated as a plain navigate (no dedicated dock-path approach; endpoint is reached, only the precise approach trajectory is skipped — negligible in a kinematic sim). - Other custom performable actions — logged and finished, not executed. **Substituted** (works, but differs from Gazebo): - Crowd — Menge/ORCA `crowd_simulator` → lightweight `RandomWalkController` (no avoidance). See [Crowd Simulation](#crowd-simulation-lightweight). - Caddy — Gazebo keyboard teleop (`diff_drive` plugin) → autonomous `PatrolController`. **Not modeled by design** (kinematic focus): - Robot sensors (lidar/camera), physics dynamics (wheel torque/contact). - `rmf_obstacle` detection pipeline (also unused in stock Gazebo demos; see the "Crowd-as-dynamic-obstacle throughput benchmark" section above). ## Interfaces External communication layers: - **ROS 2** — Topic / service / action bridge for ROS 2 ecosystem integration (see `ros2_bridge/README.md`) - **gRPC** — Language-agnostic RPC interface for orchestrators, WMS, and fleet managers - **Distributed co-simulation (Robot Proxy layer)** — Per-robot proxy processes that translate between simulated robots and real Robot Apps (Nav2, task assigners, BTs). Enables running 100+ unmodified single-robot software stacks against a centralized batched simulator. Sim Central stays single-process and batched; only a thin fixed-schema boundary (`StateSnapshot` + `CommandBuffer` + Events) crosses the IPC. **Shares schema with Snapshot/Replay** — same `StateSnapshot` dataclass feeds live IPC, replay log, and observability sinks (define schema once, fan out to multiple consumers). See `docs/design/co-simulation/spec.md` for full design including layer separation, transport options (shared memory / gRPC / DDS), lockstep vs async sync modes, and 6-phase implementation plan. ## Refactoring - **`MultiRobotSimulationCore` responsibility decomposition** — The core class has grown to ~3200 lines / 66 methods. Extract self-contained subsystems into dedicated classes composed into the core: | Subsystem | Target class | Methods to extract | Est. lines | |---|---|---|---| | Collision detection (spatial hash, AABB, broad/narrow phase) | `CollisionSystem` | `check_collisions`, `filter_aabb_pairs`, `_update_object_spatial_grid`, … (~20 methods) | ~1000 | | Visualizer (camera, transparency, keyboard) | `VisualizerController` | `configure_visualizer`, `setup_camera`, `_handle_keyboard_events`, … (~6 methods) | ~250 | | Profiling & memory tracking | `SimProfiler` | `record_profiling`, `_print_profiling_summary`, `_print_memory_profiling_summary`, … (~5 methods) | ~200 | Each subsystem is held by composition (`self._collision = CollisionSystem(self)`). The core class delegates; public API does not change. Natural stepping stone toward the SimBackend ABC (Long-Term Phase 1). - **Move `controller.py` into `controllers/` package** — Base controller classes (`Controller`, `KinematicController`, `OmniController`, `DifferentialController`, `create_controller`, `register_controller`) live in `pybullet_fleet/controller.py` while higher-level controllers (`PatrolController`, `RandomWalkController`) are already in `pybullet_fleet/controllers/`. Consolidate by moving the base module into the package as `controllers/base.py` (or splitting omni/differential into separate files) and re-exporting from `controllers/__init__.py`. Preserve the `from pybullet_fleet.controller import ...` path via a compatibility shim or `__init__.py` re-export. - **Remove scipy dependency** — Currently only `scipy.spatial.transform.Rotation` is used (9 call sites for quat↔euler, quat↔matrix, relative rotation). Replace with PyBullet utilities + lightweight helpers in `geometry.py` to eliminate the ~150 MB transitive dependency. Low priority: no runtime performance impact, only install size. - **Manual quaternion helpers in `geometry.py`** — Extend the pattern established by `SlerpPrecomp` / `quat_slerp` (avoiding scipy, hand-written scalar math) by adding the following helpers to `geometry.py`. The goal is to eliminate scipy `Rotation` object creation overhead on hot paths. Required helper functions: - `quat_rotate_vector(q, v)` — Rotate vector `v` by quaternion `q` (body→world transform). Replaces `Rotation.from_quat(q).apply(v)` - `quat_multiply(q1, q2)` — Quaternion product. Replaces `(Rotation.from_quat(q1) * Rotation.from_quat(q2)).as_quat()` - `quat_from_rotvec(rotvec)` — Quaternion from rotation vector. Replaces `Rotation.from_rotvec(rotvec).as_quat()` (used in angular velocity → orientation update) Primary application sites: - `OmniVelocityController._apply_velocity()` — body→world velocity transform + quaternion update from angular velocity - `tools.body_to_world_velocity_3d()` — same as above - `Path._calculate_orientation_for_plane()` — rotation matrix → quaternion conversion - `Path.visualize_waypoints()` — quaternion → rotation matrix conversion Design policy: - Pure Python scalar math (`math.sin`/`math.cos`) or small numpy array ops - Same file and style as `SlerpPrecomp` - Apply incrementally after profiling confirms bottleneck (YAGNI) - Full scipy removal in a separate PR after all call sites are replaced ## GUI / Interactive Tools Improve `p.GUI` interactivity for development and debugging. Long-term plan is to replace `p.GUI` with Rerun (see [Long-Term Phase 1](#long-term-backend-abstraction--beyond-pybullet)) where many of these capabilities come built-in. These items are the **interim layer** until that migration lands, and naturally collect under the planned `VisualizerController` (see Refactoring). Current state: - `sim.pause()` / `resume()` Python API ✅ - `SPACE` keybinding for pause toggle ✅ - `pause` / `resume` events on EventBus ✅ - `CameraController` with right-drag pan, zoom, top-down view ✅ Planned: - **Pause / Step / Speed control panel** — In-window UI for pause/resume, single-step, and real-time multiplier. Short-term: `p.addUserDebugParameter` sliders/buttons. Mid-term: tkinter side panel composed with `DataMonitor`. - **SimObject inspector panel** — Live list of all `_agents` / `_sim_objects` with `object_id`, `name`, current `Pose`, action queue length, and status. Click-to-select with viewport highlight. - **Camera focus on selection** — Selected entity becomes camera target; auto-follow toggle; distance / yaw / pitch sliders. Replaces the current static `setup_camera`. - **Object manipulation** — Drag selected `SimObject` in XY plane, edit pose via numeric input, delete via UI (calls `remove_object()` properly so collision/EventBus stay consistent). Respects `pickable` flag. - **Action queue inspector** — Per-Agent visualization of pending and active actions (type, status, target). Right-click to cancel an action. - **Keyboard shortcut overlay** — Help panel listing all bound keys (`SPACE`, future shortcuts), surfaced via a `?` key. Implementation notes: - All new GUI affordances must be **opt-out via `SimulationParams(gui=False)`** so headless / CI paths stay zero-cost. - UI state changes must publish through the EventBus (`gui.select`, `gui.manipulate`, `gui.action_cancel`) so plugins / recordings see them. - Avoid per-step UI redraws — throttle to ≤30 Hz independent of sim FPS. ## Performance Near-term optimizations within the current PyBullet-backed architecture. Goal: reduce per-step cost at 100–1000 agents without a full backend swap (see Long-Term section for that). ### Profiling Baseline (measured 2026-04-03) Benchmark: 500 agents, omnidirectional MoveAction, `collision_check_frequency=0` (disabled), `physics=False`, `simple_cube.urdf`. **Per-step breakdown (median):** | Component | Time | % of total | What | |---|---|---|---| | Python overhead (TPI + slerp + action queue + set_pose logic) | 11.5 ms | 88% | CPython for-loop + object dispatch | | `p.resetBasePositionAndOrientation` | 0.7 ms | 5% | PyBullet C API | | `p.getAABB` | 0.5 ms | 4% | PyBullet C API | | Movement detection | 0.3 ms | 3% | Pure Python arithmetic | | **Total** | **13.0 ms** | 100% | **FPS 77** | Key insight: **88% of step time is Pure Python overhead**, not C API calls. Collision at 10 Hz adds only ~0.2 ms at 500 agents — negligible compared to agent_update. **Vectorization micro-benchmark (500 agents):** | Operation | Python for-loop | NumPy vectorized | Speedup | |---|---|---|---| | Position compute (`start + dir × ratio`) | 1,150 μs | 5.6 μs | **205×** | | TPI-like trapezoidal profile | (per-agent) | 33 μs | — | | Per-agent cost | 23 μs/agent | ~0.01 μs/agent | — | ### Two-Phase Step: Decouple Computation from PyBullet C API ✅ (batch controller path) Current `step_once()` iterates agents one-by-one, each calling `controller.compute()` (Python/NumPy) → `set_pose_raw()` (PyBullet C API + AABB update + spatial grid) interleaved. This prevents vectorization and adds per-agent Python↔C crossing overhead. **Proposed split:** | Phase | What | Hot path | |-------|------|----------| | **Phase 1 — Compute** | All controllers compute new poses; no side effects | Pure Python / NumPy | | **Phase 2 — Apply** | Tight loop of `p.resetBasePositionAndOrientation()` only | PyBullet C calls | | **Phase 3 — Bookkeep** | Batch AABB refresh (`p.performCollisionDetection()` + `p.getAABB()`) and spatial grid update | C calls + Python dict | This requires **removing direct `pybullet` API calls from `sim_object.py`, `agent.py`, and `controller.py`**. Instead, these modules produce *pose intents* (position + orientation tuples), and `core_simulation.py` flushes them to PyBullet in bulk. Key changes: - `SimObject.set_pose()` / `set_pose_raw()` writes to an internal buffer (cached pose + dirty flag) without calling `p.resetBasePositionAndOrientation()` - `Controller.compute()` returns `(new_pos, new_orn)` or writes to agent's pending pose buffer - `core_simulation.step_once()` collects dirty poses → batch `resetBasePositionAndOrientation` → batch AABB update - Movement detection stays pure Python (already cached-pose-based), unaffected - Attached-object propagation runs after Phase 2 using the buffered parent poses ### Vectorized Agent Update (NumPy Batch) ✅ For the common "N agents on straight-line TPI paths" case, Phase 1 can be further vectorized: - Store all active agents' `forward_start_pos`, `forward_direction`, and TPI parameters in contiguous `(N, 3)` NumPy arrays - Compute `new_positions = start_positions + directions * ratios[:, np.newaxis]` in one vectorized call - Slerp batch: pre-compute all `(start_quat, target_quat, t_fraction)` and batch `quat_slerp` - Fallback: agents with non-standard controllers (velocity mode, custom callbacks) use the existing per-agent path This is a "BatchController" or "VectorizedOmniController" that sits alongside the existing `Controller` ABC. ### C++ Extensions for Hot-Path Functions Profile-guided candidates for C++ (via pybind11) or Cython acceleration: | Function | Current | Why C++ helps | |----------|---------|---------------| | **TwoPointInterpolation** | Pure Python `math` | Called N× per step; tight numerical loop ideal for native code | | **`quat_slerp` / `quat_slerp_precompute`** | Python scalar math in `geometry.py` | N× per step; SIMD-friendly | | **Spatial hash broad-phase** | Python dict + set ops in `check_collisions()` | Dict overhead at 1000+ objects; Rust/C++ hash map faster | | **AABB overlap test** | Python comparisons in `_aabb_overlap_2d` | Tight inner loop; autovectorizable in C++ | | **`getClosestPoints` narrow-phase** | PyBullet C API (already native) | Already fast; not a candidate | Priority: TPI and slerp first (highest call frequency), then spatial hash (scales with agent count²). ### Deferred AABB Update Currently `set_pose()` calls `p.getAABB()` and updates the spatial grid **per object, immediately**. For kinematic-only mode: - Defer all AABB updates to a single `p.performCollisionDetection()` call after Phase 2 - Batch `p.getAABB()` for all moved objects at once - Rebuild spatial grid once per step instead of incrementally per-object This removes N `p.getAABB()` C-API round-trips from the set_pose hot path. ### Summary: Expected Impact (500 agents, measured baseline) | Optimization | Estimated step time | Estimated FPS | Speedup vs current | Effort | |---|---|---|---|---| | **Current** | 13.0 ms | 77 | 1.0× | — | | **NumPy vectorized controller** | ~1.2 ms | ~850 | **~11×** | Medium | | **+ C++ TPI/slerp extensions** | ~1.0 ms | ~1,000 | **~13×** | Medium | | **Theoretical floor** (C API only) | 0.7 ms | ~1,400 | ~18× | — | Scaling by agent count (current → vectorized estimate): | Agents | Current FPS | Est. vectorized FPS | Speedup | |---|---|---|---| | 100 | 225 | ~2,000+ | ~9× | | 500 | 77 | ~850 | ~11× | | 1,000 | 27 | ~450 | ~17× | > **Note:** Collision detection (10 Hz spatial hash) adds <1% overhead at 500 agents. C++ spatial hash becomes relevant at 1,000+ agents with higher collision frequencies. > **Relation to Long-Term Backend Abstraction:** The two-phase split and the "remove direct pybullet calls" refactoring are natural stepping stones toward the SimBackend ABC (Phase 1 of Long-Term). The buffered-pose pattern becomes the write side of `SimBackend.set_positions_batch()`. ## SDF & DAE Support Improvements PyBulletFleet currently has two self-implemented workarounds for Gazebo ecosystem interop: 1. **`resolve_sdf_to_urdf`** — Hand-rolled SDF→URDF XML converter (PyBullet cannot load SDF directly) 2. **DAE defensive fallbacks** — Colour extraction, texture symlinks, collision try/except (PyBullet has poor DAE support) Both are functional but hacky. This section tracks improvements to make them more robust or replace them entirely. ### SDF → URDF Conversion: `gz sdf -p` Replacement `resolve_sdf_to_urdf()` in `sdf_loader.py` is a minimal self-implemented SDF→URDF converter using `xml.etree`. It covers only ``, ``, ``, and primitive geometries — enough for simple robots (DeliveryRobot) but fragile for complex SDF models. **Why self-implemented?** No usable Python SDF→URDF tool exists: - `sdformat_urdf` — C++ only, no Python bindings - `pysdf` (PyPI) — fails to build on Python 3.12+ - `libsdformat14` — C++ template API, not callable via ctypes - `gz sdf -p model.sdf` — converts SDF→URDF via CLI subprocess (see evaluation below) **`gz sdf -p` evaluation (Gazebo Harmonic / sdformat 14, as of 2026-04):** | | Self-implemented parser | `gz sdf -p` | |---|---|---| | Dependencies | None (`xml.etree` stdlib) | `sdformat14` + `gz-tools` (~200 MB) | | Coverage | Minimal (link/joint/mesh only) | Broader (frames, nested models partial) | | Edge cases | Fix ourselves immediately | Wait for upstream fix | | DAE colour injection | `_extract_dae_diffuse_color` embeds into URDF | Not supported (no colour in output URDF) | | Nested `` | Not supported | Partial support | | Known issues | Pose hierarchy may drift for deep nesting | Nested model bugs, some pose frame resolution errors | | Stability | Tested for our models | "Best effort" — known regressions between versions | **Verdict:** `gz sdf -p` is **usable but not fully stable**. For simple robots it works; for complex SDF it has regressions. The dependency cost (~200 MB) is also significant. **Current self-implemented parser is the pragmatic choice.** **Future action:** - Monitor `gz sdf -p` stability across Gazebo releases - If it stabilises + we already have `gz-tools` in Docker for other reasons → switch to subprocess call and delete `resolve_sdf_to_urdf()` - If we support complex SDF robots (nested models, sensors) → worth the dependency ### DAE → OBJ Automatic Conversion PyBullet has poor DAE (COLLADA) support: textures are not loaded, diffuse colours are ignored, and some meshes fail `createCollisionShape`. Gazebo Fuel models ship exclusively as DAE. **Proposal:** Use `assimp export` (from `assimp-utils`, ~2 MB) to convert DAE meshes to OBJ + MTL at download time. OBJ is PyBullet's best-supported mesh format — colours, textures via MTL, and collision all work reliably. Changes needed: 1. **`docker/Dockerfile.rmf_demos`** — `apt install assimp-utils` 2. **`docker/download_fuel_models.py`** — Post-download `assimp export *.dae *.obj` conversion step 3. **`pybullet_fleet/sdf_loader.py`** — Prefer `.obj` over `.dae` when both exist (`mesh_path.replace(".dae", ".obj")`) **Simplification benefit:** Once all DAE meshes have OBJ counterparts, three defensive workarounds in `sdf_loader.py` become unnecessary and can be removed: | Current workaround | Why it exists | Removed after conversion | |---|---|---| | `is_fuel` flag on `_IncludeInfo` | Fuel DAE has embedded colours; local OBJ does not → different `rgba_color` handling | OBJ+MTL always carries material info → uniform `rgba_color=None` for all models | | `_ensure_texture_symlinks()` | DAE references textures by bare filename; PyBullet resolves relative to `meshes/` not `materials/textures/` | OBJ+MTL uses relative paths that resolve correctly | | `try/except` around `createCollisionShape` in `sim_object.py` | Some DAE meshes crash PyBullet's collision shape loader | OBJ collision shapes load reliably | The `is_fuel` flag would be replaced by a mesh-format check (e.g. "does an `.obj` file exist alongside the `.dae`?"), making the loader source-agnostic — any DAE model from any provider would be handled the same way. ### Multi-material mesh textures (per-material visuals) `createVisualShape(GEOM_MESH)` applies a **single texture per mesh**, so a multi-material mesh (one OBJ/DAE packing many materials) renders with just one texture tiled over everything. The campus env is a single 50-material OBJ and showed one (a person) texture across the whole scene. Current workaround: the `world.force_color` config flat-shades such a mesh grey (strips its `mtllib` so no texture loads — see `_obj_without_materials`). Proper fix: split the mesh by material into per-material sub-visuals (reusing the trimesh `` extraction path), one texture each. Moderate effort; helps any multi-material environment. ### Priority Low — SDF parsing, DAE handling, and multi-material textures are all functional enough for demos (with the flat-shade workaround for multi-material meshes). These improvements primarily affect: - **Visual fidelity** (DAE→OBJ: textures and materials render correctly) - **Maintainability** (gz sdf -p: delete ~150 lines of hand-rolled XML conversion) - **Robustness** (both: fewer edge-case failures with complex models) ## Environments Simulation environment assets (warehouse floors, factory layouts, etc.): - **`pybullet-fleet-environments` package** — Manage environment assets in a separate repository, installable via `pip install pybullet-fleet-environments` for on-demand retrieval. Keep PyBulletFleet core lightweight by not bundling meshes. - AWS RoboMaker Small Warehouse (MIT-0): DAE→OBJ converted meshes + URDF wrappers - Open-RMF rmf_demos maps (office, hotel, clinic, airport, campus): OBJ mesh export - pybullet_data bundled environments (kiva_shelf, samurai, stadium) wrappers - Original license clearly noted per environment - **`resolve_environment()` API** — Name resolution similar to `resolve_model()` for loading environments. Shows install hints when not installed. ## CI / DevOps - **GitHub Actions refactoring** — Streamlined CI pipeline - **Automated performance tracking** — Run time / memory benchmarks in CI, auto-update results in documentation, and alert on significant performance regressions ## Observability Operational visibility for long benchmark runs and production deployments. **Shares the EventBus with Snapshot/Replay** — same emission point, different sinks. See `docs/design/observability/spec.md` for full design. - **Structured logging (JSON)** — Replace prefix-string `NamedLazyLogger` with `extra={...}` dict fields (`agent_id`, `object_id`, `sim_time`, `step`). Enables Loki/Grafana label queries. - **`sim_time` injection filter** — `logging.Filter` registered by `MultiRobotSimulationCore` auto-attaches `sim_time` and `step` to every record. Removes per-call boilerplate on hot paths. - **Prometheus / OTLP metrics** — Promote `DataMonitor` internals (`pbf_step_duration_seconds`, `pbf_active_agents`, `pbf_actions_total{type,status}`, `pbf_collision_pairs`) to scrapable metrics. Opt-in to keep default cost zero. - **OpenTelemetry trace exporter** — EventBus subscriber that converts `action.start` / `action.complete` event pairs into spans. Sampling allowed. `causation_id` aligned with OTel `span_id` so Tempo "Trace to Logs" jumps to the matching JSONL event in Loki. - **Hot-path safety** — All exporters use batch / async processors. Per-step spans are forbidden; Action-level granularity only. `LazyLogger.isEnabledFor()` semantics preserved. ## Long-Term: Backend Abstraction & Beyond PyBullet Current architecture is tightly coupled to PyBullet (`body_id`, per-entity FFI calls). At 1000 agents PyBullet's per-call overhead dominates step time (88% of 40.9 ms). The following items explore decoupling from PyBullet to unlock 10–100× performance gains while keeping the Python user API unchanged. ### Phase 1: SimBackend ABC + Numpy Pure Kinematic Backend - **SimBackend ABC** — Abstract interface (`set_positions_batch`, `detect_collisions`, `load_model`, `step_physics`) that `Agent` and `SimObject` program against instead of raw `pybullet` calls - **NumpyBackend (default)** — Contiguous numpy arrays for positions/orientations, `scipy.spatial.cKDTree` for collision. No physics engine dependency. Expected: 1000 agents in ~2–5 ms (RTF 20–50×), 5000+ agents at RTF > 1.0 - **PyBulletBackend (compat)** — Wraps existing PyBullet calls behind SimBackend ABC for backward compatibility and physics-mode users - **URDF parsing without PyBullet** — Use `yourdfpy` or similar to parse URDF into internal model data, removing the last hard dependency on PyBullet for kinematic-only mode - **Visualization decoupling** — Replace `p.GUI` with Rerun, Open3D, or RViz for rendering. Backend-agnostic scene display ### Phase 2: Native Backend (Rust/C++ via PyO3/pybind11) Only justified if Phase 1 numpy performance is insufficient (e.g., 5000+ agents at 240 Hz). - **Rust kinematic core** — Position update, yaw integration, AABB collision in Rust. Exposed to Python via PyO3. Expected: further 3–10× over numpy (RTF 100–300× for 1000 agents) - **Batch collision in Rust** — Sweep-and-prune or spatial hash for O(n log n) broad-phase, replacing Python KDTree - **Zero-copy interop** — Share numpy arrays directly with Rust via buffer protocol (no serialization) ### Phase 3: GPU Backend (optional) For 10,000+ agent scenarios or RL training workloads. - **MuJoCo MJX backend** — JAX-accelerated batch physics on GPU. URDF via MJCF conversion - **JAX pure kinematic** — `jax.numpy` drop-in for NumpyBackend, JIT-compiled, GPU-parallel ### Phase 4: ECS Architecture (v2.0 candidate) Considered when entity diversity explodes beyond what Agent/SimObject OOP can express cleanly (e.g., drones + ground robots + conveyors + elevators + dynamic obstacles all in one scene with distinct component sets). - **Component-based entity model** — Replace Agent/SimObject inheritance with composable components (Transform, Collision, Kinematics, JointState, ActionQueue, etc.) - **System pipeline** — Each system (MovementSystem, CollisionSystem, ActionSystem) operates on component arrays. Maps naturally from current Controller ABC → System - **Rust ECS runtime** — Leverage `bevy_ecs` or `hecs` for cache-friendly SoA memory layout. Python API via PyO3 for user-facing logic - **Migration path** — Controller ABC → System, EventBus → ECS Events, Registry → ECS resource/archetype queries. Plugin Architecture phases are designed as stepping stones toward this > **Note:** At this point the project may evolve beyond "PyBulletFleet" in name, as PyBullet would be just one optional backend among several, and the core would no longer depend on it. A rename (e.g., *FleetSim*, *KinematicFleet*) may be appropriate.