SHAC++: A Neural Network to Rule All Differentiable Simulators

Francesco Bertolotti^1,*, Gianluca Aguzzi^2,*, Walter Cazzola³, Mirko Viroli²

¹Domyn, Milan, Italy
²University of Bologna, Bologna, Italy
³University of Milano, Milano, Italy
ECAI 2025
^*Indicates Equal Contribution

Paper Supplementary Code

Abstract

Reinforcement learning (RL) algorithms show promise in robotics and multi-agent systems but often suffer from low sample efficiency. While methods like SHAC leverage differentiable simulators to improve efficiency, they are limited to specific settings: they require fully differentiable environments, including transition and reward functions, and have primarily been demonstrated in single-agent scenarios. To overcome these limitations, we introduce SHAC++, a novel framework inspired by SHAC. SHAC++ removes the need for differentiable simulator components by using neural networks to approximate the required gradients, training these networks alongside the standard policy and value networks. This enables the core SHAC approach to be applied in both non-differentiable and multi-agent environments. We evaluate SHAC++ on challenging multi-agent tasks from the VMAS suite, comparing it against SHAC (where applicable) and PPO, a standard algorithm for non-differentiable settings. Our results demonstrate that SHAC++ significantly outperforms PPO in both single- and multi-agent scenarios. Furthermore, in differentiable environments where SHAC operates, SHAC++ achieves comparable performance despite lacking direct access to simulator gradients, thus successfully extending SHAC's benefits to a broader class of problems.

Research Overview

This work introduces SHAC++, extending gradient-based reinforcement learning to non-differentiable and multi-agent environments. Our research demonstrates four key experimental scenarios from the VMAS simulator:

Dispersion

Non-differentiable rewards with complete observations. Agents learn cooperative spreading behavior to maximize coverage while avoiding collisions.

Discovery

Non-differentiable rewards with partial observations. Agents explore and discover targets in an environment with limited sensory information.

Transport

Differentiable rewards with complete observations. Multiple agents coordinate to push a package to a target location, demonstrating emergent cooperation.

Sampling

Differentiable rewards with partial observations. Agents collect samples from the environment through coordinated exploration and collection strategies.

These scenarios, from the VMAS suite, test SHAC++ across varying levels of reward differentiability and observation completeness, demonstrating the framework's versatility and superiority over traditional methods like PPO while maintaining comparable performance to SHAC where applicable.

Key Contributions

Novel RL Framework: SHAC++ employs learned gradient approximations, eliminating the need for differentiable simulators while maintaining the benefits of gradient-based optimization.
Multi-Agent Extension: First empirical evaluation of SHAC in multi-agent environments, establishing a baseline for differentiable multi-agent reinforcement learning (MARL).
Comprehensive Evaluation: Systematic comparison across environments with varying agent counts and differentiability properties using the VMAS simulator.
Superior Performance: Experimental evidence demonstrating that SHAC++ matches SHAC's performance in differentiable settings while substantially outperforming PPO in sample efficiency across both single-agent and complex multi-agent scenarios.

Visual Results: Transport Scenario

The Transport scenario demonstrates emergent cooperative behavior where multiple agents coordinate to push a package to the goal location.

SHAC++ agents learning coordinated transport behavior

SHAC++ Method & Technical Implementation

SHAC algorithm flow — SHAC: Requires differentiable environment components

Key Innovation: SHAC++ replaces direct simulator differentiation with learned gradient approximations. While SHAC requires both differentiable transition function F and reward function R, SHAC++ approximates these with neural networks F_θ and R_θ, enabling application to non-differentiable environments.

SHAC++ addresses the limitations of SHAC by approximating gradients from non-differentiable simulators using neural networks. The key insight is to train networks F_θ and R_θ to approximate the transition and reward functions, respectively, alongside the policy and value networks.

SHAC Limitations

Requires fully differentiable environments
Limited to single-agent scenarios
Unstable gradients in complex dynamics
Inapplicable to sparse reward functions

SHAC++ Advantages

Works with non-differentiable simulators
Extends to multi-agent environments
Robust gradient approximations
Handles complex reward structures

🧠 Neural Networks

Policy Networks: π_θ for action selection
Value Network: V_θ for state evaluation
Transition Network: F_θ for dynamics approximation
Reward Network: R_θ for reward prediction

⚙️ Training Details

Optimizer: Adam with learning rate 1e-3
Episodes: 512 environments × 32 steps
Early Stopping: 90% reward in 90% episodes
Hardware: V100/A100 GPUs, 1-8 hours per run

Multi-Agent Extensions

SHAC++ extends naturally to multi-agent settings through:

Shared Parameters: Policy networks share parameters across agents
Transformer Architecture: Handles variable agent numbers with positional invariance
Cooperative Learning: Joint optimization enables emergent coordination
Scalable Design: Linear complexity with respect to agent count (up to attention limits)

The framework trains four neural networks simultaneously: policy networks π_θ, value network V_θ, transition network F_θ, and reward network R_θ. This enables gradient-based policy optimization even when the underlying simulator components are non-differentiable.

Research Questions

RQ₁: Neural Gradient Approximation

Can we train a neural network to approximate the gradients of a differentiable simulator?

RQ₂: Comparative Performance

How does our algorithm compare to PPO/MAPPO and SHAC in both single-agent and multi-agent settings?

RQ₃: Scalability Analysis

How does the performance of these algorithms change as the search space increases with more agents?

Experimental Setup

We evaluate SHAC++ on four multi-agent scenarios from the VMAS simulator, selected for its differentiable physics, multi-agent design, and realistic simulation capabilities. Our experiments span 254 runs across different algorithms, architectures, and scenarios.

📍 Scenarios

Dispersion: Non-differentiable rewards, complete observations
Discovery: Non-differentiable rewards, partial observations
Transport: Differentiable rewards, complete observations
Sampling: Differentiable rewards, partial observations

🏗️ Architectures

MLP: 1-layer for single-agent scenarios
Transformer: 1-layer single-head for multi-agent
Transition Network: 3-layer Transformer
Agent Counts: 1, 3, and 5 agents

Hardware: Experiments conducted on V100 GPUs (32GB) and A100 GPUs (40GB) with extensive compute resources. Each run lasted 1-8 hours with rigorous early stopping criteria and hyperparameter optimization.

Key Results & Findings

✅ RQ₁: Successful Gradient Approximation

SHAC++ successfully approximates simulator gradients using neural networks, achieving comparable performance to SHAC in differentiable environments while enabling application to non-differentiable settings.

📈 Superior to PPO

Single-agent: SHAC++ outperforms PPO across all scenarios with better sample efficiency.
Multi-agent: PPO often fails to converge while SHAC++ succeeds in learning cooperative behaviors.

⚖️ Comparable to SHAC

In differentiable environments (Transport, Sampling), SHAC++ achieves performance comparable to SHAC despite lacking direct simulator access, sometimes even outperforming it.

🔗 Emergent Cooperation

SHAC++ consistently demonstrates emergent cooperative behaviors across all multi-agent scenarios. Analysis of gradient norms reveals generalization events when agents learn to coordinate and avoid collisions.

Scaling Analysis

Performance varies with agent count: Transport and Sampling scenarios actually improve with more agents (tasks become easier), while Dispersion shows increased difficulty. The framework demonstrates consistent cooperation emergence, suggesting scalability to larger agent populations. Current limitations arise from transformer attention complexity (~50 agents), addressable with linear attention mechanisms.

Performance Analysis

Main experimental results — **Comprehensive Results:** Performance comparison across all scenarios with Transformer architecture for multi-agent settings and MLP for single-agent scenarios. Results show mean ± standard deviation over 3 runs.

🎯 Sample Efficiency

SHAC++ consistently converges faster than PPO across all scenarios, often achieving target performance in <10,000 episodes vs PPO's frequent failures.

🤝 Cooperation Emergence

Multi-agent scenarios show clear cooperative behavior development, with gradient norm analysis revealing coordination learning phases.

Paper Access

📄 Read the Paper

Access the full ECAI 2025 paper for complete technical details, proofs, and extended experimental results.

Full Paper (PDF)

💻 Code Implementation

Open-source implementation available on GitHub with full experimental setup and reproducible results.

GitHub Repository

Future Directions

Several promising avenues remain for extending SHAC++ beyond its current capabilities:

🔄 Scalability Improvements

Linear attention mechanisms (Performer, Mamba)
Scaling beyond 50+ agents
Memory-efficient architectures

🎯 Cold-Start Solutions

World model pre-training
Model Predictive Path Integral (MPPI) integration
Human demonstration seeding

🌍 Broader Environments

Cooperative AI community tasks
Adversarial multi-agent settings
Extreme partial observability

🧬 Advanced Techniques

Curriculum learning integration
Meta-learning for faster adaptation
Hierarchical multi-agent coordination

BibTeX

@inproceedings{Bertolotti25,
  title={SHAC++: A Neural Network to Rule All Differentiable Simulators},
  author={Bertolotti, Francesco and Aguzzi, Gianluca and Cazzola, Walter and Viroli, Mirko},
  booktitle={European Conference on Artificial Intelligence (ECAI)},
  year={2025},
  url={https://github.com/f14-bertolotti/shacpp}
}