SHAC++: A Neural Network to Rule All Differentiable Simulators

Francesco Bertolotti1,*, Gianluca Aguzzi2,*, Walter Cazzola3, Mirko Viroli2
1Domyn, Milan, Italy
2University of Bologna, Bologna, Italy
3University of Milano, Milano, Italy
ECAI 2025
*Indicates Equal Contribution

Abstract

Reinforcement learning (RL) algorithms show promise in robotics and multi-agent systems but often suffer from low sample efficiency. While methods like SHAC leverage differentiable simulators to improve efficiency, they are limited to specific settings: they require fully differentiable environments, including transition and reward functions, and have primarily been demonstrated in single-agent scenarios. To overcome these limitations, we introduce SHAC++, a novel framework inspired by SHAC. SHAC++ removes the need for differentiable simulator components by using neural networks to approximate the required gradients, training these networks alongside the standard policy and value networks. This enables the core SHAC approach to be applied in both non-differentiable and multi-agent environments. We evaluate SHAC++ on challenging multi-agent tasks from the VMAS suite, comparing it against SHAC (where applicable) and PPO, a standard algorithm for non-differentiable settings. Our results demonstrate that SHAC++ significantly outperforms PPO in both single- and multi-agent scenarios. Furthermore, in differentiable environments where SHAC operates, SHAC++ achieves comparable performance despite lacking direct access to simulator gradients, thus successfully extending SHAC's benefits to a broader class of problems.

Research Overview

This work introduces SHAC++, extending gradient-based reinforcement learning to non-differentiable and multi-agent environments. Our research demonstrates four key experimental scenarios from the VMAS simulator:

Dispersion scenario

Dispersion

Non-differentiable rewards with complete observations. Agents learn cooperative spreading behavior to maximize coverage while avoiding collisions.

Discovery scenario

Discovery

Non-differentiable rewards with partial observations. Agents explore and discover targets in an environment with limited sensory information.

Transport scenario

Transport

Differentiable rewards with complete observations. Multiple agents coordinate to push a package to a target location, demonstrating emergent cooperation.

Sampling scenario

Sampling

Differentiable rewards with partial observations. Agents collect samples from the environment through coordinated exploration and collection strategies.

These scenarios, from the VMAS suite, test SHAC++ across varying levels of reward differentiability and observation completeness, demonstrating the framework's versatility and superiority over traditional methods like PPO while maintaining comparable performance to SHAC where applicable.

Key Contributions

  • Novel RL Framework: SHAC++ employs learned gradient approximations, eliminating the need for differentiable simulators while maintaining the benefits of gradient-based optimization.
  • Multi-Agent Extension: First empirical evaluation of SHAC in multi-agent environments, establishing a baseline for differentiable multi-agent reinforcement learning (MARL).
  • Comprehensive Evaluation: Systematic comparison across environments with varying agent counts and differentiability properties using the VMAS simulator.
  • Superior Performance: Experimental evidence demonstrating that SHAC++ matches SHAC's performance in differentiable settings while substantially outperforming PPO in sample efficiency across both single-agent and complex multi-agent scenarios.

Visual Results: Transport Scenario

The Transport scenario demonstrates emergent cooperative behavior where multiple agents coordinate to push a package to the goal location.

SHAC++ agents learning coordinated transport behavior

SHAC++ Method & Technical Implementation

SHAC algorithm flow
SHAC: Requires differentiable environment components
SHAC++ algorithm flow
SHAC++: Uses neural networks to approximate gradients

Key Innovation: SHAC++ replaces direct simulator differentiation with learned gradient approximations. While SHAC requires both differentiable transition function F and reward function R, SHAC++ approximates these with neural networks Fθ and Rθ, enabling application to non-differentiable environments.

SHAC++ addresses the limitations of SHAC by approximating gradients from non-differentiable simulators using neural networks. The key insight is to train networks Fθ and Rθ to approximate the transition and reward functions, respectively, alongside the policy and value networks.

SHAC Limitations

  • Requires fully differentiable environments
  • Limited to single-agent scenarios
  • Unstable gradients in complex dynamics
  • Inapplicable to sparse reward functions

SHAC++ Advantages

  • Works with non-differentiable simulators
  • Extends to multi-agent environments
  • Robust gradient approximations
  • Handles complex reward structures

🧠 Neural Networks

  • Policy Networks: πθ for action selection
  • Value Network: Vθ for state evaluation
  • Transition Network: Fθ for dynamics approximation
  • Reward Network: Rθ for reward prediction

⚙️ Training Details

  • Optimizer: Adam with learning rate 1e-3
  • Episodes: 512 environments × 32 steps
  • Early Stopping: 90% reward in 90% episodes
  • Hardware: V100/A100 GPUs, 1-8 hours per run

Multi-Agent Extensions

SHAC++ extends naturally to multi-agent settings through:

  • Shared Parameters: Policy networks share parameters across agents
  • Transformer Architecture: Handles variable agent numbers with positional invariance
  • Cooperative Learning: Joint optimization enables emergent coordination
  • Scalable Design: Linear complexity with respect to agent count (up to attention limits)

The framework trains four neural networks simultaneously: policy networks πθ, value network Vθ, transition network Fθ, and reward network Rθ. This enables gradient-based policy optimization even when the underlying simulator components are non-differentiable.

Research Questions

RQ₁: Neural Gradient Approximation

Can we train a neural network to approximate the gradients of a differentiable simulator?

RQ₂: Comparative Performance

How does our algorithm compare to PPO/MAPPO and SHAC in both single-agent and multi-agent settings?

RQ₃: Scalability Analysis

How does the performance of these algorithms change as the search space increases with more agents?

Experimental Setup

We evaluate SHAC++ on four multi-agent scenarios from the VMAS simulator, selected for its differentiable physics, multi-agent design, and realistic simulation capabilities. Our experiments span 254 runs across different algorithms, architectures, and scenarios.

📍 Scenarios

  • Dispersion: Non-differentiable rewards, complete observations
  • Discovery: Non-differentiable rewards, partial observations
  • Transport: Differentiable rewards, complete observations
  • Sampling: Differentiable rewards, partial observations

🏗️ Architectures

  • MLP: 1-layer for single-agent scenarios
  • Transformer: 1-layer single-head for multi-agent
  • Transition Network: 3-layer Transformer
  • Agent Counts: 1, 3, and 5 agents

Hardware: Experiments conducted on V100 GPUs (32GB) and A100 GPUs (40GB) with extensive compute resources. Each run lasted 1-8 hours with rigorous early stopping criteria and hyperparameter optimization.

Key Results & Findings

✅ RQ₁: Successful Gradient Approximation

SHAC++ successfully approximates simulator gradients using neural networks, achieving comparable performance to SHAC in differentiable environments while enabling application to non-differentiable settings.

📈 Superior to PPO

Single-agent: SHAC++ outperforms PPO across all scenarios with better sample efficiency.
Multi-agent: PPO often fails to converge while SHAC++ succeeds in learning cooperative behaviors.

⚖️ Comparable to SHAC

In differentiable environments (Transport, Sampling), SHAC++ achieves performance comparable to SHAC despite lacking direct simulator access, sometimes even outperforming it.

🔗 Emergent Cooperation

SHAC++ consistently demonstrates emergent cooperative behaviors across all multi-agent scenarios. Analysis of gradient norms reveals generalization events when agents learn to coordinate and avoid collisions.

Scaling Analysis

Performance varies with agent count: Transport and Sampling scenarios actually improve with more agents (tasks become easier), while Dispersion shows increased difficulty. The framework demonstrates consistent cooperation emergence, suggesting scalability to larger agent populations. Current limitations arise from transformer attention complexity (~50 agents), addressable with linear attention mechanisms.

Performance Analysis

Main experimental results
Comprehensive Results: Performance comparison across all scenarios with Transformer architecture for multi-agent settings and MLP for single-agent scenarios. Results show mean ± standard deviation over 3 runs.

🎯 Sample Efficiency

SHAC++ consistently converges faster than PPO across all scenarios, often achieving target performance in <10,000 episodes vs PPO's frequent failures.

🤝 Cooperation Emergence

Multi-agent scenarios show clear cooperative behavior development, with gradient norm analysis revealing coordination learning phases.

Paper Access

📄 Read the Paper

Access the full ECAI 2025 paper for complete technical details, proofs, and extended experimental results.

Full Paper (PDF)

💻 Code Implementation

Open-source implementation available on GitHub with full experimental setup and reproducible results.

GitHub Repository

Future Directions

Several promising avenues remain for extending SHAC++ beyond its current capabilities:

🔄 Scalability Improvements

  • Linear attention mechanisms (Performer, Mamba)
  • Scaling beyond 50+ agents
  • Memory-efficient architectures

🎯 Cold-Start Solutions

  • World model pre-training
  • Model Predictive Path Integral (MPPI) integration
  • Human demonstration seeding

🌍 Broader Environments

  • Cooperative AI community tasks
  • Adversarial multi-agent settings
  • Extreme partial observability

🧬 Advanced Techniques

  • Curriculum learning integration
  • Meta-learning for faster adaptation
  • Hierarchical multi-agent coordination

BibTeX

@inproceedings{Bertolotti25,
  title={SHAC++: A Neural Network to Rule All Differentiable Simulators},
  author={Bertolotti, Francesco and Aguzzi, Gianluca and Cazzola, Walter and Viroli, Mirko},
  booktitle={European Conference on Artificial Intelligence (ECAI)},
  year={2025},
  url={https://github.com/f14-bertolotti/shacpp}
}