

Key Innovation: SHAC++ replaces direct simulator differentiation with learned gradient approximations. While SHAC requires both differentiable transition function F and reward function R, SHAC++ approximates these with neural networks Fθ and Rθ, enabling application to non-differentiable environments.
SHAC++ addresses the limitations of SHAC by approximating gradients from non-differentiable simulators using neural networks. The key insight is to train networks Fθ and Rθ to approximate the transition and reward functions, respectively, alongside the policy and value networks.
SHAC Limitations
- Requires fully differentiable environments
- Limited to single-agent scenarios
- Unstable gradients in complex dynamics
- Inapplicable to sparse reward functions
SHAC++ Advantages
- Works with non-differentiable simulators
- Extends to multi-agent environments
- Robust gradient approximations
- Handles complex reward structures
🧠 Neural Networks
- Policy Networks: πθ for action selection
- Value Network: Vθ for state evaluation
- Transition Network: Fθ for dynamics approximation
- Reward Network: Rθ for reward prediction
⚙️ Training Details
- Optimizer: Adam with learning rate 1e-3
- Episodes: 512 environments × 32 steps
- Early Stopping: 90% reward in 90% episodes
- Hardware: V100/A100 GPUs, 1-8 hours per run
Multi-Agent Extensions
SHAC++ extends naturally to multi-agent settings through:
- Shared Parameters: Policy networks share parameters across agents
- Transformer Architecture: Handles variable agent numbers with positional invariance
- Cooperative Learning: Joint optimization enables emergent coordination
- Scalable Design: Linear complexity with respect to agent count (up to attention limits)
The framework trains four neural networks simultaneously: policy networks πθ, value network Vθ, transition network Fθ, and reward network Rθ. This enables gradient-based policy optimization even when the underlying simulator components are non-differentiable.