In this post, I introduce a wild idea I came up with a few weeks ago— an outer-loop optimization scheme inspired by the K-Level Policy Gradients paper. In short, the concept is that each layer’s updates should consider how other layers are being updated. This approach yields noticeable performance improvements, though at a significant computational cost. Since this is an early exploration, there’s still plenty of room for refinement.