Dense backpropagation improves training for sparse mixture-of-experts

A lightweight approximation method that gives the MoE router a dense gradient update.


Latest publications