StructMoE: augmenting MoEs with hierarchically routed low rank experts
Introducing hierarchical routing and low-rank experts to enhance the efficiency and performance of MoE models.
Topics:
The traditional approach to scaling Mixture of Experts for transformer models has been to increase the total number of experts. While performance improves with more experts, the gains are diminshing whereas memory scales linearly with the number of experts. We introduce StructMoE, a scaling approach for Mixture of Experts which augments experts with additional dynamic capacity using routed structured matrices which we refer to as Low Rank Exprts (LoRE). At a high-level, we introduce hierarchical MoEs where the first level of routing decides which expert each token should be routed to and the second level of routing decides which LoRE should each token be routed through. The outputs of the expert and the LoRE are then entangled together to provide the final output. This introduces more dynamism into the model which has empirically been demonstrated to improve model performance. We find this scaling approach to outperform a standard MoE baseline in terms of loss on a held out validation. Thus, we propose this to be an effective scaling technique for MoEs compared to the standard approach of adding more experts to the model.
Latest publications
Continual pre-training of MoEs: how robust is your router?
A systematic study of Mixture of Experts (MoE) continual pre-training.
NeurIPSSearching for efficient linear layers over a continuous space of structured matrices
Searching for efficient linear operators with optimal scaling laws leading to the development of the BTT-MoE architecture.
NeurIPSDense backpropagation improves training for sparse mixture-of-experts
A lightweight approximation method that gives the MoE router a dense gradient update.
NeurIPS