Architecture

MoE (Mixture of Experts)

Instead of passing data through a dense feed-forward network where every parameter is activated for every token, an MoE model contains several smaller expert networks. A gating mechanism (or router) determines which expert(s) are best suited to process a specific token. This enables the model to increase its total parameter count (and thus its knowledge capacity) without proportionally increasing the inference compute cost, leading to faster, more efficient generation.