Mixture of Experts (MoE): The 671B/37B Trick

You read that a model has 671 billion parameters, then watch it answer faster than a 70B one, and wonder how. The answer is a Mixture of Experts (MoE): a neural network that splits its layers into many specialized sub-networks, called experts, and fires only a few of them per token instead of the whole model. That design lets an MoE hold hundreds of billions of parameters while doing the compute of a much smaller one. DeepSeek-V3 holds 671 billion parameters but runs only 37 billion of them to answer you. MoE splits how big a model is from how much of it runs at once.

It sounds paradoxical until you split two numbers that dense models keep welded together: size, and how much of that size runs per token. In a standard dense transformer, every parameter takes part in every prediction, which ties speed to scale. MoE cuts that tie. A learned router picks which experts handle each token. The rest of the model sits idle for that step.

A dense model is a single brain that uses all of itself on every input. An MoE is a building full of specialists with a smart receptionist out front. Each visitor gets sent to the two people best suited to help, while everyone else keeps to their offices. The building can hire hundreds of specialists, yet any one visit still only occupies a couple of them. The roster grows; the cost of any one visit does not.

Dense vs sparse models: the core split

A dense model uses all of its parameters for every input. A sparse model (what MoE produces) uses only a selected subset. This is why MoE models are also called sparse models or sparsely activated transformers. In a transformer-based MoE, several parallel FFN blocks plus a router that picks among them take the place of the single dense feed-forward (FFN) block inside a layer. Everything else, attention included, usually stays shared.

The payoff is conditional computation. The Switch Transformer paper puts it plainly: the model selects different parameters for each example, producing a sparsely activated network with an outrageous number of parameters but a roughly constant computational cost per token. Capacity scales up; per-token work does not.

The feed-forward layers carry most of a transformer's parameters and most of its per-token compute, so they are the richest target for sparsity. Attention, by contrast, mixes information across the sequence and benefits from staying shared and dense. Swapping the FFN for a routed set of expert FFNs concentrates the savings where the parameters pile up, which is why nearly every large MoE language model follows this pattern.

What does the router (gating network) do in an MoE?

The router, also called the gating network, decides which experts see each token. It is a small learned layer trained jointly with the rest of the model. For an incoming token, it scores every expert, usually with a softmax over a learned weight matrix, and forwards the token to the highest scorers. Their outputs are then combined, weighted by the router's confidence.

The router is learned, not hand-coded, so experts specialize over training: each picks up different patterns, and the gate learns to send each token where it will be handled best. The router itself is tiny compared to the experts, so the overhead of choosing is small next to the compute it saves.

The specialization is rarely tidy. Experts do not split neatly into a grammar expert, a math expert, and a code expert. They tend to specialize at a more granular, syntactic level, with the router learning statistical regularities that are hard to name in plain language. What matters in practice is that the routing is consistent enough for each expert to develop its own niche, and varied enough that the work spreads across the whole pool rather than piling onto a favored few.

Also on MemX

AI Explained

What Is Attention? The QKV Trick in AI

12 min read→

AI Explained

What Is a Multimodal AI Model?

11 min read→

AI Explained

Can AI Read Your Handwriting? Mostly

11 min read→

What are the experts in a Mixture of Experts model?

Each expert is a feed-forward sub-network, the same kind of block that sits in a normal transformer layer. An MoE layer holds many of them in parallel: 8, 256, even thousands, depending on the design. Only the experts the router picks actually run for a given token, so adding more experts grows total capacity without growing the work done per token.

MoE works like a memory store of skills: the model stashes specialized knowledge across many experts, then pulls from only the relevant ones at inference. The original sparsely-gated work pushed this to thousands of expert sub-networks per layer, reporting capacity gains of more than 1000x with only minor losses in computational efficiency.

What is top-k routing in MoE (and why top-2)?

Top-k routing means the router keeps only the k best-scoring experts for each token and ignores the rest. Mixtral 8x7B uses top-2 routing: at every layer, a router network selects two of its eight experts to process the current token. The Switch Transformer simplified this further to top-1, sending each token to a single expert to cut router computation and communication cost.

The choice of k is a dial. A higher k blends more experts per token, which can improve quality but raises compute. A lower k is cheaper and faster. The token-to-expert pairing also changes across layers and timesteps, so a single sentence may touch many different experts as it flows through the network.

The original sparsely-gated layer used a noisy top-k gate: it added tunable random noise to each expert's score before picking the top k, then applied a softmax over just those winners. The noise was not decoration. It helped keep the routing from locking onto the same experts every time, an early answer to the load-balancing problem that later designs refined with dedicated losses. The pattern of scoring all experts, keeping a few, and normalizing over the survivors has stayed stable from that 2017 layer through to modern production models.

Insight

An MoE is hired by the hundred-billion but paid by the ten-billion. Total parameters are the knowledge it owns; active parameters are the staff it pays per token, and only the second number ever shows up on your inference bill.

Active parameters vs total parameters

This distinction is the heart of MoE economics. Mixtral 8x7B holds 47 billion total parameters but uses only about 13 billion per token, because just two of its eight experts fire at a time while the shared layers always run. Inference therefore costs roughly what a 13B-class model would, not a 47B one.

The gap widens at scale. DeepSeek-V3 holds 671 billion total parameters but activates only 37 billion for each token. The model carries the knowledge of a 671B network while paying the per-token compute of something far smaller. That separation, big in storage and lean in compute, is exactly what dense models cannot offer.

Reading model specs gets easier once you watch for which number a release is quoting. A name like Mixtral 8x7B misleads: shared layers and overlap make the real total about 47B, not the 56B the name suggests. The honest comparison is active-to-active. An MoE with 37B active parameters competes, in per-token cost, with a 37B dense model, even if its total dwarfs that. Compare totals and you overstate the running cost; compare actives and you get a true read on speed and price.

Pro Tip

A quick rule for reading any MoE spec: judge speed and price by active parameters, judge knowledge capacity by total parameters, and judge how aggressive the design is by the ratio between them. Mixtral activates roughly 1 in 4 of its parameters per token (13B of 47B); DeepSeek-V3 activates about 1 in 18 (37B of 671B). A bigger gap means a more aggressively sparse design.

Why are MoE models cheaper to run at scale?

Compute per token in a transformer scales with active parameters, not stored ones. By keeping the active count low while letting the total count balloon, MoE buys more capacity per unit of compute. The Switch Transformer reported a 4x pretraining speedup over the dense T5-XXL model at matched quality, and scaled to a trillion-parameter regime while holding cost roughly constant.

For deployment, this means higher quality for a fixed inference budget, or the same quality for less money. The catch lives in memory: all experts must sit in VRAM even though only a few run per token. MoE trades cheaper compute for a heavier memory footprint, which is why it favors high-throughput, multi-GPU serving over tight, single-card setups.

That tradeoff shapes who should reach for MoE. When you serve many requests across several machines and want the best quality per dollar of compute, sparsity pays off. When you run in a memory-constrained spot, such as a single consumer GPU, a dense model of equal active size can be the smarter pick because it does not force you to hold a giant expert pool in memory. Fine-tuning adds a wrinkle: sparse models can overfit more readily than dense ones on small datasets, though instruction tuning tends to help. MoE is a scaling tool, not a free upgrade for every situation.

Attribute	Dense model	MoE model
Total parameters	All parameters, one pool	Much larger, spread across many experts
Active params per token	100% of the model	A small subset (e.g. 13B of 47B)
Compute per token	Scales with full model size	Scales with active params only
Inference speed	Slower as the model grows	Fast for its size; only top-k experts run
Memory / VRAM footprint	Lower; only one parameter set to load	Higher; all experts must stay resident
Pretraining cost at matched quality	Baseline	Lower; up to ~4x speedup (Switch Transformer)
Example models	Llama-class dense LLMs	Mixtral 8x7B, DeepSeek-V3

What is the load balancing problem in MoE?

Routing has a built-in failure mode. Left alone, the gating network tends to favor a few popular experts, sending most tokens to them while others sit unused. That collapse wastes capacity and creates lopsided batches that are slow to compute, since some experts overflow while others starve.

Engineers fight this with three classic tools. An auxiliary load-balancing loss nudges the router toward even expert usage. Noise added during routing spreads tokens out. An expert capacity limit caps how many tokens each expert accepts per batch, dropping or rerouting the overflow. Tuning these so no expert overloads or idles is one of the genuinely hard parts of building MoE.

Here is what most explainers leave out: the auxiliary loss everyone teaches is now something newer models try to avoid. Push that balancing loss too hard and it fights the main objective, scarring quality to keep the router neat. DeepSeek-V3 ships an auxiliary-loss-free scheme instead. It adds a small learnable bias to each expert's routing score, raises the bias for starved experts and lowers it for swamped ones step by step, and uses that bias only to pick experts, never to weight their output. Its own report credits this for keeping every token in play with no balancing penalty dragging on accuracy. The standard three-tool story is real, but the field is already routing around its biggest cost.

Capacity limits create their own dilemma. Set the per-expert cap too low and a hot expert overflows, forcing the system to drop tokens or shunt them past the layer through a residual path, which hurts quality. Set it too high and you waste memory padding batches that never fill. The capacity factor that controls this has no universal value; it depends on the model and the workload. On top of that, MoE training can be less stable than dense training, which is why later work added router penalties to keep the gating logits from blowing up. Getting all of this right is the difference between an MoE that delivers on its promise and one that quietly wastes most of its parameters.

Pro Tip

A 671B MoE and a 671B dense model behave nothing alike at inference. The dense one runs all 671B per token; the MoE runs only its active slice. When a headline quotes a total, do not assume it predicts speed: only the active count does.

Where did Mixture of Experts come from? A short history

The mixture-of-experts idea predates deep learning. It traces to a 1991 paper, Adaptive Mixtures of Local Experts, by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton, who trained separate networks to each cover part of a task. The modern version arrived in 2017 with the sparsely-gated MoE layer, which placed thousands of feed-forward experts behind a learned softmax gate inside a single network.

Shazeer and colleagues introduced that sparsely-gated layer in Outrageously Large Neural Networks, applying it to language modeling and machine translation. The Switch Transformer then simplified the routing and scaled it to a trillion parameters. Mixtral 8x7B and DeepSeek-V3 brought open-weight, production-grade MoE to a wide audience. Note that reports of GPT-4 using an MoE design are widespread but not officially confirmed by OpenAI.

MoE and selective retrieval: a fair analogy

MoE rests on one idea worth borrowing: pull in only what is relevant instead of paying for everything. The router sends each token to the few experts that fit and leaves the rest dormant. That same instinct, retrieve the slice that matters rather than the whole pile, shows up wherever a large store has to stay useful at speed. MemX, the consumer second-brain app, works this way for your own life. You add documents, photos, scanned IDs and receipts, voice notes, and WhatsApp messages; it reads and indexes them on your phone, and when you ask a question in plain English it searches across everything you saved and answers with a citation back to the exact source. You get the relevant fact, not a wall of files to reread. The mechanisms are different, but the principle of activating a small relevant slice of a big store is the same. MemX runs on Android and on iOS via the App Store, plus WhatsApp, with web coming soon.

Frequently asked questions

Frequently Asked Questions

01What is a mixture of experts (MoE) model in simple terms?

It is a neural network split into many specialized sub-networks called experts. A router sends each token to only a few of them, so the model holds huge capacity but runs only a small piece per input. In short: a building full of specialists with a receptionist who only ever calls in two.

02What is the difference between active and total parameters?

Total parameters are everything the model stores. Active parameters are the ones that actually run for a single token. In Mixtral 8x7B, total is about 47B but active is roughly 13B, because only 2 of 8 experts fire per token.

03Are MoE models faster than dense models?

For their total size, yes. Compute per token tracks active parameters, not stored ones. An MoE activates only its top-k experts per token, so a 47B MoE can cost about what a 13B dense model costs at inference while holding far more total capacity.

04What is the main drawback of mixture of experts?

Memory. Every expert must stay loaded in VRAM even though only a few run per token, so the footprint is large. Load balancing also adds complexity, since the router can overuse a few experts and leave others idle without correction.

05Which real models use a mixture of experts architecture?

Mixtral 8x7B uses 8 experts with top-2 routing. DeepSeek-V3 holds 671B total parameters and activates 37B per token. The Switch Transformer scaled MoE to a trillion parameters. Reports that GPT-4 uses MoE are widely circulated but not officially confirmed.

Mixture of Experts (MoE): The 671B/37B Trick

Dense vs sparse models: the core split

What does the router (gating network) do in an MoE?

What are the experts in a Mixture of Experts model?

What is top-k routing in MoE (and why top-2)?

Active parameters vs total parameters

Why are MoE models cheaper to run at scale?

What is the load balancing problem in MoE?

Where did Mixture of Experts come from? A short history

MoE and selective retrieval: a fair analogy

Frequently asked questions

Stop losing what you save.
Let MemX remember it for you.

Keep reading

Dense vs sparse models: the core split

What does the router (gating network) do in an MoE?

What are the experts in a Mixture of Experts model?

What is top-k routing in MoE (and why top-2)?

Active parameters vs total parameters

Why are MoE models cheaper to run at scale?

What is the load balancing problem in MoE?

Where did Mixture of Experts come from? A short history

MoE and selective retrieval: a fair analogy

Frequently asked questions

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.