You ask a chatbot a question and it answers in under a second from your phone, no datacenter required. Knowledge distillation is the trick that made that possible: it trains a small "student" model to imitate a large "teacher" model, so the student keeps most of the teacher's accuracy at a sliver of the size and cost. The student learns from the teacher's full output, the probability it assigns to every possible answer, not just the single correct label, and that richer signal is what lets a compact model punch far above its weight. DistilBERT, a distilled copy of Google's BERT, proves it: 97% of the language understanding, 60% faster, 40% smaller.
The technique was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in the 2015 paper "Distilling the Knowledge in a Neural Network." Their core insight: a trained network's class probabilities carry more information than the hard label alone, and a smaller network can absorb that information directly.
What are the teacher and student models in distillation?
The teacher is a large, already-trained, high-accuracy model. The student is a smaller model trained to reproduce the teacher's behavior. The student does not learn the original task from scratch in the usual way. Instead, it learns to match what the teacher predicts, which compresses the teacher's learned function into fewer parameters.
Why not just train the small model on the data directly? Because the teacher has already discovered useful structure, and it hands that structure over in a form the student absorbs more easily. Its outputs act as a smarter, denser set of training targets than raw labels. A small model trained on hard labels alone often plateaus below its potential, since each example only ever tells it one thing. Feed it the teacher's full distributions instead and it sees a graded signal on every example, reaching noticeably higher accuracy for its size.
The teacher need not be a single model. Hinton's original work distilled an ensemble, a committee of many models whose combined predictions are accurate but expensive to run, into one compact model that approximated the whole committee. That framing still holds: the teacher is whatever expensive, high-quality source of predictions you want to compress, whether that is one giant network, an ensemble, or a frontier model behind an API.
Soft labels vs hard labels: what's the difference?
A hard label is a single correct answer: this image is a "cat." A soft label is the teacher's full probability distribution across every class: 90% cat, 7% dog, 2% fox, and so on. That distribution is the heart of distillation. The small non-zero probabilities encode what the teacher considers similar to what.
Hinton's team gave this leftover signal a name that stuck: "dark knowledge." When the teacher says a cat image is also slightly dog-like and almost never truck-like, it is quietly revealing how it organizes the world. A hard label throws all of that away and keeps only the winner. Training the student on the full distribution transfers the teacher's sense of how classes relate.
A hard label tells the student the right answer. A soft distribution tells it the right answer AND how the teacher thinks about every wrong one. That is why a small model can learn so much from a big one.
What is temperature in knowledge distillation?
Temperature controls how soft the teacher's output gets. Neural networks turn raw scores, called logits, into probabilities using a softmax function. A confident teacher often produces a spiky distribution: 99.7% on one class, near zero on the rest. That spike hides the dark knowledge, because the relative sizes of the tiny probabilities are what carry the inter-class information.
Raising the temperature divides the logits before the softmax, flattening the distribution so the small probabilities grow large enough to learn from. At temperature 1 you get the normal, sharp output. At a higher temperature you get a smoother spread that exposes how confident the teacher is about runner-up classes. The student is trained against these softened targets, and the same temperature is applied to the student during training so the two distributions line up.
After training, the student runs at temperature 1 like any normal model. The high temperature is a training-time device for surfacing the teacher's soft knowledge, not a permanent setting.
Three types of distillation
Knowledge distillation methods fall into three types based on what knowledge gets transferred from teacher to student: response-based, feature-based, and relation-based. They are not mutually exclusive, and production systems often combine them.
Response-based
The student mimics the teacher's final outputs, the logits or softened probabilities described above. This is the original form and what most people mean by distillation. The student only needs to see the teacher's predictions, not its internals.
Feature-based
The student also imitates the teacher's intermediate representations, the activations in hidden layers, not just the final answer. Aligning these internal features gives the student a richer view of how the teacher builds up its understanding layer by layer. It is more demanding, since you need access to and a way to match the teacher's internals.
Relation-based
The student learns the relationships the teacher captures between examples or between layers, for instance which inputs the teacher treats as similar. Rather than copying a single output or a single layer, the student preserves the structure of how things relate across the teacher's representation.
What are real examples of knowledge distillation?
DistilBERT is the textbook case. It is a distilled version of Google's BERT language model. Its authors report that it reduces the size of a BERT model by 40%, runs 60% faster, and retains 97% of BERT's language understanding capabilities. That is the distillation promise made concrete: most of the skill, far less of the cost. Training the student against the larger model during pre-training is why it generalizes across many downstream tasks, not just one.
A more recent example comes from reasoning models. DeepSeek used LLM distillation to transfer the reasoning patterns of its large DeepSeek-R1 model into smaller, dense Qwen and Llama models by fine-tuning them on roughly 800,000 reasoning samples generated by R1. The result was a family of compact checkpoints, including 1.5B, 7B, 8B, 14B, 32B, and 70B sizes, that carry over a meaningful share of the large model's reasoning ability. Distillation transfers complex behaviors, not just classification.
Note the DeepSeek method: the student was fine-tuned on text the teacher produced. Distillation does not always require live access to the teacher's logits. Training a student on a teacher's generated outputs is a practical, widely used variant.
Every time an AI answers instantly on your phone, you are talking to a student, not the teacher. The giant model trained it, then stayed behind in the datacenter. Distillation is how the lessons of a model too big to fit in your pocket end up running inside it.
Distillation vs quantization vs pruning: what's the difference?
Distillation, quantization, and pruning are the three main model compression techniques, and fine-tuning is often confused with them. The key distinction: distillation trains a separate smaller model, quantization lowers numeric precision of an existing model, pruning deletes parts of an existing model, and fine-tuning adapts a model to a task without shrinking it. The table below lays out the trade-offs.
| Aspect | Distillation | Quantization | Pruning |
|---|---|---|---|
| What changes | A new, smaller student model learns from a large teacher | Weights stored at lower numeric precision, for example 16-bit to 8- or 4-bit | Redundant weights, neurons, or layers are removed from the model |
| Effect on size | Much smaller; the student has fewer parameters by design | Smaller in memory; parameter count is unchanged but each is cheaper to store | Smaller; depends on how aggressively parts are cut |
| Effect on quality | Retains much of the teacher's skill; some loss vs the teacher | Small accuracy loss if done carefully; can degrade at very low precision | Small loss if light; larger loss as more is removed |
| Needs retraining? | Yes; training the student is the whole method | Often no; post-training quantization works, though calibration helps | Often yes; usually fine-tuned after pruning to recover accuracy |
| When to use | You want a permanently smaller, faster model with most of the skill | You want to cut memory and speed up an existing model with minimal change | You suspect the model is over-parameterized and want to trim it |
A point most explainers skip: these are not either/or choices. The common production pipeline is distill first to get a structurally smaller student, then quantize that student, then optionally prune, compounding the savings. Distillation is the only one of the three that creates a new model; the other two modify what you already have.
The cleanest way to separate distillation from quantization: quantization keeps the same model and stores its numbers more cheaply, while distillation builds a brand-new, structurally smaller model that learned from the original. One reduces precision; the other reduces size by retraining. They are complementary, and production systems often distill first, then quantize the student.
Fine-tuning does not aim to shrink anything; it adapts a model's weights to a specific task or domain. You can fine-tune a teacher, fine-tune a student after distillation, or use a teacher's outputs as fine-tuning targets, which is exactly the DeepSeek approach above. Shrinking and specializing are separate goals that often run in sequence.
What distillation does not solve
Distillation makes a model smaller. It does nothing about whether that model knows anything about you. A distilled model that runs instantly on your phone is still a general-purpose model: it has read the public internet, but it has never seen your receipts, your prescriptions, the PDF a clinic emailed you, or the voice note you left yourself last week. Shrinking the weights makes the model fast and cheap to run; it does not give the model a record of your own life. Those are separate problems: model size is one axis, having a private store of your own information is another.
That second problem is where a consumer memory app fits. MemX is an AI second brain for your own stuff. You dump in photos, PDFs, scanned documents, voice notes, and WhatsApp messages, and MemX reads and indexes them on your behalf. Later you ask a plain-English question, like what your deductible was or when a warranty expires, and it answers with a citation back to the original document or note so you can check the source. It runs on Android and as an iOS TestFlight beta, works inside WhatsApp, and keeps your data encrypted, with web access coming soon. A small fast model gives you an instant answer; MemX makes sure the answer is grounded in your own records instead of a guess.
When distillation is worth it
- You have a strong large model and need to deploy something far cheaper without starting accuracy over from zero.
- You are targeting phones, browsers, or edge devices where the teacher simply will not fit.
- You need lower latency or lower serving cost at scale, and a small student meets the quality bar.
- You want to transfer a specialized behavior, such as reasoning style, from a frontier model into an open, self-hostable one.
- You can pair distillation with quantization on the student for compounding size and speed wins.
Distillation is not free. You need a good teacher, training infrastructure, and a careful setup of temperature and loss weighting. The student's loss usually blends two terms: how well it matches the teacher's soft distribution and how well it still predicts the true hard labels, balanced by a weight you tune. Get that balance wrong and the student either ignores the teacher or overfits to it. The payoff, when tuned well, is a compact model that runs where the teacher never could.
Here is what most explainers leave out: the student does not always cap out below its teacher. In the "Born-Again Networks" study, Tommaso Furlanello and colleagues trained students with the exact same architecture as their teachers and found the students beat the teachers they copied, setting then state-of-the-art error rates on CIFAR-10 and CIFAR-100. So distillation is mainly a tool for cheap deployment of existing capability, but the assumption that a student can only ever approximate its teacher is wrong.
01What is knowledge distillation in simple terms?
It is training a small model to copy a big one. The small student learns from the large teacher's full predictions, not just the right answer, so it inherits much of the teacher's skill at a fraction of the size and cost.
02Distillation vs quantization: what is the difference?
Quantization keeps the same model and stores its numbers at lower precision to save memory. Distillation trains a brand-new, structurally smaller model that learned from the original. One reduces precision; the other reduces size by retraining. They are often combined.
03What is temperature in knowledge distillation?
Temperature is a training-time knob that softens the teacher's output distribution. Raising it flattens the probabilities so the small, informative values for non-winning classes grow large enough for the student to learn from. After training, the student runs at normal temperature.
04What are soft labels versus hard labels?
A hard label is one correct class, like "cat." A soft label is the teacher's full probability spread across all classes, such as 90% cat, 7% dog. Soft labels reveal how the teacher relates classes, which teaches the student more per example.
05Is knowledge distillation the same as fine-tuning?
No. Fine-tuning adapts a model's weights to a task without shrinking it, while distillation trains a new, smaller model to copy a larger one. They can combine: training a student on a teacher's generated outputs is a form of fine-tuning used for distillation, which is how DeepSeek built its distilled R1 models.
