Strip away the jargon and attention is one move, repeated. It lets each word in a sentence decide which other words to pull meaning from, and it runs that decision as a lookup: every token asks a question, every token advertises a label, and the answer is a weighted blend of what each token actually carries. Those three parts have names, query, key, and value, abbreviated Q, K, and V. They are the core of self-attention, and the rest of this post takes the move apart one piece at a time.
Most attention explainers cheat one of two ways. The math-first versions throw a softmax formula at you with no picture of what it means. The hand-wavy versions say the model learns to focus on important words and stop there. True, but it explains nothing about how. This post sits in the middle: one library analogy that is actually correct, then the real QKV formula, softmax and all, then how the same operation scales up into the context windows and KV caches you keep hearing about.
How attention works: the library lookup analogy
Picture a library. You walk in holding a request: a query, the thing you are looking for. Every book on every shelf has a label on its spine: a key, a short description of what that book is about. You compare your request against each label, find the ones that match best, and pull the matching books off the shelf. The contents of those books are the values, the actual information you take away.
Now run that search for every word in the sentence at once, with each word holding its own request. Each token forms a query, compares it against the keys of all tokens, then collects a blend of their values, weighted by how well the keys matched. The word "it" might issue a query that matches the key of a noun three words earlier, so it pulls in that noun's value and resolves what "it" refers to. Nothing here knows grammar in advance. The matching falls out of numbers the model learned to produce.
Query = the request you are holding. Key = the label on the shelf. Value = the book's contents. Attention is a soft lookup where every word searches every other word at the same time.
Where Q, K, and V actually come from
Q, K, and V are computed on the fly from the token embeddings, not stored in advance. Each token starts as an embedding vector, a list of numbers that encodes its meaning. The model holds three separate learned weight matrices, written W_Q, W_K, and W_V. To get a token's query, you multiply its embedding by W_Q. For its key, multiply by W_K. For its value, multiply by W_V. Three matrix multiplies on the same input, three different outputs.
Those three matrices are the part the model trains. They begin as random numbers and get adjusted through backpropagation until the projections become useful: until a pronoun's query reliably matches its antecedent's key, until a verb's query finds its subject. The same embedding can produce a query that looks for one thing and a key that advertises something else, which is why the split into three roles matters. One vector cannot both ask and answer well at once.
Why three matrices instead of one
- W_Q shapes how a token searches: what it is trying to find in the rest of the sentence.
- W_K shapes how a token is found: the label it presents so other tokens can match against it.
- W_V shapes what a token hands over once it is matched: the information that flows forward.
- Keeping these separate means asking and answering are learned independently, so a word can advertise one thing and seek another.
The formula, read in plain language
The canonical self-attention formula from the 2017 paper is softmax(QK^T / sqrt(d_k)) V, where d_k is the dimension of the key and query vectors. It looks dense, but each piece maps onto the library steps from earlier.
- QK^T: compare every query against every key using a dot product. A bigger dot product means a closer match, the same as a label matching your request more strongly.
- Divide by sqrt(d_k): scale the scores down so they do not blow up. As the vectors get longer, raw dot products grow, and the scaling keeps them in a sane range.
- softmax: turn the scores into weights that add up to 1. Now each query has a set of attention weights spread across all the keys, the soft version of picking which books to pull.
- Multiply by V: take a weighted average of the value vectors using those weights. The output for each token is a blend of every token's value, dominated by the ones it matched.
It helps to see the shapes. QK^T is not a single number; it is a full grid. For a sentence of n tokens, you get an n by n matrix of scores, one row per query and one column per key. Row 3 of that matrix holds how strongly token 3 matched every token in the sentence, including itself. softmax runs along each row, so every row becomes a set of weights that sum to 1. Multiplying that weight matrix by V then produces one output vector per token, each a blend of all the value vectors. So attention is, mechanically, a square matrix of match scores turned into a square matrix of weights, applied to the values.
Why softmax rather than just dividing each score by the total? Two reasons. softmax exponentiates first, which pushes strong matches further ahead of weak ones, so a clear winner dominates instead of being averaged into the noise. It also keeps every weight positive and bounded between 0 and 1, even when raw dot products go negative. The result is a clean probability-like spread across the keys: a soft choice, not a hard pick of a single book, which is exactly what lets a token draw on several others at graded strengths.
Here is the part most guides skip past in one line. That sqrt(d_k) term is not housekeeping; it often decides whether the model trains at all. Without it, dot products grow with the dimension of the vectors and push softmax into a saturated region where one weight sits near 1 and the rest near 0. In that regime the gradients flowing back during training nearly vanish, so the model learns slowly or not at all. Dividing by sqrt(d_k) keeps the variance of the scores near 1 and the gradients healthy. A small constant in the formula carries a large amount of the training stability.
Memorize one line and you have it: attention output is a weighted average of value vectors, and the weights come from matching queries against keys. The softmax and the sqrt(d_k) scaling are just how raw match scores become clean weights.
Self-attention vs cross-attention
When the queries, keys, and values all come from the same sequence, that is self-attention: a sentence attending to itself so each word can resolve against its neighbors. When the queries come from one sequence while the keys and values come from another, that is cross-attention. A decoder attending to an encoded input it is translating is the standard example. The mechanism is identical. Only the source of Q versus K and V changes. Read the table below row by row and the single difference stands out.
| Aspect | Self-attention | Cross-attention |
|---|---|---|
| Source of Q | The sequence itself | The target or output sequence |
| Source of K and V | The same sequence | A different (source) sequence |
| Typical job | Resolve a sentence against its own words | Align an output against a separate input |
| Formula | softmax(QK^T / sqrt(d_k)) V | softmax(QK^T / sqrt(d_k)) V |
Multi-head attention: many lookups at once
One attention computation captures one kind of relationship. Real models run several in parallel, called multi-head attention. The 2017 paper projects Q, K, and V into several lower-dimensional sets, runs scaled dot-product attention on each set independently, then concatenates the results and projects them back with a final output matrix.
A head is nothing more than one independent QKV lookup. Each one can specialize. One head might learn to track grammatical subjects, another to link pronouns to their referents, another to attend to the previous token. Because the heads run in parallel and get concatenated, the model views the same sentence through several different lenses at once and combines what they find. This is why you hear about a model having multiple attention heads per layer: each is an independent QKV lookup stacked side by side, and stacking more of them gives the layer more relationships it can track in a single pass.
How attention relates to context windows and KV caches
Attention is the root that the rest of the LLM vocabulary grows from. The context window is the set of tokens that attention is allowed to look across; a longer window means each query can match keys from further back, at a cost. That cost is concrete: the score matrix is n by n, so doubling the number of tokens roughly quadruples the attention work, which is why long context is expensive rather than free. Generating text one token at a time means recomputing attention over the whole prefix repeatedly, which is wasteful, so the KV cache stores the key and value tensors from earlier tokens and reuses them instead of recomputing.
Notice the cache stores K and V, not Q. That is not arbitrary. During generation, each new token brings a fresh query, but it needs to compare that query against the keys and values of every token before it, and those earlier keys and values never change. Cache them once, reuse them for the rest of that generation. Variants you may have seen, such as grouped-query attention and flash attention, are all optimizations of this same QKV core: fewer key and value copies, or a faster way to compute the softmax without storing the full score matrix.
- Context window: how many tokens attention may attend across at once.
- KV cache: stores the K and V tensors so generation does not recompute attention from scratch each step.
- Multi-head attention: several QKV lookups in parallel, concatenated.
- Flash attention and grouped-query attention: efficiency variants over the same QKV operation.
Attention inside a model, memory outside it
Attention is powerful but bounded: it can only reach tokens that are inside the current context window. Once a conversation scrolls past that limit, or once the information lives in your own files rather than in the prompt, attention cannot see it. That gap is what an external memory layer fills. MemX is a consumer AI memory app that sits over your own documents, photos, and notes across Android, iOS, and WhatsApp, retrieving the right pieces and feeding them into the context so the model's attention has something relevant to attend to.
MemX is private by architecture: per-user keys, encryption at rest, and an on-device first pass before anything leaves your phone. It does not change how attention works inside the model; it changes what reaches the context window in the first place, which is often the real bottleneck once a task outgrows a single prompt.
01what is the attention mechanism in simple terms
Attention lets each word in a sentence pull meaning from other words. Every word issues a query, every word offers a key, and the output is a weighted blend of each word's value, with weights set by how well queries match keys.
02what do Q K and V stand for in attention
Query, key, and value. Each is made by multiplying a token's embedding by a separate learned matrix: W_Q for the query, W_K for the key, W_V for the value. The query searches, the key is searched against, and the value is what gets passed forward.
03what is the formula for self-attention
softmax(QK^T / sqrt(d_k)) V, from the 2017 paper Attention Is All You Need. It scores every query against every key, scales by the square root of the key dimension, softmaxes the scores into weights, then takes a weighted average of the value vectors.
04why is attention divided by the square root of d_k
Without scaling, dot products grow as vectors get longer and push softmax into a saturated region where gradients vanish during training. Dividing by sqrt(d_k) keeps the score variance near 1, so learning stays stable. d_k is the dimension of the key and query vectors.
05what is the difference between self-attention and cross-attention
In self-attention the query, key, and value all come from the same sequence, so a sentence attends to itself. In cross-attention the query comes from one sequence while the keys and values come from another, such as a decoder attending to an encoded input. The formula is identical; only the source of Q versus K and V changes.
