Self-attention is a function. It takes a sequence of tokens, decides how much each token should attend to every other token, and rewrites each token’s representation accordingly.
Here’s exactly what it computes. The running example: “The cat sat on the mat” — six tokens, small enough to show every matrix.
Queries, Keys, Values
Every token starts as a vector — an embedding. For this example, each token has a 4-dimensional embedding. In real transformers these are hundreds or thousands of dimensions, but the math is identical.
Each embedding gets projected into three separate vectors through learned weight matrices , , and :
These three projections serve distinct roles:
- Query (): what token is looking for
- Key (): what token advertises about itself
- Value (): what token contributes when selected
The query-key interaction determines how much attention to pay. The value determines what information gets transmitted. This separation is the key design choice: the relevance signal (key) is decoupled from the information payload (value).
Concretely, each element of the Q matrix is a dot product between a token’s embedding row and a column of . Different weight matrices yield different projections — that’s what makes Q, K, and V different views of the same input.
Click any token below to see its Q, K, V vectors highlighted. Hover a cell to see the multiplication that produced it:
Each row is one token. Each column is one dimension. All six tokens are projected simultaneously — the matrices Q, K, V are just multiplied by three different weight matrices.
The Score
Now we compute how much each token attends to every other. The full operation:
Three steps:
1. Dot product. For each pair , compute . A high dot product means token ‘s query aligns with token ‘s key — they’re compatible. This gives a matrix of raw scores.
2. Scale. Divide every score by (here ). Without this, large dot products push softmax into saturation where gradients vanish. A practical fix that makes training stable.
3. Softmax. Apply softmax row-wise. Each row becomes a probability distribution — six non-negative values that sum to 1. Row tells you how token distributes its attention across all six positions.
The result is a attention matrix. Each cell is a weight between 0 and 1. Brighter means stronger attention.
Click any cell in the heatmap below to see the exact computation — dot product, scaling, softmax. Click a row label to step through the full pipeline for that token:
The Output
The attention weights determine how to blend information. Each token’s new representation is a weighted sum of all value vectors:
where is the attention weight from token to token .
If token attends strongly to token , then contributes heavily to the output at position . Information flows from attended tokens to the attending token. The output is the same shape as the input — six vectors of dimension — but each vector now encodes contextual information from across the entire sequence.
Click any row in the heatmap above to see the weighted value sum and the resulting output vector.
This is one attention head with one set of weight matrices. Real transformers run multiple heads in parallel — each with different , , — so different heads learn to attend to different relationships (syntactic, semantic, positional). The outputs are concatenated and projected down. Stack layers, and each layer refines representations using everything the previous layer learned.
But every head, in every layer, performs exactly this: project, score, combine.