Skip to content
14 min read

Every Word Is Watching

Self-attention is a function. It takes a sequence of tokens, decides how much each token should attend to every other token, and rewrites each token’s representation accordingly.

Here’s exactly what it computes. The running example: “The cat sat on the mat” — six tokens, small enough to show every matrix.


Queries, Keys, Values

Every token starts as a vector — an embedding. For this example, each token has a 4-dimensional embedding. In real transformers these are hundreds or thousands of dimensions, but the math is identical.

Each embedding gets projected into three separate vectors through learned weight matrices WQW_Q, WKW_K, and WVW_V:

Q=XWQK=XWKV=XWVQ = XW_Q \qquad K = XW_K \qquad V = XW_V

These three projections serve distinct roles:

  • Query (QiQ_i): what token ii is looking for
  • Key (KjK_j): what token jj advertises about itself
  • Value (VjV_j): what token jj contributes when selected

The query-key interaction determines how much attention to pay. The value determines what information gets transmitted. This separation is the key design choice: the relevance signal (key) is decoupled from the information payload (value).

Concretely, each element of the Q matrix is a dot product between a token’s embedding row and a column of WQW_Q. Different weight matrices yield different projections — that’s what makes Q, K, and V different views of the same input.

Click any token below to see its Q, K, V vectors highlighted. Hover a cell to see the multiplication that produced it:

Each row is one token. Each column is one dimension. All six tokens are projected simultaneously — the matrices Q, K, V are just XX multiplied by three different weight matrices.


The Score

Now we compute how much each token attends to every other. The full operation:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Three steps:

1. Dot product. For each pair (i,j)(i, j), compute QiKjQ_i \cdot K_j. A high dot product means token ii‘s query aligns with token jj‘s key — they’re compatible. This gives a 6×66 \times 6 matrix of raw scores.

2. Scale. Divide every score by dk\sqrt{d_k} (here 4=2\sqrt{4} = 2). Without this, large dot products push softmax into saturation where gradients vanish. A practical fix that makes training stable.

3. Softmax. Apply softmax row-wise. Each row becomes a probability distribution — six non-negative values that sum to 1. Row ii tells you how token ii distributes its attention across all six positions.

The result is a 6×66 \times 6 attention matrix. Each cell is a weight between 0 and 1. Brighter means stronger attention.

Click any cell in the heatmap below to see the exact computation — dot product, scaling, softmax. Click a row label to step through the full pipeline for that token:


The Output

The attention weights determine how to blend information. Each token’s new representation is a weighted sum of all value vectors:

outputi=jαijVj\text{output}_i = \sum_j \alpha_{ij}\, V_j

where αij\alpha_{ij} is the attention weight from token ii to token jj.

If token ii attends strongly to token jj, then VjV_j contributes heavily to the output at position ii. Information flows from attended tokens to the attending token. The output is the same shape as the input — six vectors of dimension dkd_k — but each vector now encodes contextual information from across the entire sequence.

Click any row in the heatmap above to see the weighted value sum and the resulting output vector.


This is one attention head with one set of weight matrices. Real transformers run multiple heads in parallel — each with different WQW_Q, WKW_K, WVW_V — so different heads learn to attend to different relationships (syntactic, semantic, positional). The outputs are concatenated and projected down. Stack layers, and each layer refines representations using everything the previous layer learned.

But every head, in every layer, performs exactly this: project, score, combine.

Thanks for reading.

Say hi on LinkedIn (opens in new tab)

More writings

  1. 2026 4

    1. So Much for O(1)
    2. We Might All Be AI Engineers Now
    3. The Hardest Bug I Ever Fixed Wasn't in Code
    4. Why I Switched to Podman (and Why You Might Too)
  2. 2024 3

    1. The World is Stochastic
    2. Debugging a running Java app in Docker
    3. Why is it UTC and not CUT?
  3. 2023 12

    1. Deep prop drilling in ReactJS
    2. Eigenvectors
    3. Java's fork/join framework
    4. TypeScript's omit and pick
    5. JavaScript's new immutable array methods
    6. Integrating JUnit 5 in Maven projects
    7. My take on ChatGPT and prompt engineering
    8. Declarative events in ReactJS
    9. Positive Lookaheads
    10. Functors
    11. Fast forward videos with ffmpeg
    12. Rotate y-axis of a 2D vector
  4. 2022 9

    1. Synchronizing time
    2. Vector rotation
    3. Sed find and replace
    4. Asgardeo try it application
    5. Flatten error constraints
    6. Good Git commit messages
    7. Asgardeo JIT user provisioning
    8. Monotonic Arrays
    9. How GOROOT and GOPATH works
  5. 2021 1

    1. Two summation