Module 6 Assignment: Attention mechanism analysis

Module 6 Assignment: Attention mechanism analysis#

Theme#

Attention and transformers

Exercises#

  1. Explain query, key, value, and attention weights using a concrete sequence example.

  2. Run the starter attention computation and inspect the attention matrix.

  3. Describe how masking would change the computation for autoregressive generation.

  4. Compare self-attention with recurrence for parallelism and long-range dependencies.

Submission#

Submit a 600-900 word technical memo plus any code, plots, or shape traces needed to support your claims. Use the starter cell as a minimum reproducible experiment, then make at least one meaningful modification.

Rubric#

  • Correct use of module vocabulary and notation

  • Clear connection between design choices and data/problem structure

  • Evidence from the starter experiment or your own extension

  • Concise reflection on limitations, failure modes, or next steps

import torch
import torch.nn.functional as F

torch.manual_seed(6)
Q = torch.randn(1, 4, 8)
K = torch.randn(1, 4, 8)
V = torch.randn(1, 4, 8)
scores = Q @ K.transpose(-2, -1) / (Q.size(-1) ** 0.5)
weights = F.softmax(scores, dim=-1)
context = weights @ V
print("attention weights shape:", tuple(weights.shape))
print(weights[0].round(decimals=3))
print("context shape:", tuple(context.shape))
attention weights shape: (1, 4, 4)
tensor([[0.0160, 0.3920, 0.5170, 0.0760],
        [0.3150, 0.1920, 0.0810, 0.4120],
        [0.4670, 0.1910, 0.1510, 0.1910],
        [0.1360, 0.6080, 0.1970, 0.0590]])
context shape: (1, 4, 8)

Reflection prompts#

  • What changed when you modified the starter experiment?

  • Which result surprised you, and what diagnostic would you run next?

  • What assumption would you document before handing this model to another practitioner?