Module 6 Assignment: Attention mechanism analysis#
Theme#
Attention and transformers
Exercises#
Explain query, key, value, and attention weights using a concrete sequence example.
Run the starter attention computation and inspect the attention matrix.
Describe how masking would change the computation for autoregressive generation.
Compare self-attention with recurrence for parallelism and long-range dependencies.
Submission#
Submit a 600-900 word technical memo plus any code, plots, or shape traces needed to support your claims. Use the starter cell as a minimum reproducible experiment, then make at least one meaningful modification.
Rubric#
Correct use of module vocabulary and notation
Clear connection between design choices and data/problem structure
Evidence from the starter experiment or your own extension
Concise reflection on limitations, failure modes, or next steps
import torch
import torch.nn.functional as F
torch.manual_seed(6)
Q = torch.randn(1, 4, 8)
K = torch.randn(1, 4, 8)
V = torch.randn(1, 4, 8)
scores = Q @ K.transpose(-2, -1) / (Q.size(-1) ** 0.5)
weights = F.softmax(scores, dim=-1)
context = weights @ V
print("attention weights shape:", tuple(weights.shape))
print(weights[0].round(decimals=3))
print("context shape:", tuple(context.shape))
attention weights shape: (1, 4, 4)
tensor([[0.0160, 0.3920, 0.5170, 0.0760],
[0.3150, 0.1920, 0.0810, 0.4120],
[0.4670, 0.1910, 0.1510, 0.1910],
[0.1360, 0.6080, 0.1970, 0.0590]])
context shape: (1, 4, 8)
Reflection prompts#
What changed when you modified the starter experiment?
Which result surprised you, and what diagnostic would you run next?
What assumption would you document before handing this model to another practitioner?