Deep Dive: How Transformers Work (Mathematically) & How RLHF Fine-Tunes Models Like ChatGPT
1. How Transformers Work: The Math Behind LLMs
The transformer architecture (Vaswani et al., 2017) revolutionized AI by enabling models like GPT to process language in parallel (unlike older RNNs). Here's the mathematical breakdown:
Key Components of a Transformer
(1) Input Embeddings & Positional Encoding
-
Each word/token is converted to a d-dimensional vector (e.g., d=768 in GPT-3).
-
Since transformers don't process words sequentially, positional encodings are added to give order information:
PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
where
pos
= position in sentence,i
= dimension index.
(2) Self-Attention Mechanism
The core innovation that lets words "pay attention" to other relevant words.
For each input vector x, transformers compute:
-
Queries (Q), Keys (K), and Values (V) matrices:
Q = xW_Q, K = xW_K, V = xW_V
(where W_Q, W_K, W_V are learned weight matrices)
-
Attention Scores (how much each word attends to others):
Attention(Q,K,V) = softmax(QK^T/√d_k)V
-
QK^T
computes pairwise word relevance -
√d_k
(dimension of K) scales values for stable gradients -
Softmax converts scores to probabilities
-
(3) Multi-Head Attention
-
Runs multiple attention heads in parallel (e.g., 12 heads in GPT-3), each learning different relationships:
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W_O
where each
head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)
(4) Feed-Forward Network
After attention, each position passes through:
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
(where W_1, W_2 are learned matrices)
(5) Layer Normalization & Residual Connections
Critical for training deep networks:
LayerNorm(x + Sublayer(x))
-
Residual connections help gradients flow
-
LayerNorm stabilizes activations
2. How RLHF Fine-Tunes Models Like ChatGPT
Reinforcement Learning from Human Feedback (RLHF) is what makes ChatGPT helpful and aligned. Here's how it works:
Step 1: Supervised Fine-Tuning (SFT)
-
Take pre-trained LLM (e.g., GPT-4 base model)
-
Fine-tune on high-quality human demonstrations:
Loss = -Σ log P(y|x) // maximize likelihood of human responses
Step 2: Reward Modeling
-
Collect comparison data where humans rank multiple model responses:
Preferred: "The capital of France is Paris." Dispreferred: "France's capital? Maybe London?"
-
Train a reward model (RM) to predict human preferences:
Loss = -log(σ(r(x,y_w) - r(x,y_l)))
-
r(x,y)
= reward for prompt x and response y -
y_w
= preferred response,y_l
= dispreferred -
σ = sigmoid function
-
Step 3: Reinforcement Learning (PPO)
Use Proximal Policy Optimization to align the LLM with human preferences:
-
Generate responses from current policy π
-
Compute rewards using RM:
r(x,y)
-
Optimize with KL penalty to prevent over-optimization:
Objective = E[r(x,y) - β KL(π||π_SFT)]
-
β controls how much we penalize deviation from SFT model
-
Updated via gradient ascent
-
Why RLHF Matters
-
Turns raw next-word prediction into helpful dialogue
-
Suppresses harmful/untruitful outputs
-
Enables nuanced behavior like admitting uncertainty
Key Mathematical Insights
Concept | Key Equations | Purpose | ||
---|---|---|---|---|
Self-Attention | softmax(QK^T/√d_k)V |
Compute word relationships | ||
PPO Objective | `E[r(x,y) - β KL(π | π_SFT)]` | Align model with human preferences | |
Reward Modeling | -log(σ(r(x,y_w)-r(x,y_l))) |
Learn human preference rankings |
Current Frontiers
-
Mixture of Experts: Only activate parts of model per input (e.g., GPT-4 rumored to use this)
-
Speculative Decoding: Faster generation by "drafting" multiple tokens
-
Constitutional AI: Alternative to RLHF using self-critique
Try stackboard