Deep Dive: How Transformers Work (Mathematically) & How RLHF Fine-Tunes Models Like ChatGPT

1. How Transformers Work: The Math Behind LLMs

The transformer architecture (Vaswani et al., 2017) revolutionized AI by enabling models like GPT to process language in parallel (unlike older RNNs). Here's the mathematical breakdown:

Key Components of a Transformer

(1) Input Embeddings & Positional Encoding

Each word/token is converted to a d-dimensional vector (e.g., d=768 in GPT-3).
Since transformers don't process words sequentially, positional encodings are added to give order information:
Copy
```
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
```
where pos = position in sentence, i = dimension index.

(2) Self-Attention Mechanism

The core innovation that lets words "pay attention" to other relevant words.

For each input vector x, transformers compute:

Queries (Q), Keys (K), and Values (V) matrices:
Copy
```
Q = xW_Q, K = xW_K, V = xW_V
```
(where W_Q, W_K, W_V are learned weight matrices)
Attention Scores (how much each word attends to others):
Copy
```
Attention(Q,K,V) = softmax(QK^T/√d_k)V
```
- QK^T computes pairwise word relevance
- √d_k (dimension of K) scales values for stable gradients
- Softmax converts scores to probabilities

(3) Multi-Head Attention

Runs multiple attention heads in parallel (e.g., 12 heads in GPT-3), each learning different relationships:
Copy
```
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W_O
```
where each head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)

(4) Feed-Forward Network

After attention, each position passes through:

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

(where W_1, W_2 are learned matrices)

(5) Layer Normalization & Residual Connections

Critical for training deep networks:

LayerNorm(x + Sublayer(x))

Residual connections help gradients flow
LayerNorm stabilizes activations

2. How RLHF Fine-Tunes Models Like ChatGPT

Reinforcement Learning from Human Feedback (RLHF) is what makes ChatGPT helpful and aligned. Here's how it works:

Step 1: Supervised Fine-Tuning (SFT)

Take pre-trained LLM (e.g., GPT-4 base model)

Fine-tune on high-quality human demonstrations:

Loss = -Σ log P(y|x)  // maximize likelihood of human responses

Step 2: Reward Modeling

Collect comparison data where humans rank multiple model responses:

Preferred: "The capital of France is Paris."
Dispreferred: "France's capital? Maybe London?"

Train a reward model (RM) to predict human preferences:
Copy
```
Loss = -log(σ(r(x,y_w) - r(x,y_l)))
```
- r(x,y) = reward for prompt x and response y
- y_w = preferred response, y_l = dispreferred
- σ = sigmoid function

Step 3: Reinforcement Learning (PPO)

Use Proximal Policy Optimization to align the LLM with human preferences:

Generate responses from current policy π
Compute rewards using RM: r(x,y)
Optimize with KL penalty to prevent over-optimization:
Copy
```
Objective = E[r(x,y) - β KL(π||π_SFT)]
```
- β controls how much we penalize deviation from SFT model
- Updated via gradient ascent

Why RLHF Matters

Turns raw next-word prediction into helpful dialogue
Suppresses harmful/untruitful outputs
Enables nuanced behavior like admitting uncertainty

Key Mathematical Insights

Concept	Key Equations	Purpose
Self-Attention	`softmax(QK^T/√d_k)V`	Compute word relationships
PPO Objective	`E[r(x,y) - β KL(π		π_SFT)]`	Align model with human preferences
Reward Modeling	`-log(σ(r(x,y_w)-r(x,y_l)))`	Learn human preference rankings

Current Frontiers

Mixture of Experts: Only activate parts of model per input (e.g., GPT-4 rumored to use this)
Speculative Decoding: Faster generation by "drafting" multiple tokens
Constitutional AI: Alternative to RLHF using self-critique

Try stackboard