Deep Dive: How Transformers Work (Mathematically) & How RLHF Fine-Tunes Models Like ChatGPT

1. How Transformers Work: The Math Behind LLMs

The transformer architecture (Vaswani et al., 2017) revolutionized AI by enabling models like GPT to process language in parallel (unlike older RNNs). Here's the mathematical breakdown:

Key Components of a Transformer

(1) Input Embeddings & Positional Encoding

(2) Self-Attention Mechanism

The core innovation that lets words "pay attention" to other relevant words.

For each input vector x, transformers compute:

  1. Queries (Q)Keys (K), and Values (V) matrices:

    Copy
    Q = xW_Q, K = xW_K, V = xW_V

    (where W_Q, W_K, W_V are learned weight matrices)

  2. Attention Scores (how much each word attends to others):

    Copy
    Attention(Q,K,V) = softmax(QK^T/√d_k)V
    • QK^T computes pairwise word relevance

    • √d_k (dimension of K) scales values for stable gradients

    • Softmax converts scores to probabilities

(3) Multi-Head Attention

(4) Feed-Forward Network

After attention, each position passes through:

Copy
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

(where W_1, W_2 are learned matrices)

(5) Layer Normalization & Residual Connections

Critical for training deep networks:

Copy
LayerNorm(x + Sublayer(x))

2. How RLHF Fine-Tunes Models Like ChatGPT

Reinforcement Learning from Human Feedback (RLHF) is what makes ChatGPT helpful and aligned. Here's how it works:

Step 1: Supervised Fine-Tuning (SFT)

Step 2: Reward Modeling

  1. Collect comparison data where humans rank multiple model responses:

    Copy
    Preferred: "The capital of France is Paris."
    Dispreferred: "France's capital? Maybe London?"
  2. Train a reward model (RM) to predict human preferences:

    Copy
    Loss = -log(σ(r(x,y_w) - r(x,y_l)))
    • r(x,y) = reward for prompt x and response y

    • y_w = preferred response, y_l = dispreferred

    • σ = sigmoid function

Step 3: Reinforcement Learning (PPO)

Use Proximal Policy Optimization to align the LLM with human preferences:

  1. Generate responses from current policy π

  2. Compute rewards using RM: r(x,y)

  3. Optimize with KL penalty to prevent over-optimization:

    Copy
    Objective = E[r(x,y) - β KL(π||π_SFT)]
    • β controls how much we penalize deviation from SFT model

    • Updated via gradient ascent

Why RLHF Matters


Key Mathematical Insights

ConceptKey EquationsPurpose
Self-Attention softmax(QK^T/√d_k)V Compute word relationships
PPO Objective `E[r(x,y) - β KL(π π_SFT)]` Align model with human preferences
Reward Modeling -log(σ(r(x,y_w)-r(x,y_l))) Learn human preference rankings

Current Frontiers


Try stackboard