Formula Deep-Dive: Softmax Function

The Formula

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$
Where:
  • \(z_i\) = raw score (logit) for class \(i\)
  • \(K\) = total number of classes
  • \(e\) = Euler's number (~2.718)

Why Do We Need This?

The Problem

Your neural network outputs raw scores (logits) for each class: [2.3, 1.5, 0.8] for "cat, dog, bird". These are unbounded numbers. How do you turn them into probabilities that sum to 1?

❌ Bad Idea: Just Normalize

Scores: [2.3, 1.5, 0.8] → Sum = 4.6

Divide: [2.3/4.6, 1.5/4.6, 0.8/4.6] = [0.50, 0.33, 0.17]

Problem: What if scores are negative? [-1, 2, -3] → Sum = -2 😱

The Problem: We need a function that:
  • Converts any real numbers → valid probabilities (0 to 1)
  • Output probabilities sum to exactly 1
  • Preserves order (highest score → highest probability)
  • Is differentiable (for backpropagation)

How Softmax Works

Step 1: Exponentiate Everything

Why \(e^z\)? Because:

  • \(e^x > 0\) for ALL \(x\) (negative inputs become small positive numbers)
  • \(e^x\) is smooth and differentiable
  • \(e^x\) grows fast → amplifies differences

Example: [2.3, 1.5, 0.8]

$$e^{2.3} \approx 9.97, \quad e^{1.5} \approx 4.48, \quad e^{0.8} \approx 2.23$$

Step 2: Normalize

Sum: \(9.97 + 4.48 + 2.23 = 16.68\)

Divide each by the sum:

$$\left[\frac{9.97}{16.68}, \frac{4.48}{16.68}, \frac{2.23}{16.68}\right] = [0.598, 0.269, 0.134]$$

Sum = 1.001 ≈ 1 ✓ (rounding error)

Step 3: Check Properties

  • ✓ All values between 0 and 1
  • ✓ Sum to 1
  • ✓ Highest score (2.3) → highest probability (0.598)
  • ✓ Differentiable everywhere

The "Soft" Part

Compare to argmax (hard maximum):

Argmax (Hard)

Scores: [2.3, 1.5, 0.8]

Output: [1, 0, 0] (one-hot)

Problem: Not differentiable! Can't backpropagate.

Softmax (Soft)

Scores: [2.3, 1.5, 0.8]

Output: [0.598, 0.269, 0.134]

Benefit: Smooth approximation! Works with gradient descent.

Key Insight: Softmax is a "soft" version of argmax. It mostly picks the winner but gives non-zero probabilities to others.

Temperature: Controlling Confidence

The full formula includes temperature \(T\):

$$\text{softmax}(z_i, T) = \frac{e^{z_i/T}}{\sum_{j=1}^{K} e^{z_j/T}}$$
Temperature Effect Example Output
\(T = 1\) Normal softmax [0.598, 0.269, 0.134]
\(T \to 0\) Approaches argmax (confident) [0.999, 0.001, 0.000]
\(T \to \infty\) Uniform distribution (uncertain) [0.333, 0.333, 0.333]
\(T = 0.5\) More "peaky" (confident) [0.756, 0.188, 0.056]
\(T = 2\) More "flat" (uncertain) [0.461, 0.312, 0.227]
Use cases:
  • Low \(T\) → confident predictions (e.g., deployment)
  • High \(T\) → exploratory predictions (e.g., creative text generation)

Concrete Example

Scenario: Image Classification (Cat, Dog, Bird)

Raw scores from neural network: [3.2, 1.3, 0.2]

Step 1: Exponentiate

  • \(e^{3.2} = 24.53\)
  • \(e^{1.3} = 3.67\)
  • \(e^{0.2} = 1.22\)

Step 2: Sum: \(24.53 + 3.67 + 1.22 = 29.42\)

Step 3: Normalize

  • Cat: \(24.53 / 29.42 = 0.834\) (83.4%)
  • Dog: \(3.67 / 29.42 = 0.125\) (12.5%)
  • Bird: \(1.22 / 29.42 = 0.041\) (4.1%)

Model is 83% confident it's a cat!

Why Exponential? Why Not Just Divide?

Let's see what happens without exponentiation:

Without \(e^x\)

Scores: [5, 4, 1]

Simple normalize: [0.5, 0.4, 0.1]

Difference between top two: 0.1

With \(e^x\)

Scores: [5, 4, 1]

Softmax: [0.643, 0.236, 0.006]

Difference between top two: 0.407

Key Insight: Exponentiation amplifies differences. A score of 5 is way more confident than 4! Softmax captures that.

Derivative (For Backpropagation)

The gradient of softmax is elegant but tricky:

$$\frac{\partial \text{softmax}(z_i)}{\partial z_j} = \begin{cases} \text{softmax}(z_i)(1 - \text{softmax}(z_i)) & \text{if } i = j \\ -\text{softmax}(z_i) \cdot \text{softmax}(z_j) & \text{if } i \neq j \end{cases}$$
Why it matters: When combined with cross-entropy loss, the gradient simplifies beautifully to \(\hat{y} - y\) (prediction minus truth). This is why softmax + cross-entropy is the standard for classification!

Common Gotchas

1. Numerical Stability

❌ Large \(z_i\) → \(e^{z_i}\) overflows (infinity)

✅ Subtract max before exponentiating:

$$\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_{j} e^{z_j - \max(z)}}$$

This is mathematically equivalent but numerically stable!

2. Don't Double-Apply!

softmax(softmax(logits))

Many frameworks have cross_entropy_with_logits that applies softmax internally

3. Binary Classification?

For 2 classes, use sigmoid instead (simpler and equivalent to 2-class softmax)

Softmax for \(K \geq 3\) classes

Try It Yourself

Exercise 1: Calculate manually

Logits: [1.0, 2.0, 3.0] → Softmax probabilities = ?

(Hint: \(e^1 \approx 2.72\), \(e^2 \approx 7.39\), \(e^3 \approx 20.09\))

Exercise 2: Temperature experiment

For logits [2, 1, 0], calculate softmax at \(T=0.1\), \(T=1\), and \(T=10\). What pattern do you see?

Exercise 3: Prove stability

Show that \(\text{softmax}(z) = \text{softmax}(z - c)\) for any constant \(c\)

(This is why subtracting max works!)