Softmax Function - Mitzfitz

The Formula

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

Where:

$z_i$ = raw score (logit) for class $i$
$K$ = total number of classes
$e$ = Euler's number (~2.718)

Why Do We Need This?

The Problem

Your neural network outputs raw scores (logits) for each class: [2.3, 1.5, 0.8] for "cat, dog, bird". These are unbounded numbers. How do you turn them into probabilities that sum to 1?

❌ Bad Idea: Just Normalize

Scores: [2.3, 1.5, 0.8] → Sum = 4.6

Divide: [2.3/4.6, 1.5/4.6, 0.8/4.6] = [0.50, 0.33, 0.17]

Problem: What if scores are negative? [-1, 2, -3] → Sum = -2 😱

The Problem: We need a function that:

Converts any real numbers → valid probabilities (0 to 1)
Output probabilities sum to exactly 1
Preserves order (highest score → highest probability)
Is differentiable (for backpropagation)

How Softmax Works

Step 1: Exponentiate Everything

Why $e^z$? Because:

$e^x > 0$ for ALL $x$ (negative inputs become small positive numbers)
$e^x$ is smooth and differentiable
$e^x$ grows fast → amplifies differences

Example: [2.3, 1.5, 0.8]

$$e^{2.3} \approx 9.97, \quad e^{1.5} \approx 4.48, \quad e^{0.8} \approx 2.23$$

Step 2: Normalize

Sum: $9.97 + 4.48 + 2.23 = 16.68$

Divide each by the sum:

$$\left[\frac{9.97}{16.68}, \frac{4.48}{16.68}, \frac{2.23}{16.68}\right] = [0.598, 0.269, 0.134]$$

Sum = 1.001 ≈ 1 ✓ (rounding error)

Step 3: Check Properties

✓ All values between 0 and 1
✓ Sum to 1
✓ Highest score (2.3) → highest probability (0.598)
✓ Differentiable everywhere

The "Soft" Part

Compare to argmax (hard maximum):

Argmax (Hard)

Scores: [2.3, 1.5, 0.8]

Output: [1, 0, 0] (one-hot)

Problem: Not differentiable! Can't backpropagate.

Softmax (Soft)

Scores: [2.3, 1.5, 0.8]

Output: [0.598, 0.269, 0.134]

Benefit: Smooth approximation! Works with gradient descent.

Key Insight: Softmax is a "soft" version of argmax. It mostly picks the winner but gives non-zero probabilities to others.

Temperature: Controlling Confidence

The full formula includes temperature $T$:

$$\text{softmax}(z_i, T) = \frac{e^{z_i/T}}{\sum_{j=1}^{K} e^{z_j/T}}$$

Temperature	Effect	Example Output
$T = 1$	Normal softmax	`[0.598, 0.269, 0.134]`
$T \to 0$	Approaches argmax (confident)	`[0.999, 0.001, 0.000]`
$T \to \infty$	Uniform distribution (uncertain)	`[0.333, 0.333, 0.333]`
$T = 0.5$	More "peaky" (confident)	`[0.756, 0.188, 0.056]`
$T = 2$	More "flat" (uncertain)	`[0.461, 0.312, 0.227]`

Use cases:

Low $T$ → confident predictions (e.g., deployment)
High $T$ → exploratory predictions (e.g., creative text generation)

Concrete Example

Scenario: Image Classification (Cat, Dog, Bird)

Raw scores from neural network: [3.2, 1.3, 0.2]

Step 1: Exponentiate

$e^{3.2} = 24.53$
$e^{1.3} = 3.67$
$e^{0.2} = 1.22$

Step 2: Sum: $24.53 + 3.67 + 1.22 = 29.42$

Step 3: Normalize

Cat: $24.53 / 29.42 = 0.834$ (83.4%)
Dog: $3.67 / 29.42 = 0.125$ (12.5%)
Bird: $1.22 / 29.42 = 0.041$ (4.1%)

Model is 83% confident it's a cat!

Why Exponential? Why Not Just Divide?

Let's see what happens without exponentiation:

Without $e^x$

Scores: [5, 4, 1]

Simple normalize: [0.5, 0.4, 0.1]

Difference between top two: 0.1

With $e^x$

Scores: [5, 4, 1]

Softmax: [0.643, 0.236, 0.006]

Difference between top two: 0.407

Key Insight: Exponentiation amplifies differences. A score of 5 is way more confident than 4! Softmax captures that.

Derivative (For Backpropagation)

The gradient of softmax is elegant but tricky:

$$\frac{\partial \text{softmax}(z_i)}{\partial z_j} = \begin{cases} \text{softmax}(z_i)(1 - \text{softmax}(z_i)) & \text{if } i = j \\ -\text{softmax}(z_i) \cdot \text{softmax}(z_j) & \text{if } i \neq j \end{cases}$$

Why it matters: When combined with cross-entropy loss, the gradient simplifies beautifully to $\hat{y} - y$ (prediction minus truth). This is why softmax + cross-entropy is the standard for classification!

Common Gotchas

1. Numerical Stability

❌ Large $z_i$ → $e^{z_i}$ overflows (infinity)

✅ Subtract max before exponentiating:

$$\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_{j} e^{z_j - \max(z)}}$$

This is mathematically equivalent but numerically stable!

2. Don't Double-Apply!

❌ softmax(softmax(logits))

Many frameworks have cross_entropy_with_logits that applies softmax internally

3. Binary Classification?

For 2 classes, use sigmoid instead (simpler and equivalent to 2-class softmax)

Softmax for $K \geq 3$ classes

Try It Yourself

Exercise 1: Calculate manually

Logits: [1.0, 2.0, 3.0] → Softmax probabilities = ?

(Hint: $e^1 \approx 2.72$, $e^2 \approx 7.39$, $e^3 \approx 20.09$)

Exercise 2: Temperature experiment

For logits [2, 1, 0], calculate softmax at $T=0.1$, $T=1$, and $T=10$. What pattern do you see?

Exercise 3: Prove stability

Show that $\text{softmax}(z) = \text{softmax}(z - c)$ for any constant $c$

(This is why subtracting max works!)

Formula Deep-Dive: Softmax Function