Binary Cross-Entropy Loss

The Formula

$$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

Where:

$y_i$ = actual label (0 or 1)
$\hat{y}_i$ = predicted probability (between 0 and 1)
$n$ = number of samples

Why Do We Need This?

The Problem

You're training a binary classifier (cat vs. dog). Your model outputs a probability: 0.8 means "80% sure it's a cat". How do you measure how wrong you are?

❌ Bad Idea: Use Squared Error $(y - \hat{y})^2$

Actual label: $y = 1$ (cat)
Prediction: $\hat{y} = 0.1$ (only 10% confident)
Error: $(1 - 0.1)^2 = 0.81$

But if $\hat{y} = 0.9$:

Error: $(1 - 0.9)^2 = 0.01$

The Problem: When you're VERY wrong (0.1 vs 1.0), MSE gives a gradient of ~0.18. When you're close (0.9 vs 1.0), gradient is ~0.02. When you're very wrong, you should learn FAST! But squared error barely punishes bad predictions.

Where Cross-Entropy Comes From

Step 1: Maximum Likelihood

Your model outputs probability $\hat{y}$. The likelihood of the true label $y$ is:

If $y = 1$: likelihood = $\hat{y}$
If $y = 0$: likelihood = $1 - \hat{y}$

We can write both cases as:

$$P(y|\hat{y}) = \hat{y}^y \cdot (1-\hat{y})^{(1-y)}$$

Check it:

If $y=1$: $\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}$ ✓
If $y=0$: $\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}$ ✓

Step 2: Maximize Log-Likelihood

For all $n$ samples, total likelihood:

$$\prod_{i=1}^{n} \hat{y}_i^{y_i} \cdot (1-\hat{y}_i)^{(1-y_i)}$$

Products are hard. Take the log (monotonic, so maximizing log = maximizing original):

$$\log L = \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

Step 3: Turn Into a Loss

We want to minimize loss (not maximize). Flip the sign and average:

$$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

That's cross-entropy loss!

Why It Works Better

For one sample where $y=1$:

$$L = -\log(\hat{y})$$

Taking the derivative with respect to $\hat{y}$:

$$\frac{\partial L}{\partial \hat{y}} = -\frac{1}{\hat{y}}$$

Prediction $\hat{y}$	MSE gradient	Cross-Entropy gradient
0.1 (very wrong)	0.18	-10.0
0.5 (unsure)	0.10	-2.0
0.9 (almost right)	0.02	-1.1

The magic: When you're very wrong (0.1), cross-entropy gradient is massive (-10). You learn FAST. When you're close (0.9), gradient is smaller. You fine-tune.

Concrete Example

Scenario: Cat Classifier

Image 1: Cat ($y=1$), model says 0.9 → $L = -\log(0.9) = 0.105$
Image 2: Cat ($y=1$), model says 0.1 → $L = -\log(0.1) = 2.303$

Image 2 contributes 22× more to the loss! The model will focus on fixing that mistake.

The Intuition

Cross-entropy measures surprise.

90% confident it's a cat AND it IS a cat → low surprise, low loss
90% confident it's NOT a cat BUT it IS → high surprise, high loss

Borrowed from information theory: "How many bits do I need to encode this outcome given my (wrong) probability distribution?"

Why "Cross" Entropy?

True distribution: $p$ (real labels: 0 or 1)
Predicted distribution: $q$ (model's probabilities)

Cross-entropy between $p$ and $q$:

$$H(p, q) = -\sum p(x) \log q(x)$$

In our case:

$p(y=1) = y_i$ and $p(y=0) = 1-y_i$ (one-hot)
$q(y=1) = \hat{y}_i$ and $q(y=0) = 1-\hat{y}_i$

Plug in → you get our formula!

Common Gotchas

1. Don't use with raw logits!

❌ loss = cross_entropy(logits, labels)

✅ loss = cross_entropy(sigmoid(logits), labels)

Or use binary_cross_entropy_with_logits which is numerically stable

2. Log(0) = -∞!

If $\hat{y} = 0$ or $\hat{y} = 1$ exactly → undefined loss

Clip predictions: clip(y_pred, 1e-7, 1-1e-7)

3. Multi-class is different

Categorical cross-entropy uses softmax + multiple classes

Try It Yourself

Exercise 1: Calculate manually

$y = 1$, $\hat{y} = 0.7$ → $L = $ ?
$y = 0$, $\hat{y} = 0.3$ → $L = $ ?

Exercise 2: Think about it

Why does the formula have two terms if only one "fires" at a time?

Exercise 3: Derive it

Find the gradient $\frac{\partial L}{\partial \hat{y}}$ for the case when $y=0$

Formula Deep-Dive: Binary Cross-Entropy Loss