# The Case Against Local Realism

Realism is the principle that the world exists independent of any observer.  If a tree falls in a forest with no one around to hear it, it still makes a sound.  Locality is the principle that an occurrence can only directly affect its immediate surroundings.  Its effects can eventually propagate far and wide, but this propagation has a speed limit, specifically the speed of light.  Local realism is the logical conjunction of locality and realism.

Locality and realism are both very appealing and intuitive properties for a physical theory to have.  The problem is that local realism has been empirically disproven, over and over again.  This is pretty disconcerting, but the first rule of science is ideas are tested by experiment.  No way around it.  This post explains what the Bell test experiments are, and why the results of those experiments are incompatible with local realism.  This explanation does not rely on an understanding of quantum mechanics, or actually on basically any physics at all.  (It does use a bit of math though.)

## Setup

A device of some kind — let’s call it a particle generator — periodically emits a pair of particles that quickly zoom off in opposite directions.  For the argument we want to make, it doesn’t matter what the particle generator is, or what exactly the particles are.  Just two particles, zooming away from each other, that we have reason to believe may be related to each other in some way.  The particles head towards two distant locations, say A and B.  At each of A and B there is another device — let’s call these detectors.  When the particle gets to the detector, it goes through it and interacts with it in some way, and the detector records something about how it was affected by the particle.  This process is often called a measurement.  Again, the details don’t matter for the argument we need to make.  The detector records something that has something to do with the particle; that’s all we need to know.  (Note that in discussions of quantum mechanics, the word “measure” is often used in a confusing way.  In what follows, “measure” means simply that the particle passed through the detector and the detector recorded something.)

While the particles are en route, the detector at A is pseudorandomly set to one of two configurations, called a and a’.  The one at B is simultaneously and independently pseudorandomly set to b or b’.  These detector settings happen in time to affect their own measurement, but too late and too far away to affect the other detector’s measurement of the corresponding particle (assuming locality).  There are two possibilities for the recorded measurement, called + and -.

In the experiment, the particle generator generates many pairs of particles, and we look at the relative frequencies of A and B getting different combinations of + and -, conditioned on the settings of a or a’, b or b’.  In 1964, John Bell proved Bell’s Theorem, which states that if local realism is true, some inequalities have to hold between these relative frequencies, and that in some situations, quantum mechanics predicts that these inequalities will be violated.  For the case against local realism, we don’t care about the predictions of quantum mechanics.  We only care that under local realism, the inequalities must hold, and that empirically, they do not.

## The CHSH Inequality

### Statement

Many of the Bell test experiments demonstrate the violation of a variant of Bell’s original inequality called the CHSH inequality (named after the initials of its developers).  The inequality says that, assuming local realism is correct, there is a relationship that must hold on the frequencies with which the two detectors record the same measurement as each other, both + or both -, between the four combinations of detector settings, (a, b), (a, b’), (a’, b), and (a’, b’).  We define $E(a, b)$ to be, looking only at the measurements where the detectors were in the states a and b, the probability that the measurements are the same, minus the probability of them being different.  So if, whenever the detectors were set to a and b, they always recorded either both + or both -, $E(a, b) = 1$.  If (a, b) always leads to +- or -+, $E(a, b) = -1$.  CHSH says:

$E(a, b) + E(a', b) + E(a, b') - E(a', b') \le 2$

Defining $M(a, b)$ as the probability of mismatch (+- or -+) given settings a and b, the probability of a match is $1 - M(a, b)$, and so $E(a, b) = (1 - M(a, b)) - M(a, b) = 1 - 2M(a, b)$. (The difference between probabilities and proportions of results is discussed below in Statistics.) CHSH can therefore be rewritten as

$1 + 1 + 1 - 1 - 2(M(a, b) + M(a', b) + M(a, b') - M(a', b')) \le 2$,

which is equivalent to saying that

$M(a, b) + M(a', b) + M(a, b') \ge M(a', b')$.

### Assumption of Local Realism

We’re going to prove that, assuming local realism, the CHSH inequality must hold.  The assumptions of locality and realism come into play separately.  In the proof, we consider a pair of particles heading towards the two detectors.  Realism comes into play when we consider the probabilities of a detector recording + for a particle, under both possible settings of the detector.  Realism asserts that the particle exists, and either measurement could theoretically be performed on it, so it makes sense to talk about both detector settings for the same particle, even though only one is actually applied.  This point of view specifically is called counterfactual definiteness: experiments that were not performed can still be reasoned about.

The assumption of locality comes into play when we look at the simultaneous measurement of one pair of particles by the two detectors.   Because of the timing and distance between the detectors, locality says that there is no way one measurement can affect the other one.  This means that the simultaneous recordings of + or – by the two detectors must be independent of each other.  Below, we use the fact that when two outcomes are independent, the probability of both happening is the product of the probabilities of the two outcomes.

### Proof

We first prove that the CHSH inequality holds for a specific pair of particles, and then later for the full space of particle pairs that might be emitted by the particle generator.  Imagine a pair of particles moving towards the two detectors.  For the particle going towards A, there must be some probability of + if the detector is set to a, and some probability of + if the detector is set to a’.  We call these probabilities p and p’. And let the probabilities of b and b’ recording + for the particle going towards B be q and q’, respectively.

Even though the two particles may be related to each other in some way, once they are spatially separated from each other, measurements on the two particles must be independent.  $M(a, b)$ is the probability of +-, $p(1-q)$, plus the probability of -+, $q(1-p)$.  Similarly, $M(a', b) = p'(1-q) + q(1-p')$, and so on.  This means that we can rewrite $M(a, b) + M(a', b) + M(a, b') \ge M(a', b')$ as:

$p(1-q) + q(1-p) + p'(1-q) + q(1-p') + p(1-q') + q'(1-p) \ge p'(1-q') + q'(1-p')$

It turns out that this statement is true whenever $0 \le p, p', q, q' \le 1$.  This is a purely mathematical statement, and it is proven in the appendix.  Next, we consider that the particle generator may emit different pairs of particles at different times.  The value of $M(a, b)$ for the total distribution of particle pairs is the expected value, over the space of particle pairs, of $M(a, b)$ for each pair of particles.  Since CHSH is true for each pair of particles, the inequality must also hold for the total distribution of particle pairs.  This follows from the fact that if X is always greater than or equal to Y, the expected value of X must be greater than or equal to the expected value of Y.

We have proven that $M(a, b) + M(a', b) + M(a, b') \ge M(a', b')$.  As discussed above, this is equivalent to the original CHSH inequality: $E(a, b) + E(a', b) + E(a, b') - E(a', b') \le 2$.

### Symmetrical Results

Since the detector settings were named arbitrarily, there was nothing special about (a’, b’) that enabled us to prove that $E(a, b) + E(a', b) + E(a, b') - E(a', b') \le 2$.  So, it’s also true that $E(a, b) + E(a', b) + E(a', b') - E(a, b') \le 2$, or either of the other combinations.  Also, + and – were named arbitrarily.  Interchanging them at one detector only has the effect of interchanging matches with mismatches, which negates all the E’s.  This means we also know that $-E(a, b) - E(a', b) - E(a, b') + E(a', b') \le 2$, or equivalently, $E(a, b) + E(a', b) + E(a, b') - E(a', b') \ge -2$.

These symmetrically equivalent versions are often combined.  For example, the paper Rowe et al. uses the statement $|E(a, b) - E(a, b')| + |E(a', b) + E(a', b')| \le 2$.  The implications of this, for different signs inside the absolute values, are all symmetrically equivalent versions of the CHSH inequality proved above.

## Empirical Results

There have been many experiments performed of the form described above.  These experiments have consistently found that the inequality being tested (such as CHSH) is empirically violated, demonstrating that local realism is impossible.  The rest of this section discusses a couple of caveats about these experiments.

### Statistics

Above, we defined $E(a, b)$ in terms of the probabilities of the different measurement results.  Although CHSH is sometimes stated in terms of the counts of the different measurement results, all that can be proven is the inequality on the probabilities.  The corresponding inequality on the counts is only guaranteed to be true on average, in the long-run.

When the experiments are run and the counts violate the CHSH inequality, it’s always still theoretically possible that the underlying probabilities satisfy the inequality, and the difference between the two was simply caused by luck.  However, generally, sufficiently many particle pairs are measured that the probability of that happening is extremely unlikely.  That low probability is reported in the papers as the p-value, which in some cases is as low as $10^{-16}$.  When something is that unlikely, we consider it to definitely not have happened, without reservation.  If you’re still looking for a way out, I suggest looking for more loopholes.

### Loopholes

What if not all of the pairs of particles were detected, and the ones that were detected were biased in a way that effected the relative probabilities of the measurement outcomes?  What if one detector was set early enough to affect the other measurement before it happened?  What if information from a common source affected the particle generation and the detector settings?  These what-ifs, and a few others, are the Bell test loopholes.  I won’t go into detail about all the possible loopholes here, but recent experiments have increasingly claimed to close all the loopholes.

I think it’s hard to say with absolute certainty that all possible loopholes are closed, but it seems to me that at this point the evidence is overwhelming that CHSH has been repeatedly violated in a way that is inconsistent with local realism.

## Conclusion

If local realism is wrong, what are we left with?  One answer is to look to the interpretations of quantum mechanics for guidance.  The most common one, the Copenhagen interpretation, sacrifices both locality and realism.  The violation of locality occurs in the assumption of wavefunction collapse, which happens instantaneously over large distances.  As for realism, the Copenhagen interpretation treats “measurement” as a first-class concept, but asserts that the property being measured didn’t have a value until the measurement happened.  This is certainly strange, but perhaps not as overtly anti-realist as, say, the “consciousness causes collapse” interpretation.

In any case, the rejection of local realism doesn’t require one to accept the Copenhagen interpretation of quantum mechanics.  There are many other interpretations, although none of them have gained widespread acceptance.  Nevertheless, there is a lot of philosophical space between the observer-dependent Copenhagen interpretation and the sense of local realism that is ruled out by the argument above.  Perhaps a middle ground can be found: an interpretation that feels philosophically coherent (no pun intended) without adhering strictly to the requirements of local realism.

## Appendix

We’ll prove that for all $0 \le p, q, p', q' \le 1$,

$p(1-q) + q(1-p) + p'(1-q) + q(1-p') + p(1-q') + q'(1-p) \ge p'(1-q') + q'(1-p')$

Note that all six terms on the left are always nonnegative.  We look at cases.  Case 1: $p \ge p'$ and $q \ge q'$.  Then LHS terms $p(1-q')$ and $q(1-p')$ are $\ge$ RHS terms $p'(1-q')$ and $q'(1-p')$ respectively.  Case 2: $p \le p'$ and $q \le q'$.  Then LHS terms $p'(1-q)$ and $q'(1-p)$ are $\ge p'(1-q')$ and $q'(1-p')$ respectively.

Case 3 is that $p \le p'$ and $q \ge q'$.  For this we use three of the LHS terms: $q(1-p)$, $p'(1-q)$, and $p(1-q')$.  First,

\begin{aligned} q(1-p) &= q'(1-p') + (q-q')(1-p') + q'(p'-p) + (q-q')(p'-p)\\ & \ge q'(1-p') + (q-q')(p'-p)\end{aligned}.

We also know that $p'(1-q) \ge (p'-p)(1-q)$ and $p(1-q') \ge p(1-q')$.  Adding these three inequalities gives:

\begin{aligned} \textrm{LHS} & \ge q'(1-p') + (q-q')(p'-p) + (1-q)(p'-p) + p(1-q')\\ & = q'(1-p') + (1-q')(p'-p) + (1-q')p \\ &= q'(1-p') + p'(1-q') \\ &= \textrm{RHS} \end{aligned}

Case 4 is $p \ge p'$ and $q \le q'$.  The proof is the same as in case 3, but with all instances of p and q interchanged (and p’ and q’).

# Understanding Lagrange Multipliers

Lagrange multipliers are a method for locally minimizing or maximizing a function, subject to one or more constraints. For example, find the values of $x$ and $y$ that make $x^2 + y^2$ as small as possible, while satisfying the constraint $xy = 1$. What follows is an explanation of how to use Lagrange multipliers and why they work.  The explanation (especially of why they work) is much simpler with only one constraint, so we’ll start with an intuitive argument in that case and then move on to a slightly more rigorous argument in the case with multiple constraints.  Both cases require some familiarity with partial derivatives and vectors.

### One Constraint

Say we want to minimize a function $f(x, y)$, subject to the constraint $g(x, y) = 0$.  (For now we’ll minimize a function of two variables, but the method extends naturally to functions of more variables, and the procedure for maximization is exactly the same.)  For the example above, $f(x, y) = x^2 + y^2$ and $g(x, y) = xy - 1$ (becasue $xy - 1 = 0$ is equivalent to $xy = 1$).  To solve the problem, we create a new function, $L(x, y, \lambda)$, called the Lagrangian.  It is defined as $L(x, y, \lambda) = f(x, y) - \lambda g(x, y)$, where $\lambda$ is a new variable called the Lagrange multiplier.  (Some people use a $+$ instead of a $-$, but it doesn’t change anything except for the sign of $\lambda$.) Then, we take the partial derivatives of $L$ with respect to the original variables, $x$ and $y$, and also $\lambda$.  We set all those partials equal to zero, and solve for $x$ and $y$. For the example given above, we have:

$\begin{array}{rcl} L(x, y, \lambda) & = & x^2 + y^2 - \lambda (xy - 1) \\ \partial L / \partial x & = & 2x - \lambda y \\ \partial L / \partial y & = & 2y - \lambda x \\ \partial L / \partial \lambda & = & -(xy - 1) \end{array}$

Setting all the partials equal to zero and solving, we get that $\lambda = 2$ and $(x, y)$ is either $(1, 1)$ or $(-1, -1)$.  Indeed, these points are local minima of $f(x, y)$, subject to the constraint that $g(x, y) = 0$.  (In this case they are also the global minima, although in general that is not guaranteed.)

Why do Lagrange multipliers work?  It is tempting to think that we are finding a constrained minimum of $f$ by finding an unconstrained minimum of $L$; after all, we are setting all the partials of $L$ equal to zero.  But this is wrong.  The solution $(x, y, \lambda)$ turns out to actually never be a local minimum or maximum for $L$; it’s always a saddle point for $L$.  So why do Lagrange multipliers work?  The answer has to do with level curves and gradients.

A level curve of a function is the set of points for which the function equals a constant value.  (In three dimensions they’re called level surfaces, and more generally, level sets.)  In the picture above, the red circles are some of the level curves of $f(x, y) = x^2 + y^2$ (larger circles correspond to higher values of $f$).  The blue curve is the set of points that satisfy the constraint $g(x, y) = 0$, making it a level curve of $g(x, y)$.

Constrained minimization seeks the point on the blue curve that is the lowest in the red valley: the valley where each red curve represents a different height. It turns out that this point will always be one where the red and blue level curves are tangent to each other, such as the black point in the picture above.  To see why, it helps to look at a point where the curves are not tangent, such as the orange point.  Since the red and blue level curves aren’t tangent to each other there, they cross each other.  Since they cross, moving one direction along the blue curve moves up in the red valley, and moving the other direction moves down.  In other words, there is a local change to $(x, y)$ that preserves the constraint $g(x, y) = 0$ and causes the objective function $f(x, y)$ to decrease, and the opposite change causes it to increase.  There’s no way it can be a constrained minimum or maximum. Since the minimum can’t be at a point where the level curves cross, it has to be at a point where they are tangent.  There, as you move along the curve of $g(x, y) = 0$, $f(x, y)$ holds constant to a first order approximation, so it could be a minimum or a maximum.  (It still might not be either, but at least it could be.)

How do we express mathematically the statement that the level curves of $f$ and $g$ are tangent?  This is where the gradients come in.  The gradient of a function $f$ is written as $\nabla f$ and defined as a vector where each component is the partial derivative of $f$ with respect to the corresponding variable: $\nabla f = (\partial f / \partial x, \partial f / \partial y)$.  The gradient has a property that will be very useful to us: the gradient of $f$ is always perpendicular to the level curve of $f$ at a given $(x, y)$. To see why this is true, say that $(\Delta x, \Delta y)$ is a small amount by which we wish to change $(x, y)$.  How will it change $f(x, y)$?  To first order, the answer is that:

$f(x + \Delta x, y + \Delta y) \approx f(x, y) + \frac{\partial f}{\partial x} \Delta x + \frac{\partial f}{\partial y} \Delta y$

If $(\Delta x, \Delta y)$ is in the direction of the level curve, we will have $f(x + \Delta x, y + \Delta y) \approx f(x, y)$, which means that $\frac{\partial f}{\partial x}\Delta x + \frac{\partial f}{\partial y} \Delta y = 0$.  In other words, the dot product of $\nabla f$ and $(\Delta x, \Delta y)$ is zero.  Since $(\Delta x, \Delta y)$ is in the direction of the level curve, this means that the level curve and the gradient are perpendicular to each other.

Since we want the level curves of $f$ and $g$ to be tangent to each other, and each function’s gradient is perpendicular to its level curve, we want the gradients to point in the same direction (or in opposite directions).  Two vectors point in the same direction if one is a scalar multiple of the other.  This scalar will turn out to be the Lagrange multiplier, so we’ll call it $\lambda$: $\nabla f = \lambda \nabla g$.  This vector equation expands into one equation for each of $x$ and $y$.  Adding the constraint itself to our set of equations, we get:

$\boxed{\begin{array}{rcl}\frac{\partial f}{\partial x}&=&\lambda \frac{\partial g}{\partial x}\\ \frac{\partial f}{\partial y}&=&\lambda \frac{\partial g}{\partial y} \\ g(x, y) & = & 0 \end{array}}$

The method of Lagrange multipliers is to produce these equations and solve them.  However, it is usually described in terms of the Lagrangian function $L(x, y, \lambda) = f(x, y) - \lambda g(x, y)$.  Setting the partial derivatives of the Lagrangian with respect to $x$, $y$, and $\lambda$ equal to zero reproduces exactly the boxed equations above. For instance, $\partial L / \partial x = \partial f / \partial x - \lambda \partial g / \partial x$.  Setting this equal to zero gives you the first equation in the box.  For the last equation, $\partial L / \partial \lambda = -g(x, y)$.  If you prefer, you don’t need to worry about the Lagrangian at all; you just need to find a point $(x, y)$ and a scalar $\lambda$, such that $\nabla f = \lambda \nabla g$ and the constraint is satisfied.  Such points will be your candidates for the constrained minimum (or maximum).

To summarize, constrained minima and maxima occur at points where the level curve of the constraint function $g$ is tangent to the level curve of the objective function $f$.  This happens when $\nabla f = \lambda \nabla g$, giving us the equations in the box above.  The Lagrangian function $L$ unifies those equations together in terms of the partial derivatives of one function.

### Multiple Constraints

Now we’ll look at the general case, where we are minimizing a function of $n$ variables, $f(x_1, ..., x_n)$, and we have $k$ constraints: $g_i(x_1, ..., x_n) = 0$, where $i$ goes from $1$ to $k$.  The Lagrangian $L$ is defined as above, except with a different $\lambda$ for each constraint:

$L(x_1, ..., x_n, \lambda_1, ..., \lambda_k) = f(x_1, ..., x_n) - \sum_{j=1}^k \lambda_j g_j(x_1, ..., x_n)$

(We’ll use $j$ for an index into the $k$ constraints and $i$ for an index into the $n$ variables $x_i$.)  The constrained minimum occurs at a point where all $n+k$ of the partial derivatives of $L$ are zero.

We’ll use a relatively trivial example that nevertheless illustrates the method: we are trying to minimize $x_1^2 + x_2^2 + x_3^2$ subject to two constraints: $x_1 = 1$ and $x_2 = 1$.  The Lagrangian is $x_1^2 + x_2^2 + x_3^2 - \lambda_1 (x_1 - 1) - \lambda_2 (x_2 - 1)$, and its partial derivatives are:

$\begin{array}{rcl} \partial L / \partial x_1 &=& 2 x_1 - \lambda_1 \\ \partial L / \partial x_2 &=& 2 x_2 - \lambda_2 \\ \partial L / \partial x_3 &=& 2 x_3 \\ \partial L / \partial \lambda_1 &=& -(x_1 - 1) \\ \partial L / \partial \lambda_2 &=& -(x_2 - 1) \end{array}$

Setting them all equal to zero yields the solution: $(x_1, x_2, x_3) = (1, 1, 0)$ and $\lambda_1 = \lambda_2 = 2$.

In the one constraint case, the Lagrange multiplier equations are equivalent to the vector equation $\nabla f = \lambda \nabla g$.  With multiple constraints, the corresponding vector equation is:

$\nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j$

This equivalence follows from the fact that component $i$ of the above equation says that $\partial f / \partial x_i = \sum_{j=1}^k \lambda_j \partial g_j / \partial x_i$, while $\partial L / \partial x_i = \partial f / \partial x_i - \sum_{j=1}^k \lambda_j \partial g_j / \partial x_i$, so setting the partial $\partial L / \partial x_i$ equal to zero for each $i$ from $1$ to $n$ is equivalent to the vector equation above.  Note that setting the partials $\partial L / \partial \lambda_j$ equal to zero simply enforces the constraints $g_j(x_1, ..., x_n) = 0$.

In the one constraint case, it is useful to find points where $\nabla f = \lambda \nabla g$ because that implies that the level curves of $f$ are tangent to the level curves of $g$.  In the multiple constraints case, it’s not so simple.  The level surfaces of the individual $g_i$ are not necessarily tangent to the level surface of $f$ at the constrained minimum.  (In the example above, the constraint plane $x_1 = 1$ intersects the level surface of $f$, the sphere $x_1^2 + x_2^2 + x_3^2 = 2$, at the optimal point, $(1, 1, 0)$.)  Instead, it turns out that the intersection of all the level surfaces of the $g_i$ is tangent to the level surface of $f$.

To understand this better, let’s look directly at the effect of a small perturbation to $x$ on both $f$ and the constraint functions $g_j$.  Writing $x = (x_1, ..., x_n)$ and letting $\Delta x = (\Delta x_1, ..., \Delta x_n)$, we have:

$f(x + \Delta x) - f(x) \approx \sum_{i=1}^n \Delta x_i \frac{\partial f}{\partial x_i}$

If there were a change to $x$, $\Delta x$, that kept all of the constraint functions $g_j$ the same (to first order) and caused non-zero change to $f(x)$, then $x$ could not possibly be a constrained local minimum, because moving in some direction along the intersection of the level curves of all the $g_j$ would cause $f$ to decrease.  So the condition we want is:

1. For every possible $\Delta x$ such that $\nabla g_j(x) \perp \Delta x$ for every $j$ from $1$ to $k$, $\nabla f(x) \perp \Delta x\textrm{.}$

For each $j$ such that $\nabla g_j(x) \perp \Delta x$, $\Delta x$ is in a direction tangent to the level surface of $g_j$.  If it is true for every $j$, $\Delta x$ is in a direction tangent to the intersection of the level surfaces of the all the $g_j$.  Condition 1 says that any such $\Delta x$ is also tangent to the level surface of $f$.  On the other hand, the condition enforced by Lagrange multipliers is:

2. There exist $\lambda_1$ through $\lambda_k$ such that $\nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j$.

Happily, it turns out that conditions 1 and 2 are equivalent to each other.

Theorem: Conditions 1 and 2 are equivalent to each other.

Proof:  (Why do a rigorous proof now when we’ve been waving our hands all day?  Sometimes the most intuitive argument for something is a proof; perhaps this is such a case.)  First we’ll show that 2 implies 1.  Assume that $\nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j$, and that $\nabla g_j(x) \perp \Delta x$ for every $j$ from $1$ to $k$.  Since $\nabla g_j(x) \perp \Delta x$, $\nabla g_j(x) \cdot \Delta x = 0$.  Therefore $\nabla f \cdot \Delta x = (\sum_{j = 1}^k \lambda_j \nabla g_j) \cdot \Delta x = \sum_{j = 1}^k \lambda_j \nabla g_j \cdot \Delta x = 0$.  This implies that $\nabla f \perp \Delta x$.

Next we’ll show that 1 implies 2, or equivalently, that not 2 implies not 1.  The negation of 2 says that $\nabla f$ is not in the span of the $k$ vectors $\nabla g_j$.  Let’s denote by $z$ the projection of $\nabla f$ into the span of the $\nabla g_j$, and set $\Delta x = \nabla f - z$.  We will show that $\Delta x$ is a counterexample establishing the negation of condition 1.

We know that $\nabla f \neq z$, so $\Delta x \neq 0$, and by the definition of projection, $\Delta x \perp g_j$ for each $j$.  Additionally, $\nabla f \cdot \Delta x = (z + \Delta x) \cdot \Delta x = z \cdot \Delta x + \Delta x \cdot \Delta x$,  Since $z$ is in the span of the $g_j$, all of which are perpendicular to $\Delta x$, $z \perp \Delta x$ as well.  Adding $z \cdot \Delta x = 0$ and $\Delta x \cdot \Delta x \neq 0$, we get $(z + \Delta x) \cdot \Delta x \neq 0$.  That is, $\nabla f \cdot \Delta x \neq 0$, so $\nabla f$ is not perpendicular to $\Delta x$.  Since $\Delta x$ is perpendicular to all of the $\nabla g_j$, but not to $\nabla f$, condition 1 is violated.  We assumed the negation of condition 2, so this shows that not 2 implies not 1, or equivalently, that 1 implies 2.  $\square$

Now we’ve seen that Lagrange multipliers work by finding an $x$ and a set of $\lambda_j$ such that $\nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j$.  This equation implies that there is no local change to $x$ that, to first order, preserves all the constraints but changes the objective function $f$.  Therefore the values of $x$ that satisfy the equation are candidates for a constrained local minimum or maximum.  Solving this vector equation is equivalent to setting the partial derivatives of the Lagrangian function to zero.

I hope that sheds some light on how and why Lagrange multipliers work, and makes them seem a bit less like a magic trick.

# Knight and Bishop Checkmate on a Large Board

The bishop and knight checkmate against a lone king occurs extremely rarely in normal chess games.  (I have never encountered it.)  But I see it as an interesting case study on the abilities of the pieces.  Are the knight, bishop, and king strong enough to corral the opposing king into the corner?  Indeed they are, and the crux of the most common method is a procedure called a “W-maneuver”.

Playing through the W-maneuver has a very tenuous feel to it, with the defending king on the brink of escape at one point.  This makes me want to kick its tires a little, to see if the bishop and knight checkmate is “almost impossible”.  One way to do this is to see if the mate can be forced on a larger board, say 10×10.  What about 100×100?  Someone asked exactly this question on chess.stackexchange, and I used an endgame tablebase program I wrote to answer it.

# Which Has More Volume, a Dodecahedron or Icosahedron?

### Platonic Solids

Which has more volume, a dodecahedron or an icosahedron, both having the same edge length?  (The same question can be asked of the cube and octahedron, and the following discussion applies just as well to them.) It’s tempting to think that the icosahedron is bigger, because it has more faces (20 to the dodecahedron’s 12).  I think it is also natural to think that they are very close in size.  In fact, the dodecahedron has about 3.5 times more volume.

People sometimes think that math is about manipulating strings of symbols according to formal rules, but really it’s about figuring out which intuitions are right and why others are wrong. Before we figure out some good intuitions for the relative volumes of polyhedra, let’s look at the numbers for the Platonic solids, assuming an edge length of 1.  (If the edge length is actually $a$, multiply the given volume by $a^3$.)

Name Faces Edges Vertices Volume
Tetrahedron 4 6 4 0.118
Octahedron 8 12 6 0.471
Cube 6 12 8 1.000
Icosahedron 20 30 12 2.182
Dodecahedron 12 30 20 7.663

One reason we might think the icosahedron and dodecahedron are roughly the same volume is that we might picture them with the same diameter. Since they are both pretty spherical, they would have roughly the same volume in that case.  For example, see Wikipedia’s stock photos of the two solids:

However, when the two shapes are the same volume, the dodecahedron’s edges are about 34% shorter. If we expand the dodecahedron to make its edges the same length as the icosahedron’s, that increases the volume by a factor of $(1/0.66)^3 \approx 3.5$.

In fact, this turns out to be a pretty good way to think about volumes of polyhedra in general. Instead of asking what the relative volumes are for the same edge length, ask what the relative edge lengths are for the same volume. Smaller edge length for the same volume means larger volume for the same edge length.

So how can we get a handle on relative edge lengths for the same volume? One useful intuition is to picture two polyhedra that are the same volume as same-sized spheres. Which one will have shorter edges? Well, the vertices are pretty evenly spaced over the surface, so perhaps the one with more vertices will have more closely packed vertices, and therefore shorter edges. This suggests that holding the edge length constant, the solids with more vertices will tend to have larger volumes.

Looking back to the table above, this appears to be fairly accurate. Holding edge length constant, the number of vertices seems to correspond pretty well with volume, certainly better than the number of faces or edges.

Incidentally, why does the number of faces suggest the wrong volume ordering? This is because different faces are different sizes. Sure, the dodecahedron only has 12 faces, but they are pentagons — much bigger than the icosahedron’s 20 triangular faces. A regular pentagon has about four times as much area as an equilateral triangle with the same edge length. So the 12 pentagons are worth about 48 equilateral triangles. Now we also have a strong face-based intuition that points in the right direction: the dodecahedron is significantly larger than the icosahedron.

### Archimedean Solids

For Platonic solids, the ordering by number of vertices is the same as the volume ordering.  Let’s take a look at a more general class of solids.  Archimedean solids have faces that are not all the same, but the faces are still regular polygons, and there is the same configuration of faces at each vertex.  For example a cuboctahedron has two squares and two triangles meeting at each vertex. Here are the 13 Archimedean solids in volume order.

Name Faces Edges Vertices Volume
Cuboctahedron 14 24 12 2.357
Truncated tetrahedron 8 18 12 2.711
Snub cube 38 60 24 7.889
Rhombicuboctahedron 26 48 24 8.714
Truncated octahedron 14 36 24 11.314
Truncated cube 14 36 24 13.600
Icosidodecahedron 32 60 30 13.836
Snub dodecahedron 92 150 60 37.617
Rhombicosidodecahedron 62 120 60 41.615
Truncated cuboctahedron 26 72 48 41.799
Truncated icosahedron 32 90 60 55.288
Truncated dodecahedron 32 90 60 85.040
Truncated icosidodecahedron 62 180 120 206.803

Well it’s not perfect, but vertices still correspond to volumes far better than faces or edges.  Let’s look at the relationship between number of vertices and volume a little more closely.

As you can see, the volume does correspond pretty well with the number of vertices, but there are some glaring problems.  Most notably, look at the four with 60 vertices; there is over a factor of two between the volumes of the biggest (truncated dodecahedron) and smallest (snub dodecahedron).  Above we speculated that for the same volume, more vertices would be more densely packed, leading to shorter edges.  Why does the truncated dodecahedron have shorter edges than the snub dodecahedron for the same volume and same number of vertices?  Let’s take a look.

The snub dodecahedron (left) has 80 triangles and 12 pentagons, and the vertices look pretty well spaced out.  The truncated dodecahedron (right) has 20 triangles and 12 decagons.  The vertices are not well spaced out at all; they’re basically all lined up to make those big empty decagons.  This explains why the truncated dodecahedron has so much volume for a given edge length and number of vertices.  For a given volume, the vertices are very unevenly spaced, making the edges short.

### Four-Dimensional Platonic Solids

Finally, do the intuitions we developed above for three-dimensional solids carry over to four-dimensional solids? Absolutely. Six data points isn’t much, but here are the volumes of the four-dimensional Platonic solids (4-d volumes in this case, so multiply by $a^4$ if the side length is $a$).

Name Cells Faces Edges Vertices Volume
5-cell (simplex) 5 10 10 5 0.023
16-cell (orthoplex) 16 32 24 8 0.167
8-cell (hypercube) 8 24 32 16 1.000
24-cell 24 96 96 24 2.000
600-cell 600 1200 720 120 26.475
120-cell 120 720 1200 600 787.857

Looks good to me. The volumes are in the same order as the number of vertices, and the relatively large volume increases correspond to the relatively large increases in the number of vertices.