Understanding Lagrange Multipliers

Lagrange multipliers are a method for locally minimizing or maximizing a function, subject to one or more constraints. For example, find the values of x and y that make x^2 + y^2 as small as possible, while satisfying the constraint xy = 1. What follows is an explanation of how to use Lagrange multipliers and why they work.  The explanation (especially of why they work) is much simpler with only one constraint, so we’ll start with an intuitive argument in that case and then move on to a slightly more rigorous argument in the case with multiple constraints.  Both cases require some familiarity with partial derivatives and vectors.

One Constraint

Say we want to minimize a function f(x, y), subject to the constraint g(x, y) = 0.  (For now we’ll minimize a function of two variables, but the method extends naturally to functions of more variables, and the procedure for maximization is exactly the same.)  For the example above, f(x, y) = x^2 + y^2 and g(x, y) = xy - 1 (becasue xy - 1 = 0 is equivalent to xy = 1).  To solve the problem, we create a new function, L(x, y, \lambda), called the Lagrangian.  It is defined as L(x, y, \lambda) = f(x, y) - \lambda g(x, y), where \lambda is a new variable called the Lagrange multiplier.  (Some people use a + instead of a -, but it doesn’t change anything except for the sign of \lambda.) Then, we take the partial derivatives of L with respect to the original variables, x and y, and also \lambda.  We set all those partials equal to zero, and solve for x and y. For the example given above, we have:

\begin{array}{rcl} L(x, y, \lambda) & = & x^2 + y^2 - \lambda (xy - 1) \\ \partial L / \partial x & = & 2x - \lambda y \\ \partial L / \partial y & = & 2y - \lambda x \\ \partial L / \partial \lambda & = & -(xy - 1) \end{array}

Setting all the partials equal to zero and solving, we get that \lambda = 2 and (x, y) is either (1, 1) or (-1, -1).  Indeed, these points are local minima of f(x, y), subject to the constraint that g(x, y) = 0.  (In this case they are also the global minima, although in general that is not guaranteed.)

Why do Lagrange multipliers work?  It is tempting to think that we are finding a constrained minimum of f by finding an unconstrained minimum of L; after all, we are setting all the partials of L equal to zero.  But this is wrong.  The solution (x, y, \lambda) turns out to actually never be a local minimum or maximum for L; it’s always a saddle point for L.  So why do Lagrange multipliers work?  The answer has to do with level curves and gradients. Lagrange

A level curve of a function is the set of points for which the function equals a constant value.  (In three dimensions they’re called level surfaces, and more generally, level sets.)  In the picture above, the red circles are some of the level curves of f(x, y) = x^2 + y^2 (larger circles correspond to higher values of f).  The blue curve is the set of points that satisfy the constraint g(x, y) = 0, making it a level curve of g(x, y).

Constrained minimization seeks the point on the blue curve that is the lowest in the red valley: the valley where each red curve represents a different height. It turns out that this point will always be one where the red and blue level curves are tangent to each other, such as the black point in the picture above.  To see why, it helps to look at a point where the curves are not tangent, such as the orange point.  Since the red and blue level curves aren’t tangent to each other there, they cross each other.  Since they cross, moving one direction along the blue curve moves up in the red valley, and moving the other direction moves down.  In other words, there is a local change to (x, y) that preserves the constraint g(x, y) = 0 and causes the objective function f(x, y) to decrease, and the opposite change causes it to increase.  There’s no way it can be a constrained minimum or maximum. Since the minimum can’t be at a point where the level curves cross, it has to be at a point where they are tangent.  There, as you move along the curve of g(x, y) = 0, f(x, y) holds constant to a first order approximation, so it could be a minimum or a maximum.  (It still might not be either, but at least it could be.)

How do we express mathematically the statement that the level curves of f and g are tangent?  This is where the gradients come in.  The gradient of a function f is written as \nabla f and defined as a vector where each component is the partial derivative of f with respect to the corresponding variable: \nabla f = (\partial f / \partial x, \partial f / \partial y).  The gradient has a property that will be very useful to us: the gradient of f is always perpendicular to the level curve of f at a given (x, y). To see why this is true, say that (\Delta x, \Delta y) is a small amount by which we wish to change (x, y).  How will it change f(x, y)?  To first order, the answer is that:

f(x + \Delta x, y + \Delta y) \approx f(x, y) + \frac{\partial f}{\partial x} \Delta x + \frac{\partial f}{\partial y} \Delta y

If (\Delta x, \Delta y) is in the direction of the level curve, we will have f(x + \Delta x, y + \Delta y) \approx f(x, y), which means that \frac{\partial f}{\partial x}\Delta x + \frac{\partial f}{\partial y} \Delta y = 0 .  In other words, the dot product of \nabla f and (\Delta x, \Delta y) is zero.  Since (\Delta x, \Delta y) is in the direction of the level curve, this means that the level curve and the gradient are perpendicular to each other.

Since we want the level curves of f and g to be tangent to each other, and each function’s gradient is perpendicular to its level curve, we want the gradients to point in the same direction (or in opposite directions).  Two vectors point in the same direction if one is a scalar multiple of the other.  This scalar will turn out to be the Lagrange multiplier, so we’ll call it \lambda: \nabla f = \lambda \nabla g.  This vector equation expands into one equation for each of x and y.  Adding the constraint itself to our set of equations, we get:

\boxed{\begin{array}{rcl}\frac{\partial f}{\partial x}&=&\lambda \frac{\partial g}{\partial x}\\ \frac{\partial f}{\partial y}&=&\lambda \frac{\partial g}{\partial y} \\ g(x, y) & = & 0 \end{array}}

The method of Lagrange multipliers is to produce these equations and solve them.  However, it is usually described in terms of the Lagrangian function L(x, y, \lambda) = f(x, y) - \lambda g(x, y).  Setting the partial derivatives of the Lagrangian with respect to x, y, and \lambda equal to zero reproduces exactly the boxed equations above. For instance, \partial L / \partial x = \partial f / \partial x - \lambda \partial g / \partial x.  Setting this equal to zero gives you the first equation in the box.  For the last equation, \partial L / \partial \lambda = -g(x, y).  If you prefer, you don’t need to worry about the Lagrangian at all; you just need to find a point (x, y) and a scalar \lambda, such that \nabla f = \lambda \nabla g and the constraint is satisfied.  Such points will be your candidates for the constrained minimum (or maximum).

To summarize, constrained minima and maxima occur at points where the level curve of the constraint function g is tangent to the level curve of the objective function f.  This happens when \nabla f = \lambda \nabla g, giving us the equations in the box above.  The Lagrangian function L unifies those equations together in terms of the partial derivatives of one function.

Multiple Constraints

Now we’ll look at the general case, where we are minimizing a function of n variables, f(x_1, ..., x_n), and we have k constraints: g_i(x_1, ..., x_n) = 0, where i goes from 1 to k.  The Lagrangian L is defined as above, except with a different \lambda for each constraint:

L(x_1, ..., x_n, \lambda_1, ..., \lambda_k) = f(x_1, ..., x_n) - \sum_{j=1}^k \lambda_j g_j(x_1, ..., x_n)

(We’ll use j for an index into the k constraints and i for an index into the n variables x_i.)  The constrained minimum occurs at a point where all n+k of the partial derivatives of L are zero.

We’ll use a relatively trivial example that nevertheless illustrates the method: we are trying to minimize x_1^2 + x_2^2 + x_3^2 subject to two constraints: x_1 = 1 and x_2 = 1.  The Lagrangian is x_1^2 + x_2^2 + x_3^2 - \lambda_1 (x_1 - 1) - \lambda_2 (x_2 - 1), and its partial derivatives are:

\begin{array}{rcl} \partial L / \partial x_1 &=& 2 x_1 - \lambda_1 \\ \partial L / \partial x_2 &=& 2 x_2 - \lambda_2 \\ \partial L / \partial x_3 &=& 2 x_3 \\ \partial L / \partial \lambda_1 &=& -(x_1 - 1) \\ \partial L / \partial \lambda_2 &=& -(x_2 - 1) \end{array}

Setting them all equal to zero yields the solution: (x_1, x_2, x_3) = (1, 1, 0) and \lambda_1 = \lambda_2 = 2.

In the one constraint case, the Lagrange multiplier equations are equivalent to the vector equation \nabla f = \lambda \nabla g.  With multiple constraints, the corresponding vector equation is:

\nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j

This equivalence follows from the fact that component i of the above equation says that \partial f / \partial x_i = \sum_{j=1}^k \lambda_j \partial g_j / \partial x_i, while \partial L / \partial x_i = \partial f / \partial x_i - \sum_{j=1}^k \lambda_j \partial g_j / \partial x_i, so setting the partial \partial L / \partial x_i equal to zero for each i from 1 to n is equivalent to the vector equation above.  Note that setting the partials \partial L / \partial \lambda_j equal to zero simply enforces the constraints g_j(x_1, ..., x_n) = 0.

In the one constraint case, it is useful to find points where \nabla f = \lambda \nabla g because that implies that the level curves of f are tangent to the level curves of g.  In the multiple constraints case, it’s not so simple.  The level surfaces of the individual g_i are not necessarily tangent to the level surface of f at the constrained minimum.  (In the example above, the constraint plane x_1 = 1 intersects the level surface of f, the sphere x_1^2 + x_2^2 + x_3^2 = 2, at the optimal point, (1, 1, 0).)  Instead, it turns out that the intersection of all the level surfaces of the g_i is tangent to the level surface of f.

To understand this better, let’s look directly at the effect of a small perturbation to x on both f and the constraint functions g_j.  Writing x = (x_1, ..., x_n) and letting \Delta x = (\Delta x_1, ..., \Delta x_n), we have:

f(x + \Delta x) - f(x) \approx \sum_{i=1}^n \Delta x_i \frac{\partial f}{\partial x_i}

If there were a change to x, \Delta x, that kept all of the constraint functions g_j the same (to first order) and caused non-zero change to f(x), then x could not possibly be a constrained local minimum, because moving in some direction along the intersection of the level curves of all the g_j would cause f to decrease.  So the condition we want is:

1. For every possible \Delta x such that \nabla g_j(x) \perp \Delta x for every j from 1 to k, \nabla f(x) \perp \Delta x\textrm{.}

For each j such that \nabla g_j(x) \perp \Delta x, \Delta x is in a direction tangent to the level surface of g_j.  If it is true for every j, \Delta x is in a direction tangent to the intersection of the level surfaces of the all the g_j.  Condition 1 says that any such \Delta x is also tangent to the level surface of f.  On the other hand, the condition enforced by Lagrange multipliers is:

2. There exist \lambda_1 through \lambda_k such that \nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j.

Happily, it turns out that conditions 1 and 2 are equivalent to each other.

Theorem: Conditions 1 and 2 are equivalent to each other.

Proof:  (Why do a rigorous proof now when we’ve been waving our hands all day?  Sometimes the most intuitive argument for something is a proof; perhaps this is such a case.)  First we’ll show that 2 implies 1.  Assume that \nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j, and that \nabla g_j(x) \perp \Delta x for every j from 1 to k.  Since \nabla g_j(x) \perp \Delta x, \nabla g_j(x) \cdot \Delta x = 0.  Therefore \nabla f \cdot \Delta x = (\sum_{j = 1}^k \lambda_j \nabla g_j) \cdot \Delta x = \sum_{j = 1}^k \lambda_j \nabla g_j \cdot \Delta x = 0.  This implies that \nabla f \perp \Delta x.

Next we’ll show that 1 implies 2, or equivalently, that not 2 implies not 1.  The negation of 2 says that \nabla f is not in the span of the k vectors \nabla g_j.  Let’s denote by z the projection of \nabla f into the span of the \nabla g_j, and set \Delta x = \nabla f - z.  We will show that \Delta x is a counterexample establishing the negation of condition 1.

We know that \nabla f \neq z, so \Delta x \neq 0, and by the definition of projection, \Delta x \perp g_j for each j.  Additionally, \nabla f \cdot \Delta x = (z + \Delta x) \cdot \Delta x = z \cdot \Delta x + \Delta x \cdot \Delta x,  Since z is in the span of the g_j, all of which are perpendicular to \Delta x, z \perp \Delta x as well.  Adding z \cdot \Delta x = 0 and \Delta x \cdot \Delta x \neq 0, we get (z + \Delta x) \cdot \Delta x \neq 0.  That is, \nabla f \cdot \Delta x \neq 0, so \nabla f is not perpendicular to \Delta x.  Since \Delta x is perpendicular to all of the \nabla g_j, but not to \nabla f, condition 1 is violated.  We assumed the negation of condition 2, so this shows that not 2 implies not 1, or equivalently, that 1 implies 2.  \square

Now we’ve seen that Lagrange multipliers work by finding an x and a set of \lambda_j such that \nabla f = \sum_{j = 1}^k \lambda_j \nabla g_j.  This equation implies that there is no local change to x that, to first order, preserves all the constraints but changes the objective function f.  Therefore the values of x that satisfy the equation are candidates for a constrained local minimum or maximum.  Solving this vector equation is equivalent to setting the partial derivatives of the Lagrangian function to zero.

I hope that sheds some light on how and why Lagrange multipliers work, and makes them seem a bit less like a magic trick.


Knight and Bishop Checkmate on a Large Board

The bishop and knight checkmate against a lone king occurs extremely rarely in normal chess games.  (I have never encountered it.)  But I see it as an interesting case study on the abilities of the pieces.  Are the knight, bishop, and king strong enough to corral the opposing king into the corner?  Indeed they are, and the crux of the most common method is a procedure called a “W-maneuver”.

Playing through the W-maneuver has a very tenuous feel to it, with the defending king on the brink of escape at one point.  This makes me want to kick its tires a little, to see if the bishop and knight checkmate is “almost impossible”.  One way to do this is to see if the mate can be forced on a larger board, say 10×10.  What about 100×100?  Someone asked exactly this question on chess.stackexchange, and I used an endgame tablebase program I wrote to answer it.

Which Has More Volume, a Dodecahedron or Icosahedron?

Platonic Solids

Which has more volume, a dodecahedron or an icosahedron, both having the same edge length?  (The same question can be asked of the cube and octahedron, and the following discussion applies just as well to them.) It’s tempting to think that the icosahedron is bigger, because it has more faces (20 to the dodecahedron’s 12).  I think it is also natural to think that they are very close in size.  In fact, the dodecahedron has about 3.5 times more volume.

People sometimes think that math is about manipulating strings of symbols according to formal rules, but really it’s about figuring out which intuitions are right and why others are wrong. Before we figure out some good intuitions for the relative volumes of polyhedra, let’s look at the numbers for the Platonic solids, assuming an edge length of 1.  (If the edge length is actually a, multiply the given volume by a^3.)

Name Faces Edges Vertices Volume
Tetrahedron 4 6 4 0.118
Octahedron 8 12 6 0.471
Cube 6 12 8 1.000
Icosahedron 20 30 12 2.182
Dodecahedron 12 30 20 7.663

One reason we might think the icosahedron and dodecahedron are roughly the same volume is that we might picture them with the same diameter. Since they are both pretty spherical, they would have roughly the same volume in that case.  For example, see Wikipedia’s stock photos of the two solids:

Icosahedron          Dodecahedron

However, when the two shapes are the same volume, the dodecahedron’s edges are about 34% shorter. If we expand the dodecahedron to make its edges the same length as the icosahedron’s, that increases the volume by a factor of (1/0.66)^3 \approx 3.5.

In fact, this turns out to be a pretty good way to think about volumes of polyhedra in general. Instead of asking what the relative volumes are for the same edge length, ask what the relative edge lengths are for the same volume. Smaller edge length for the same volume means larger volume for the same edge length.

So how can we get a handle on relative edge lengths for the same volume? One useful intuition is to picture two polyhedra that are the same volume as same-sized spheres. Which one will have shorter edges? Well, the vertices are pretty evenly spaced over the surface, so perhaps the one with more vertices will have more closely packed vertices, and therefore shorter edges. This suggests that holding the edge length constant, the solids with more vertices will tend to have larger volumes.

Looking back to the table above, this appears to be fairly accurate. Holding edge length constant, the number of vertices seems to correspond pretty well with volume, certainly better than the number of faces or edges.

Incidentally, why does the number of faces suggest the wrong volume ordering? This is because different faces are different sizes. Sure, the dodecahedron only has 12 faces, but they are pentagons — much bigger than the icosahedron’s 20 triangular faces. A regular pentagon has about four times as much area as an equilateral triangle with the same edge length. So the 12 pentagons are worth about 48 equilateral triangles. Now we also have a strong face-based intuition that points in the right direction: the dodecahedron is significantly larger than the icosahedron.

Archimedean Solids

For Platonic solids, the ordering by number of vertices is the same as the volume ordering.  Let’s take a look at a more general class of solids.  Archimedean solids have faces that are not all the same, but the faces are still regular polygons, and there is the same configuration of faces at each vertex.  For example a cuboctahedron has two squares and two triangles meeting at each vertex. Here are the 13 Archimedean solids in volume order.

Name Faces Edges Vertices Volume
Cuboctahedron 14 24 12 2.357
Truncated tetrahedron 8 18 12 2.711
Snub cube 38 60 24 7.889
Rhombicuboctahedron 26 48 24 8.714
Truncated octahedron 14 36 24 11.314
Truncated cube 14 36 24 13.600
Icosidodecahedron 32 60 30 13.836
Snub dodecahedron 92 150 60 37.617
Rhombicosidodecahedron 62 120 60 41.615
Truncated cuboctahedron 26 72 48 41.799
Truncated icosahedron 32 90 60 55.288
Truncated dodecahedron 32 90 60 85.040
Truncated icosidodecahedron 62 180 120 206.803

Well it’s not perfect, but vertices still correspond to volumes far better than faces or edges.  Let’s look at the relationship between number of vertices and volume a little more closely.


As you can see, the volume does correspond pretty well with the number of vertices, but there are some glaring problems.  Most notably, look at the four with 60 vertices; there is over a factor of two between the volumes of the biggest (truncated dodecahedron) and smallest (snub dodecahedron).  Above we speculated that for the same volume, more vertices would be more densely packed, leading to shorter edges.  Why does the truncated dodecahedron have shorter edges than the snub dodecahedron for the same volume and same number of vertices?  Let’s take a look.

Snubdodecahedronccw                Truncateddodecahedron

The snub dodecahedron (left) has 80 triangles and 12 pentagons, and the vertices look pretty well spaced out.  The truncated dodecahedron (right) has 20 triangles and 12 decagons.  The vertices are not well spaced out at all; they’re basically all lined up to make those big empty decagons.  This explains why the truncated dodecahedron has so much volume for a given edge length and number of vertices.  For a given volume, the vertices are very unevenly spaced, making the edges short.

Four-Dimensional Platonic Solids

Finally, do the intuitions we developed above for three-dimensional solids carry over to four-dimensional solids? Absolutely. Six data points isn’t much, but here are the volumes of the four-dimensional Platonic solids (4-d volumes in this case, so multiply by a^4 if the side length is a).

Name Cells Faces Edges Vertices Volume
5-cell (simplex) 5 10 10 5 0.023
16-cell (orthoplex) 16 32 24 8 0.167
8-cell (hypercube) 8 24 32 16 1.000
24-cell 24 96 96 24 2.000
600-cell 600 1200 720 120 26.475
120-cell 120 720 1200 600 787.857

Looks good to me. The volumes are in the same order as the number of vertices, and the relatively large volume increases correspond to the relatively large increases in the number of vertices.