Skip to content

Extrema

In this part of the course we work on the following skills:

  • Locating and classifying the extrema of scalar fields.
  • Applying Lagrange's multipliers method to optimize quantities with respect to constraints.

See also the exercises associated to this part of the course.

In the previous chapter we introduced various notions of differentials for higher dimensional functions (scalar fields, vector fields, paths, etc.). This part of the course is devoted to searching for extrema (minima / maxima) in various different scenarios. This extends what we already know for functions in R and we will find that in higher dimensions more possibilities and subtleties exist.

Extrema (minima / maxima / saddle)

Let SRn be open, f:SR be a scalar field and aS.

Definition

If f(a)f(x) (resp. f(a)f(x)) for all xS, then f(a) is said to be the absolute minimum (resp. maximum) of f.

Definition

If f(a)f(x) (resp. f(a)f(x)) for all xB(a,r) for some r>0, then f(a) is said to be a relative minimum (resp. maximum) of f.

Collectively we call the these points the extrema of the scalar field. In the case of a scalar field defined on R2 we can visualize the scalar field as a 3D plot like the figure. Here we see the extrema as the "flat" places. We sometimes use global as a synonym of absolute and local as a synonym of relative.

Bumps
f(x,y)=xe(x2y2)+14ey310

To proceed it is convenient to connect the extrema with the behaviour of the gradient of the scalar field.

Theorem

If f:SR is differentiable and has a relative minimum or maximum at a, then f(a)=0.

Proof

Suppose f has a relative minimum at a (or consider f). For any unit vector v let g(u)=f(a+uv). We know that g:RR has a relative minimum at u=0 so u(0)=0. This means that the directional derivative Dvf(a)=0 for every v. Consequently this means that f(a)=0.

Graph of an inflection
f(a)=0 doesn't imply a minimum or maximum at a, even in R, as seen with the function f(x)=x3. In higher dimensions even more is possible.

Observe that, here and in the subsequent text, we can always consider the case of f:RR, i.e., the case of Rn where n=1. Everything still holds and reduces to the arguments and formulae previously developed for functions of one variable.

Definition (stationary point)

If f(a)=0 then a is called a stationary point.

Graph of a bowl shaped function
If f(x,y)=x2+y2 then f(x,y)=(2x2y) and f(0,0)=(00). The point (0,0) is an absolute minimum for f.

As we see in the inflection example, the converse of the above theorem fails in the sense that a stationary point might not be a minimum or a maximum. The motivates the following.

Definition (saddle point)

If f(a)=0 and a is neither a minimum nor a maximum then a is said to be a saddle point.

The quintessential saddle has the shape seen in the graph. However it might be similar to an inflection in 1D or more complicated using the possibilities available in higher dimension.

Graph of a saddle
If f(x,y)=x2y2 then f(x,y)=(2x2y) and f(0,0)=0. The point (0,0) is a saddle point for f.

Hessian matrix

To proceed it is useful to develop the idea of a second order Taylor expansion in this higher dimensional setting. In particular this will allow us to identify the local behaviour close to stationary points. The main object for doing this is the Hessian matrix. Let f:R2R be twice differentiable and use the notation f(x,y). The Hessian matrix at aR2 is defined as

Hf(a)=(2fx2(a)2fxy(a)2fyx(a)2fy2(a)).

Observe that the Hessian matrix Hf(a) is a symmetric matrix since we know that

2fxy(a)=2fyx(a)

for twice differentiable functions.

The Hessian matrix is defined analogously in any dimension.

Let f:RnR be twice differentiable. The Hessian matrix at aRn is defined as

Hf(a)=(2fx12(a)2fx1x2(a)2fx1xn(a)2fx2x1(a)2fx22(a)2fx2xn(a)2fxnx1(a)2fxnx2(a)2fxn2(a)).

Observe that the Hessian matrix is a real symmetric matrix in any dimension. If f:RR then Hf(a) is a 1×1 matrix and coincides with the second derivative of f. In this sense what we know about extrema in R is just a special case of everything we do here.

As an example, let f(x,y)=x2y2 (figure). The gradient and the Hessian are respectively

f(x,y)=(fx(x,y)fy(x,y))=(2x2y),Hf(x,y)=(2fx2(x,y)2fxy(x,y)2fyx(x,y)2fy2(x,y))=(2002).

The point (0,0) is a stationary point since f(0,0)=(00). In this example Hf does not depend on (x,y) but in general we can expect dependence and so it gives a different matrix at different points (x,y).

Theorem

If v=(v1,,vn) then,

v Hf(a) vT=j,k=0n2fxjxk(a)vjvkR.
Proof

Multiplying the matrices we calculate that

v Hf(a) vT=(v1vn)(11f(a)1nf(a)n1f(a)nnf(a))(v1vn)=j,k=0njkf(a)vjvk

as required.

Second order Taylor formula for scalar fields

First let's recall the first order Taylor approximation we saw before. If f is differentiable at a then

f(x)f(a)+f(a)(xa).

If a is a stationary point then this only tells us that f(x)f(a) so a natural next question is to search for slightly more detailed information.

Theorem (second order Taylor for scalar fields)

Let f be a scalar field twice differentiable on B(a,r). Then, for x close to a,

f(x)f(a)+f(a)(xa)+12(xa) Hf(a) (xa)T

in the sense that the error is o((xa)2).

Proof

Let v=xa and let g(u)=f(a+uv). The Taylor expansion of g tells us that g(1)=g(0)+g(0)+12g(c) for some c(0,1). Since g(u)=f(a1+uv1,,an+uvn), by the chain rule,

g(u)=j=1njf(a1+uv1,,an+uvn)vj=f(a+uv)v,g(u)=j,k=1njkf(a1+uv1,,an+uvn)vjvk=vT Hf(a+uv) v.

Consequently f(a+v)=f(a)+f(a)v+12vT Hf(a+cv) v. We define the "error" in the approximation as ϵ(v)=12vT(Hf(a+cv)Hf(a))v and estimate that

|ϵ(v)|j,k=0nvjvk(jkf(a+cv)jkf(a)).

Since |vjvk|v2 we observe that |ϵ(v)|v20 as v0 as required.

Classifying stationary points

In order to classify the stationary points we will take advantage of the Hessian matrix and therefore we need to first understand the follow fact about real symmetric matrices.

Theorem

Let A be a real symmetric matrix and let Q(v)=vTAv. Then,

  • Q(v)>0 for all v0 all eigenvalues of A are positive,
  • Q(v)<0 for all v0 all eigenvalues of A are negative.
Proof

Since A is symmetric it can be diagonalised by matrix B which is orthogonal (BT=B1) and the diagonal matrix D=BTAB has the eigenvalues of A as the diagonal. This means that Q(v)=vTBTBABTBv=wTDw where w=Bv. Consequently Q(v)=jλjwj2. Observe that, if all λj>0 then jλjwj2>0.

In order to prove the other direction in the "if and only if" statement, observe that Q(Buk)=λk. This means that, if Q(v)>0 for all v0 then λk>0 for all k.

Theorem (classifying stationary points)

Let f be a scalar field twice differentiable on B(a,r). Suppose f(a)=0 and consider the eigenvalues of Hf(a). Then,

  • All eigenvalues are positive relative minimum at a,
  • All eigenvalues are negative relative maximum at a,
  • Some positive, some negative a is a saddle point.

Proof

Let Q(v)=vTHf(a)v, w=Bv and let Λ:=minjλj. Observe that w=v and that Q(v)=jλjwj2Λjwj2=Λv2. We have them 2nd-order Taylor

f(a+v)f(a)=12vT Hf(a) v+ϵ(v)(Λ2ϵ(v)v2)v2.

Since |ϵ(v)|v20 as v0, |ϵ(v)|v2<Λ2 when v is small. The argument is analogous for the second part. For final part consider vj which is the eigenvector for λj and apply the argument of the first or second part.

Attaining extreme values

Here we explore the extreme value theorem for continuous scalar fields. The argument will be in two parts: Firstly we show that continuity implies boundedness; Secondly we show that boundedness implies that the maximum and minimum are attained. We use the following notation for interval / rectangle / cuboid / tesseract, etc. If a=(a1,,an) and b=(b1,,bn) then we consider the n-dimensional closed Cartesian product

[a,b]=[a1,b1]××[an,bn].

We call this set a rectangle (independent of the dimension). As a first step it is convenient to know that all sequences in our setting have convergent subsequences.

Theorem

If {xn}n is a sequence in [a,b] there exists a convergent subsequence {xnj}j.

Proof

In order to prove the theorem we construct the subsequence. Firstly we divide [a,b] into sub-rectangles of size half the original. We then choose a sub-rectangle which contains infinite elements of the sequence and choose the first of these elements to be part of the sub-sequence. We repeat this process by again dividing the sub-rectangle we chose by half and choosing the next element of the subsequence. We repeat to give the full subsequence.

Theorem

Suppose that f is a scalar field continuous at every point in the closed rectangle [a,b]. Then f is bounded on [a,b] in the sense that there exists C>0 such that |f(x)|C for all x[a,b].

Proof

Suppose the contrary: for all nN there exists xn[a,b] such that |f(xn)|>n. Bolzano-Weierstrass theorem means that there exists a subsequence {xnj}j converges to x[a,b]. Continuity of f means that f(xnj) converges to f(x). This is a contradiction and hence the theorem is proved.

We can now use the above result on the boundedness in order to show that the extreme values are actually obtained.

Theorem

Suppose that f is a scalar field continuous at every point in the closed rectangle [a,b]. Then there exist points x,y[a,b] such that

f(x)=inffandf(y)=supf.
Proof

By the boundedness theorem supf is finite and so there exists a sequence {xn}n such that f(xn) converges to supf. Bolzano-Weierstrass theorem implies that there exists a subsequence {xnj}j which converges to x[a,b]. By continuity f(xn)f(x)=supf.

Extrema with constraints (Lagrange multipliers)

We now consider a slightly different problem to the one earlier in this chapter. There we wished to find the extrema of a given scalar field. Here the general problem is to minimise or maximise a given scalar field f(x,y) under the constraint g(x,y)=0. Subsequently we will also consider the same problem but in higher dimensions. For this graphic representation we draw the constraint and also various level sets of the function that we want to find the extrema of. The graphical representation suggests to us that at the "touching point" the gradient vectors are parallel. In other words, f=λg for some λR. The implementation of this idea is the method of Lagrange multipliers.

Suppose that a differentiable scalar field f(x,y) has a relative minimum or maximum when it is subject to the constraint g(x,y)=0. Then there exists a scalar λ such that, at the extrema point,

f=λg.
Visualization of Lagrange multiplier method
Searching extrema of f under constraint g=0

In three dimensions a similar result holds. Suppose that a differentiable scalar field f(x,y,z) has a relative minimum or maximum when it is subject to the constraints

g1(x,y,z)=0,g2(x,y,z)=0

and the gk are linearly independent. Then there exist scalars λ1, λ2 such that, at the extrema point,

f=λ1g1+λ2g2.

In higher dimensions and possibly with additional constraints we have the following general theorem.

Theorem (Lagrange multipliers)

Suppose that a differentiable scalar field f(x1,,xn) has an relative extrema when it is subject to m constraints

g1(x1,,xn)=0,,gm(x1,,xn)=0,

where m<n, and the gk are all linearly independent. Then there exist m scalars λ1,,λm such that, at each extrema point,

f=λ1g1++λmgm.
The Lagrange multiplier method is often stated and far less often proved.

Since the proof is rather involved we will follow this tradition here. See, for example, Chapter 14 of "A First Course in Real Analysis" (2012) by Protter & Morrey for a complete proof and further discussion.

Idea of proof

Let us consider a particular case of the method when n=3 and m=2. More precisely we consider the following problem: Find the maxima and minima of f(x,y,z) along the curve C defined as

g1(x,y,z)=0,g2(x,y,z)=0

where g1, g2 are differentiable functions. In this particular case we will prove the Lagrange multiplier method. Suppose that a is some point in the curve. Let α(t) denote a path which lies in the curve C in the sense that α(t)C for all t(1,1), α(t)0 and α(0)=a. If a is a local minimum for f restricted to C it means that f(α(t))f(α(0)) for all t(δ,δ) for some δ>0. In words, moving away from a along the curve C doesn't cause f(x) to decrease. Let h(t)=f(α(t)) and observe that h:RR so we know how to find the extrema. In particular we know that h(0)=0. By the chain rule h(t)=f(α(t))α(t) and so

f(a)α(0)=0.

Since we know that g1(α(t))=0 and g2(α(t))=0, again by the chain rule,

g1(a)α(0)=0,g2(a)α(0)=0.

To proceed it is convenient to isolate the following result of linear algebra.

Consider w,u1,u2R3 and let V={v:ukv=0,k=1,2}. If wv=0 for all vV then w=λ1u1+λ2u2 for some λ1,λ2R.

In order to prove this we write w=λ1u1+λ2u2+v0 where v0V because u1,u2 together with V must span R3. Since v0V and, by assumption, wv0=0,

0=wv0=(λ1u1+λ2u2+v0)v0=v0v0=v02.

This means that v0=0 and so w=λ1u1+λ2u2.

The above statement holds in any dimension with any number of vectors with the analogous proof. Applying this lemma to the vectors f(a), g1(a) and g2(a) recovers exactly the Lagrange multiplier method in this setting.