<
Optimization for Economics, a Visual Approach
Mike Carr
Contents
0 Calculus Reference 3
0.1 Graphs of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 Limits and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.3 Multivariable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1 Unconstrained Optimization 31
1.1 Single-Variable Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.2 Concavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.3 Multivariable Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2 Constrained Optimization 89
2.1 Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2 Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3 The Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3 Comparative Statics 135
3.1 The Implicit Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.2 The Envelope Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4 Sufficient Conditions 163
4.1 The Extreme Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2 The Bordered Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.4 Concave Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5 Quasiconcavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
1
Note to the Reader
For the last several years I have taught a course in optimization at Emory University for seniors
majoring in mathematics and economics. While mathematics and economics is the standard preparation
for a doctoral program in economics, the vast majority of students in this course are bound for careers
in finance. One of the challenges in designing this course has been balancing the interests and needs
of students on both tracks. These notes have evolved from my own teaching materials. Their goal is
to present methods for optimization and the reasoning behind them. This reasoning makes students
resilient to the myriad modifications that these tools endure at the hands of economists.
As an outsider to economics education, I was impressed to see the extent to which economics has
embraced visual methods of instruction. More than any other field whose foundation lies in mathematics,
the standard economics curriculum provides students not only with descriptions and formulas, but with
diagrams to depict core principles. Microeconomics students are inundated with supply curves, budget
lines, indifference curves, and graphs in marginal space. This is certainly a lifeline for visual learners,
but I suspect it produces deeper understanding for all.
As a geometer, my first inclination is to lean heavily upon visual reasoning when presenting the
methods of optimization. In our efforts to find a textbook for our admittedly niche curriculum, we have
not found a economics-focused text that takes this approach. I have made these notes available in the
hope that this visual viewpoint can become an effective supplement to existing techniques. Like with
the graphical arguments in undergraduate microeconomics, I think it can bring a complete and rigorous
understanding of optimization methods to a broader set of learners.
The audience for this text is anyone who wants to understand the mathematical methods for finding
maximizers in economic theory. The best-prepared reader will have mastered the techniques of differ-
entiation, including partial derivatives. They will also be familiar with the foundations of mathematical
logic: set notation, functions, and methods of proof like contradiction and the contrapostive. Finally,
they will need a basic proficiency in vector and matrix operations: sums, products and determinants. A
competent high-school treatment may suffice for this requirement.
Anyone who sets out to teach or learn the methods of optimization from this text should be aware
of its limitations. It does not contain the economic applications that my economist colleagues present
each semester. These must be provided in order to create genuine enthusiasm for the material. This is
a deficiency that I would like to rectify eventually, if there is some consensus about which examples to
include.
We have used one widely-available source of economic examples: Optimization for Economic Theory
by Avinash Dixit. This excellent book has a rigorous, traditional treatment of most of the material here
and good explanations of some advanced applications. Dixit’s book has proved too difficult for all but the
strongest undergraduates to read independently, but the examples are compelling and comprehensible
with some extra exposition.
Finally I would like to acknowledge my economist colleagues Blake Allison and Teddy Kim who were
essential partners in developing this course. Three students, Alexia Witthaus, Jacob Sugarmann and
Aarya Aamer, carefully read my early drafts and asked many questions. Their feedback allowed me to
identify the most opaque exposition (and clarify it, I hope). There is still a long road of revision and
improvement ahead for this document. I would appreciate any feedback you have.
Mike Carr
May 2023
Definition 0.3
A monomial is a function of the form:
f(x) = x
n
where n is an integer greater than 0.
For n ≥ 2 the graph y = x
n
curves upward over the positive values of x.
Greater values of n have lower values of f(x) when 0 < x < 1 but higher values when x > 1.
For even values of n the graph is symmetric across the y-axis, curving up when x is negative.
For odd values of n the graph curves down when x is negative. It is anti-symmetric across the
x = 0.
x
y
Figure 1: Graphs of even-powered
monomials
x
y
Figure 2: Graphs of odd-powered
monomials
Monomials of Negative Power
Monomials of negative power have the form f (x) = x
−n
. They are also commonly written
f(x) =
1
x
n
.
The graph y =
1
x
n
has a vertical asymptote at x = 0.
The graph approaches the x-axis, y = 0 as x gets large.
For even values of n, the graph is above the x-axis.
For odd values of n, the graph is above the x-axis for positive x and below it for negative x.
A greater choice of n makes the function approach the x-axis more quickly.
5
Definition 0.5
An exponential function has the form:
f(x) = a
x
where a is a number greater than 0.
a is called the base of the exponential function.
The graph y = a
x
passes through (0, 1).
If a > 1 then f(x) increases quickly as x takes on positive values. Greater values of a give a
steeper increase. f(x) approaches 0 as x goes to −∞. Greater values of a give a faster approach.
The graph does not touch or cross the x-axis.
If a < 1, then the above is reversed.
e is a commonly used base. e is approximately 2.718.
x
y
Figure 6: Graphs of y = 2
x
, y = e
x
and y = 3
x
Definition 0.6
A logarithmic function has the form:
f(x) = log
a
x
where the base a is a number greater than 1. log
a
x is the number b such that a
b
= x. The natural
logarithm is the logarithm with base e. It is denoted f (x) = ln x.
a
b
can never be 0 or less. The domain of f(x) = log
a
x is (0, ∞).
As x approaches 0, log
a
x goes to −∞.
y = log
a
x has an x intercept at (1, 0).
7
Example
The function f(x) =
4
√
3x
2
− 17x + 2 −
e
x
log
5
x
is the difference of two functions. The first is a compo-
sition of a root function and polynomial (both continuous on their domains). The second is a quotient
of an exponential and a logarithm (both continuous on their domains). Thus f(x) is the difference of
two continuous functions and is continuous on its domain.
Remark
Just about any function we can write using algebraic expressions is continuous on its domain. This does
not mean it is continuous everywhere. f(x) =
1
x
is not continuous at x = 0, for example.
0.2.3
The Intermediate Value Theorem
One early intuition for continuity is that the graph of a continuous function can be drawn without
any breaks. There are many ways to formalize this idea. One of the most important is the following
theorem.
Theorem 0.11 [The Intermediate Value Theorem]
If f is a continuous function on [a, b] and K is a number between f(a) and f (b), then there is some
number c between a and b such that f(c) = K.
Intuitively, a continuous graph cannot get from one side of the line y = K to the other without
intersecting y = K. Notice that this theorem does not say exactly where this intersection must occur,
only that it must occur somewhere in the interval (a, b). It also does not rule out the possibility of more
than one such c existing.
Example
Show that f (x) = e
x
− 3x has a root between 0 and 1.
13
0.3
Goals:
1 Compute partial derviatives.
2 Recognize continuous multivariable functions.
3 Apply the chain rule to compositions of multivariable functions.
0.3.1
Most interesting phenomena are not described by a single variable. We will need to develop methods
for optimizing multivariable functions. There are many ways to denote multivariable domains and the
functions on them. This is how we will denote them.
Notation
An n-vector is an ordered set of n numbers called components. For instance a = (5, −
√
17, 12.31)
is a 3-vector.
We add a vector arrow, as in a, to indicate that a is a vector.
The components of a vector can be denoted abstractly by subscripts: x = (x
1
, x
2
, . . . x
n
). The
x
i
do not have arrows because they are numbers, not vectors.
n-space, the set of all n-dimensional vectors, is denoted R
n
.
0 is the zero vector: (0, 0, 0, . . . , 0). The dimension should be clear from context.
The vectors e
1
, e
2
, . . . , e
n
are the standard basis vectors of R
n
. e
i
has 1 in the ith component
and 0 in the others. For example e
2
= (0, 1, 0, . . . , 0).
An n-variable function f (x
1
, x
2
, . . . , x
n
) can be written f(x).
19
0.3.1
Multi-Variable Functions
Remark
Using a common letter with an index variable
x
1
, x
2
, x
3
. . .
is a good choice for a large or an unknown number of variables. We will use this notation when developing
the theory of multivariable optimization. In many economics problems, there is a fixed, small number
of variables. In these problems, it is more convenient to use different letters for each variable, to avoid
keeping track of subscripts.
x, y, z, . . .
Even better, we should try to choose descriptive variable names like q for quantity or p for price.
We visualize functions with their graphs. The height of the graph over a point in the domain
represents the value of the function at that point. This allows us to detect visually where the function
is large or small, increasing or decreasing.
Definition 0.13
Given an n-variable function f(x), the graph of f is the set of points (x
1
, x
2
, . . . , x
n
, y) in R
n+1
that
satisfy y = f(x
1
, x
2
, . . . , x
n
).
Figure 12: The graph y =
p
36 − 4x
2
1
− x
2
2
and the height showing f(1, 4).
20
Remark
In general, a graph of the form y = f (x
1
, x
2
) will be hard to visualize. For more than three variables,
this visualization becomes impossible. So why bother? Graphs are a useful visual aid to the reasoning
behind our methods. As we progress, it is useful to have an prototypical two-variable graph in your
head. You can apply our methods to that graph, whether or not you have an algebraic expression to go
with it.
0.3.2
Our optimization tools rely on the ability to measure rates of change. For a function of multiple
variables, there are many rates of change, because there are many ways in which the input variables can
change. The simplest are those where only one variable is changing while the others remain fixed.
Definition 0.14
The partial derivative of f with respect to x
i
is a function of x. It measures the rate of change of f
at x as x
i
increases but the other coordinates remain constant. The formula is
lim
h→0
f(x + he
i
) − f (x)
h
Here are two different notations for the partial derivative.
1
∂f
∂x
i
(x) (Leibniz)
2 f
x
1
(x) (Lagrange)
Each has advantages, so we will use both. When it will not cause confusion, we can shorten Lagrange’s
notation from f
x
i
(x) to f
i
(x).
In the two variable case, we can interpret f
1
(x) as the slope of the tangent line to the graph
y = f(x
1
, x
2
) in the x
1
-direction. Higher-dimensional partial derivatives are also slopes, but are harder
to visualize.
21
0.3.4
Multivariable Limits and Continuity
Definition 0.15
A multivariable function f (x) is continuous at a, if
lim
x→a
f(x) = f(a)
In order to verify this limit, we must check that f(x) can be made arbitrarily close to f(a) by
restricting to a sufficiently small neighborhood of a. This neighborhood allows for travel in infinitely
many directions from a, rather than just forwards and backwards like a one-variable limit. This makes
multivariable limits difficult to compute rigorously and multivariable continuity difficult to verify directly.
Fortunately, we can use the same approach we used for single-variable functions.
Variant of Theorem 0.9
The following multivariable functions are continuous on their domains
1 Constant functions
2 Linear functions
3 Polynomials
4 Roots
5 Exponential functions
6 Logarithms
7 Trigonometric functions
23
0.3.4 Multivariable Limits and Continuity
Variant of Theorem 0.10
If f(x) and g(x) are continuous on their domains, and c is a constant, then the following are also
continuous on their domains
1 f(x) + g(x)
2 f(x) − g(x)
3 f(x)g(x)
4
f(x)
g(x)
(note that any x where g(x) = 0 is not in the domain)
5 f(x)
g(x)
as long as f(x) > 0
6 f(g(x)) where f (x) is a one-variable function
Multivariable continuity becomes important when discussing derivatives. Partial derivatives do not
use multivariable limits. They use a limit as a single variable h goes to 0. For this reason, we are not
guaranteed that partial derivatives reliably model the shape of a function.
Example
Consider the function
f(x
1
, x
2
) =
(
0 if x
2
≤ 0
x
1
if x
2
> 0
This function is 0 when x
1
= 0 or x
2
= 0. Thus the partial derivatives at (0, 0) are
f
1
(0, 0) = 0 f
2
(0, 0) = 0
If we increase x
1
while holding x
2
constant or increase x
2
while holding x
1
constant, then the function
stays constant at 0. This does not reflect the fact that if we increase both x
1
and x
2
at (0, 0), the
function will have a positive slope.
Many theorems rely on a function behaving consistently with its partial derivatives no matter which
direction we travel. The following property will usually serve that purpose.
Definition 0.16
A function f(x) is continuously differentiable, if all the partial derivative functions f
i
(x) are continuous
functions. If instead they are all continuous at a point a, we say f(x) is continuously differentiable at
a.
24
0.3.5
How do model the change of a multivariable function when more than one input variable is changing?
We write each input variable as a function of a parameter. For instance, if x
1
and x
2
are both changing
we can write each as a function of a parameter t. We can combine these to define a vector function:
x(t) = (x
1
(t), x
2
(t)).
f(x(t)) is a composition of functions. If we have defined x
1
(t) and x
2
(t) to correctly model the change
we want in x
1
and x
2
, then the derivative of f(x(t)) will tell us how f is changing as well. Notice
f(x(t)) is a single variable function. The value of t determines its value completely. The multivariable
chain rule computes its derivative using x(t) and the gradient of f(x).
Definition 0.17
Given a function f(x), the gradient vector of f at x is
∇f(x) = (f
1
(x), f
2
(x), . . . , f
n
(x))
Theorem 0.18 [The Multivariable Chain Rule]
Suppose f(x) is a continuously differentiable, n-variable function. If x(t) is differentiable, then the
derivative of the composition f(x(t)) with respect to t is
df
dt
(t) = ∇f(x(t)) · x
′
(t)
or
df
dt
(t) =
n
X
i=1
f
i
(x(t))x
′
i
(t)
We should avoid using the notations f
′
(t) and f
t
(t) for derivatives of compositions. Instead, we use
Leibniz notation. This makes the variable of differentiation clear without implying that we are computing
a partial derivative.
25
Example
Let f (x, y) be a continuously differentiable function. Let x and y be defined by differentiable functions
g(t) and h(t) respectively. The chain rule states that
d
dt
f(g(t), h(t)) = f
x
(g(t), h(t))g
′
(t) + f
y
(g(t), h(t))h
′
(t).
We do not write f
g
or f
h
in this example. f is defined as a function of x and y, so those are the only
partial derivatives it has.
Some applications use one of the variables of f as the parameter. The simplest example gives an
alternate formulation for the partial derivative.
Example
Let f(x
1
, x
2
) be a continuously differentiable function. Let x
1
be the identity function of itself and let
x
2
to be a constant function of x
1
.
x
1
= x
1
dx
1
dx
1
= 1
x
2
= c
dc
dx
1
= 0
The rate of change with this parameterization should reflect that x
1
is changing and x
2
is not. That is
exactly what the chain rule tells us.
d
dx
1
f(x
1
, c) = f
x
1
(x
1
, c)
dx
1
dx
1
+ f
x
2
(x
1
, c)
dc
dx
1
= f
x
1
(x
1
, c)(1) + f
x
2
(x
1
, c)(0)
= f
x
1
(x
1
, c)
We will build upon this formulation when we compute comparative statics.
Finally, we should note that the chain rule applies when the x
i
are multivariable functions of a vector
t. In this case, f(x(
t)) is a function of
t and thus we can compute its partial derivatives.
27
0.3.7 Alternate Notations and the Chain Rule
Generalization of Theorem 0.18
Suppose f(x) is a continuously differentiable, n-variable function. If x
i
(
t) are differentiable functions of
the variables t
j
, then the partial derivative of the composition f (x(
t)) with respect to t
k
is
∂f
∂t
k
(
t) = ∇f(x(
t)) · x
k
(
t)
or
∂f
∂t
k
(
t) =
n
X
i=1
f
i
(x(
t))
∂x
i
∂t
k
(
t)
This generalization follows immediately from treating each t
j
except t
k
as a constant.
0.3.8
Proving the Multivariable Chain Rule
The proof of the multivariable chain rule uses the same tools as the single variable chain rule (and
product rule). However, multivariable limits are much more difficult to verify than single variable limits.
To check that lim
x→a
f(x) = L, we have to consider values of x in every direction from a, not just forward
or backwards along a line. We will sketch the proof for the case where f is a two-variable function.
Even the sketch is quite technical. It contains no arguments that are important enough to commit to
memory.
Proof
We apply the definition of a derivative
df
dt
= lim
h→0
f(x(t + h)) − f(x(t))
h
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t))
h
We break up this limit into a sum of two limits by adding and subtracting a term and regrouping the
result (assuming the limit of each summand exists)
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t + h)) + f(x
1
(t), x
2
(t + h)) − f(x
1
(t), x
2
(t))
h
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t + h))
h
+ lim
h→0
f(x
1
(t), x
2
(t + h)) − f(x
1
(t), x
2
(t))
h
28
Next we multiply each limit by 1, represented as a quotient of an expression divided by itself.
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t + h))
h
x
1
(t + h) − x
1
(t)
x
1
(t + h) − x
1
(t)
+ lim
h→0
f(x
1
(t), x
2
(t + h)) − f(x
1
(t), x
2
(t))
h
x
2
(t + h) − x
2
(t)
x
2
(t + h) − x
2
(t)
Naturally
x
i
(t+h)−x
i
(t)
x
i
(t+h)−x
i
(t)
evaluates to
0
0
when h = 0. This is not a problem for a limit. However, if it
also evaluates to
0
0
at other values of h, no matter how small a neighborhood of h = 0 we choose,
then another approach is needed. In this case, the entire term will limit to 0, but we omit the formal
argument from this sketch. Instead we assume all is well, and we reorganize each product by swapping
denominators.
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t + h))
x
1
(t + h) − x
1
(t)
x
1
(t + h) − x
1
(t)
h
+ lim
h→0
f(x
1
(t), x
2
(t + h)) − f(x
1
(t), x
2
(t))
x
2
(t + h) − x
2
(t)
x
2
(t + h) − x
2
(t)
h
We break up the result as a product of limits (assuming the limit of each factor exists)
= lim
h→0
f(x
1
(t + h), x
2
(t + h)) − f(x
1
(t), x
2
(t + h))
x
1
(t + h) − x
1
(t)
lim
h→0
x
1
(t + h) − x
1
(t)
h
+ lim
h→0
f(x
1
(t), x
2
(t + h)) − f(x
1
(t), x
2
(t))
x
2
(t + h) − x
2
(t)
lim
h→0
x
2
(t + h) − x
2
(t)
h
The second factor of each product now looks like a derivative. To make first factors look more like
derivatives, we let j = x
1
(t + h) −x
1
(t) and k = x
2
(t + h) −x
2
(t). These quantities go to 0 as h → 0.
Our limits can be rewritten as
= lim
(h,j)→(0,0)
f(x
1
(t) + j, x
2
(t + h)) − f(x
1
(t), x
2
(t + h))
j
lim
h→0
x
1
(t + h) − x
1
(t)
h
+ lim
(h,k)→(0,0)
f(x
1
(t), x
2
(t) + k) − f(x
1
(t), x
2
(t))
k
lim
h→0
x
2
(t + h) − x
2
(t)
h
At this point we have four limits, each of which looks like the definition of a derivative. We can
replace each one with its derivative notation. In general, we cannot evaluate a multivariable limit by
handling one variable at a time, but the fact that the partial derivatives are continuous allows us to do
so here. The details of this kind of argument are covered in an analysis course.
= lim
h→0
f
1
(x
1
(t), x
2
(t + h))x
′
1
(t) + lim
h→0
f
2
(x
1
(t), x
2
(t))x
′
2
(t)
= f
1
(x
1
(t), x
2
(t))x
′
1
(t) + f
2
(x
1
(t), x
2
(t))x
′
2
(t)
This is the dot product
= ∇f(x(t)) · x
′
(t)
One can adapt this proof to a higher dimension by breaking the limit into more summands. The
theorem we gave is even more general, because applies to an n-variable function. To prove that version,
we would use a proof by induction.
29
1.1
Single-Variable Optimization
Goals:
1 Know the definition of a local or global maximizer.
2 Apply the first- and second-order conditions to calculate maximizers.
3 Distinguish between necessary and sufficient conditions and recognize the role of each in optimiza-
tion.
4 Understand the role of the derivative in proving the first- and second-order conditions.
Some of the most important methods in calculus are those that identify maximizers and minimizers
of functions. This section gives precise theorems to describe those methods. We will also examine
the distinct but complementary roles played by necessary conditions and sufficient conditions. Finally,
we will give a reasonably compact formal basis to prove the theorems of this section and support the
theorems in the sections that follow.
1.1.1
The First-Order Condition
Given a function, we are interested in what inputs of that function will produce the largest or
smallest values of that function. These inputs are called maximizers and minimizers. In order to identify
maximizer and minimizers, we need to have a rigorous definition that we can verify.
Definition 1.1
Suppose a lies in the domain of a function f(x).
1 a is a maximizer of f if f(a) ≥ f(x) for all other x in the domain of f. In this case f(a) is called
the maximum of f.
2 a is a local maximizer of f if f(a) ≥ f(x) for all other x in some neighborhood of a. It is a
strict local maximizer if f (a) > f(x) instead. In either case f(a) is called a local maximum of
f.
3 a is a minimizer of f if f(a) ≤ f(x) for all other x in the domain of f. In this case f(a) is called
the minimum of f.
4 a is a local minimizer of f if f(a) ≤ f (x) for all other x in some neighborhood of a. It is a
strict local minimizer if f(a) < f(x) (than 0) to the left of a and smaller values (than 0) to the
right. Thus, travelinstead. In either case f (a) is called a local minimum of f .
32
1.1.1 The First-Order Condition
smaller values for x < a. Notice that this behavior does not need to persist for all x. We cannot say
how long y = f(x) will stay close to its tangent line. It may be that the neighborhood where it does is
quite small.
We can make the same argument for functions with a negative derivative, except the behavior is
backwards. Aside from switching the direction of some inequalities, the proof is identical. Rather than
treat this result as its own lemma, we present it as a variant.
Variant of Lemma 1.2
If f
′
(a) < 0, then for all x in some neighborhood of a,
1 f(x) < f(a) if x > a
2 f(x) > f(a) if x < a.
The existence of nearby x such that f(x) > f (a) is inconsistent with the definition of a local
maximizer. Lemma 1.2 and its variant guarantee such x, so:
If f
′
(a) > 0, then a is not a local maximizer.
If f
′
(a) < 0, then a is not a local maximizer.
We convert these statements to their contrapositives. If a is a local maximizer, then f
′
(a) is neither
positive nor negative. This gives us the following condition.
Theorem 1.3 [The First-Order Condition (FOC)]
Let a be a local maximizer or local minimizer of f(x). Either f
′
(a) does not exist or f
′
(a) = 0.
Definition 1.4
The values of x that satisfy the first-order condition are called critical points.
34
1.1.3
The Second-Order Condition
As the previous example shows, the first-order condition is limited in its conclusion. Knowing the
value of f
′
(a) is not enough to give us certainty about the shape of the graph near a. For that we need
to know how the first derivative is changing at a. The change in the first derivative is measured by
the second derivative. The second derivative function, denoted f
′′
(x), is the derivative of the function
f
′
(x). The sign of the second derivative allows us to classify some critical points.
Theorem 1.5 [The Second-Order Condition (SOC)]
If f
′
(a) = 0 and f
′′
(a) < 0, then a is a strict local maximizer of f.
−2 −1
1 2 3 4 5
−10
10
20
y = f(x)
y = f
′
(x)
(2, f(2))
y = f(x)
x
y
Figure 1.3: A neighborhood where a = 2 is the maximizer of f (x)
The intuition behind this result relies on the fact that f
′′
(a) is the derivative of f
′
(x) at a. If
f
′′
(a) < 0, then f
′
(a) takes on larger values (than 0) to the left of a and smaller values (than 0) to the
right. Traveling left to right, the function increases until it reaches a. After passing a, it decreases.
Naturally, we have the following variant.
Variant of Theorem 1.5
If f
′
(a) = 0 and f
′′
(a) > 0, then a is a strict local minimizer of f.
Remark
This is sometimes called the local second-order condition, since it gives information about local
maximizers.
We cannot conclude anything, if f
′′
(a) = 0. a may be a maximizer, a minimizer or neither.
36
1.1.5 The Global Second-Order Condition
Remark
Unlike the (local) second-order condition, this theorem requires that the second derivative is
negative everywhere, not just at x
∗
.
In return for a stronger requirement, we obtain a much stronger conclusion.
Economists traditionally use x
∗
to denote a global maximizer. Thus we will use x
∗
to denote a
known maximizer or any point that will imminently be identified as a maximizer.
−2
2 4
−2
−1
1
2
3
(2, f(2))
y = f(x)
y = f
′
(x)
x
y
Figure 1.4: A function and its derivative near a maximizer
Unlike in the local case, we can make some use of a zero second derivative here.
Variants of Theorem 1.6
1 If f
′
(x
∗
) = 0 and f
′′
(x) > 0 for all x, then x
∗
is the only critical point of f and is the global
minimizer of f.
2 If f
′
(x
∗
) = 0 and f
′′
(x) ≤ 0 for all x, then x
∗
is a global maximizer of f (not necessarily unique).
3 If f
′
(x
∗
) = 0 and f
′′
(x) ≥ 0 for all x, then x
∗
is a global minimizer of f (not necessarily unique).
38
1.1.6
Sufficient and Necessary Conditions
This means that the first-order condition can only rule out values that are not maximizers. We
cannot conclude that a point is a maximizer just by showing that it satisfies the first-order condition.
The second-order condition is a sufficient condition. If f
′
(a) = 0 and f
′′
(a) < 0 then a must be a
local maximizer. It is not a necessary condition: there may be local maximizers that do not satisfy the
second-order condition.
Example
Let f(x) = −x
4
. The local maximizer a = 0 does not satisfy the SOC, because f
′′
(0) = 0.
The global second-order condition is also a sufficient condition and not a necessary one. For example,
a function can have a global maximizer without satisfying it.
−2
2 4
−1
1
2
3
x
y
Figure 1.5: The graph of f(x) =
2
x
2
−2x+2
, which has a global maximizer and unique critical point at
x
∗
= 1, but does not satisfy the GSOC.
Remark
Abstractly, there is no difference between P and Q. There are two ways to view the statement:
If P , then Q.
P is a sufficient condition for Q.
Q is a necessary condition for P
However, we like to think of conditions as statements that we use to test for what we really care about.
So while, abstractly, being a local maximizer is a sufficient condition for f
′
(a) = 0, this does not reflect
the way we use the first-order condition in practice.
Generally we will want to have both necessary and conditions to test for properties that we care about.
With the tools we have so far, we can determine that certain points are local or global maximizers. We
40
can determine that many points are not local maximizers. Still, there may be points that satisfy the first-
order condition but not the second-order condition. We cannot tell whether these are local maximizers
or not.
The best condition would be a condition that is both necessary and sufficient. No such tool exists for
general optimization, so we are on the lookout for additional conditions to apply when the ones we have
are inconclusive. Coming up with new conditions is usually hard work, but we can obtain one easily, if
we exploit the relationship between maximizers and minimizers.
A sufficient condition for a minimizer can be turned into a necessary condition for a maximizer. If
f
′
(a) = 0 and f
′′
(a) > 0 then by a variant of Theorem 1.5, a is a strict local minimizer. Thus it cannot
be a maximizer. On the other hand, if f
′′
(a) exists but f
′
(a) = 0 then a is still not a local maximizer.
This leaves only the following possibilities for the second derivative at a local maximizer a.
Theorem 1.8
If a is a local maximizer of f, then f
′′
(a) ≤ 0 or f
′′
(a) does not exist.
1.1.7
Proving the First-Order Condition
In general, we expect a positive derivative to mean that greater values of x produce greater values
of f(x). The lemma at the heart of the first-order condition described what we can conclude when a
derivative is positive at one point.
The derivative is a limit, so any formal argument needs to start there. Proofs about limits can require
extensive computations and creative problem solving. Fortunately, we will only need the following lemma
about limits.
Lemma 1.9
If lim
x→a
f(x) = L and L > 0, then there is a neighborhood of a on which f is positive.
Proof
Since L is positive, L/2 is also positive. By definition of a limit there is some neighborhood of a in
which f(x) is within L/2 of L. We can express that distance with an absolute value.
|f(x) − L| < L/2
−L/2 < f(x) − L < L/2
L/2 < f(x) < 3L/2
Since L/2 is positive, so is f(x).
41
1.1.7 Proving the First-Order Condition
Variant of Lemma 1.9
If lim
x→a
f(x) = L and L < 0, then there is a neighborhood of a on which f is negative.
We are now in a position to prove Lemma 1.2.
Lemma 1.2
If f
′
(a) > 0, then for all x in some neighborhood of a,
1 f(x) > f(a) if x > a
2 f(x) < f(a) if x < a.
Proof
Since f
′
(a) = lim
h→0
f(a + h) − f (a)
h
> 0, Lemma 1.9 guarantees that there is a neighborhood of h
values near 0 where
f(a+h)−f (a)
h
> 0. Let x = a + h. In the corresponding neighborhood of x values
around a, we can make the following computations.
1 If x > a then h > 0 and
f(a + h) − f (a)
h
> 0
f(x) − f(a)
h
> 0
f(x) − f(a) > 0
f(x) > f(a)
2 If x < a then h < 0 and
f(a + h) − f (a)
h
> 0
f(x) − f(a)
h
> 0
f(x) − f(a) < 0
f(x) < f(a)
42
Remark
This proof uses an earlier lemma. If you are carefully reading this proof to understand the argument,
you may need to look back at the lemma and think about how it is being used here.
Visually, the argument of this proof is that the secant lines from a have positive slope in some
neighborhood of a.
1.1.8
Proving the Second-Order Conditions
In order to prove the second-order condition, we need a stronger conclusion than Lemma 1.2. Calculus
teaches that a function with positive derivatives is increasing, while a function with negative derivatives
is decreasing. In order to make a rigorous argument, we should know formal definitions of increasing
and decreasing.
Definition 1.10
Let f(x) be a function, and let I be an interval in its domain.
f(x) is increasing if for any numbers a < b in the domain of f , f(a) < f(b).
f(x) is decreasing if for any numbers a < b in the domain of f , f(a) > f(b).
43
1.1.8 Proving the Second-Order Conditions
Variants of Lemma 1.11
1 If f
′
(x) < 0 on an interval I, then f (x) is decreasing on I.
2 If f
′
(x) ≥ 0 on an interval I, then f(x) is non-decreasing on I, meaning that if a < b,
f(a) ≤ f(b).
3 If f
′
(x) ≤ 0 on an interval I, then f (x) is non-increasing on I, meaning that if a < b, f(a) ≥
f(b).
Moving forward, we will see strict and non-strict variants so frequently that when we don’t, it is
worth pondering why not.
With these lemmas in hand, we are ready to prove the second-order condition.
Theorem 1.5
If f
′
(a) = 0 and f
′′
(a) < 0, then a is a strict local maximizer of f.
Arguments about the second derivative usually rely on the fact that f
′′
(x) is the derivative of f
′
(x).
We will apply Lemma 1.9, letting f
′
(x) play the role of the original function with f
′′
(x) as its derivative.
We can thus use the sign of f
′′
(a) to compare the values of f
′
(x) and f
′
(a). Here is the formal
argument.
Proof
We suppose f
′
(a) = 0 and f
′′
(a) < 0. f
′′
(x) is the derivative of f
′
(x), and it is negative at a. A
variant of Lemma 1.9 applies. There is some neighborhood I of a where
if x > a, then f
′
(x) < f
′
(a) = 0
if x < a, then f
′
(x) > f
′
(a) = 0.
The second inequality shows, by Lemma 1.11, that f(x) is increasing to the left of x in I. By definition
of increasing this means that f(x) < f(a) for all x < a in I. Similarly, the first inequality shows that
f(x) is decreasing to the right of a in I. By definition of decreasing this means that f(x) < f(a) for
all x > a in I. Thus f(a) > f(x) for all other x in I. This satisfies the definition of a strict local
maximizer, so we conclude that a is a strict local maximizer of f .
The proof of the global second-order condition follows the same logic. The only difference is that
we can apply Lemma 1.11 instead of Lemma 1.9 to get a global statement about the first derivative on
either side of x
∗
. Here is the full text for the sake of completeness.
46
1.2.1
Convex Sets
only checked segments between boundary points, the segment between c and
d would be sufficient to
indicate that D is not convex. We summarize this argument in the following theorem.
a
b
c
d
Figure 1.12: A segment that leaves a nonconvex region
Theorem 1.13
A region D is convex if for all a and
b in the boundary of D and all t in [0, 1],
(1 − t)a + t
b ∈ D
We visually identify convex polygons as polygons where every corner “points outward.” However,
the corners of the polygon are the only places that point outward. The rest of the boundary is flat,
pointing neither inward nor outward. With non-polygonal regions, it is possible to have a boundary that
points outward everywhere, not just at a few corners. This will be a useful property to keep track of, so
we have a name for such regions.
Definition 1.14
A region D is strictly convex, if the segment, not including the endpoints, between any two points a
and
b in D entirely lies in the interior of D.
Figure 1.13: Two convex regions, one strictly convex and the other not strictly convex
50
Remark
Every strictly convex region is also convex, but some convex regions are not strictly convex. We say
that strict convexity is a stronger condition and convexity is a weaker condition.
We are often interested in the intersection of two regions. For instance, if one region is the set of
points satisfying one condition, while a second region is the set of points satisfying a second conditions,
then their intersection is the points satisfying both conditions. Convexity behaves well in these situations.
Theorem 1.15
If D
1
and D
2
are convex regions, then D
1
∩ D
2
is convex.
The proof is a good exercise.
1.2.2
Our next step is to use convexity to describe the shape of functions or, more precisely, their graphs.
Most students encounter convex and concave functions in calculus, though they are sometimes called
concave up and concave down. With a rigorous definition of a convex region, we are now in position to
define what it means to be a convex function.
Definition 1.16
Let f(x) be a function whose domain is convex. We say f(x) is convex, if y ≥ f(x), the region above
its graph, is convex. It is concave if y ≤ f(x), the region below its graph, is convex.
51
Theorem 1.19
A function is concave, if and only if for all a and
b in its domain and any 0 ≤ t ≤ 1 we have
(1 − t)f (a) + tf(
b)
| {z }
height of secant
≤ f((1 − t)a + t
b)
| {z }
height of y=f (x(t))
0 ≤ t ≤ 1
Remarks
This is an “if and only if” condition. That means it is both necessary and sufficient for establishing
concavity.
Only a function with a convex domain can meet this condition. Otherwise we cannot always
evaluate f((1 − t)a + t
b), because (1 − t)a + t
b may lie outside the domain.
The theorem follows directly from the definitions we have developed so far. Here is a representation
of our reasoning.
(1 − t)f(a) + tf(
b) ≤ f((1 − t)a + t
b)
for all a,
b and t
The secant between any a and
b
lies under the graph y = f(x)
The region below y = f(x) is convex
f is concave
Applying the same reasoning to the definitions of convexity and strict concavity/convexity gives the
following variants. Note that we cannot demand a strict inequality at t = 0 or t = 1 because the secant
will always intersect the graph y = f(x) at a and b.
55
f((1 − t)a + tb) = 5 − ((1 − t)a + tb)
2
definition of f
= 5 − (1 − t)
2
a
2
− 2t(1 − t)ab − t
2
b
2
distribute
≥ 5 − (1 − t)
2
a
2
− t(1 − t)(a
2
+ b
2
) − t
2
b
2
AM-GM
= 5 − (1 − t)
2
a
2
− t(1 − t)a
2
− t(1 − t)b
2
− t
2
b
2
distribute
= 5 − (1 − t)(t + 1 − t)a
2
− t(1 − t + t)b
2
factor
= 5 − (1 − t)a
2
− tb
2
simplify
= 5(1 − t) − (1 − t)a
2
+ 5t − tb
2
break up 5
= (1 − t)(5 − a
2
) + t(5 − b
2
) factor
= (1 − t)f (a) + tf(b) definition of f
The inequality condition is difficult to verify, even for this simple function. It is much worse for a
function involving a square root or an exponential. On the other hand. The inequality condition can be
convenient for abstract results.
Theorem 1.20
If f(x) and g(x) are both concave, then so is their sum: h(x) = f(x) + g(x)
Proof
Let a and
b be in the domain of h. Since the domain of h is the intersections of the domains of f and g,
h has a convex domain (Theorem 1.15). We know from the concavity of f and g that for all t in [0, 1]:
(1 − t)f (a) + tf(
b) ≤ f((1 − t)a + t
b)
(1 − t)g(a) + tg(
b) ≤ g((1 − t)a + t
b)
We now verify that the inequality holds for h:
(1 − t)h(a) + th(
b) = (1 − t)f (a) + (1 − t)g(a) + tf(
b) + tg(
b)
≤ f((1 − t)a + t
b) + g((1 − t)a + t
b)
= h((1 − t)a + t
b)
We conclude that h is concave.
57
1.2.4
Verifying Concavity with an Inequality
Main Idea
The inequality condition is sometimes useful at a theoretical level, but is painful to check for any but
the simplest specific functions. We will want a condition that is easier to check.
1.2.5
Concavity and Tangent Lines
We sought out the idea of concavity because it applies to functions of many variables. Some results
about concave functions are related to their derivatives. We will develop these tools in the setting
that is most comfortable and intuitive: one-variable functions. Once we have presented the theory of
optimization for multivariable functions, we will revisit and generalize these results.
Derivatives tell us about tangent lines, and tangent lines have an interesting relationship to convex
sets. If you draw a nonconvex set, you will be able to find a tangent line that crosses through the set.
A convex set, on the other hand, lies entirely on one side of any of its tangent lines. We will not try
to prove this for general sets, but we will prove it for the strictly convex region below the graph of a
strictly concave function.
Lemma 1.21
If f(x) is a strictly concave function that is differentiable at a, then the rest of the graph y = f(x) lies
below its tangent line at (a, f(a)).
x
y
Figure 1.19: The graph of a strictly concave function and some of its tangent lines
This lemma describes the relationship between y = f(x) and a single tangent line. It can apply to
a function that is not differentiable at other points in its domain. If the function is differentiable on its
entire domain, we can produce a necessary and sufficient condition on the tangent lines.
58
Theorem 1.22
Let f(x) be a differentiable function on a convex domain. f (x) is strictly concave, if and only if the
graph y = f(x) lies below all of its tangent lines (except at the point on tangency).
The proofs of these results are somewhat technical. We will provide them after we have discussed
applications.
1.2.6
Concavity Conditions for a Maximizer
Our main use of concavity (for now) is to produce sufficient conditions for maximizers. The following
corollary follows from Lemma 1.21.
Corollary 1.23
If f(x) is a strictly concave function and f
′
(x
∗
) = 0, then x
∗
is the unique global maximizer of f.
Proof
The assumption that f
′
(x
∗
) = 0 tells us that f is differentiable at x
∗
, and the tangent line at x
∗
is
horizontal. By Lemma 1.21, the rest of the graph lies below this line. We conclude that x
∗
is the unique
global maximizer.
x
y
Figure 1.20: The graph of y = f (x), which lies below the tangent line at x
∗
.
Like in the global second-order condition, we can also argue that x
∗
is the only critical point. If
there were another, then it would also be the unique global maximizer, which is nonsense.
59
1.2.7
A Second Derivative Test for Concavity
We would still like to have a better way to determine when a function is concave. In calculus, we
learn that the sign of the second derivative determines the concavity of a function. We are now in a
position to formally argue this result.
Theorem 1.24
If f is a function on a convex domain D ⊆ R and f
′′
(x) < 0 for all x in D, then f(x) is strictly
concave.
This means that a negative second derivative is a sufficient condition for strict concavity. Is it
necessary? No, consider the following.
Example
Consider f(x) = −x
4
. This function is strictly concave, but f
′′
(0) = 0, so the second derivative is not
always negative.
Putting this with Corollary 1.23 gives a string of conditions. Each is sufficient and neither is necessary,
so the implications can only be followed from left to right.
f
′′
(x) < 0
for all x
f strictly concave
Any x
∗
with f
′
(x
∗
) = 0
is a maximizer
We could claim that Corollary 1.23 and Theorem 1.24 have combined to give an alternate argument
for the global second-order condition. However, proving Theorem 1.24 without the global second-order
condition would require us to redo much of the work that went into that theorem. Instead we will use
the global second-order condition to keep our proof of Theorem 1.24 brief.
Proof
Our goal is to apply Theorem 1.22. Let a be a point in the domain of f. Since f
′′
(x) exists, f
′
(a)
exists. Let y = ℓ(x) be the tangent line to y = f(x) at a. Let g(x) = f(x) − ℓ(x). We know the
following about g:
g
′′
(x) = f
′′
(x) − ℓ
′′
(x) < 0 for all x, since ℓ
′′
(x) = 0.
g
′
(a) = f
′
(a) − ℓ
′
(a) = 0, since the derivative is the slope of the tangent line.
g(a) = 0, since ℓ(a) = f(a).
g and a satisfy the conditions of the global second-order condition. Thus a is the unique maximizer of
g, meaning g(x) < g(a) = 0 for all x = a. Thus y = f (x) lies below y = ℓ(x). This reasoning holds for
the tangent line at any value a. Theorem 1.22 tells us that f is strictly concave.
60
1.2.9 Proving the Relationship Between Concavity and Tangent Lines
Proof
Pick any b = a on the graph y = f (x). The tangent line to y = f (x) at (a, f (a)) has equation
y = f(a) + f
′
(a)(x −a). We will first show that (b, f(b)) does not lie above the tangent line by showing
that f(b) ≤ f(a) + f
′
(a)(b − a).
Consider the case that b > a. Denote the secant from (a, f(a)) to (b, f(b)) by the equation y = s(x)
and let g(x) = s(x) − f(x). The region below y = f(x) is convex, so y = s(x) is below y = f(x)
for any x between a and b. Their difference, g(x), is less than or equal to 0 on this interval. By the
contrapositive of Lemma 1.2 we know g
′
(a) cannot be greater than 0.
s
′
(a) is the slope of the secant:
f(b) − f(a)
b − a
. We substitute this into g
′
(a) ≤ 0 and solve.
g
′
(a) ≤ 0
s
′
(a) − f
′
(a) ≤ 0 (sum rule)
f(b) − f(a)
b − a
− f
′
(a) ≤ 0 (substitute)
f(b) − f(a) − f
′
(a)(b − a) ≤ 0 (b − a > 0)
f(b) ≤ f(a) + f
′
(a)(b − a)
We conclude that (b, f(b)) does not lie above the tangent line.
Now we must also show it does not lie on the tangent line. This is easiest with a contradiction
argument. Suppose some (b, f(b)) lies on the tangent line at a. Then consider any c between a and b.
(c, f(c)) also does not lie above tangent line by the argument we gave for b. However, the secant from
(a, f(a)) to (b, f(b)) is part of the tangent line, and (c, f(c)) lies above this by the definition of strict
concavity. This is a contradiction. Thus there is no (b, f(b)) that lies on the tangent line.
The case where b < a can be proved with a similar argument. We conclude that (b, f(b)) lies below
the tangent line to y = f(x) at (a, f(a)).
y = f(x)
y = s(x)
(a, f(a))
(b, f(b))
x
y
Figure 1.22: A tangent line and a secant to a concave function
Theorem 1.22 is an “if and only if” statement, so it requires two arguments to prove. We must
show that if f is strictly concave, then its graph lies below its tangent lines. We must also show that if
62
its graph lies below its tangent lines, then it is strictly concave. Fortunately, Lemma 1.21 has already
shown that if f (x) is strictly concave, then y = f(x) lies below the tangent line at a.
Proof
Suppose we know f is concave. For each a in the domain of f, Lemma 1.21 states that y = f (x) lies
below its tangent line at a. Since this holds for all a, y = f(x) lies below all of its tangent lines.
Now suppose that we know y = f (x) lies below all its tangent lines. We will verify that the region
under y = f (x) is strictly convex. Let a and b be any points in the domain of f . Let c be any point
between them. We know that (a, f(a)) and (b, f(b)) both lie below the tangent line to y = f(x) at
(c, f(c)). Thus the tangent line lies above the entire secant between (a, f(a)) and (b, f(b)). (c, f(c))
is on this tangent line, so (c, f(c)) is above the secant as well.
Since this argument holds for any such c, we conclude that the entire graph between a and b lies
above the secant. Since this is true for any choice of a and b, we conclude that the region below
y = f(x) is strictly convex and f (x) is a strictly concave function.
y = f(x)
(a, f(a))
(c, f(c))
(b, f(b))
x
y
Figure 1.23: A secant with both endpoints below the tangent line at (c, f(c))
1.2.10
Non-Strict Concavity and Convexity
We have primarily worked with strictly concave functions, because they give the strongest results
about maximizers. The properties of non-strict concavity and convexity give similar results. The results
for convexity are obtained by flipping the directions of inequalities. They are probably not worth mem-
orizing. On the other hand, economists frequently need to work with non-strictly concave functions. It
is worth understanding how utility maximization works for these.
Graphs of non-strictly concave functions can intersect their own tangent lines. For instance, a linear
function is concave, and its graph is identical to its tangent line. While the graph of a concave function
can intersect the tangent lines, it cannot go above them.
63
1.2.10 Non-Strict Concavity and Convexity
Variants of Lemma 1.21
Let f(x) be a function on a convex domain.
1 If f (x) is a concave function that is differentiable at a, then the graph y = f(x) has no points
above its tangent line at (a, f(a)).
2 If f(x) is a strictly convex function that is differentiable at a, then the rest of the graph y = f (x)
lies above its tangent line at (a, f(a)).
3 If f(x) is a convex function that is differentiable at a, then the graph y = f(x) has no points
below its tangent line at (a, f(a)).
The proof for non-strict concavity is identical to the proof for strict concavity, with the contradiction
argument omitted. The necessary and sufficient conditions are unsurprising.
Variants of Theorem 1.22
Let f(x) be a differentiable function on a convex domain.
1 f(x) is concave if and only if the graph y = f(x) has no points above any of its tangent lines.
2 f(x) is strictly convex if and only if the graph y = f(x) lies above each of its tangent lines (except
at the point of tangency).
3 f(x) is convex if and only if the graph y = f(x) has no points below any of its tangent lines.
These results can also be extended to well-behaved non-differentiable functions. For instance, what-
ever “tangent lines” you might reasonably add to the function in Figure 1.16 will not stray below the
graph. Rigorously defining tangent lines in these situations would require a detailed argument involving
limits, so we will not pursue it here.
The variants of Lemma 1.21 give rise to their own corollaries about maximizers and minimizers.
64
Variants of Corollary 1.23
1 If f(x) is a concave function and f
′
(x
∗
) = 0, then x
∗
is a global maximizer (but other points may
be as well).
2 If f(x) is a strictly convex function and f
′
(x
∗
) = 0, then x
∗
is the unique global minimizer of f.
3 If f(x) is a convex function and f
′
(x
∗
) = 0, then x
∗
is a global minimizer (but other points may
be as well).
x
1
x
2
Figure 1.24: A global but non-unique maximizer of a not strictly concave function
Finally, we can use the Variants of Theorem 1.22 to produce second-derivative tests for the other
forms of concavity and convexity.
Variants of Theorem 1.24
1 If f is a function on a convex domain D ⊂ R and f
′′
(x) ≤ 0 for all x in D, then f(x) is concave.
2 If f is a function on a convex domain D ⊂ R and f
′′
(x) > 0 for all x in D, then f(x) is strictly
convex.
3 If f is a function on a convex domain D ⊂ R and f
′′
(x) ≥ 0 for all x in D, then f(x) is convex.
In the non-strict case, the condition is necessary as well as sufficient.
65
1.2.10 Non-Strict Concavity and Convexity
Corollary 1.25
A twice differentiable function f(x) on a convex domain D is concave, if and only if f
′′
(x) ≤ 0 for all
x in D.
The sufficiency of f
′′
(x) ≤ 0 is found in the preceding variant. We establish necessity with a
contrapositive argument. It should look familiar to anyone who read our proof of Theorem 1.24. Notice
that it is making a local rather than a global argument.
Proof
Suppose f
′′
(x) ≤ 0 for all x. A variant of Theorem 1.24 tells us that f is concave.
On the other hand, suppose f
′′
(a) > 0 for some a. First note than since f
′′
(x) exists, f is a
differentiable function. Let y = ℓ(x) be the tangent line to y = f(x) at a. Consider g(x) = f(x)−ℓ(x).
We know the following about g:
g
′′
(a) = f
′′
(a) − ℓ
′′
(a) > 0, since ℓ
′′
(x) = 0.
g
′
(a) = f
′
(a) − ℓ
′
(a) = 0, since the derivative is the slope of the tangent line.
g(a) = 0, since f(a) = ℓ(a).
Thus by a variant of the (local) second-order condition, Theorem 1.5, a is a strict local minimizer of g,
meaning g(x) > g(a) = 0 for all x in a neighborhood of a. Thus y = f(x) lies above y = ℓ(x) in this
neighborhood. By a variant of Theorem 1.22, f is not concave.
The contrapositive of this argument is that if f is concave, then there is no a such that f
′′
(a) > 0.
Equivalently, f
′′
(x) ≤ 0 for all x.
This result means that we not only have a way to show that f(x) is concave, but also that it is not
concave. If f
′′
(a) > 0 for any a, then f is not a concave function.
1.2.11
The most important definitions and results from this section were
The definition of a (strictly) convex region (Definitions 1.12 and 1.14)
The definition of a (strictly) concave function (Definitions 1.16 and 1.17)
The inequality condition for concave functions (Theorem 1.19)
The tangent line condition for strictly concave functions (Theorem 1.22)
66
The sufficient condition for a maximizer using strict concavity (Corollary 1.23)
The second derivative tests for strictly concave and concave functions (Theorem 1.24 and Corollary
1.25)
Here is a summary of which conditions and statements from this section imply which others. Some
of these conditions only apply to differentiable, or even twice differentiable functions.
If f differentiable
(twice if relevant)
(1 − t)f (a) + tf(b) < f((1 − t)a + tb) (1 − t)f (a) + tf(b) ≤ f((1 − t)a + tb)
f strictly concave f concave
y = f(x) below tangent lines y = f(x) not above tangent lines
f
′′
(x) < 0 f
′′
(x) ≤ 0
critical points are unique maximizers critical points are maximizers
Figure 1.25: Conditions and applications of concave functions
67
The value of
df
dt
is realized geometrically as the slope of the tangent line to the graph of the
composition function: y = f (x(t)).
Using compositions to find maximizers has two main consequences. The first is helpful. The second
creates difficulties.
1 f(x(t)) is a one-variable function. No matter the dimension of x, we can graph y = f(x(t)) in
the ty-plane and look for maximizers there.
2 One path will not cover the whole domain of f(x). We have to study many different paths x(t),
if we want to compare a potential maximizer to every other point in the domain.
We could study all paths in the domain of f, but that would be needlessly difficult. Instead we
restrict ourselves to paths that are lines. We will express these lines using the following notation.
Notation
Every line has a direction vector v = (v
1
, v
2
, . . . , v
n
). A line through a in the direction of v has the
equation
x(t) = a + tv
In a convex domain, any two values f (a) and f(
b) can be attained as values of the composition
f(a + tv), where v =
b −a. If we want to compare these values (for instance to argue that a is or is
not a maximizer), we compare f(a + 0v) and f (a + 1v).
Lemma 1.26
Suppose f (x) has a convex domain. A point a is a (local or global) maximizer, if and only if 0 is a
(local or global) maximizer of f(a + tv) for all direction vectors v.
This lemma tells us that a necessary condition for 0 to be a maximizer of f(a + tv) for all v is
a necessary condition for a to be a maximizer of f(x). Similarly, a sufficient condition for 0 to be a
maximizer of f(a + tv) for all v is a sufficient condition for a to be a maximizer of f(x).
The compositions f(a + tv) are realized as cross sections of the graph y = f(x) above the line
x(t) = a + tv.
71
Remark
If a lies at the boundary of the domain, then it may not be possible to travel in all directions v from
a and still evaluate f(a + tv). Fortunately, the first-order condition still holds. If ∇f(a) exists, then
the partial derivatives f
i
(a) exist. Thus f (a + he
i
) exists for h near 0. That means we can at least
travel in the standard basis directions from a without immediately leaving the domain of f(x). At a
local maximizer, the derivatives in these directions must be 0.
1.3.3
Computing Critical Points
What does the multivariable first-order condition say about f (x
1
, x
2
) = x
4
1
− 4x
1
x
2
+ x
4
2
?
Solution
We compute the gradient vector by taking the partial derivatvies with respect to x
1
and x
2
.
∇f(x) = (f
1
(x), f
2
(x))
= (4x
3
1
− 4x
2
, 4x
1
+ 4x
3
2
)
This is never undefined, so we solve for where it is (0, 0).
4x
3
1
− 4x
2
= 0 −4x
1
+ 4x
3
2
= 0
x
3
1
= x
2
(solve for x
2
)
−4x
1
+ 4x
9
1
= 0 (substitute)
4x
1
(x
8
− 1) = 0 (factor)
x
1
= 0 or ± 1
Subsituting back into x
3
1
= x
2
gives lets us solve for the corresponding values of x
2
. We obtain the
critical points (0, 0), (1, 1) (−1, −1). We conclude that no point other than (0, 0), (1, 1) (−1, −1) can
be a local maximizer or a local minimizer. We cannot conclude whether each of these points is a local
maximizer, a local minimzer or neither.
73
Point Form and Column Form
When we want to emphasize that a vector x represents a point in R
n
, we write it as x = (x
1
, x
2
, . . . x
n
).
If we want to do matrix multiplication, we can write it as a column vector:
x =
x
1
x
2
.
.
.
x
n
The expression
d
2
dt
2
f(a + tv) =
n
X
i=1
n
X
j=1
f
ij
(a + tv)v
j
v
i
that we obtained can be written as a matrix product.
h
v
1
v
2
··· v
n
i
f
11
(a + tv) f
12
(a + tv) ··· f
1n
(a + tv)
f
21
(a + tv) f
22
(a + tv) ··· f
2n
(a + tv)
.
.
.
.
.
.
.
.
.
f
n1
(a + tv) f
n2
(a + tv) ··· f
nn
(a + tv)
v
1
v
2
.
.
.
v
n
You may want to convince yourself that this algebra is correct with a 2 or 3 dimensional example.
What do we make of the individual factors in this product? The vectors on either end of this expression
are just v and its transpose (v flipped sideways). The matrix in the middle seems important. We should
have a name for it.
Definition 1.28
Given a function f(x) of n variables, the Hessian is a n ×n matrix function of x whose entries are the
second partial derivatives of f at x. Its formula is
Hf (x) =
f
11
(x) f
12
(x) ··· f
1n
(x)
f
21
(x) f
22
(x) ··· f
2n
(x)
.
.
.
.
.
.
.
.
.
f
n1
(x) f
n2
(x) ··· f
nn
(x)
Notice that for well-behaved functions, f
ij
= f
ji
. This means the Hessian will be a symmetric
matrix.
75
1.3.6
Negative Definite Matrices
Our Hessian notation allows us to write
d
2
dt
2
f(a + tv) = v
T
Hf (a + tv)v. The following vocabulary
captures exactly which Hessian matrices will produce a negative (or positive) second derivative in all
directions.
Definition 1.29
A symmetric n × n matrix M is negative definite, if for all nonzero n-vectors v,
v
T
Mv < 0.
It is positive definite, if for all nonzero n-vectors v,
v
T
Mv > 0.
Remark
We may think, based on positive and negative numbers, that most matrices are either positive definite
or negative definite. In fact, a randomly chosen matrix is most likely to be neither. For most matrices,
v
T
Mv is positive for some v and negative for others.
Connecting this back to our original goal, we can apply the definition of negative definite to describe
the sign of the second derivatives of the compositions f(a + tv). For a critical point a, we can draw the
following conclusions.
If Hf(a) is negative definite, then
d
2
f
dt
2
(0) < 0 for all v. Since
df
dt
(0) = 0, t = 0 satisfies the
second-order condition for all v. We conclude a is a strict local maximizer of f.
If Hf (x) is negative definite for all x, then
d
2
f
dt
2
(t) < 0 for all t and all v. Since
df
dt
(0) = 0, t = 0
satisfies the global second-order condition for all v. We conclude a is the unique global maximizer
of f.
Without an efficient way to test whether a matrix is negative definite, we have only managed to
restate the problem. We can say that we are checking whether Hf(a) is negative definite, or we can
say that we are checking whether
n
X
i=1
n
X
j=1
f
ij
(a)v
j
v
i
< 0 for all v
1
, v
2
, . . . , v
n
. The computations are
the same, and they are daunting.
Fortunately, there is an easier way to show that a matrix is negative or positive definite.
76
Theorem 1.30
A symmetric n ×n matrix M is negative definite if the upper left square minors M
i
have determinants
that satisfy the alternating condition:
(−1)
i
|M
i
| > 0 for all 1 ≤ i ≤ n
Example
For a 4 × 4 matrix, here are the M
i
and what we would check to apply Theorem 1.30.
· · · ·
· · · ·
· · · ·
· · · ·
|M
1
| < 0
|M
2
| > 0
|M
3
| < 0
|M
4
| > 0
Variant of Theorem 1.30
A n × n matrix M is positive definite if the upper left square minors have positive determinants.
1.3.7
The Multivariable Second-Order Conditions
Now that we have a check for negative definite and positive definite matrices, it is worth stating
the results of our investigation as theorems. These are our most important sufficient conditions for a
maximizer of a multi-variable function.
77
1.3.7 The Multivariable Second-Order Conditions
Theorem 1.31 [The Multivariable Second-Order Condition]
Given a function f(x) with a it its domain. If
∇f(a) =
0
Hf (a) is negative definite
then a is a strict local maximizer of f.
Theorem 1.32 [The Multivariable Global Second-Order Condition]
Suppose f is a function with a convex domain D and x
∗
is in D. If
∇f(x
∗
) =
0
Hf (x) is negative definite for all x ∈ D.
then x
∗
is the only critical point of f and the unique global maximizer.
We can summarize the reasoning behind the global second-order condition as follows:
Hf (x) is negative definite
for all x
d
2
dt
2
f(x
∗
+ tv) < 0 for
all directions v
∇f(x
∗
) =
0
df
dt
(0) = 0 along
all lines x
∗
+ tv
0 is the global maximizer
of f(x
∗
+ tv) for all directions v
x
∗
is the global maximizer of f(x)
These theorems have variants for minimizers.
78
1.3.9 The Hessian and Concavity
Lemma 1.35
A function f(x) on a convex domain is concave, if and only if f(a + tv) is concave for every a in the
domain of f and every direction vector v.
f(a + tv) is concave
for all lines a + tv
y = f(a + tv) lies above
its secants (t, s(t))
for all lines a + tv
y = f(x) lies above the secants
x(t) = a + tv
y(t) = s(t)
for all lines a + tv
f(x) is concave
Def 1.16
Cor 1.34
Def 1.16
Variants of Lemma 1.35
Suppose f(x) is a function on a convex domain
1 f(x) is strictly concave if and only if f(x(t)) is strictly concave for every line x(t) in the domain
of f.
2 f(x) is convex if and only if f (x(t)) is convex for every line x(t) in the domain of f.
3 f(x) is strictly convex if and only if f(x(t)) is strictly convex for every line x(t) in the domain of
f.
The result of this lemma is that any test for concavity or convexity of the single-variable compositions
f(a + tv) becomes a test for the concavity of f(x) itself. If every composition passes, then f does.
If at least one composition fails, then so does f. Our natural next step is to take our theorems from
single-variable concavity and apply them to these compositions,
Theorem 1.22 showed that the graphs single-variable strictly concave functions lie below their tangent
lines. Tangent lines to a composition have the same heights as the tangent lines of the multivariable
graph. We can formalize this relationship with a lemma.
Lemma 1.36
If the line y = ℓ(t) is tangent to y = f (a + tv), then
x(t) = a + tv
y(t) = ℓ(t)
is a tangent line to y = f(x).
82
1.3.9 The Hessian and Concavity
We can also use compositions to update our theorems about second derivatives and concavity.
Theorem 1.39 [Multivariable Version of Theorem 1.24]
If f is a function on a convex domain D ⊂ R
n
and
d
2
dt
2
f(a + tv) < 0 for all lines a + tv in D, then
f(x) is strictly concave.
Corollary 1.40 [Multivariable Version of Corollary 1.25]
A twice differentiable function f(x) on a convex domain D is concave, if and only if
d
2
dt
2
f(a + tv) ≤ 0
on all lines a + tv in D.
As we have seen, the sign of these second derivatives depends on the Hessian matrix. This gives us
our most useful computational test for the concavity of a multivariable function.
Theorem 1.41
Let f(x) be a twice-differentiable function on a convex domain D. If Hf (x) is negative definite for all
x in D, then f is strictly concave.
This means that concavity plays the same role in multivariable optimization as in single-variable
optimization. For some functions, we can identify a maximizer by concavity even though they do not
satisfy the second-order condition. On the other hand, the most convenient way to identify concave
functions is still the second derivatives.
84
1.3.10
Negative Semi-Definite Matrices
Changing the condition f
′′
(x) < 0 to f
′′
(x) ≤ 0 was a natural approach to generalizing Theorem
1.24. As a bonus, we obtained Corollary 1.25, which is both a necessary and sufficient condition
for concavity. The same is possible for multivariable functions, but the condition on the Hessian is
surprisingly complicated.
We might expect the multivariable analogue to be that (−1)
i
|M
i
| ≥ 0 for each i. Consider a
function like f(x
1
, x
2
, x
3
) = x
2
3
. In the x
3
direction this function is a parabola which is strictly convex.
Unfortunately, its Hessian passes our expected test for concavity.
Hf (x
1
, x
2
, x
3
) =
0 0 0
0 0 0
0 0 2
.
This has |M
i
| = 0 for each i. The determinant cannot detect the curvature of later variables, if the
earlier variables produce a row of zeroes in Hf(x). Our expected test is not a correct test. To avoid
this problem, we must consider different orders of variables. Which orders? All of them.
Theorem 1.42
Let f(x) be a twice-differentiable function on a convex domain D ⊆ R
n
. let σ be any reordering of
the coordinates of R
n
. Let σM = Hf(σx).
1 f is concave if and only if
(−1)
i
|σM
i
| ≥ 0 for all σ and all 1 ≤ i ≤ n.
If so, then v
T
Hf (x)v ≤ 0 for all nonzero v. We say Hf(x) is negative semidefinite
2 f is convex if and only if
|σM
i
| ≥ 0 for all σ and all 1 ≤ i ≤ n.
If so, then v
T
Hf (x)v ≥ 0 for all nonzero v. We say Hf(x) is positive semidefinite
To give a sense of the amount of computation required, we will consider an example. Suppose
Hf (x) =
0 2 5
2 −6 0
5 0 −3
.
We check the determinants
|M
1
| = |0| ≤ 0 |M
2
| =
0 2
2 −6
≥ 0 |M
3
| =
0 2 5
2 −6 0
5 0 −3
≤ 0
85
1.3.10 Negative Semi-Definite Matrices
But then we must do the same for all possible reorderings of x
1
, x
2
and x
3
meaning we must check
the same three minors for each of
0 2 5
2 −6 0
5 0 −3
original Hf(x)
−6 2 0
2 0 5
0 5 −3
σ switches x
1
and x
2
−3 0 5
0 −6 2
5 2 0
σ switches x
1
and x
3
0 5 2
5 −3 0
2 0 −6
σ switches x
2
and x
3
−6 0 2
0 −3 5
2 5 0
σ : (x
1
, x
2
, x
3
) 7→ (x
2
, x
3
, x
1
)
−3 5 0
5 0 2
0 2 −6
σ : (x
1
, x
2
, x
3
) 7→ (x
3
, x
1
, x
2
)
This is a significant amount of work, though linear algebra knowledge lets us avoid some. For
instance, all the |σM
3
| are equal. As the dimension increases, there are more symmetries to exploit.
With these shortcuts, the main driver in complexity is computing the n×n determinant, not the number
of re-ordered smaller determinants. Checking whether a matrix is negative semidefinite takes about twice
as many operations as checking that it is negative definite.
1.3.11
The most important definitions and results from this section were
The multivariable first-order condition (Theorem 1.27)
The Hessian matrix (Definition 1.28)
Positive and negative definite (Definition 1.29)
The determinant test for a negative definite matrix (Theorem 1.30)
The multivariable global second-order condition (Theorem 1.32)
The Hessian test for strict concavity (Theorem 1.41)
Here is a summary of which conditions and statements from this section imply which others.
86
Hf (x) negative
definite for all x
Hf (x) negative semi-
definite for all x
Hf (a) negative
definite
f is strictly concave
f is concave
a is the
unique global max
and only critical point
a is a
global max
a is a strict
local max
∇f(a) =
0
∇f(a) =
0
Hf (a) negative
semidefinite
Thm 1.41
Thm 1.42
SOC
GSOC
GSOC
FOC
FOC
Thm 1.33
Thm 1.33
Figure 1.32: Relationships between the conditions of multivariable optimization
87
Figure 2.2: The graph y = f (x) over a curve S
Figure 2.3: The graph y = f (x) over a finite set S
Constraints arise naturally in economics. Here are two familiar examples recast in the abstract form
that we just introduced.
Example
What is the maximum utility one can attain given a budget constraint?
max
x∈S
u(x)
where S is the set of purchases one can afford.
91
2.1.2
Equality Constraints and Level Sets
The first type of constraint we will examine is an equality constraint.
Definition 2.3
An equality constraint is a constraint of the form g(x) = 0. With such a constraint we are solving
max
x∈D
f(x) subject to g(x) = 0 or
max
x∈S
f(x) where S = {x ∈ D : g(x) = 0}
To contrast g with the objective function f, we call it a constraint function.
Remark
We can rewrite any equation in the variables of x to have the form g(x) = 0. For example:
x
2
=
16
x
1
=⇒ 16 − x
1
x
2
= 0
An equality constraint is an equation, a condition that a point can satisfy or fail to satisfy. In order
to maximize f among such points, we want to be able to visualize the entire set of points that satisfy
the constraint. Mathematics has a term for this set.
Definition 2.4
Given a function g(x) a level set of g is the set of points that, for some number c, satisfy g(x) = c.
Example
Given the function g(x
1
, x
2
) = x
1
x
2
, here are four different level sets of g:
{(x
1
, x
2
) : x
1
x
2
= 9}
{(x
1
, x
2
) : x
1
x
2
= 17.32}
{(x
1
, x
2
) : x
1
x
2
= 0}
{(x
1
, x
2
) : x
1
x
2
= −4}
93
2.1.2
Equality Constraints and Level Sets
Remark
Commonly, people refer to a level set by its defining equation, rather than using set notation. Thus “the
level set x
2
1
+ x
2
2
= 16” refers to the set {(x
1
, x
2
) : x
2
1
+ x
2
2
= 16}.
Level sets have several applications in economics you may be familiar with:
Example
If we are choosing to buy quantities x
1
and x
2
of goods priced at p
1
and p
2
, then the goods we can buy
with a budget constraint I is the level set (often called a budget line):
p
1
x
1
+ p
2
x
2
= I
Example
Given a utility function u(x
1
, x
2
), the level set u(x
1
, x
2
) = c is called an indifference curve. The
chooser will be equally happy with any point on the curve.
For a two-variable function, a level set is the intersection of the graph y = g(x
1
, x
2
) with a horizontal
plane y = c. These are often curves, so we call them level curves.
Figure 2.4: The level curves x
1
x
2
= c
Notice that x
1
x
2
= c is not always a curve. When c = 0, the level set is the x
1
- and x
2
-axes. At
the origin, this is two curves (lines) intersecting.
94
For a three-variable function, the level sets are called level surfaces because for most values of c,
they are 2-dimensional.
Figure 2.5: The level surfaces x
2
1
+ x
2
2
+ x
2
3
= c
Notice x
2
1
+ x
2
2
+ x
2
3
= 0 is not a surface, it is a point. x
2
1
+ x
2
2
+ x
2
3
= −5 is the empty set.
It can be difficult to determine the shape and structure of a level set. A powerful mathematical
theorem can tell us when to expect good behavior.
Corollary 2.5 [Corollary to the Implicit Function Theorem]
Let g(x) be a continuously differentiable function at a. If a lies on the level set g(x) = c and ∇g(a) =
0,
then the level set g(x) = c is a (n −1)-dimensional shape in some neighborhood of a. Specifically, it is
the graph of a differentiable function of n − 1 of the variables of R
n
.
This means that for smooth functions we can generally expect the level set to be a space of one
dimension lower than the domain, except at the occasional points where ∇g(x) =
0. We can see this
in our previous examples.
For g(x
1
, x
2
) = x
1
x
2
the nonempty level sets are curves, except for x
1
x
2
= 0. Even x
1
x
2
= 0
looks like a curve in any neighborhood that does not include (0, 0).
For g(x
1
, x
2
, x
3
) = x
2
1
+ x
2
2
+ x
2
3
the nonempty level sets are surfaces (spheres), except for
x
2
1
+ x
2
2
+ x
2
3
= 0
We will give an argument for this corollary once we have seen the implicit function theorem in chapter
4.
95
2.1.3
Finding the Maximizer and Minimizer Graphically
The level set g(x) = 0 is obviously important to a constrained optimization problem. We can also
use level sets to understand the shape of the objective function. Level sets have the advantage that
they live in R
n
, not R
n+1
, making them easier to draw than the graph y = f(x),
Consider the objective function f(x
1
, x
2
) = x
2
1
+ x
2
2
subject to x
2
1
− x
2
− 4 = 0.
a
What are the level curves of f (x
1
, x
2
)?
b
How can we use a diagram of the level curves of f to argue that (1, −3) is not a local maximizer
or minimizer?
c
Where do the local maximizer(s) and local minimizer(s) of f on x
2
1
− x
2
− 4 = 0 appear to lie?
d
Are any of the local maximizers and minimizers also global maximizers or minimizer?
Solution
a
The level sets have the form x
2
1
+x
2
2
= c these are circles centered at the origin, with higher values
the farther from the origin we go.
b
(1, −3) lies on the level set x
2
1
+x
2
2
= 10. We can see from a sketch that the curve x
2
1
−x
2
−4 = 0
passes through this level set from higher values to lower values. Thus (1, −3) is not a local
maximizer or minimizer.
c
(0, −4) appears to be a local maximizer, because x
2
1
− x
2
− 4 = 0 meets the x
2
1
+ x
2
2
= 16 and
travels back to smaller-values in both directions. There also appear to be local minimizers in the
third and fourth quadrant where x
2
1
− x
2
− 4 = 0 touches a smallest level set and then travels
back to larger-valued level sets.
d
There is no global maximizer. As we travel upwards in x
2
1
−x
2
−4 = 0, we cross larger and larger
valued level sets. There is no upper bound to the values of f(x
1
, x
2
) that we can attain on the
constraint. On the other hand, the local minimizers are global minimizers. The only smaller-valued
level curves are inside that circle, and the constraint does not enter that region.
96
c = 1
c =?
c = 10
c = 16
c = 25
(1, −3)
global min global min
local max
x
1
x
2
Figure 2.6: Level curves of x
2
1
+ x
2
2
and the level curve x
2
1
− x
2
− 4 = 0
Figure 2.7: The graph of y = x
2
1
+ x
2
2
over the constraint x
2
1
− x
2
− 4 = 0
When the level set g(x) = 0 crosses a level set f(x) = c at a, this usually means that it intersects
97
Theorem 2.7
Let f(x) and g(x) be continuously differentiable functions. If a is a local maximizer or minimizer of
f(x) subject to the constraint g(x) = 0, then either
1 there is some number λ such that (a, λ) satisfies the first-order condition of L or
2 ∇g(a) = 0.
We will call these the points that satisfy these conditions critical points or stationary points of
the constrained optimization problem. The value of λ is called a Lagrange multiplier.
Remark
Notice that this theorem is a necessary condition for a maximizer or minimizer, not a sufficient one.
2.1.5
We expect that the objective function f(x
1
, x
2
) = x
2
1
+ x
2
2
has a minimizer on the constraint
x
2
1
− x
2
− 4 = 0 somewhere in the fourth quadrant. Find it.
Solution
Letting g(x
1
, x
2
) = x
2
1
− x
2
− 4 we have the Lagrangian:
L(x
1
, x
2
, λ) = x
2
1
+ x
2
2
+ λ(x
2
1
− x
2
− 4)
∇g(x
1
, x
2
) = (2x
1
, −1). This is never
0, so any minimizer must satisfy the first-order condition of L.
99
2.1.5
Using the Lagrangian
We set the partial derivatives equal to 0 and solve.
∂L
∂x
1
= 0
∂L
∂x
2
= 0
∂L
∂λ
= 0
2x
1
+ 2x
1
λ = 0 2x
2
− λ = 0 x
2
1
− x
2
− 4 = 0
2x
1
(1 + λ) = 0
if x
1
= 0 0
2
− x
2
− 4 = 0
x
2
= −4
if λ = −1 2x
2
+ 1 = 0
x
2
= −
1
2
x
2
1
+
1
2
− 4 = 0
x
1
= ±
r
7
2
The only possible local maximizers and minimizer are (0, −4),
q
7
2
, −
1
2
and
−
q
7
2
, −
1
2
. Since we
are looking for a minimizer in the fourth quadrant, it must be
q
7
2
, −
1
2
.
2.1.6
Proving the First-Order Condition of the Lagrangian
The partial derivatives of the Lagrangian are
∂L
∂x
i
=f
i
(x) + λg
i
(x)
∂L
∂λ
=g(x)
Any solution (a, λ) to the first-order condition on L satisfies
1 f
i
(a) = −λg
i
(a) for all i, so ∇f (a) = −λ∇g(a).
2 g(a) = 0, so a lies on the constraint.
The necessity of 2 is easy to explain. If a does not lie on the constraint, then it cannot be a
maximizer or minimizer subject to that constraint.
In order to understand 1 , we need to investigate the relationship between gradient vectors and
level sets. Like with multivariable unconstrained optimization, we use compositions over paths to study
maximizers on constraints. We need to alter our approach in three important ways.
100
2.1.6 Proving the First-Order Condition of the Lagrangian
Definition 2.8
A vector v is normal to a set at a, if v orthogonal to all tangent vectors of that set at a.
To avoid having to mention it as a separate case, we consider
0 to be orthogonal to every vector.
Lemma 2.9
If a is a local maximizer or local minimizer of a continuously differentiable function f(x) subject to
g(x) = 0, then ∇f(a) is normal to the level set g(x) = 0.
This gives us a good characterization of the gradient of the objective function at a local maximizer
or minimizer. We can now visually rule out a potential maximizer, if its gradient does not point in the
right direction. We now need to connect this to our computational tool. The Lagrangian compares ∇f
to ∇g. What can we say about ∇g?
Again, we consider a path x(t) in the level set g(x) = 0. This time we study the composition g(x(t)).
If you think carefully about what we have done, you will notice that this is a silly composition. We can
evaluate this composition at any value of t, and we will always get 0 (make sure you see why).
We can now apply the chain rule to g(x(t)), the zero function. This time the derivative is zero
because we are differentiating a constant function.
dg
dt
(t
0
) = ∇g(x(t
0
)) · x
′
(t
0
)
0 = ∇g(a) · x
′
(t
0
)
We conclude that ∇g(x(t)) is orthogonal to x
′
(t) for all paths x(t) in the level set g(x) = 0.
Remark
This argument did not rely on a being any special point on the level set. Nor did it matter that the level
set had value 0. Any level set g(x) = c will have this relationship to the gradient, since the composition
will still be a constant function.
The result of this argument is the one of the most important geometric facts about gradient vectors.
We will use it over and over again.
102
Example
If x
1
denotes a physical quantity then it is often appropriate to assume
x
1
≥ 0.
Inequalities like these define the familiar shape of the budget set.
x
1
x
2
Figure 2.13: The region constrained by I − p
1
x
1
− p
2
x
2
≥ 0, x
1
≥ 0 and x
2
≥ 0
The set of points that satisfies an inequality constraint generally lies on one side of the level set
g(x) = 0. Graphically, it is the subset of the domain over which the graph y = g(x) is above the plane
y = 0. We therefore adopt the following terminology.
Definition 2.13
Given a multivariate function f(x) a upper level set of f is the set of points that, for some number
c, satisfy f(x) ≥ c. The points that satisfy f (x) ≤ c are a lower level set.
107
The first-order condition of the Lagrangian requires that ∇f (a) = −λ∇g(a). ∇g(a) points into the
upper level set g(x) ≥ 0. If λ is negative then ∇f (a) points into this set as well.
Lemma 2.16
Suppose that (a, λ) satisfies the first-order condition of the Lagrangian of f subject to g(x) = 0. If a
is a local maximizer of f(x) subject to g(x) ≥ 0, then either
1 λ ≥ 0 or
2 ∇g(a) =
0
Understanding the intuition behind this result is probably more important than knowing a formal
proof. Still, here is a proof. Like the argument we have already given, it is a proof of the contrapositive.
Proof
If ∇g(a) =
0 then the lemma is satisfied. We will consider the case where ∇g(a) =
0 and λ < 0. We
will show that a is not a local maximizer.
Consider the line through a in the direction of ∇g(a). Its equation is x(t) = a + t∇g(a) and
x
′
(0) = ∇g(a). By Lemma 2.14, since ∇g(a) · x
′
(0) = ∇g(a) · ∇g(a) > 0, this line travels into the
upper level set g(x) ≥ 0 at t = 0.
What happens to values of f along this line? Consider the composition f(a+ t∇g(a)). Its derivative
at t = 0 is
df
dt
(0) = ∇f(a) · ∇g(a)
= (−λ∇g(a)) · ∇g(a)
= −λ(∇g(a) · ∇g(a))
> 0
By Lemma 1.2, there is a neighborhood of 0 where for t > 0, f(x(t)) > f(x(0)) = f(a). We have seen
that these points lie in the upper level set, so a cannot be a maximizer of f(x) subject to g(x) ≥ 0.
Since a cannot be a maximizer when λ < 0, we conclude that if a is a maximizer, then λ ≥ 0.
Remark
Showing that a is not a maximizer required us to compare it to points in the upper level set g(x) ≥ 0.
This reasoning does not work when maximizing subject to an equality constraint. The line we used does
not lie in g(x) = 0. We can thus pout no restriction on λ when maximizing subject to g(x) = 0. λ can
be positive, negative, or zero at a maximizer.
111
2.2.5
Conditions for a Maximizer Subject to an Inequality Constraint
With complementary slackness in hand, we can write a necessary condition for a maximizer on an
inequality constraint. This theorem is written using notation 2 , but we could rewrite it with one of the
other notations.
Theorem 2.18
Let f(x) and g(x) be continuously differentiable functions. If a is a local maximizer of f(x) subject to
the constraint g(x) ≥ 0, then one of the following must be true
1 there is some number λ such that (a, λ) satisfies
∂L
∂x
i
(a, λ) = 0 for all i
g(a) ≥ 0, λ ≥ 0, and λg(a) = 0.
2 g(a) = 0 and ∇g(a) = 0.
2.2.6
Opimization Subject to an Inequality Constraint
Find the maximizer of f(x
1
, x
2
) = 10 − x
2
1
− x
2
2
subject to 3x
1
+ 4x
2
≥ 25.
Solution
First we write our constraint in the correct form. g(x
1
, x
2
) = 3x
1
+4x
2
−25 ≥ 0. Note that ∇g(x
1
, x
2
)
is never
0, so any critical point must satisfy our complementary slackness conditions. The Lagrangian is
L(x
1
, x
2
, λ) = 10 − x
2
1
+ x
2
2
+ λ(3x
1
+ 4x
2
− 25)
The conditions are
∂L
∂x
1
= 0
∂L
∂x
2
= 0
λ ≥ 0 3x
1
+ 4x
2
− 25 ≥ 0 λ(3x
1
+ 4x
2
− 25) = 0
113
2.3
The Kuhn-Tucker Conditions
Goals:
1 Solve for maximizers on multiple constraints using the Kuhn-Tucker conditions
2 Recognize what each case of the Kuhn-Tucker conditions is checking
2.3.1
Multiple Equality Constraints
In economics, it is easy to imagine a firm is constrained by more than one equation:
A budget for capital and labor
Availability of labor or other inputs
A government regulation
You can’t purchase or produce a negative amount
What if we want to maximize an n-variable function f(x) subject to more than one equality con-
straint?
g
1
(x) = g
2
(x) = ··· = g
m
(x) = 0
The feasible set, is the intersection of the level sets g
j
(x) = 0. We have seen that a single level set
is usually one dimension less than the space it lies in. The intersection of m level sets will usually be an
(n − m)-dimensional shape.
Notation
The g
i
are different functions, not partial derivatives of g. Whenever we have multiple g’s around, we
will need to write our partial derivatives using ∂ notation, for instance
∂
∂x
2
g
3
(x) not g
3
2
(x)
116
Figure 2.19: Two level sets intersecting at a pair of (one-dimensional) curves
Figure 2.20: Three level sets intersecting at a pair of (zero-dimensional) points
Like with a single constraint, there are exceptions. Also like a single constraint, these exceptions
rely on the inadequacy of the gradient vector(s). The following definition come from linear algebra. A
corollary of the multivariable implicit function theorem characterizes the intersection of the level sets.
Definition 2.19
Vectors v
1
, v
2
, . . .v
m
are linearly dependent if there are constants k
j
, not all 0, such that
m
X
j=1
k
j
v
j
=
0
If no such k
j
exist then they are linearly independent
117
2.3.1 Multiple Equality Constraints
Equivalently, vectors are linearly dependent, if one of them is a linear combination of the others.
Corollary 2.20
Let g
1
(x), g
2
(x), . . . g
m
(x) be a continuously differentiable functions at a. If a lies on the level sets
g
j
(x) = c
j
and the ∇g
j
(a) are linearly independent, then the intersection of the level sets g
j
(x) = c
j
is
a (n −m)-dimensional shape in some neighborhood of a. Specifically, it is the graph of m differentiable
functions on R
n−m
.
Remark
Any set containing the zero vector is linearly dependent. Thus this corollary agrees with our criterion
for a single constraint. If the vector ∇g(a) =
0, then it is linearly dependent.
Two vectors are linearly dependent if they are parallel. With parallel gradients, the intersection of
m level sets can have fewer than (n − m) dimensions.
Figure 2.21: Two level sets with parallel gradients intersecting at a point
Parallel gradients can also occur when the level sets overlap, in this case the intersection can have
more than (n − m) dimensions.
Figure 2.22: Two (overlapping) level sets whose intersection is a plane
118
Three vectors are linearly dependent if they are coplanar. Coplanar gradient vectors can also lead to
intersections of unexpected dimension.
Figure 2.23: Three level sets with coplanar gradients intersecting on a line
Main Idea
Each additional equality constraint reduces the dimension of the feasible set by 1, unless its gradient
vector is a linear combination of the existing gradient vectors.
2.3.2
The Lagrangian for Multiple Equality Constraints
For multiple constraints, we write a Lagrangian that uses all of the constraints. We need a different
variable λ for each one. The first-order condition of this Lagrangian is a necessary condition for a local
maximizer or local minimizer.
Definition 2.21
Given the problem of maximizing an n-variable objective function f(x) subject to constraints g
j
(x) = 0
for 1 ≤ j ≤ m, the Lagrangian is a function of the n components of x and m variables
λ = (λ
1
, . . . , λ
m
).
L(x,
λ) = f(x) +
m
X
j=1
λ
j
g
j
(x)
119
2.3.2 The Lagrangian for Multiple Equality Constraints
Theorem 2.22
If a is a local maximizer or local minimizer of f(x) subject to constraints g
j
(x) = 0, all continuously
differentiable at a, then either
1 (a,
λ) satisfies the first-order condition of L for some
λ or
2 The ∇g
j
(a) are linearly dependent
Notation
When developing the theory, we write our constraint functions and our Lagrange multipliers using a
common letter and an index variable
g
1
(x), g
2
(x), . . . λ
1
, λ
2
, . . .
This makes it easier to generalize to any number of constraints, because can use Σ notation in the
Lagrangian. For a small number of constraints, using an index is inconvenient. Economists often use
different letters for each constraint and multiplier.
g(x), h(x), . . . λ, µ, . . .
2.3.3
A Multi-Constraint Optimization
Find the maximizer of f(x
1
, x
2
, x
3
) = 3x
2
on the constraints x
2
1
+x
2
3
−50 = 0 and x
1
+x
2
+x
3
= 0.
Solution
The Lagrangian is
L(x
1
, x
2
, x
3
, λ
1
, λ
2
) = 3x
2
+ λ
1
(x
2
1
+ x
2
3
− 50) + λ
2
(x
1
+ x
2
+ x
3
)
The first-order condition is
120
∂L
∂x
1
= 0
∂L
∂x
2
= 0
∂L
∂x
3
= 0
∂L
∂λ
1
= 0
∂L
∂λ
2
= 0
2λ
1
x
1
+ λ
2
= 0 3 + λ
2
= 0 2λ
1
x
3
+ λ
2
= 0 x
2
1
+ x
2
3
− 50 = 0 x
1
+ x
2
+ x
3
= 0
λ
2
= −3
x
1
=
3
2λ
1
x
3
=
3
2λ
1
x
1
= x
3
x
2
1
+ x
2
1
− 50 = 0
x
1
= ±5
±5 = x
3
±5 + x
2
± 5 = 0
x
2
= ∓10
We conclude that (±5, ∓10, ±5). The value f(x
1
, x
2
, x
3
) = 3x
2
is larger when x
2
= 10. We conclude
that no point besides (−5, 10, 5) can be the maximizer of f on these constraints. Because this is not a
sufficient condition, we cannot be sure that this is the maximizer. It may be that no maximizer exists
at all.
We can visualize the maximizer of f(x
1
, x
2
, x
3
) = 3x
2
by looking for the point in the level sets that
is farthest in the x
2
direction. It appears that such a point exists at (−5, 10, −5).
Figure 2.24: The gradients of x
2
1
+ x
2
3
− 50, x
1
+ x
2
+ x
3
= 0, and f(x) = 3x
2
at the maximizer
121
2.3.4
The Reasoning for the Multi-Constraint Lagrangian
The partial derivatives of the Lagrangian are
L
x
i
=
∂f
∂x
i
(x) +
m
X
j=1
λ
j
∂g
j
∂x
i
(x)
L
λ
j
=g
j
(x)
Any solution (a,
λ) to the first-order condition on L satisfies
f
x
i
(a) = −
m
X
j=1
λ
j
∂g
j
∂x
i
(a) for all i, so ∇f (a) = −
P
λ
j
∇g
j
(a).
g
j
(a) = 0 for all j so a lies on all the constraints.
We can examine the relationship between the gradients using the same procedure we used for a
single constraint. We take a path x(t) through a at t
0
. We assume that x(t) lies in the feasible set.
This means it lies in all of the level sets g
j
(x) = 0.
If a is a local maximizer or minimizer of f(x) on the feasible set then it must be a local maximizer
or minimizer of the composition f(x(t)) for all paths x(t) in the feasible set. This means the t-value
corresponding to a (we will call it t
0
) must satisfy the first-order condition of f(x(t)). The derivative
calculation should be familiar:
0 =
df
dt
(t
0
)
0 = ∇f(x(t
0
)) · x
′
(t
0
)
0 = ∇f(a) · x
′
(t
0
)
Thus ∇f(a) is normal to the feasible set.
When we had a single constraint, we argued that there were two opposite normal directions to the
feasible set, and ∇g(a) pointed in one of them. This relied on the fact that our feasible set was a level
set and had dimension (n −1). If we assume that the ∇g
j
(a) are linearly independent, then the feasible
set has dimension (n − m) at a.
There is an entire m-dimensional space of vectors normal to the feasible set at a. If a is a maximizer,
∇f(a) must be a vector in this space.
122
2.3.4 The Reasoning for the Multi-Constraint Lagrangian
Figure 2.26: Two gradients spanning the normal plane of the feasible set
2.3.5
An Optimization with Dependent Gradients
When the gradients of the g
j
(x) are linearly dependent, the feasible set may not have the expected
dimension of (n − m). Even if it does, the gradient vectors need not span the normal space. In this
case, calculus arguments cannot rule out a point from being a maximizer. Here is an example where
insisting upon the first-order condition of the Lagrangian would incorrectly rule out a maximizer.
Consider the smooth function f(x
1
, x
2
) = x
2
1
+ x
2
2
and the constraints x
2
+ 2 = 0 and x
2
− (x
1
−
3)
2
+ 2 = 0. If we attempt to apply the first-order condition we obtain the following.
The constraints intersect only at (3, −2).
At (3, −2) the gradients are
∇g
1
(3, −2) = (0, 1) ∇g
2
(3, −2) = (0, 1) ∇f(3, −2) = (6, −4).
∇f cannot be written as a linear combination of ∇g
1
and ∇g
2
Even though it doesn’t satisfy the first-order condition, (3, −2) must be the maximizer (and mini-
mizer), since it is the only point that satisfies both constraints.
124
Figure 2.27: (3, 2) is a maximizer, but ∇f cannot be written λ
1
∇g
1
+ λ
2
∇g
2
.
If we count dimensions, the feasible set is a 0-dimensional subspace of R
n
. Its normal space is
2 − 0 = 2-dimensional, which suggests that ∇f(3, −2) can be any vector in R
2
. This sounds bizarre
but actually makes sense. (3, −2) is the entire feasible set. It must be a maximizer, no matter what
∇f(3, −2) is. The exception for dependent gradients in Theorem 2.22 exists to avoid ruling out a
maximizer in a situation like this.
2.3.6
The Kuhn-Tucker Conditions
The Kuhn-Tucker conditions are a robust set of necessary conditions for constrained optimizations
with inequality constraints, potentially in addition to some number of equality constraints. They combine
all of the ideas we have developed in this chapter. We will use the same Lagrangian that we would for
equality constraints.
A Lagrangian For Multiple Equality and Inequality Constraints
Given an objective function f(x) and constraints of the forms g
j
(x) ≥ 0 or g
j
(x) = 0 the Lagrangian
is
L(x,
λ) = f(x) +
m
X
j=1
λ
j
g
j
(x).
Like with a single inequality constraint, complementary slackness will remove the λ
j
g
j
(x) when g
j
(x)
does not bind.
125
2.3.6 The Kuhn-Tucker Conditions
Theorem 2.23 [The Kuhn-Tucker Conditions]
Given the objective function f (x) and constraints of the forms g
j
(x) ≥ 0 or g
j
(x) = 0, then at any
local maximizer a one of the following must be true
1 There is some vector
λ such that (a,
λ) satisfies the Kuhn-Tucker conditions:
For each variable x
i
,
L
x
i
(a,
λ) = 0
For each equality constraint function g
j
,
g
j
(a) = 0
For each inequality constraint function g
j
,
g
j
(a) ≥ 0 and λ
j
≥ 0 and λ
j
g
j
(a) = 0
2 The binding ∇g
j
(a) are linearly dependent
2.3.7
Solving the Kuhn-Tucker Conditions
According to Kuhn-Tucker, what points (x
1
, x
2
) could be maximizers of f(x
1
, x
2
) = x
1
x
2
2
given the
constraints
12 − x
1
− 4x
2
≥ 0 9 − x
1
− x
2
≥ 0
Solution
The Lagrangian is L(x
1
, x
2
, λ
1
, λ
2
) = x
1
x
2
2
+ λ
1
(12 −x
1
−4x
2
) + λ
2
(9 −x
1
−x
2
). The Kuhn-Tucker
conditions are
∂L
∂x
1
= x
2
2
− λ
1
− λ
2
= 0
∂L
∂x
2
= 2x
1
x
2
− 4λ
1
− λ
2
= 0
12 − x
1
− 4x
2
≥ 0 9 − x
1
− x
2
≥ 0
λ
1
≥ 0 λ
2
≥ 0
λ
1
(12 − x
1
− 4x
2
) = 0 λ
2
(9 − x
1
− x
2
) = 0
The final equations carry our complementary slackness conditions. There are two ways to choose which
factor is 0 for each, giving us 2 × 2 = 4 cases to check.
126
1 λ
1
= 0, λ
2
= 0
x
2
2
− λ
1
− λ
2
= 0 2x
1
x
2
− 4λ
1
− λ
2
= 0
x
2
2
= 0 2x
1
x
2
= 0
x
2
= 0
check 12 − x
1
− 4x
2
≥ 0 6 − x
1
− x
2
≥ 0
12 − x
1
≥ 0 6 − x
1
≥ 0
x
1
≤ 12 x
1
≤ 9
So (k, 0, 0, 0) satisfies the Kuhn-Tucker conditions for k ≤ 9.
2 λ
1
= 0, 9 − x
1
− x
2
= 0
x
2
2
− λ
1
− λ
2
= 0 2x
1
x
2
− 4λ
1
− λ
2
= 0 9 − x
1
− x
2
= 0
x
2
2
− λ
2
= 0 2x
1
x
2
− λ
2
= 0
x
2
2
= 2x
1
x
2
x
2
2
− 2x
1
x
2
= 0
x
2
(x
2
− 2x
1
) = 0
x
2
= 0 covered in case 1
or x
2
= 2x
1
9 − x
1
− 2x
1
= 0
x
1
= 3
x
2
= 6
6
2
− λ
2
= 0
λ
2
= 36
check 12 − x
1
− 4x
2
≥ 0 λ
2
≥ 0
12 − 3 − 4(6) ≥ 0 36 ≥ 0
−15 ≥ 0
(3, 6, 0, 36) does not satisfy the inequalities.
127
2.3.7
Solving the Kuhn-Tucker Conditions
3 12 − x
1
− 4x
2
= 0, λ
2
= 0
x
2
2
− λ
1
− λ
2
= 0 2x
1
x
2
− 4λ
1
− λ
2
= 0 12 − x
1
− 4x
2
= 0
x
2
2
− λ
1
= 0 2x
1
x
2
− 4λ
1
= 0
4x
2
2
= 4λ
1
2x
1
x
2
= 4λ
1
4x
2
2
= 2x
1
x
2
4x
2
2
− 2x
1
x
2
= 0
2x
2
(2x
2
− x
1
) = 0
x
2
= 0 covered in case 1
or 2x
2
= x
1
12 − 2x
2
− 4x
2
= 0
x
2
= 2
x
1
= 4
2
2
− λ
1
= 0
λ
2
= 4
check 9 − x
1
− x
2
≥ 0 λ
1
≥ 0
9 − 4 − 2 ≥ 0 4 ≥ 0
(4, 2, 4, 0) satisfies the Kuhn-Tucker conditions
4 12 − x
1
− 4x
2
= 0, 9 − x
1
− x
2
= 0
x
2
2
− λ
1
− λ
2
= 0 2x
1
x
2
− 4λ
1
− λ
2
= 0 12 − x
1
− 4x
2
= 0 9 − x
1
− x
2
= 0
12 − x
1
− 4x
2
= 0
−(9 − x
1
− x
2
) = −0
3 − 3x
2
= 0
x
2
= 1
9 − x
1
− 1 = 0
x
1
= 8
1 − λ
1
− λ
2
= 0 16 − 4λ
1
− λ
2
= 0
−(16 − 4λ
1
− λ
2
) = −0
−15 + 3λ
1
= 0
λ
1
= 5
16 − 4(5) − λ
2
= 0
λ
2
= −4
check λ
1
≥ 0 λ
2
≥ 0
5 ≥ 0 −4 ≥ 0
(8, 1, 5, −4) does not satisfy the Kuhn-Tucker conditions.
The Kuhn-Tucker conditions are necessary. No point except (4, 2) or (k, 0) for k ≤ 9 can be a maximizer.
128
By evaluating we see
f(4, 2) = 16 f(k, 0) = 0
So only (4, 2) could be a maximizer. We cannot conclude that it is a maximizer, until we learn a relevant
sufficient condition.
Strategy
In general our method for solving the Kuhn Tucker conditions is
Pick an equality from a factor of each complementary slackness condition.
Solve the resulting system of equations.
Discard any solutions that violate the remaining inequalities.
Repeat for a different choice of equalities.
With m inequality constraints, we will need to repeat for all 2
m
combinations of equalities.
2.3.8
Visualizing the Kuhn-Tucker Conditions
Each choice of equalities mutates the Kuhn-Tucker conditions into the necessary conditions from
Theorem 2.22, where the binding g
j
(x) ≥ 0 are treated as equality constraints. The following diagram
shows how each choice corresponds to a different piece of the feasible set, where different constraints
bind.
x
1
x
2
g
1
(x) ≥ 0 and λ
1
= 0
g
2
(x) = 0 and λ
2
≥ 0
L(x,
λ) = f(x) + λ
2
g
2
(x)
∇f(x) = −λ
2
∇g
2
(x)
g
1
(x) ≥ 0 and λ
1
= 0
g
2
(x) ≥ 0 and λ
2
= 0
L(x,
λ) = f(x)
∇f(x) =
0
g
1
(x) = 0 and λ
1
≥ 0
g
2
(x) = 0 and λ
2
≥ 0
L(x,
λ) = f(x) + λ
1
g
1
(x) + λ
2
g
2
(x)
∇f(x) = −λ
1
∇g
1
(x) −λ
2
∇g
2
(x)
g
1
(x) = 0 and λ
1
≥ 0
g
2
(x) ≥ 0 and λ
2
= 0
L(x,
λ) = f(x) + λ
1
g
1
(x)
∇f(x) = −λ
1
∇g
1
(x)
Figure 2.28: The four regions covered by two pairs of inequalities with complementary slackness
129
2.3.8 Visualizing the Kuhn-Tucker Conditions
In the case of multiple binding inequality constraints, the condition that λ
j
≥ 0 forces ∇f(a) to lie
in the cone made by −∇g
j
(x).
x
1
x
2
Figure 2.29: The cone where ∇f must lie if both constraints bind at a maximizer
2.3.9
Proving the Necessity of the Kuhn-Tucker Conditions
To justify the necessity of the Kuhn-Tucker conditions, we can list each condition and describe why
it must apply.
1 A given a will satisfy some binding equations g
j
(x) = 0.
2 For the constraints that do not bind, setting λ
j
= 0 turns L(x,
λ) into the Lagrangian for f with
only the binding equality constraints. If a is a local maximizer, it must satisfy Theorem 2.22,
specifically,
∂L
∂x
i
(a) = 0.
3 For each inequality constraint, a maximizer must satisfy g
j
(a) ≥ 0 to be feasible.
4 Finally, if every path from a into the feasible region decreases f , then all λ
j
≥ 0.
We have not given a convincing argument for the last statement yet. A formal proof is just below,
but it may be more illuminating to convince yourself graphically. Try drawing a few ∇f that lie outside
the cone between −∇g
1
and −∇g
2
in the previous figure. For each one, you should be able to identify
a vector x
′
(t) that points into the feasible region but makes an acute angle with ∇f.
Here is a formal proof that λ
j
≥ 0 at a maximizer.
Proof
Pick any inequality constraint g
k
(x) ≥ 0. If g
k
(x) ≥ 0 is not binding at a, then λ
k
= 0 and we are
done.
130
We will therefore consider the case where g
k
(a) ≥ 0 is binding and show λ ≥ 0. Let S be the inter-
section of all the binding constraints except g
k
(x) = 0. Since the gradients of the binding constraints
a linearly independent, we can conclude:
The gradients of the binding constraints other than ∇g
k
(a) span the normal space of S at a
∇g
k
(a) is not a linear combination of these gradients, so it does not lie in the normal space of S
at a
There must be a path x(t) in S through a such that x
′
(t
0
) is not orthogonal to ∇g
k
(a). We can pick
such a path so that ∇g
k
(a) · x
′
(t
0
) > 0. If the first path we try produces a negative dot product, just
traverse x(t) backwards instead. First we show that, in some neighborhood of t
0
, x(t) lies in the feasible
region for all t > t
0
. We check that it satisfies each constraint.
a lies in the interior of each upper level set g
j
(x) ≥ 0 for each nonbinding g
j
. x(t) will travel for
some distance before leaving the upper level set.
Since x(t) was chosen to lie in S it lies in the level set g
j
(x) = 0 for all binding g
j
except g
k
.
Since ∇g
k
(a) · x
′
(t
0
) > 0, x(t) must travel into the upper level set g
k
(x) ≥ 0.
Thus in some neighborhood of t
0
, x(t) lies in the feasible set for t > t
0
. Since a is a local maximizer,
f(x(t
0
)) ≥ f(x(t)) for t > t
0
in this neighborhood. Thus
df
dt
(t
0
) ≤ 0. We use the chain rule to examine
this inequality.
df
dt
(t
0
) ≤ 0
∇f(a) · x
′
(t
0
) ≤ 0
−
m
X
j=1
λ
j
∇g
j
(a) · x
′
(t
0
) ≤ 0
We pause to examine the terms of this summation. Most are 0. Here is the reasoning.
For nonbinding g
j
(x), λ
j
= 0.
For binding g
j
(x) except j = m, we have ∇g
j
(a) ·x
′
(t
0
) = 0, since ∇g
j
(a) is a normal vector of
S and x(t) lies in S.
Finally, ∇g
k
(a) · x
′
(t
0
) > 0 by our choice of x(t).
We apply these to our inequality.
−
m
X
j=1
λ
j
∇g
j
(a) · x
′
(t
0
) ≤ 0
−λ
k
∇g
k
(a) · x
′
(t
0
) ≤ 0
λ
k
≥ 0
131
2.3.10
Kuhn-Tucker with Non-Negativity Constraints
If we demand that x
i
≥ 0 for each i, then we have added n new constraints to our Langrangian.
We will use µ
i
for the Lagrange multipliers of the x
i
.
L(x,
λ, µ) = f(x) +
m
X
j=1
λ
j
g
j
(x) +
n
X
i=1
µ
i
x
i
This is unwieldy. If we are clever, we can do better.
For 1 ≤ k ≤ n, the inequality conditions for µ
k
are
∂L
∂µ
k
= x
k
≥ 0 and µ
k
≥ 0
If the constraint x
k
≥ 0 is not binding, then µ
k
= 0. The µ
k
x
k
term is 0 in L and its partial
derivatives. We can remove it, but we will still need to verify that x
k
≥ 0 is satisfied.
If the constraint is binding, then we can still remove the µ
k
x
k
term from L.
The term goes to 0 anyway in L
x
i
for i = k.
In L
x
k
, there is a single +µ
k
term. L
x
k
is supposed to be 0, whereas µ
k
≥ 0. We can
replicate this effect by removing µ
k
x
k
from L, but requiring that the remaining terms of L
x
k
have a sum less than or equal to 0.
For each k, we can remove the variable µ
k
and the µ
k
x
k
term from our Lagrangian in exchange for
some new conditions.
132
Example
From the point of view of a producer, the choice variables might be
q: the quantity of a product they produce
p: the price they sell for
the exogenous parameters may include
w, r: the price of labor or capital
t, s: a tax or subsidy imposed by the government
some aspect of the demand function
Remark
The choice variables are the variables whose value is chosen by the agent who wants to maximize the
function.
In economics, we assume that choosers are rational and well-informed. We assume they will learn the
value of the parameters. After that, they will pick the value of the choice variables that maximizes their
objective function. Comparative statics ask how the outcome changes as the value of the parameter
changes. We will present the tools to compute two types of comparative statics
1 How does the optimal value of a choice variable change as a parameter changes?
2 How does the value of the objective function change as a parameter changes?
Notation
Given an objective function f(x, α), with choice variable x and parameter α, we expect that different
values of α will lead the chooser to pick different values of x. The chooser’s optimal choice is a function
of α that we write x
∗
(α).
The first-order condition tells us that for each α, x
∗
(α) must satisfy the first-order condition:
f
x
(x
∗
(α), α) = 0
Without an expression for x
∗
(α), trying to understand its rate of change raises a sequence of questions.
1 What is
dx
∗
(α)
dα
?
137
3.1.4
The Implicit Function Theorem
The implicit function theorem tells us when a point on the graph of an implicit equation has a
neighborhood that is identical to the graph of an explicit function. The basic version takes an implicit
equation in two variables and writes a function that expresses one (the dependent variable) in terms of
the other (the independent variable).
Theorem 3.3 [The Implicit Function Theorem]
Suppose F (x, y) is a continuously differentiable function and (a, b) is a point on F (x, y) = c. If
F
y
(a, b) = 0, then there exists a differentiable function f (x) such that y = f(x) and F (x, y) = c
describe the same graph in some neighborhood of (a, b).
The theorem does not tell us what the function f(x) is, only that it exists. Even so, we can express
its derivatives in terms of F .
Corollary 3.4
The derivative of f with respect to x at a is given by
f
′
(a) = −
F
x
(a, b)
F
y
(a, b)
The derivation of this formula is famous and not too difficult. To compute the derivative of f, we
parameterize a path in the graph y = f(x). Unlike our previous parametrizations, we will use x as the
parameter. Differentiating x with respect to x is most palatable with Leibniz notation.
x = x
dx
dx
= 1
y = f(x)
dy
dx
= f
′
(x)
The points (x, f(x)) lie in F (x, y) = c near (a, b). Thus the composition F (x, f(x)) is the constant
function c. Differentiating F (x, f(x)) = c with respect to the parameter x to produces an equation that
contains f
′
(x). We solve this equation to obtain an expression for f
′
(x).
141
3.1.4 The Implicit Function Theorem
F (x, f(x) = c (y = f (x) lies in F (x, y) = c)
dF (x, f(x))
dx
= 0 (derivative of a constant is 0)
F
x
(x, f(x))
dx
dx
+ F
y
(x, f(x))
df(x)
dx
= 0 (chain rule)
F
x
(x, f(x))(1) + F
y
(x, f(x))f
′
(x) = 0 (evaluate derivative of x)
f
′
(x) = −
F
x
(x, f(x))
F
y
(x, f(x))
(solve for f
′
(x))
f
′
(a) = −
F
x
(a, b)
F
y
(a, b)
(evaluate at x = a)
The implicit function theorem guarantees that f exists and is differentiable in a neighborhood of a.
Since we don’t know how big this neighborhood is, x = a is the only point at which we can be sure
f
′
(x) exists.
We often apply this formula before checking whether the implicit function theorem applies. Assuming
F is continuously differentiable, the theorem will only fail when F
y
(a, b) = 0. Conveniently, this formula
will be undefined in that case.
We can also determine the derivative of f geometrically. Near (a, b), the graph y = f(x) is the level
set F (x, y) = c. We know the ∇F (a, b) = (F
x
(a, b), F
y
(a, b)) is normal to the level set.
The gradient has a slope of
F
y
(a,b)
F
x
(a,b)
. The tangent line, which is perpendicular, has a negative
reciprocal slope: −
F
x
(a,b)
F
y
(a,b)
. The slope of the tangent line is also the derivative f
′
(a).
Figure 3.3: The gradient of F and the tangent line whose slope is f
′
(a)
142
3.1.5
Applying the Implicit Function Theorem
If x and y satisfy x
3
− 2xy
2
+ 3y = −13, show that y can be written as a function f(x) of x near
(2, 3) and compute f
′
(2)
Solution
We may first want to check that (2, 3) satisfies the implicit equation.
(2)
2
− 2(2)(3)
2
+ 3(3) = −13
In order to apply the implicit function theorem, we need to know that F (x, y) = x
3
− 2xy
2
+ 3y has
continuous partial derivatives. It is a polynomial, so it does. Finally, we need to show F
y
(2, 3) = 0.
F
y
(x, y) = −4xy + 3
F
y
(2, 3) = −4(2)(3) + 3
F
y
(2, 3) = −21 = 0
Therefore, by the implicit function theorem, there is a function f (x) such that y = f(x) and x
3
−2xy
2
+
3y = −13 describe the same graph near (2, 3). The derivative of f at x = 2 is given by Corollary 3.4.
f
′
(2) = −
F
x
(2, 3)
F
y
(2, 3)
= −
3(2)
2
− 2(3)
2
−4(2)(3) + 3
= −
−6
−21
= −
2
7
Figure 3.4: The graph of x
3
− 2xy
2
+ 3y = −13 and its tangent line at (2, 3)
143
3.1.6
The Derivative of the Optimal Choice
In comparative statics, we are interested in the function x
∗
(α), which is a solution to the equation
f
x
(x, α) = 0. We apply the implicit function theorem where:
α takes the role of the independent variable “x”.
x takes the role of the dependent variable “y”.
f
x
(x, α) takes the role of the two-variable function “F ”.
(b, a) is a point on the graph f
x
(x, α) = 0.
The derivatives of F are second derivatives of f .
The implicit function theorem requires that
F
x
(b, a) = f
xx
(b, a) = 0.
It concludes there is a differentiable function x
∗
(α) such that x = x
∗
(α) matches the graph of f
x
(x, α) =
0 in a neighborhood of (b, a).
Corollary 3.5
Given a function f(x, α), suppose that
1 a is a value of α and b = x
∗
(a)
2 x
∗
(α) satisfies f
x
(x
∗
(α), α) = 0 near (b, a)
3 f(x, α) has continuous second derivatives near (b, a)
4 f
xx
(b, a) = 0
Then
dx
∗
(a)
dα
= −
f
xα
(b, a)
f
xx
(b, a)
This computes the derivative at a point. If, in some interval of α values, every point (x
∗
(a), a)
satisfies these conditions, then we can extend this to a derivative function for x
∗
(α).
dx
∗
(α)
dα
= −
f
xα
(x
∗
(α), α)
f
xx
(x
∗
(α), α)
144
Remark
When working with comparative statics, we can reasonably assume the requirements of the implicit
function theorem.
1 If an optimal choice exists, then it must satisfy the first-order condition.
2 It makes sense to work with smooth functions when modeling empirical data.
3 At a maximizer, f
xx
is likely (though not required) to be negative. Alternately, assuming f
xx
< 0
on the entire domain guarantees that x
∗
(α) is actually the global maximizer for each α.
3.1.7
Here we will give a construction of the function y = f(x) that matches the graph of F (x, y) = c
near a point (a, b). The same argument works for more variables, but the pictures are harder to draw.
Constructing f(x) requires the following tools:
1 Lemma 1.11: If f
′
(x) > 0 then f is increasing.
2 Definition of continuity: The values of F can be kept arbitrarily close to F (a, b) by restricting to
points sufficiently close to (a, b).
3 The intermediate value theorem: If f(x) is continuous and f(a
−
) < c < f(a
+
) then there is a
value k between a
−
and a
+
such that f(k) = c.
The implicit function theorem requires that F
y
(a, b) = 0. There are two cases to consider, but the
arguments are analogous. We will consider the case F
y
(a, b) > 0.
145
3.1.7 The Construction of f(x)
1 F is continuously differentiable, so
there is a neighborhood of (a, b) where
F
y
(x, y) > 0.
2 Within this neighborhood, we can travel
h units in the y-direction from (a, b). Let
F (a, b + h) = c
+
and F (a, b − h) = c
−
3 Since F
y
(x, y) > 0, we have c
+
> c and
c
−
< c.
4 Since F is continuous, there is a neigh-
borhood of (a, b + h) where F (x, y) > c.
There is a neighborhood of (a, b − h)
where F (x, y) < c.
5 We consider segments from (x, b + h)
to (x, b − h), with one endpoint in each
neighborhood.
6 Apply the intermediate value theorem,
since F (x, b − h) < c < F (x, b + h),
there is a k between b−h and b +h such
that F (x, k) = c.
7 Since F
y
(x, y) > 0, F is increasing along
the segment, so it cannot take the value
c more than once. The k in the previous
step is unique.
8 Repeat this for every segment of the form
(x, b+h) to (x, b−h), and define f(x) =
k.
F
y
(x, y) > 0
F (a, b + h) = c
+
> c
F (a, b − h) = c
−
< c
(a, b)
F
y
(x, y) > 0
F (x, y) > c
F (x, y) < c
(a, b)
n some in
F
y
(x, y) > 0
F (x, b + h) > c
F (x, b − h) < c
(x, b + h)
(x, b − h)
(a, b)
(x, k)
F
y
(x, y) > 0
y = f(x)
146
3.1.8
The Implicit Function Theorem with More Variables
We can also apply the implicit function theorem to an implicit equation of n > 2 variables. In this
case, one dependent variable can be expressed as a function of n − 1 independent variables.
Theorem 3.6 [The Multivariable Implicit Function Theorem]
Suppose F (x, y) is a continuously differentiable function and F (a, b) = c. If F
y
(a, b) = 0, then there
exists a differentiable function f(x) such that y = f(x) and F (x, y) = c describe the same graph in
some neighborhood of (a, b).
Since f (x) is an n − 1 variable function, the derivatives we can compute are the partial derivatives.
The formula for these derivatives is analogous to the single-variable version.
Corollary 3.7
For each variable x
k
, the partial derivative of f with respect to x
k
at a is given by
f
x
k
(a) = −
F
x
k
(a, b)
F
y
(a, b)
We can now justify our earlier characterization of level sets. At the time we made no mention
of dependent and independent variables. This lack of distinction actually makes the implicit function
theorem easier to apply.
Remark
There is nothing special about the letter y, nor the fact that it is the last variable of F . The variable
“y” in the implicit function theorem can apply to any variable of an implicit equation, so long as the
partial derivative with respect to that variable is not zero.
147
Corollary 3.8
Given a function f(x, α), suppose that
1 a is a value of α and b = x
∗
(a).
2 x
∗
(α) satisfies f
x
(x
∗
(α), α) = 0 near (b, a)
3 f(x, α) has continuous second derivatives near (b,a)
4 f
xx
(b,a) = 0
Then
∂x
∗
(a)
∂α
k
= −
f
xα
k
(b,a)
f
xx
(b,a)
.
We can justify the multivariable implicit function and Corollary 3.7 using arguments similar to the
single-variable versions. The construction of f is the same for both versions, except that x replaces x.
The computation of the partial derivatives of f requires more adaptation.
Since f
k
(x) is a partial derivative, we treat x
k
as a parameter. The other x
i
are constants, held
equal to the corresponding components of a.
x
i
=
(
a
i
if i = k
x
k
if i = k
dx
i
dx
k
=
(
0 if i = k
1 if i = k
y = f(x)
dy
dx
k
= f
x
k
(x)
The strategy is the same. We differentiate F (x, f(x)) with respect to x
k
, solve for f
x
k
(x), and evaluate
at x
k
= a
k
.
F (x, f(x)) = c ((x, f(x)) lies in F (x, y) = c)
dF (x, f(x))
dx
k
= 0 (derivative of a constant is 0)
n
X
i=1
F
x
i
(x, f(x))
dx
i
dx
k
+ F
y
(x, f(x))f
x
k
(x) = 0 (chain rule)
F
x
k
(x, f(x))
dx
k
dx
k
+ F
y
(x, f(x))f
x
k
(x) = 0 (
dx
i
dx
k
= 0 for i = k)
F
x
k
(x, f(x))(1) + F
y
(x, f(x))f
x
k
(x) = 0 (evaluate derivative of x
k
)
f
x
k
(x) = −
F
x
k
(x, f(x))
F
y
(x, f(x))
(solve for f
x
k
(x))
f
x
k
(a) = −
F
x
k
(a, b)
F
y
(a, b)
(evaluate at x = a)
149
3.1.9
The Implicit Function Theorem for Multiple Equations
We have seen two instances previously where the solution to multiple implicit equations was relevant.
1 The feasible set of multiple equality constraints is the intersection of multiple level sets.
2 Critical points of a multivariable function satisfy f
x
i
(x) = 0 for each i.
A graph of the form y = f(x) in R
n+1
will have dimension n. In general, each equation we wish
to satisfy lowers the dimension of our space of solutions by 1. If we want to express an intersection of
level sets, y = f(x) will not have the right dimension.
The way to handle this loss of dimension in an explicit function is to increase the number of dependent
variables. Specifically, if we have n-variables x and m-variables y then the graph of a family of functions
y
j
= f
j
(x) will have dimension n in R
n+m
The most general version of the implicit function theorem states when a family of implicit equations
can be expressed as a family of explicit functions instead.
Notation
Given a family of functions
F(x, y) = (F
1
(x, y), F
2
(x, y), . . . , F
m
(x, y))
the derivative of F with respect to y
j
is the family of functions
∂F
∂y
j
(x, y) =
∂F
1
∂y
j
(x, y),
∂F
2
∂y
j
(x, y), . . . ,
∂F
n
∂y
j
(x, y)
Note that we are using subscripts here to indicate different components of the vector F, not as
partial derivatives.
Theorem 3.9 [The Implicit Function Theorem for Multiple Dependent Variables]
Suppose y is an m-vector and F(x, y) is a family of m continuously differentiable functions such that
F(a,
b) = (c
1
, . . . , c
m
). If the vectors
∂F
∂y
j
(a,
b) are linearly independent, then there exists a family
of differentiable functions (f
j
(x)) such that the equations y
j
= f
j
(x) describe the same graph as
F(x, y) = (c
1
, . . . , c
m
) in some neighborhood of (a,
b).
We can use the chain rule to solve for partial derivatives
∂f
j
(a,
b)
∂x
k
, but the derivative of any equation
F
j
(x, y) = c
j
with respect to x
k
will contain the derivatives of all the f
j
. To compute the derivative
we want, we need to differentiate all of the implicit equations and solve a system of equations. Here is
the simplest example.
150
Example
Consider two implicit equations F
1
(x, y
1
, y
2
) = c
1
and F
2
(x, y
1
, y
2
) = c
2
. Assuming F
y
1
and F
y
2
are
linearly independent, the implicit function guarantees differentiable explicit functions y
1
= f
1
(x) and
y
2
= f
2
(x). Differentiating the original implicit equations with respect to x gives
∂F
1
∂x
dx
dx
+
∂F
1
∂y
1
f
′
1
(x) +
∂F
1
∂y
2
f
′
2
(x) = 0
∂F
2
∂x
dx
dx
+
∂F
2
∂y
1
f
′
1
(x) +
∂F
2
∂y
2
f
′
2
(x) = 0
We can use this system to solve for f
′
1
(x) and f
′
2
(x).
3.1.10
The Derivative of Optimal Choice with Multiple Choice Variables
Suppose your utility is a function of two choice variables and one exogenous parameter.
u(x
1
, x
2
, α)
Your optimal choices x
∗
1
(α) and x
∗
2
(α) satisfy the following implicit equations:
u
1
(x
∗
1
(α), x
∗
2
(α), α) = 0
u
2
(x
∗
1
(α), x
∗
2
(α), α) = 0.
By the implicit function theorem, we can write x
∗
1
(α) and x
∗
2
(α) as differentiable explicit functions of α
if their derivatives with respect to x
1
and x
2
are linearly independent. Here are those derivatives.
∂(u
1
, u
2
)
∂x
1
(x
∗
1
(α), x
∗
2
(α), α) = (u
11
(x
∗
1
(α), x
∗
2
(α), α), u
21
(x
∗
1
(α), x
∗
2
(α), α))
∂(u
1
, u
2
)
∂x
2
(x
∗
1
(α), x
∗
2
(α), α) = (u
12
(x
∗
1
(α), x
∗
2
(α), α), u
22
(x
∗
1
(α), x
∗
2
(α), α))
These are the columns of the second upper left square minor of the Hessian of u. Columns of a matrix
are independent, if the matrix has a nonzero determinant.
|Hu(x
∗
1
(α), x
∗
2
(α), α)
2
| =
u
11
(x
∗
1
(α), x
∗
2
(α), α) u
12
(x
∗
1
(α), x
∗
2
(α), α)
u
21
(x
∗
1
(α), x
∗
2
(α), α) u
22
(x
∗
1
(α), x
∗
2
(α), α)
= 0
Assuming this holds, we can use the chain rule to differentiate both implicit equations with respect to
α. We obtain
u
11
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
1
(α)
dα
+ u
12
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
2
(α)
dα
+ u
1α
(x
∗
1
(α), x
∗
2
(α), α)
dα
dα
= 0
u
21
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
1
(α)
dα
+ u
22
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
2
(α)
dα
+ u
2α
(x
∗
1
(α), x
∗
2
(α), α)
dα
dα
= 0
151
3.1.10
The Derivative of Optimal Choice with Multiple Choice Variables
In a concrete example we could solve for
dx
∗
1
(α)
dα
and
dx
∗
2
(α)
dα
directly using algebra. If that seems difficult
or the problem is abstract, we can borrow an approach from linear algebra. We write this system of
equations as a matrix product.
u
11
(x
∗
1
(α), x
∗
2
(α), α) u
12
(x
∗
1
(α), x
∗
2
(α), α)
u
21
(x
∗
1
(α), x
∗
2
(α), α) u
22
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
1
(α)
dα
dx
∗
2
(α)
dα
=
−u
1α
(x
∗
1
(α), x
∗
2
(α), α)
−u
2α
(x
∗
1
(α), x
∗
2
(α), α)
Cramer’s rule writes the solution to a matrix equation as a ratio of determinants.
dx
∗
1
(α)
dα
=
−u
1α
(x
∗
1
(α), x
∗
2
(α), α) u
12
(x
∗
1
(α), x
∗
2
(α), α)
−u
2α
(x
∗
1
(α), x
∗
2
(α), α) u
22
(x
∗
1
(α), x
∗
2
(α), α)
u
11
(x
∗
1
(α), x
∗
2
(α), α) u
12
(x
∗
1
(α), x
∗
2
(α), α)
u
21
(x
∗
1
(α), x
∗
2
(α), α) u
22
(x
∗
1
(α), x
∗
2
(α), α)
dx
∗
2
(α)
dα
=
u
11
(x
∗
1
(α), x
∗
2
(α), α) −u
1α
(x
∗
1
(α), x
∗
2
(α), α)
u
21
(x
∗
1
(α), x
∗
2
(α), α) −u
2α
(x
∗
1
(α), x
∗
2
(α), α)
u
11
(x
∗
1
(α), x
∗
2
(α), α) u
12
(x
∗
1
(α), x
∗
2
(α), α)
u
21
(x
∗
1
(α), x
∗
2
(α), α) u
22
(x
∗
1
(α), x
∗
2
(α), α)
Notice the denominator in each formula is |Hu(x
∗
1
(α), x
∗
2
(α), α)
2
|. We replace the corresponding column
with the right side of the equation to obtain the numerator.
Cramer’s rule can apply to any number of variables, so we can extend this procedure to more than
two choice variables. We can also handle multiple parameters. This method would compute partial
derivatives of x
∗
i
(α) with respect to some α
j
.
Main Ideas
The optimal choice of multiple variables satisfies the multivariable first-order condition.
The implicit function theorem applies to the optimal choice variables, if the Hessian minor cor-
responding to the choice variables has a nonzero determinant. If the optimal choice satisfies the
second-order condition, this is automatic.
We can differentiate the multivariable first-order condition with respect to a parameter and obtain
a linear system of equations in the derivatives of the choice variables.
Cramer’s rule is a reliable way to solve such systems.
152
3.2
Goals:
1 Use the envelope theorem to compute the derivative of value functions.
3.2.1
Definition 3.10
Suppose we have
f(x, α), a differentiable function.
x is a choice variable
α is a parameter
x
∗
(α) is the x that maximizes f(x, α) for a given α.
Then the outcome that will occur for each α is the value function
V (α) = f(x
∗
(α), α)
Vocabulary
In the case that f has an economic meaning, we sometimes use the term indirect, along with the ∗
notation, to refer to its value function of f.
If u(x, α) is a utility function then u
∗
(α) = u(x
∗
(α), α) is the indirect utility function.
If π(x, α) is a profit function then π
∗
(α) = π(x
∗
(α), α) is the indirect profit function.
x
∗
(α) is a solution to f
x
(x, α) = 0. Assuming f is continuously differentiable, the implicit function
guarantees that x
∗
(α) is a differentiable function. This means that V is a differentiable function. If we
want to understand how a change in α affects the value of V , we want to compute V
′
(α).
154
V
′
(α) computes how a change in a parameter affects the an outcome, assuming the agents involved
make the optimal choices. This derivative can answer a variety of questions in economics.
Example
How will increasing a tax impact company profits?
Will increasing a subsidy increase consumer well-being?
In the case that we have an expression for f, computing V
′
(α) is not difficult:
1 Solve for x
∗
(α)
2 Substitute x
∗
(α) into f(x, α) to obtain an expression for V (α)
3 Differentiate V (α)
But is there a better way?
3.2.2
The envelope theorem gives us an alternative method. We compute V
′
(α) by parametrizing the
values of α and x
∗
in a neighborhood where x
∗
(α) is differentiable. We use α as the parameter.
α = α
dα
dα
= 1
x = x
∗
(α)
We can apply the chain rule to V (α) = f(x
∗
(α), α). We use the fact that x
∗
(α) satisfies the first-order
condition for all α. Specifically, have f
x
(x
∗
(α), α) = 0.
V
′
(α) =
df(x, α)
dα
= f
x
(x
∗
(α), α)
dx
∗
(α)
dα
+ f
α
(x
∗
(α), α)
dα
dα
(chain rule)
= (0)
dx
∗
(α)
dα
+ f
α
(x
∗
(α), α)(1) (FOC)
= f
α
(x
∗
(α), α)
155
3.2.2 The Envelope Theorem
Remark
This computation requires us to know that x
∗
(α) is a differentiable function. Here are two ways to
verify this.
1 For a concrete function, compute x
∗
(α). Verify directly that it is differentiable.
2 In more abstract settings, apply the implicit function theorem. Check that
f has continuous second derivatives
f
xx
(x
∗
(α), α) = 0 for all α.
.
Theorem 3.11 [The Envelope Theorem, Single-Variable]
Suppose
f(x, α) is a differentiable function, and
x
∗
(α) that maximizes f for each α is a differentiable function.
The following two derivatives are equal:
V
′
(α)
|{z}
derivative of
value function
= f
α
(x
∗
(α), α)
| {z }
partial derivative
of original function
The envelope theorem allows us to compute a partial derivative of f instead of the total derivative
of V . In some sense, the envelope theorem is saying that the change in x
∗
does not matter. This makes
sense, because at a maximizer, we cannot increase the value of the function by changing x.
This is an interesting insight into the behavior of value functions, but all we have done is traded one
derivative for another. We still need to compute x
∗
(α) to evaluate the partial derivative. It is natural
to ask: does the envelope theorem save us any work in practice?
Compare the following methods of computing V
′
(α):
Without the envelope theorem
1 Compute x
∗
(α)
2 Substitute into f(x, α) to get V (α)
3 Differentiate V (α)
With the envelope theorem
1 Compute x
∗
(α)
2 Partially differentiate f(x, α)
3 Substitute x
∗
(α)
156
Remark
In concrete situations, the first method gives us a more complicated function to differentiate. In abstract
situations, the first method may be impossible or give us an answer in a less useful form.
The envelope theorem can also be justified visually. If we pick a specific a, we can compare two
functions
1 The value function: V (α) = f(x
∗
(α), α), which uses the best x for each α
2 The stubborn function: V
0
(α) = f(x
∗
(a), α), which sticks with the best x for a, even if α
changes.
The stubborn function has the following properties
Since its x-coordinate is constant, the derivative of V
0
is equal to the partial derivative f
α
.
V
0
(a) = V (a)
For any other α, x
∗
(a) cannot be a better choice than x
∗
(α). Thus V
0
(α) ≤ V (α).
The graph y = V
0
(α) meets y = V (α) at a but does not go above it. They must be tangent, which
means their derivatives are equal.
V
′
(a) = V
′
0
(a) = f
α
(x
∗
(a), a)
Figure 3.6: The graphs of the value function and stubborn function and their tangent line
The envelope theorem gets its name from the fact that y = V (α) envelopes all of the stubborn
functions y = V
0
(α), generated by choosing different values of a.
157
3.2.3
Generalizations of the Envelope Theorem
There are several ways to generalize the envelope theorem. First we can consider a function of more
choice variables or more parameters.
1 If f is a function of n choice variables, then each choice variable has an optimal value that depends
of α. We obtain a family of functions x
∗
i
(α). We can write them as a vector x
∗
(α). The value
function is
V (α) = f(x
∗
(α), α)
2 If f is a function of m parameters α
j
, we can express the parameters as a vector α. The optimal
choice of each choice variable is a function of all the α
j
. The value function also a function of all
the α
j
.
V (α) = f(x
∗
(α), α)
Since the x
∗
i
(α) satisfy the family of equations f
x
i
(x, α) = 0, we need the multi-equation version of the
implicit function theorem to justify their existence and differentiability.
The value function is a function of multiple parameters, so the envelope theorem computes its partial
derivatives.
Theorem 3.12 [The Envelope Theorem, Multivariable]
Suppose
f(x, α) is a differentiable function
x
∗
(α) that maximizes f for each α is a differentiable function
For any region of R
m
where x
∗
(α) is differentiable and any coordinate α
k
of α we have:
V
α
k
(α) = f
α
k
(x
∗
(α), α)
We can again show this with a parametrization. Since we need the partial derivative, we treat α
k
as
a parameter. The other α
j
are constants, held equal to the corresponding components of some fixed a.
This parametrization will not cover the entire domain of α, only those that lie in the α
k
-direction from
a.
α
j
=
(
a
j
j = k
α
k
j = k
dα
j
dα
k
=
(
0 if j = k
1 if j = k
x = x
∗
(α)
In general, x
∗
i
is a multivariable function of α. In this parametrization, x
i
is a function only of the
parameter α
k
. By the first-order condition, x
∗
(α) satisfies f
x
i
(x
∗
(α), α) = 0 for each x
i
. We can apply
158
the chain rule to V (α) = f(x
∗
(α), α) to compute V
α
k
(α).
V
α
k
(α) =
df(x
∗
(α), α)
dα
k
=
n
X
i=1
f
x
i
(x
∗
(α), α)
dx
∗
i
(α)
dα
k
+
m
X
j=1
f
α
j
(x
∗
(α), α)
dα
j
dα
k
=
n
X
i=1
(0)
dx
∗
i
(α)
dα
k
+
X
j=k
f
α
j
(x
∗
(α), α)(0) + f
α
k
(x
∗
(α), α)(1)
= f
α
k
(x
∗
(α), α)
This is valid for any α that lies on our path through a in the α
k
-direction. We can apply this
reasoning to any a though, so the computation holds for the entire domain of x
∗
(α).
Our final generalization of the envelope theorem assumes that x
∗
(α) is the optimal choice given an
equality constraint, g(x, α) = 0.
Theorem 3.13 [The Envelope Theorem, Constrained]
Suppose
f(x, α) is a differentiable objective function
g(x, α) is a differentiable constraint function
x
∗
(α) that maximizes f subject to g(x, α) = 0 for each α is a differentiable function
λ
∗
(α) is the value of λ that solves the first-order conditions of L along with x
∗
(α) and α.
For any region of R
m
where x
∗
(α) and λ
∗
(α) are differentiable and any coordinate α
k
of α we have:
V
α
k
(α) = L
α
k
(x
∗
(α), α, λ
∗
(α))
Proving this version requires us to use the fact that g(x
∗
(α), α) = 0 for all α. This means that
L(x
∗
(α), α, λ
∗
(α)) = f(x
∗
(α), α) + λ
∗
(α)g(x
∗
(α), α)
= f(x
∗
(α), α) + λ
∗
(α)(0)
= V (α)
x
∗
(α) does not necessarily satisfy the first-order condition of f, but for a fixed α it does satisfy the first
order condition of L(x
∗
(α), λ
∗
(α)). Specifically:
L
x
i
(x
∗
(α), λ
∗
(α)) = 0
L
λ
(x
∗
(α), λ
∗
(α)) = 0
159
3.2.3 Generalizations of the Envelope Theorem
We can use the same parametrization as the unconstrained case, with the understanding that x
∗
(α)
now describes the maximizer of the constrained optimization and with λ
∗
(α) the corresponding λ value.
V
α
k
(α) =
df(x
∗
(α), α)
dα
k
=
dL(x
∗
(α), α, λ
∗
(α))
dα
k
=
n
X
i=1
L
x
i
(x
∗
(α), α, λ
∗
(α))
dx
∗
i
(α)
dα
k
+
m
X
j=1
L
α
j
(x
∗
(α, α, λ
∗
(α))
dα
j
dα
k
+ L
λ
(x
∗
(α), α, λ
∗
(α))
dλ
∗
(α)
dα
k
=
n
X
i=1
(0)
dx
∗
i
(α)
dα
k
+
X
j=k
L
α
j
(x
∗
(α), α, λ
∗
(α))(0) + L
α
k
(x
∗
(α), α, λ
∗
(α))(1) + (0)
dλ
∗
(α)
dα
k
= L
α
k
(x
∗
(α), α, λ
∗
(α))
Further Generalizations
We can extend this reasoning to multiple equality constraints without any trouble.
We can also also extend the envelope theorem to inequality constraints, where we set λ
∗
(α) = 0 for
any α such that the constraint is not binding at the maximizer x
∗
(α). If x
∗
and λ
∗
are differentiable
functions of α in this setting, we can still conclude that
V
α
k
(α) = L
α
k
(x
∗
(α), α, λ
∗
(α))
However, we expect x
∗
(α) will not be differentiable at the transition between the binding and
nonbinding cases.
Proving the inequality case is identical to the equality case, except
L
λ
(x(t), α(t), λ(t))
| {z }
=0 if binding
λ
′
(t)
|{z}
=0 if
nonbinding
= 0
Remark
The generalizations of the envelope theorem require that x
∗
(α) is a differentiable function. Like in the
single-variable case, it usually makes sense to solve for x
∗
(α) and directly verify that it is differentiable. If
we need to use the implicit function theorem instead, we need the multivariable version. The requirement
for that is
|Hf
x
(x, α)| = 0 or |HL
x,λ
(x, α)| = 0
depending on whether there is a constraint.
160
For one-variable functions, we can use the fact that a union of finitely many closed intervals (or
isolated points) is closed. If the intervals are each finite length, then the union is also bounded.
x
−4 −2 0 2 4
Figure 4.1: A finite union of closed intervals
The same holds for curves. A curve must include its endpoints, or have no endpoints to be closed.
x
1
x
2
Figure 4.2: A closed curve
x
1
x
2
Figure 4.3: The graph x
2
1
+ x
2
2
= 9 is
closed.
Generally, a region defined by a strict inequality will not contain its boundary points and thus will
not be closed.
x
1
x
2
Figure 4.4: x
2
1
+ x
2
2
≤ 9 is closed.
x
1
x
2
Figure 4.5: x
2
1
+ x
2
2
< 9 is not closed.
If multiple inequalities are involved and relevant, they must all be nonstrict in order to avoid removing
boundary points. An interesting case is the removal of a single interior point. If we exclude that point
from S, then that point becomes a boundary point. Any neighborhood of it contains points in S and a
point not in S.
165
Theorem 4.4
1 The intersection of any number of closed sets is closed.
2 The union of any finite number of closed sets is closed.
The second theorem is specifically useful for feasible sets defined by multiple constraints. Under our
formulation of constrained optimization, the feasible set is an intersection of level sets and upper level
sets. As long as the constraint functions are continuous, the feasible set will be closed.
The extreme value theorem is a standard result in analysis. While we will not prove it, we can at
least demonstrate that each hypothesis is necessary.
Example
Consider f(x) = x
2
on the region S = {x : x ≥ 0}. f(x) is continuous and S is closed but not
bounded. f(x) grows without bound in S and has no maximum.
2 4
2
4
x
y
Figure 4.10: The graph of y = x
2
over [0, ∞)
Example
Consider
f(x) =
(
x
2
if x < 2
0 if x ≥ 2
on the region S = {x : 0 ≤ x ≤ 2}. S is closed and bounded, but f (x) is not continuous. f (x)
approaches a value of 4 but never reaches it. There is no maximizer a. For any a < 2, there is a b closer
to 2 with f(b) > f(a).
167
4.1.2
Applying the Extreme Value Theorem
Find the maximizer(s) of f(x
1
, x
2
) = x
2
1
+ x
2
1
x
2
+ 2x
2
2
subject to 8 − x
2
1
− x
2
2
≥ 0.
−4 −2 2 4
−4
−2
2
4
x
1
x
2
Figure 4.13: The feasible set
Solution
Apply the necessary conditions for maximizers subject to an inequality. The Lagrangian is
L(x
1
, x
2
, λ) = x
2
1
+ x
2
1
x
2
+ 2x
2
2
+ λ(8 − x
2
1
− x
2
2
)
The conditions are
L
x
1
(x
1
, x
2
, λ) = 2x
1
+ 2x
1
x
2
− 2λx
1
= 0
L
x
2
(x
1
, x
2
, λ) = x
2
1
+ 4x
2
− 2λx
2
= 0
8 − x
2
1
− x
2
2
≥ 0, λ ≥ 0 and λ(8 − x
2
1
− x
2
2
) = 0
We solve each case of the complementary slackness.
1 Set λ = 0
2x
1
+ 2x
1
x
2
= 0 x
2
1
+ 4x
2
= 0
2x
1
(1 + x
2
) = 0
if x
1
= 0 0
2
+ 4x
2
= 0
x
2
= 0
if x
2
= −1 x
2
1
+ 4(−1) = 0
x
1
= ±2
Check that (0, 0, 0) and (±2, −1, 0) satisfy 8 − x
2
1
− x
2
2
≥ 0. They do.
2 Set 8 − x
2
1
− x
2
2
= 0. We need to solve
2x
1
+ 2x
1
x
2
− 2λx
1
= 0 x
2
1
+ 4x
2
− 2λx
2
= 0 8 − x
2
1
− x
2
2
= 0
169
4.1.2
Applying the Extreme Value Theorem
One good approach is to factor
2x
1
+ 2x
1
x
2
− 2λx
1
= 0
2x
1
(1 + x
2
− λ) = 0
x
1
= 0 or λ = 1 + x
2
We treat these two cases separately.
a If x
1
= 0,
8 − (0)
2
− x
2
2
= 0
x
2
= ±2
√
2 (0)
2
+ 4(±2
√
2) − 2λ(±2
√
2) = 0
±8
√
2 = ±4
√
2λ
2 = λ
b If λ = 1 + x
2
,
x
2
1
+ 4x
2
− 2(1 + x
2
)x
2
= 0
x
2
1
= 2x
2
2
− 2x
2
8 − (2x
2
2
− 2x
2
) − x
2
2
= 0
3x
2
2
− 2x
2
− 8 = 0
−(3x
2
+ 4)(x
2
− 2) = 0
if x
2
= 2
λ = 1 + 2 x
2
1
= 8 − (2)
2
λ = 3 x
1
± 2
if x
2
= −
4
3
λ = 1 −
4
3
x
2
1
= 8 −
4
3
2
λ = −
1
3
x
1
= ±
2
√
14
3
We verify that (0, ±2
√
2, 2), (±2, 2, 3) and
±
2
√
14
3
,
4
3
, −
1
3
satisfy λ ≥ 0. The third one does
not.
Now we can apply the extreme value theorem. The function f(x
1
, x
2
) is continuous. The feasible
set is the upper level set of a continuous function: 10 − x
2
1
−x
2
2
≥ 0, so it is closed. The feasible set is
a disk, so it is bounded. The extreme value theorem tells us a maximizer must exist.
Only a point that satisfies our necessary condition can be that maximizer. To determine which one,
170
we evaluate f (x
1
, x
2
) at each.
f(0, 0) = (0)
2
+ (0)
2
(0) + 2(0)
2
= 0
f(±2, −1) = (±2)
2
+ (±2)
2
(−1) + 2(−1)
2
= 2
f(0, ±2
√
2) = (0)
2
+ (0)
2
(±2
√
2) + 2(±2
√
2)
2
= 16
f(±2, 2) = (±2)
2
+ (±2)
2
(2) + 2(2)
2
= 20
Because they produce the greatest values among the candidates, the maximizers are (2, 2) and (−2, 2).
Figure 4.14: The maximizers of y = x
2
1
+ x
2
1
x
2
+ 2x
2
2
subject to 8 − x
2
1
− x
2
2
≥ 0
Main Ideas
The algebraic expressions tell us when the objective function is continuous and the feasible set is
closed.
Draw the feasible set to decide whether it is bounded.
If the EVT applies, we can evaluate f at all of the points that passed our necessary conditions.
The one that attains the greatest value is the maximizer.
171
4.2
Goals:
1 Use the bordered Hessian to identify local maximizers of a function subject to a constraint.
4.2.1
We used the Hessian matrix to recognize maximizers and minimizers in unconstrained optimization.
This worked because the Hessian computed the second derivative over a straight line a+ tv. An equality
constraint does not usually contain straight lines. The test for the second derivative will need to take
account of the shape of the level set g(x) = 0.
Definition 4.5
Given a constrained optimization problem with Hessian L, the matrix HL(λ, x) is called the bordered
Hessian of the constrained optimization problem.
Example
The 2-variable bordered Hessian has the form
HL(λ, x
1
, x
2
) =
L
λλ
L
λ1
L
λ2
L
λ1
L
11
L
12
L
λ1
L
12
L
22
=
0 g
1
g
2
g
1
f
11
+ λg
11
f
12
+ λg
12
g
2
f
21
+ λg
21
f
22
+ λg
22
Why is this called “bordered?”
The bottom right 2 × 2 minor looks like a Hessian.
It is bordered to the left and above by ∇g.
Notice that we have switched the order of variables in our Lagrangian. This is common when writing
the bordered Hessian. Placing the border on the top allows us to write our condition for a local maximizer
in a familiar way. If we instead prioritized consistency, we could keep the λ last and modify our condition.
173
4.2.1 The Bordered Hessian
Theorem 4.6
Let f(x) be an n-variable function and g(x) = 0 be a constraint. If (ℓ,a) satisfies the first-order
condition of the Lagrangian and the upper left square minors of HL(ℓ,a) satisfy
(−1)
i
|M
i
| < 0 2 ≤ i ≤ n + 1,
then a is a strict local maximizer of f among points on the constraint.
Remark
We do not test |M
1
|, since M
1
= [0].
We generally do not need to worry about M
2
either, since
|M
2
| =
0 g
1
g
1
f
11
+ λg
11
= −(g
1
)
2
Notice the inequality is reversed from the unconstrained second-order condition.
Variant of Theorem 4.6
Let f(x) be an n-variable function and g(x) = 0 be a constraint. If (ℓ,a) satisfies the first-order
condition of the Lagrangian and the upper left square minors of HL(ℓ,a) satisfy
|M
i
| < 0 2 ≤ i ≤ n + 1,
then a is a strict local minimizer of f among points on the constraint.
Remark
We might hope to quickly extend this to a global condition, but unfortunately, HL only tells us about
the second derivatives at a critical point. Deriving a condition that works for all points on the constraint
is possible but more complicated.
174
4.2.2
Using the Bordered Hessian
Let f(x
1
, x
2
) = x
2
1
+ x
2
2
on the domain
D = {(x
1
, x
2
) : x
1
> 0, x
2
> 0}
Find the critical point of f on the constraint x
4
1
+ x
4
2
= 2. Classify it as a local maximizer or local
maximizer of f (or neither) on the constraint.
Solution
The Lagrangian is
L(x
1
, x
2
, λ) = x
2
1
+ x
2
2
+ λ(2 − x
4
1
− x
4
2
)
Here are the first-order conditions. We can solve them, using the fact that x
1
> 0 and x
2
> 0.
2x
1
− 4λx
3
1
= 0 2x
2
− 4λx
3
2
= 0 2 − x
4
1
− x
4
2
= 0
x
2
1
=
1
2λ
x
2
2
=
1
2λ
x
2
1
= x
2
2
x
1
= x
2
2 − x
4
1
− x
4
1
= 0
2 = 2x
4
1
1 = x
1
1 = x
2
1 =
1
2λ
λ =
1
2
The critical point is
1, 1,
1
2
. Switching the order of the variables, we compute
HL(λ, x
1
, x
2
) =
0 −4x
3
1
−4x
3
2
−4x
3
1
2 − 12λx
2
1
0
−4x
3
2
0 2 − 12λx
2
2
HL
1
2
, 1, 1
=
0 −4 −4
−4 −4 0
−4 0 −4
The determinants of the upper left square minors are
|M
2
| =
0 −4
−4 −4
= −16
|M
3
| =
0 −4 −4
−4 −4 0
−4 0 −4
= 0 + 4
−4 0
−4 −4
− 4
−4 −4
−4 0
= 128
175
4.3
Goals:
1 Use normal vectors to describe half-spaces of a hyperplane.
2 Apply optimization by separation arguments.
4.3.1
Upper Level Sets and Optimization
Recall our necessary conditions for a maximizer of f(x) subject to g(x) ≥ 0. If the constraint is
binding, we learned to check that ∇f(a) is parallel to ∇g(a), meaning the level sets are parallel. We
also check that λ ≥ 0, meaning that ∇f(a) points away from the feasible set. However, these checks
are not sufficient. We will construct an example that passes these conditions but isn’t a maximizer.
Recall the following definition
Definition 2.13
The upper level sets of a function f(x) with domain D are the sets
{x ∈ D | f(x) ≥ c} for some number c
The lower level sets are
{x ∈ D | f(x) ≤ c} for some number c
The following characterization will be important to our arguments.
Lemma 2.14
Suppose ∇f(x) =
0, and x(t) is a path that passes through a in the level set f(x) = c at t
0
If x
′
(t
0
) makes an acute angle with ∇f (a), then x(t) travels into the upper level set f (x) ≥ c.
If x
′
(t
0
) makes an obtuse angle with ∇f (a), then x(t) travels into the lower level set f (x) ≤ c.
178
The analogous object in R
3
is a plane h. It has a normal vector and divides R
3
into two half-spaces
h
+
and h
−
. The sign of v · (x −a) tests which half-space x lies in. Specifically, x lies on the plane, if
v · (x −a) = 0.
Figure 4.20: A plane, its normal vector v, and a vector on it
This reasoning works in any dimension to define a set of points whose displacement from a known
point a is orthogonal to a normal vector v.
Definition 4.9
In R
2
, v · (x −a) = 0 defines a line.
In R
3
, v · (x −a) = 0 defines a plane.
In R
n
, v · (x −a) = 0 defines a hyperplane, a linear (n − 1)-dimensional subspace.
When dimension is general or ambiguous, we use the term hyperplane as a catch-all term. We can
rewrite our dot product lemma to reflect the n-dimensional case.
Variant of Lemma 4.8
Suppose h is a hyperplane with normal vector v and a is a point on h. For any point x
v · (x −a)
> 0 if x lies in h
+
= 0 if x lies on h
< 0 if x lies in h
−
181
x
1
x
2
Figure 4.21: An upper level set of f and the feasible set intersecting multiple times along their
separating line
We can modify this theorem by stipulating that x
∗
is either
1 the only point of the upper level set on h or
2 the only feasible point on h.
In this case we can conclude that x
∗
is the unique maximizer.
If we want to avoid talking directly about the hyperplane h, we can think of v ·x as a function of x.
Its level sets are the hyperplanes v ·x = k. The value of v ·x increases as we travel in the direction of v.
x
1
x
2
Figure 4.22: Some level sets of v ·x that intersect an upper level set of f and a feasible set
Our separation argument can be rephrased to require that the upper level set meets only higher
values of v ·x while the feasible set meets only lower values of v · x. We can verify this algebraically.
Suppose x be a point in the upper level set of f(x). Lemma 4.8 states that x does not lie in h
−
, if
v · (x − x
∗
) ≥ 0
v ·x ≥ v · x
∗
183
4.4
Goals:
1 Understand the geometric consequences of concavity on constrained optimizations.
2 Apply sufficient conditions for maximizers of concave functions with concave constraints.
4.4.1
When is a separation argument possible? When does the tangent hyperplane separate the upper level
set from the feasible set? Concavity is one way to guarantee this. There is an a rich theory and toolset
for optimization that is specific to concave functions. This is the field of concave programming. Here
is our main result.
Theorem 4.12
Given an objective function f(x) and constraints g
j
(x) ≥ 0, suppose (x
∗
,
λ
∗
) satisfies the Kuhn-Tucker
conditions:
L
x
i
(x
∗
,
λ
∗
) = 0 for all i
g
j
(x
∗
) ≥ 0 and λ
∗
j
≥ 0 with complementary slackness for each j
If f(x) and the g
j
(x) are all concave functions, then x
∗
maximizes f(x), subject to the constraints.
It is worth understanding the full argument of this theorem that follows, but the essential ideas are:
1 The upper level sets of a concave function are convex sets. This applies not only to f (x) ≥ c but
also to the feasible set, which is an intersection of the upper level sets: g
j
(x) ≥ 0.
2 Using this convexity, we can show that the tangent hyperplane to f(x) = c separates the upper
level set f(x) ≥ c from the feasible set.
186
4.4.2
Applying the Kuhn-Tucker Sufficient Condition
Solution
Increasing x
1
seems to be the most important factor in increasing f , but larger x
2
helps too. We should
draw and examine the region S. The set x
2
+ x
2
1
−5 ≥ 0 appears to be nonconvex. On the other hand,
based on our diagram, that inequality appears not to bind. We will instead maximize f over the region
T = {(x
1
, x
2
) : 25 − 7x
1
− x
2
≥ 0, x
2
≥ 0, and 25 − x
2
1
− x
2
2
≥ 0}
S is a subset of T so a maximizer in T that lies in S is a maximizer in S. This modified problem has
the following Lagrangian:
L(x
1
, x
2
, λ
1
, λ
2
, λ
3
) = 4x
1
+ x
2
+ λ
1
(25 − 7x
1
− x
2
) + λ
2
x
2
+ λ
3
(25 − x
2
1
− x
2
2
).
The Kuhn-Tucker conditions are
L
x
1
= 4 − 7λ
1
− 2λ3x
1
= 0
L
x
2
= 1 − λ
1
+ λ
2
− 2λ
3
x
2
25 − 7x
1
− x
2
≥ 0 λ
1
≥ 0 λ
1
(25 − 7x
1
− x
2
) = 0
x
2
≥ 0 λ
2
≥ 0 λ
2
x
2
= 0
25 − x
2
1
− x
2
2
≥ 0 λ
3
≥ 0 λ
3
(25 − x
2
1
− x
2
2
) = 0
We might guess that 25 −7x
1
−x
2
= 0 and 25−x
2
1
−x
2
2
= 0 are binding at the maximizer and x
2
≥ 0 is
not. Based on that guess, we first consider the case where λ
2
= 0. Next we use the binding constraints
to solve for x
1
and x
2
.
25 − 7x
1
− x
2
= 0 25 − x
2
1
− x
2
2
= 0
25 − 7x
1
= x
2
25 − x
2
1
− (25 − 7x
1
)
2
= 0
25 − x
2
1
− 49x
2
1
+ 350x
1
− 625 = 0
−50x
2
1
+ 350x
1
− 600 = 0
−50(x
1
− 3)(x
1
− 4) = 0
x
1
= 3 or 4
25 − 7(3) = 4 = x
2
or
25 − 7(4) = −3 = x
2
We can look ahead and see that x
2
= −3 will not satisfy our inequalities, so we set (x
1
, x
2
) = (3, 4).
We use the remaining equations to solve for the λ
1
and λ
3
.
4 − 7λ
1
− 2λ
3
x
1
= 0 1 − λ
1
− 2λ
3
x
2
= 0
4 − 7λ
1
− 6λ
3
= 0 1 − λ
1
− 8λ
3
= 0
1 − 8λ
3
= λ
1
4 − 7 + 56λ
3
− 6λ
3
= 0
50λ
3
= 3
λ
3
=
3
50
1 −
24
50
= λ
1
13
25
= λ
1
188
Our solution is
3, 4,
13
25
, 0,
3
50
. We check, and it satisfies the remaining inequalities.
λ
1
=
13
25
≥ 0
x
2
= 4 ≥ 0
λ
3
=
3
50
≥ 0
There are 7 other cases of complementary slackness to check, but we can avoid them. Our sufficiency
theorem applies to (3, 4). We only need to check for concavity of the relevant functions.
4x
1
+ x
2
is concave because it is linear.
25 − 7x
1
− x
2
is concave because it is linear.
x
2
is concave because it is linear.
25 −x
2
1
−x
2
2
is strictly concave because its hessian is
−2 0
0 −2
, which is negative definite for
all (x
1
, x
2
).
We ignored the constraint x
2
+ x
2
1
− 5 ≥ 0.
Based on these checks, the theorem applies, and (3, 4) must be a maximizer of f on T . Since (3, 4)
satisfies
x
2
+ x
2
1
− 5 = 4 + 9 − 5 ≥ 0
it lies in S as well. Since S ⊆ T , we conclude (3, 4) is also a maximizer of f in S. The maximum value
is f(3, 4) = 4(3) + (4) = 16.
Main Idea
The most common methods to check concavity are
linear functions are concave
functions with negative definite Hessians are strictly concave
If the functions are concave, and we guess the right combination of binding constraints, then we only
need to check the Kuhn-Tucker conditions for that case.
189
4.4.2
Applying the Kuhn-Tucker Sufficient Condition
Remark
The underlying separation argument was between the convex upper level set f(x
1
, x
2
) ≥ 16 and the
convex set
T = {(x
1
, x
2
) : 25 − 7x
1
− x
2
≥ 0, x
2
≥ 0 and 25 − x
2
1
− x
2
2
≥ 0}
Since (3, 4) is a maximizer on T , it is a maximizer on S.
x
1
x
2
Figure 4.26: A convex set T containing S, an upper level set, and a separating line
The reasoning of this example also suggests the following variant.
Variant of Theorem 4.12
Given an objective function f(x) and constraints g
j
(x) ≥ 0, suppose (x
∗
,
λ
∗
) satisfies the Kuhn-Tucker
conditions.
If f(x) and the binding g
j
(x) are all concave, then x
∗
maximizes f(x), subject to the constraints.
4.4.3
Proving Kuhn-Tucker Sufficiency
This will be an extensive argument with many parts, but there are two reasons to give it our attention.
1 Some of the lemmas along the way are useful or interesting in their own right.
2 There are many ways to modify this argument for different circumstances. If you understand the
original argument, you can understand or even generate these variants.
190
4.4.3 Proving Kuhn-Tucker Sufficiency
We now know that if f is concave, then its upper level sets are convex. For a separation argument,
we also would like to know they do not cross their tangent hyperplane. Fortunately, this is the case.
Lemma 4.14
Let
f(x) be a continuously differentiable function
a be a point such that f(a) = c and ∇f (a) =
0
h be the tangent hyperplane to f(x) = c at a.
If upper level set f(x) ≥ c is convex, then it does not intersect h
−
.
Proof
Suppose
b is any point in the upper level set f(x) ≥ c. We want to show that
b does not lie in h
−
.
Since the upper level set is convex, the entire segment
x(t) = (1 − t)a + t
b 0 ≤ t ≤ 1
= a + t(
b −a)
must lie within it. This specifically requires that the direction vector x
′
(0) =
b −a points into the upper
level set at a. Lemma 2.14 tells us that if ∇f(a) · x
′
(0) < 0, then x(t) travels into the lower level set.
Thus ∇f · (
b −a) ≥ 0. By Lemma 4.8,
b does not lie in h
−
.
x
1
x
2
Figure 4.28: The vector
b −a in a
convex upper level set making an acute
angle with ∇f(a)
x
1
x
2
Figure 4.29: The vector
b −a leaving an
upper level set and making an obtuse
angle with ∇f(a)
Now we turn our attention to the constraint functions. Lemma 4.13 applies as well to each g
j
as it
does to f. If each function g
j
(x) is concave, then each upper level set g
j
(x) ≥ 0 is convex. With some
very familiar looking conditions on ∇g
j
(a), we can ensure that the feasible set stays on one side of the
tangent hyperplane to f(x) = c.
192
Corollary 4.15
Let f(a) = c and ∇f(a) =
0. Let h be the tangent hyperplane to f(x) = c at a. If
1 The upper level sets g
j
(x) ≥ 0 are convex
2 ∇f(a) = −
X
j
λ
j
∇g
j
(a) for some numbers λ
j
3 For each j, λ
j
≥ 0 if g
j
(a) = 0 and λ
j
= 0 otherwise
then the intersection of the upper level sets g
j
(x) ≥ 0 does not intersect h
+
.
The geometry is easiest to visualize if we consider a single constraint g(x) ≥ 0. In this case,
∇f(a) = −λ∇g(a). For any
b in g(x) ≥ 0, the vector
b −a makes an acute angle with ∇g(a). Thus it
makes an obtuse angle with ∇f (a), meaning
b lies in h
−
.
x
1
x
2
Figure 4.30: The vector
b −a in a convex feasible set making an obtuse angle with ∇f(a)
For multiple constraints, angles are too difficult to discern. The dot product provides a cleaner
argument.
Proof
Let
b be a point in the intersection of the upper level sets. For each g
j
(x) that is binding at a, we can
apply Lemma 4.14 to the convex upper level set g
j
(x) ≥ 0. As that proof argued, we can conclude that
∇g
j
· (
b −a) ≥ 0.
By Lemma 4.8, the sign of ∇f (a) · (
b −a) will tell us whether
b lies in h
+
. We can substitute as
193
4.4.3 Proving Kuhn-Tucker Sufficiency
follows:
∇f(a) · (
b −a) = −
X
j
λ
j
∇g
j
(a)
· (
b −a)
= −
X
j
λ
j
∇g
j
(a) · (
b −a)
We can determine the sign of each term of the summation.
1 For nonbinding g
j
, λ
j
= 0.
2 For binding g
j
, λ
j
≥ 0 and ∇g
j
(a) · (
b −a) ≥ 0.
We conclude that ∇f (a) · (
b −a) ≤ 0. Thus
b does not lie in h
+
.
We now have the ingredients to prove Theorem 4.12. There are two cases of complementary slackness
to consider when proving this theorem.
1 If all λ
∗
j
= 0 then ∇f(x
∗
) = 0. Since f is concave, a variant of Corollary 1.23 tells us that x
∗
is
an unconstrained maximizer of f. Corollary 2.2 tells us that since it lies in the feasible set, is also
a maximizer there.
2 If some λ
∗
j
> 0 then we can put together the results of this section to conclude the upper level
set of f and the feasible set lie on opposite sides of h. This is the requirement for optimization
by separation established in Theorem 4.10, so x
∗
is a maximizer.
Here is a diagram containing the reasoning for each case.
binding λ
∗
j
> 0
all λ
∗
j
= 0
∇f(x
∗
) =
0
f(x) is
concave
f(x) ≥ c is convex
f(x) ≥ c does not
intersect h
−
g
j
(x) are
concave
g
j
(x) ≥ 0
are convex
∇f(x
∗
) = −
P
λ
∗
j
∇g
j
(x
∗
)
feasible set does
not intersect h
+
x
∗
is a max
x
∗
is a max
in the feasible set
Lem 4.13
Lem 4.14
Lem 4.13
Cor 4.15
Thm 4.10
Cor 1.38
Cor 2.2
194
4.4.4
Variants of Kuhn-Tucker Sufficiency
In a separation argument, the upper level set and the feasible set may meet at many points in h. For
example, we could have an entire line segment of intersection, and every point on that segment would
satisfy Kuhn-Tucker.
x
1
x
2
Figure 4.31: An upper level set of f and the feasible set intersecting along a separating line
This pathology only exists if both sets contain multiple points on the separating hyperplane. If one
of the sets is strictly convex, this will not happen. We can achieve this with strict concavity. Each
lemmas we used has a variant for strict concavity.
Variant of Lemma 4.13
If f(x) is a strictly concave function then the upper level set f(x) ≥ c is strictly convex.
Variant of Lemma 4.14
Suppose we have
a continuously differentiable function f(x)
a point a such that f(a) = c and ∇f (a) =
0
the tangent hyperplane to f(x) = c at a, denoted h.
If upper level set f(x) ≥ c is strictly convex, then it lies entirely within h
+
, except for the point a.
195
4.4.4 Variants of Kuhn-Tucker Sufficiency
Variant of Corollary 4.15
Let f(a) = c and ∇f(a) =
0. Let h be the tangent hyperplane to f(x) = c at a. If
1 The upper level sets g
j
(x) ≥ 0 are strictly convex
2 ∇f(a) = −
X
j
λ
j
∇g
j
(a) for some numbers λ
j
3 For each j, λ
j
≥ 0 if g
j
(a) = 0 and λ
j
= 0 otherwise
then the intersection of the upper level sets g
j
(x) ≥ 0 lies entirely within h
−
except the point a.
We can use these lemmas to guarantee a unique maximizer. We can either keep the upper level set
or the feasible set from having multiple points on h.
Theorem 4.12 for a Unique Maximizer
Given an objective function f(x) and constraints g
j
(x) ≥ 0, suppose (x
∗
,
λ
∗
) satisfies the Kuhn-Tucker
conditions. If f(x) and the binding g
j
(x) are concave, and additionally either
1 f(x) is strictly concave, or
2 at least one constraint binds and the binding g
j
(x) are strictly concave,
then x
∗
is the unique maximizer of f(x), subject to the constraints.
Another avenue of modification is to include equality constraints. One method is to treat the equality
constraint as an inequality constraint. The level set is a subset of the upper level set. By Corollary 2.2,
a maximizer over an inequality constraint that happens to bind is also a maximizer over the equality
constraint. This requires the constraint function to be concave and its λ
j
to be positive.
Alternately, if the equality constraint is linear, then its level set is a hyperplane, which is convex.
Thus the feasible set is still convex. Corollary 4.15 still holds regardless of the sign of λ
j
, because
∇g
j
(a) · (
b −a) = 0. We can formalize this possibility with the following variant.
Theorem 4.12 with Equality Constraints
Given an objective function f (x) and constraints of the forms g
j
(x) ≥ 0 and g
j
(x) = 0, suppose (x
∗
,
λ
∗
)
satisfies the Kuhn-Tucker conditions.
If f(x) and the binding g
j
(x) are all concave, and the equality constraints are linear, then x
∗
maximizes f(x), given the constraints.
196
This section contains just a few possibilities. There are other ways to modify our sufficiency theorems
that allow them to apply in more situations or to draw stronger conclusions.
4.4.5
Concavity has produced a convenient sufficient condition for constrained optimization. It can also
help simplify our necessary conditions. Recall that the Kuhn-Tucker conditions are a necessary condition,
but they have an exception. Theorem 2.23 states that a maximizer a must either satisfy the Kuhn-Tucker
conditions or have binding ∇g
j
(a) that are linearly dependent.
Checking for linearly dependent gradient vectors is difficult, especially if we do not know where to
look for them. Slater’s condition allows us to skip that check in some situations.
Slater’s condition requires that the objective and constraint functions are concave. It also requires
the feasible set to have an interior, rather than collapsing to a lower dimensional set. Formally, it
demands the existence of a point
b in the interior. This point is not special. If the feasible region has
an interior, then you can identify infinitely many points inside it by looking at a diagram.
Theorem 4.16 [Slater’s Condition]
Suppose that the functions f(x) and g
j
(x) satisfy Slater’s condition:
1 f(x) and the g
j
(x) are all concave functions
2 there is at least one point
b in the interior of the feasible set, meaning
g
j
(
b) > 0 for all j
If a is a local maximizer of f(x) subject to g
j
(x) ≥ 0, then a satisfies the Kuhn-Tucker conditions.
Remark
Slater’s condition checks for concavity, just like Theorem 4.12 (our sufficient condition). This means
that if f(x) and g
j
(x) satisfy Slater’s condition, then the Kuhn-Tucker conditions are both necessary
and sufficient for a maximizer.
Slater’s condition doesn’t prevent the gradient vectors of the binding g
j
from being linearly depen-
dent. Instead it argues that even if the gradient vectors are linearly dependent at a, the Kuhn-Tucker
conditions must still be satisfied in order for a to be a maximizer.
Slater’s condition can be strengthened to handle equality constraints in the case that those are linear.
197
4.4.5 Slater’s Condition
Variant of Theorem 4.16
Suppose that we are maximizing f(x) subject to constraints of the form g
j
(x) ≥ 0 or g
j
(x) = 0.
Suppose further that
1 f(x) and the inequality constraint functions g
j
(x) are concave
2 the equality constraint functions g
j
(x) are linear
3 there is at least one point
b in the relative interior of the feasible set, meaning
g
j
(
b) > 0 for all inequality constraints
g
j
(
b) = 0 for all equality constraints
If a is a local maximizer, then it satisfies the Kuhn-Tucker conditions.
4.4.6
The Separating Hyperplane Theorem
Throughout this section, we have used tangent hyperplanes to separate convex sets. It is possible to
make these arguments without separation, however. Here is a short alternative proof of Theorem 4.12.
Proof
x
∗
satisfies the first-order condition of a stubborn lagrangian, in which
λ is held constant at
λ
∗
.
L
∗
(x) = f(x) +
X
j
λ
∗
j
g
j
(x)
Since λ
∗
j
≥ 0, L
∗
is a sum of concave functions. By Theorem 1.20, L
∗
is concave. Thus x
∗
is a
maximizer of L
∗
. x
∗
also satisfies λ
∗
j
g
j
(x
∗
) = 0 by complementary slackness. If we compare x
∗
and
any other feasible point a, we have
L
∗
(x
∗
) ≥ L
∗
(a)
f(x
∗
) +
X
j
λ
∗
j
g
j
(x
∗
) ≥ f(a) +
X
j
λ
∗
j
g
j
(a)
f(x
∗
) + 0 ≥ f(a) +
X
j
λ
∗
j
|{z}
≥0
g
j
(a)
|{z}
≥0
f(x
∗
) ≥ f(a)
Since this holds for any feasible a, we can conclude that x
∗
is a maximizer of f(x) subject to g
j
(x) ≥ 0.
198
This argument does not establish a separating hyperplane, even though we know one exists from our
longer proof. It turns out that in any successful maximization argument for a concave function over a
convex feasible set, the hyperplane is there, whether we use it or not. The following famous theorem
guarantees that.
Theorem 4.17 [The Separating Hyperplane Theorem]
If S and U are two convex sets, at least one has a non-empty interior, and they share no interior points
in common, then there is a vector v and a number k such that
1 For all x in U, v · x ≥ k
2 For all x in S, v ·x ≤ k
Remark
This theorem tells us that if a convex upper level set U and a convex feasible set S meet only at their
boundaries, then the hyperplane v · x = k separates them. It does not tell us what v is or how to
construct it.
As you might infer from this remark, the applications of the separating hyperplane theorem tend to
be abstract. The proof requires ideas from analysis.
4.4.7
The most important definitions and results from this section were
Upper level sets of concave functions are convex (Lemma 4.13)
Convex upper level sets lie in one half space of their tangent hyperplanes (Lemma 4.14)
Kuhn-Tucker sufficiency for concave functions (Theorem 4.12)
Slater’s condition (Theorem 4.16)
This is also a good time to summarize the sufficient conditions we have covered. They differ in when
they can be applied and what conclusion they draw.
199
4.5.2
Quasiconcave Functions
Proof
There are two directions of argument that we need to make.
1 Suppose that f (x) is quasiconcave. Let a and
b be in the upper level set f(x) ≥ c. By definition,
we know that f(a) ≥ c and f(
b) ≥ c. The points on the segment from a to
b are parametrized by
(1 − t)a + t
b 0 ≤ t ≤ 1
If we evaluate f along these points we get
f((1 − t)a + t
b) ≥ min{f(a), f(
b)} (f is quasiconcave)
≥ c (a and
b lie in the upper level set)
Thus the segment lies in the upper level set f(x) ≥ c. Since this holds for any a and
b in any
upper level set, we conclude that every upper level set of f(x) is convex.
2 Now suppose that every upper level set of (x) is convex. Let a and
b be any points in the domain.
Let c = min{f(a), f(
b)}, so a and
b both lie in the upper level set f (x) ≥ c. Since the upper
level set is convex, the segment between them
(1 − t)a + t
b 0 ≤ t ≤ 1
lies in this set. Thus for 0 ≤ t ≤ 1,
f((1 − t)a + t
b) ≥ c = min{f(a), f(
b)}
Since this holds for all a and
b, we conclude f(x) is quasiconcave.
4.5.3
Quasiconcave programming describes the methods we use to solve constrained optimization when
the objective functions and constraint functions are quasiconcave. Most methods are analogs of results
for concavity, but because quasiconcavity is a weaker condition, they often need additional conditions
or draw weaker conclusions. We begin with a result for unconstrained optimization.
206
Theorem 4.22
If x
∗
is a strict local maximizer of a quasiconcave function f(x), then x
∗
is the unique global maximizer
of f(x).
x
1
x
2
Figure 4.38: A segment, a neighborhood in which x
∗
is a maximizer, and a point a that lies in both
Proof
Let
b be any other point in the domain of f(x). x
∗
is the unique maximizer in some neighborhood, and
the segment from x
∗
to
b travels through this neighborhood. Let a = x
∗
be a point that lies both on
the segment and in the neighborhood.
f(a) ≥ min{f(x
∗
), f(
b)} (f(x) is quasiconcave)
f(x
∗
) > min{f(x
∗
), f(
b)} (f(x
∗
) > f(a))
f(x
∗
) > f(
b) (f(x
∗
) is not less than itself)
Since this holds for any
b, we conclude that x
∗
is the unique maximizer of f(x),
Remark
For a quasiconcave function, knowing a is a critical point is not enough to conclude that a is a maximizer.
For instance f(x) = x
3
is increasing, so it is quasiconcave. 0 is a critical point but not a maximizer.
This means that our Kuhn-Tucker sufficiency theorem cannot translate directly to quasiconcavity,
because there is no way to cover the ∇f (x) =
0 case. The simplest workaround is to write a theorem
that only applies when ∇f(x
∗
) =
0.
207
4.5.3 Quasiconcave Programming
Theorem 4.23
Given an objective function f(x) and constraints g
j
(x) ≥ 0, suppose ∇f(x
∗
) =
0 and (x
∗
,
λ
∗
) satisfies
the Kuhn-Tucker conditions. If f (x) and the g
j
(x) are all quasiconcave, then x
∗
maximizes f(x), subject
to the constraints.
We can use almost the same argument that we used for the λ
j
> 0 case with concave functions.
Only the first steps need to change. Once we establish that the upper level sets are convex, the rest of
the proof is identical to Theorem 4.12.
binding λ
∗
j
> 0
all λ
∗
j
= 0
f(x) is
quasiconcave
f(x) ≥ c is convex
f(x) ≥ c does not
intersect h
−
g
j
(x) are
quasiconcave
g
j
(x) ≥ 0
are convex
∇f(x
∗
) = −
P
λ
∗
j
∇g
j
(x
∗
)
feasible set does
not intersect h
+
x
∗
is a max
in the feasible set
Lem 4.21
Lem 4.14
Lem 4.21
Cor 4.15
Thm 4.10
4.5.4
Positive Transformations and Quasiconcavity
The condition of quasiconcavity is ordinal rather than cardinal in nature. We care that certain
points attain greater values than others, but not how much greater. Contrast this to concavity where
f((1 − t)a + t
b) has to be at least t(f(
b) − f (a)) greater than f (a).
For this reason, quasiconcavity is a relevant property of utility functions. The values of a utility
function reflect no inherent information beyond relative preferences. We can rescale a quasiconcave
function f(x) without affecting whether f (a) > f (
b). The following definition and theorem formalizes
this flexibility.
Definition 4.24
A function f(x) is a positive transformation of a function g(x) if there is an increasing function p(x)
such that g(x) = p(f(x))
208
Example
Consider the function f(x
1
, x
2
) = x
1
x
2
on the domain
R
2
+
= {(x
1
, x
2
) : x
1
> 0, x
2
> 0}.
p(x) = ln x is an increasing function so
g(x
1
, x
2
) = p(f(x
1
, x
2
))
= ln x
1
+ ln x
2
is a positive transformation of f.
Positive transformation is a symmetric relation. If p(x) is increasing, then so is p
−1
(x). That means
that if g(x) = p(f(x)) is a positive transformation of f(x), then f(x) = p
−1
(g(x)) is also a positive
transformation of g(x).
Theorem 4.25
Let f (x) be a function, and let g(x) be a positive transformation of f(x). f(x) is quasiconcave if and
only if g(x) is quasiconcave.
Proof
Since p(x) is an increasing function, larger values of f(x) correspond to larger values of g(x). This
means that
f((1 − t)a + t
b) ≥ f(a), if and only if g((1 − t)a + t
b) ≥ g(a),
and f((1 − t)a + t
b) ≥ f(
b), if and only if g((1 − t)a + t
b) ≥ g(
b).
Putting these together gives
f((1 − t)a + t
b) ≥ min{f(a), f(
b)}, if and only if g((1 − t)a + t
b) ≥ min{g(a), g(
b)}.
If f(x) is quasiconcave, then this inequality is satisfied for all a and
b in the domain and t in [0, 1]. This
means that g(x) is also quasiconcave. If f(x) is not quasiconcave, then this inequality is not satisfied
for some a,
b and t. That means that g(x) is also not quasiconcave.
We will use this theorem to verify that positive transformations of concave functions are quasiconcave.
Here is an example.
209
4.5.4 Positive Transformations and Quasiconcavity
Example
Consider the function f(x
1
, x
2
) = x
1
x
2
on the domain
R
2
+
= {(x
1
, x
2
) : x
1
> 0, x
2
> 0}.
We can reason that f(x
1
, x
2
) is quasiconcave as follows:
1 g(x
1
, x
2
) = ln x
1
+ ln x
2
is a positive transformation of f(x
1
, x
2
).
2 Apply Theorem 1.41. g(x
1
, x
2
) is concave, because its Hessian is negative definite.
Hg(x
1
, x
2
) =
"
−
1
x
2
1
0
0 −
1
x
2
2
#
3 Apply Theorem 4.19. Since g(x
1
, x
2
) is concave, it is also quasiconcave.
4 Apply Theorem 4.25. Since g(x
1
, x
2
) is quasiconcave, f(x
1
, x
2
) is quasiconcave.
Remark
Notice this reasoning does not apply to concavity.
g(x
1
, x
2
) = ln x
1
+ ln x
2
is concave
f(x
1
, x
2
) = x
1
x
2
is not concave
To use this method, we must find a function g(x) that is a positive transformation of f(x) and is
also concave. If we do not have an obvious candidate, random guessing is not practical. We would like
a more straightforward procedure.
4.5.5
The Bordered Hessian and Quasiconcavity
We can verify quasiconcavity by direct computation. We will derive that method here.
Recall Lemma 4.14. It showed that convex upper level sets lie in the positive half-space of their
tangent hyperplanes. Another way to say this is that a is the maximizer of f(x) among points on the
tangent hyperplane to its level set.
210
x
1
x
2
Figure 4.39: A maximizer of f(x) on the
tangent line to its level set
x
1
x
2
Figure 4.40: A local minimizer of f(x)
on the tangent line to its level set
We can prove a reverse version of this lemma with some modifications.
1 We generalize the tangent hyperplane to “points that satisfy ∇f (a) · (x −a) = 0.”
2 We require that each point is a strict local maximizer subject to that constraint.
If this holds for all a, we can conclude that f(x) is quasiconcave.
Lemma 4.26
Suppose f(x) is a continuously differentiable function on a convex domain. If every a in the domain of
f(x) is a strict local maximizer of f(x) subject to ∇f(a) · (x −a) = 0, then f(x) is quasiconcave.
The proof shows that f(x) satisfies the inequality that defines quasiconcavity.
Proof
Let f(x) be a function and g(x) be a positive transformation of f(x). Let a and
b be points in the
domain of f(x). In order to show f(x) is quasiconcave, we need to evaluate f (x) along the segment
x(t) = (1 − t)a + t
b 0 ≤ t ≤ 1
Consider the composition f (x(t)). By the extreme value theorem, this has a minimizer on [0, 1]. Let
0 < t
0
< 1. We claim that t
0
is not a minimizer of f(x(t)), there are two cases:
1 If
df(x(t
0
))
dt
= 0 then t
0
does not satisfy the first-order condition, so it is not a minimizer.
2 If
df(x(t
0
))
dt
= 0 then by the chain rule
∇f(x(t
0
)) · x
′
(t
0
) = 0
∇f(x(t
0
)) · (
b −a) = 0
211
4.5.5 The Bordered Hessian and Quasiconcavity
For any t, the vector x(t) − x(t
0
) is parallel to
b − a. Thus every x(t) satisfies ∇f(x(t
0
)) ·
(x(t) − x(t
0
)) = 0. By the hypothesis of this lemma, x(t
0
) is a strict local maximizer subject to
∇f(x(t
0
)) · (x −x(t
0
)) = 0. We can conclude that t
0
is not a minimizer of f(x(t)).
Since there is a minimizer of f (x(t)) on [0, 1], and it cannot occur for 0 < t < 1, the minimizer
must be t = 0 or t = 1. That means that
f((1 − t)a + t
b) ≥ min{f(a), f(
b)} for all 0 ≤ t ≤ 1
Since this holds for any a and
b, we conclude f(x) is quasiconcave.
Figure 4.41: A critical point of the composition f(x(t)), its gradient, and its upper level set
According to thus lemma, we now have a new way to verify quasiconcavity. We must check that
every point a is a strict local maximizer of f(x) subject to ∇f (a) · (x − a) = 0. We use different
methods at different points a. Which method we use depends on ∇f (a).
1 If ∇f(a) =
0 then every point x satisfies the constraint, making it meaningless. We can check
that a is an unconstrained strict local maximizer by checking the determinants of the minors of
Hf (a) (Theorem 1.31).
2 If ∇f(a) =
0 then the Lagrangian of f(x) subject to ∇f(a) · (x −a) = 0 is
L(λ, x) = f(x) + λ(∇f(a) · (x −a))
= f(x) + λ
n
X
i=1
f
i
(a)(x
i
− a
i
)
We can check the determinants of the minors of HL(λ,a) (Theorem 4.6).
212
The Hessian of
L(λ, x) = f(x) + λ
n
X
i=1
f
i
(a)(x
i
− a
i
)
has a memorable form.
Example
For n = 2, the bordered Hessian is
HL(λ, x) =
0 f
1
(a) f
2
(a)
f
1
(a) f
11
(x) f
12
(x)
f
2
(a) f
21
(x) f
22
(x)
.
Evaluating at x = a gives
HL(λ, a) =
0 f
1
(a) f
2
(a)
f
1
(a) f
11
(a) f
12
(a)
f
2
(a) f
21
(a) f
22
(a)
.
Due to its importance in this test, we call HL(λ, x) the bordered Hessian of f . We denote it
BHf(x), since it does not depend on λ.
Theorem 4.27
Suppose f(x) is a continuously differentiable function on a convex domain. If for each x in the domain
either
1 Hf(x) satisfies (−1)
i
|M
i
| > 0 for 1 ≤ i ≤ n, or
2 BHf(x) satisfies (−1)
i
|M
i
| < 0 for 2 ≤ i ≤ n + 1,
then f(x) is a quasiconcave function.
Here is a diagram of the steps of the proof.
Hf (a) satisfies the
alternating condition
BHf(a) satisfies the
alternating condition
a is a strict local maximizer
subject to ∇f(a) · (x −a) = 0
f(x) is quasiconcave
either
Thm 4.6
Local SOC
Lem 4.26
213
4.5.5 The Bordered Hessian and Quasiconcavity
The requirement that a be a strict local maximizer of f (x) subject to ∇f(a) ·(x −a) = 0 is actually
stronger than we need. We could make a valid argument only requiring a nonstrict local maximizer.
This would add some extra complexity to the proof of Lemma 4.26. In exchange we would gain the
ability to use the test for a negative semidefinite matrix instead of negative definite (Theorem 1.42).
That test is unpleasant to perform, so such a result may not be worth the effort.
4.5.6
Verifying Qusiconcavity Using the Bordered Hessian
Let f(x
1
, x
2
) = x
1
x
2
on the domain
R
2
+
= {(x
1
, x
2
) : x
1
> 0, x
2
> 0}.
Show that f (x
1
, x
2
) is quasiconcave.
Solution
We need to check the Hessian where ∇f(x
1
, x
2
) = (x
2
, x
1
) = (0, 0). This only occurs at (0, 0), which
is not in the domain. Everywhere else we can check the bordered Hessian.
BHf(x) =
0 f
1
(x) f
2
(x)
f
1
(x) f
11
(x) f
12
(x)
f
2
(x) f
21
(x) f
22
(x)
=
0 x
2
x
1
x
2
0 1
x
1
1 0
The minor determinants we need to check are
|M
2
| =
0 x
2
x
2
0
= −x
2
2
< 0
|M
3
| =
0 x
2
x
1
x
2
0 1
x
1
1 0
= 0 − x
2
x
2
1
x
1
0
+ x
1
x
2
0
x
1
1
= 2x
1
x
2
> 0
For all x
1
, x
2
> 0, this satisfies (−1)
i
|M
i
| < 0 for 2 ≤ i ≤ 3. We conclude that f(x
1
, x
2
) = x
1
x
2
is
quasiconcave on R
2
+
.
214
4.5.7
Like with strict concavity, strict quasiconcavity is defined by taking the inequality condition of qua-
siconcavity and making the inequalities strict.
Definition 4.28
A function f(x) is strictly quasiconcave if for any points a and
b in the domain of f(x),
f((1 − t)a + t
b) > min{f(a), f(
b)} for all 0 < t < 1
It is strictly quasiconvex if
f((1 − t)a + t
b) < max{f(a), f(
b)} for all 0 < t < 1
We can see from this definition that it is stronger than quasiconcavity. Every strictly quasiconcave
function is quasiconcave.
The condition on the upper level sets is more complicated.
Variant of Lemma 4.21
f(x) is a strictly quasiconcave function, if and only if
1 every upper level set f(x) ≥ c is a strictly convex set, and
2 there are no neighborhoods on which f(x) is constant
A function with a thick level set will fail condition 2 .
215
4.5.7 Strict Quasiconcavity
Figure 4.42: A thick level set of f(x)
As we might expect, strict quasiconcavity is a weaker condition than strict concavity.
Variant of Theorem 4.19
If a function f(x) is strictly concave, then it is also strictly quasiconcave.
We can summarize the relationships different forms of (quasi)-concavity in the following diagram.
f(x) is strictly concave
f(x) is concave
f(x) is strictly quasiconcave
f(x) is quasiconcave
Like with strict concavity, strict quasiconcavity can ensure that the Kuhn-Tucker conditions generate
a unique maximizer.
Variant of Theorem 4.23
Given an objective function f(x) and constraints g
j
(x) ≥ 0, suppose ∇f(x
∗
) =
0 and (x
∗
,
λ
∗
) satisfies
the Kuhn-Tucker conditions. If f (x) and the binding g
j
(x) are quasiconcave, and additionally either
1 f(x) is strictly quasiconcave or
2 the binding g
j
(x) are strictly quasiconcave,
then x
∗
is the unique maximizer of f(x), subject to the constraints.
216
>