Asked

Modified 3 months ago

Viewed 70k times

I have for some time been trawling through the Internet looking for an aesthetic proof of Taylor’s theorem.

By which I mean this: there are plenty of proofs that introduce some arbitrary construct: no mention is given of from whence this beast came. and you can logically hack away line by line until the thing is solved. but this kind of proof is ugly. a beautiful proof should rise naturally from the ground.

I’ve seen one proof claiming to do it from the fundamental theorem of calculus. It looked messy.

I’ve seen several attempts to use integration by parts repeatedly. But surely it would be tidier to do this without bringing in all of that extra machinery.

The nicest two approaches seem to involve using the mean value theorem and Rolle’s theorem. but I can’t find a lucid presentation of either approach.

Maybe my brain is unusually stupid, and the approaches on Wikipedia etc are perfectly good enough for everyone else.

Does anyone have a crystal clear understanding of this phenomenon? Or a web-link to such an understanding?

*EDIT*: Eventually a Cambridge mathematician explained it to me in a way that I could understand, and I have written up the proof here. To my mind it is the most instructional proof I have encountered, yet putting it as an answer received mostly downvotes. It seems strange to me that no one else seems to concur. But it should be up to the keenest mathematical minds to choose which answer should be accepted. It shouldn’t be up to me. Therefore I will bow to the wisdom of the community, and accept the currently most-upvoted answer. I have learned from Machine Learning that a “Committee of Experts” outperforms any one expert, and I am certainly no expert.

4

Start a bounty

Here is an approach that seems rather natural, based on applying the fundamental theorem of calculus successively to , , , etc.:

Notice that

and in general

By induction, then, one proves

where is the Taylor polynomial

and the remainder is represented by nested integrals as

We can establish the Lagrange form of the remainder by applying the intermediate and extreme value theorems, using simple comparisons as follows. Consider the case first. Let be the minimum value of on , and the maximum value. Then since

for all in , after repeated integrations one finds

But now, notice that the function

attains the extreme values

at some points in . By the intermediate value theorem, there must be some point between these two points (so ) such that

This is the Lagrange form of the remainder. If and is odd, the same proof works. If and is even, and the same proof works after reversing some inequalities.

One can motivate this whole approach in a couple of different ways. E.g., one can argue that becomes small for large , so the remainders will become small if the derivatives of stay bounded, say.

Or, one can reason loosely as follows: for near . Ask, what is the remainder exactly? Apply the fundamental theorem as above, then approximate the first remainder using the approximation . Repeating, one produces the Taylor polynomials by the pattern of the argument above.

11

The clearest proof one can find, in my opinion, is the following. Note it is just a generalized mean value theorem!

THM Let be functions defined on a closed interval that admit finite -th derivatives on and continuous -th derivatives on . Suppose . Then for each in there exists in the segment joining and such that

PROOF For simplicity assume and . Keep fixed and consider

for each . Then are continuous on and admit finite derivative on . By the mean value theorem we may write

for . This gives that

since . But we see, by cancelling terms with opposite signs, that

which gives the desired formula when plugging .

COR We get Taylor’s theorem with , namely, for some we have

or

Note that if , .

5

The following proof is in Bartle’s Elements of Real Analysis. It’s goal is to exploit Rolle’s Theorem as the more elementary version of the Mean Value Theorem does. To this end, it incorporates a clever use of the product rule.

So, suppose that denotes a function on such that is -times differentiable on and such that is times differentiable on . For every and distinct from we show there is a point strictly between both and such that

To prove this, let denote the real number which satisfies

And now define the function on by

We clearly have that and, by the definition of , we have . Thus, Rolle’s Theorem implies there is a strictly between and such that

This is where the clever use of the product rule comes in. For when we use the definition of and differentiate at , we obtain a telescoping series which, upon simplification, leaves us with

This shows that as desired.

Let us try and approximate a function by a polynomial in such a way that they coincide closely at the origin. To achieve this, we will require the same value, the same slope, the same curvature and the same higher order derivatives at .

WLOG we us use a cubic polyomial and we start from

where is an error term.

Imposing our conditions, we need as many equations as there are unknown coefficients

Lastly,

To achieve a small error, we ensure , and set . This gives us the Taylor coefficients. We now have to bound the error term.

Assuming that in the range , by integration

To summarize, for ,

where .

My personal favorite is the proof which uses L’Hopital’s rule. It is without a doubt one of the lightest proofs for it, and in my own view one of the more elegant. This proof below is quoted straight out of the related Wikipedia page:

Let:

$h_k(x) = \begin{cases} \frac{f(x) - P(x)}{(x-a)^k} & x\not=a\ 0&x=a

\end{cases}$

where, as in the statement of Taylor’s theorem,

It is sufficient to show that

. The proof here is based on repeated application of L’Hôpital’s rule.

Note that, for each .

Hence each of the first derivatives of the numerator in vanishes at , and the same is true of the denominator. Also, since the condition that the function be times differentiable at a point requires differentiability up to order in a neighborhood of said point (this is true, because differentiability requires a function to be defined in a whole neighborhood of a point), the numerator and its derivatives are differentiable in a neighborhood of . Clearly, the denominator also satisfies said condition, and additionally, doesn’t vanish unless , therefore all conditions necessary for L’Hopital’s rule are fulfilled, and its use is justified. So

\begin{align} \lim_{x\to a} \frac{f(x) - P(x)}{(x-a)^k} &= \lim_{x\to a} \frac{\frac{d}{dx}(f(x) - P(x))}{\frac{d}{dx}(x-a)^k} = \cdots = \lim_{x\to a} \frac{\frac{d^{k-1}}{dx^{k-1}}(f(x) - P(x))}{\frac{d^{k-1}}{dx^{k-1}}(x-a)^k}\\ &=\frac{1}{k!}\lim_{x\to a} \frac{f^{(k-1)}(x) - P^{(k-1)}(x)}{x-a}\\ &=\frac{1}{k!}(f^{(k)}(a) - f^{(k)}(a)) = 0 \end{align}

where the second to last equality follows by the definition of the derivative at .

1

This is the best proof I’ve seen:

https://arxiv.org/abs/0801.1271v2

4

It’s all about smoothness of the functions.

A continuous function is such that it can be accurately approximated by a constant in the neighborhood of a point:

where is a “remainder” function, which tends to zero at .

A smooth function is such that it is differentiable, and its derivatives are continuous. (The more derivatives, the smoother.) For the sake of the example, consider the third order:

Then integrating from to three times,

In the above, the remainders are antiderivatives of each other, and one can show that they belong to .

No integration-by-parts, no l’Hôpital, some telescoping

By the product rule and the chain rule, if the nominated derivatives exist,

In the sum on the right, the left-hand term for   cancels with the right-hand term for   respectively, unless , in which case there is only one pair of terms and no cancellation. In either case, the only remaining terms are the left-hand term for   and the right-hand term for . Solving for the latter gives

Then, integrating w.r.t.   from to   and adding to both sides, we have QED:

This is Taylor’s theorem with the remainder term (red) in integral form. (The indexed summation on the right, highlighted in blue, is the inspiration for the summation on the left of ( ), which yields a telescoping series when differentiated w.r.t. the variable with the minus sign in front; compare the answer by user123641.)

In ( ), the term may be taken under the sign simply by extending the range of  down to  :

This result has been obtained for , and if    it reduces to the Fundamental Theorem of Calculus; thus it is established for .

If is continuous on the interval of integration, the remainder term in ( ) or ( ) may be converted from integral form to Lagrange form as follows (cfVenkata Karthik Bandaru’s answer). Because the factor    does not change sign on the interval, the integral is between   the values that it takes if we replace the factor by its minimum and maximum on the interval (where between   is interpreted inclusively). The minimum or maximum may then be taken outside the integration, and the remaining integral evaluated as

Thus the remainder is between

and

(inclusive), where the minimum or maximum is taken over the interval of integration, and the other factor is independent of . Hence, by the continuity, there exists a real on that interval such that the remainder is exactly

The following isn’t a rigorous proof, but I think it’s “aesthetic”, and “rise[s] naturally from the ground”, as the original question asked for.

In searching for intuition for Taylor Series, I’ve developed a perspective involving Pascal’s Triangle, which arises from recursively applied Riemann Sum approximations to the function.

I found @Bob Pego’s answer really helpful and it’s how I started developing this.

The end result involves coefficients based on rows of Pascal’s Triangle, and the sequence of approximations (sequence of rows) looks like this

“Pascal” approximations for sin(x)

And they’re much less efficient approximations than plain finite Taylor polynomial

Taylor approximations for sin(x)

I’ll explain the derivation, but the essence of it is that the recursive Riemann Sum procedure produces binomial coefficients — rows of Pascal’s Triangle — which are also simplex numbers. Simplex numbers converge to factorial fractions of hypercubes. The nth triangle number approaches , the nth tetrahedral number approaches , and so on.

A regular Riemann Sum approximation of f(x) of “resolution” 4 would be

After each discrete step, we update the slope by setting it to the true slope of the function — what the 1st derivative is at that point we’ve stepped to along x. This is the idea of a Riemann Sum.

But since we’re interested in Taylor Series (about 0) here, let’s pretend that we can’t update to directly, and can only use the values of all derivatives evaluated at 0, not at or anywhere else.

So instead of updating to the actual slope, we’ll use a recursive approximation to get an approximate slope update. We can now recurse and approximate each of the terms that have a non-0 x value. For example,

There are still some terms with evaluated elsewhere than 0, so we recursively approximate terms until all terms are derivatives of evaluated at 0.

For resolution 4, you’ll end up with

Note the appearance of the Pascal row .

In general, for resolution n, that will be

But I prefer to focus on the simplex perspective. Equivalently, that’s

Where e.g. is the pentatope number, like if we index the simplex numbers from 1 to infinity. A few examples:

\color{blue}{tetra}_\color{red}{2} = {(\color{blue}{3} - 1) + \color{red}{2} \choose \color{blue}{3}} = 4, \color{blue}{tetra}_\color{red}{5} = {(\color{blue}{3} - 1) + \color{red}{5} \choose \color{blue}{3}} = 35

etc.

Check the Pascal’s Triangle wikipedia page if you’re not following that.

Simplex numbers approaching factorial fractions of hypercubes

and

Taking to corresponds to increasing the “resolution” of your Riemann Sum, and approaching continuous integration, thus approaching the Taylor Series.

Just like this “triangle”

0
00
000
0000

Is a low resolution of an actual right isosceles triangle polygon

This may seem really roundabout given the concise alternative of the binomial coefficient notation, but I think simplexes are a nice way to visualize the “lagged” effect of higher order derivatives. If you begin traveling with constant acceleration of 1, then after 1 unit of time, your displacement will be the area of the right triangle in a unit square, . If you begin traveling with a constant jerk of 1, then after 1 unit of time, your displacement will be the area of a tetrahedron in the corner of a unit cube, = .

There is also a natural and well-known proof using integration by parts.

Let be a function on open interval , and . The goal is to relate to and s.

Using integration by parts on will make higher derivative terms appear.
One thought is to write , but appears here.

To avoid this, we can instead do .
So continuing this way,
\scriptstyle{\begin{align} f(b) &= f(a) + \int_{a}^{b} f'(t) dt \\ &= f(a) + f'(t) (t-b) \bigr|_{a}^{b} - \int_{a}^{b} f^{(2)}(t) (t-b) dt \\ &= f(a) + f'(a) (b-a) - \left(f^{(2)}(t) \frac{(t-b)^2}{2} \Bigr|_{a}^{b} - \int_{a}^{b} f^{(3)}(t) \frac{(t-b)^2}{2} dt \right) \\ &= f(a) + f'(a) (b-a) + \frac{f^{(2)}(a)}{2} (b-a)^2 + \int_{a}^{b} f^{(3)}(t) \frac{(t-b)^2}{2} dt \\ &\vdots \\ &= f(a) + f'(a) (b-a) + \frac{f^{(2)}(a)}{2!} (b-a)^2 + \ldots + \frac{f^{(n-1)}(a)}{(n-1)!} (b-a)^{n-1} + (-1)^{n-1} \int_{a}^{b} f^{(n)}(t) \frac{(t-b)^{n-1}}{(n-1)!}dt, \end{align}}%

the remainder term being

Like in Bob Pego’s answer, this can be expressed as where :
For convenience say Now remainder is between and That is, is between and Hence is for some as needed.

Here a nice summary and proof from Stewart’s Calculus:

http://www.stewartcalculus.com/data/CALCULUS%20Early%20Transcendentals/upfiles/Formulas4RemainderTaylorSeries5ET.pdf

Regarding the initial answer to the posted question (which is as straightforward of an approach to a proof of Taylor’s Theorem as possible), I find the following the easiest way to explain how the last term on the RHS of the equation (the nested integrals) approaches 0 as the number of iterations (n) becomes arbitrarily large:

There are two cases - (1) f(x) is finitely differentiable or (2) f(x) is infinitely differentiable.

(1) if f(x) is finitely differentiable, then there exists a value of n s.t. for all derivatives of order n+1 or greater, the derivatives are 0, thus resulting in a nested integral with an innermost integral equal to 0, thus rendering the collective nested integral equal to 0, and thus giving us the aforementioned Taylor Polynomial of finite order n with no remainder.

(2) if f(x) is infinitely differentiable, then, as the number of iterations (n) approaches infinity, because we require by definition of the nested integrals that a < t_n < t_n-1 < t_n-2 <… < t_2 < t_1 < x, we see that t_n a as n infinity. As a result, we have (as is true in case (1)), that the innermost integral of the collective nested integral approaches 0, thus giving us a remainder term of 0 in the limit, and hence resulting in the infinite series expression for the Taylor Series of the function, f(x).

Authors of most books will not be so kind to illustrate a proof in this manner, though. It’s upsetting, I know.

First, we have:

Second, from this follows:

In general, arguing by induction, it also follows that, for :

Thus,

Third, defining

it follows that

Thus, upon substitution:

This requires a suitable assumption to be made on over , and thus on as ranges between and - not the least of which being that exists and be continuous over that range.

This states that if has the required property for between and , then has as a factor; if also has that property, then has as a factor; and so on; with the multiple of the factor being continuous functions. Then the case generalizes the theorem that has as a factor, for polynomials in .

This also applies to multiple points. You can try and write out the expressions for:

for two points, for instance and try to figure out what the weighting function should be. The relevant extension of Taylor’s Theorem to multiple points has no name that I am aware of; but it reflects the correct use of Taylor’s Theorem - which is curve-sculpting, a.k.a. smooth-interpolation.

Example:
For , since and , then

The remainder term is actually quite small over - and even outside the interval; but also:

over , for .

Based on Edwards’ approach in “Advanced Calculus of Several Variables”, given here because it’s a different flavor from what has been shown.

Given and a differentiable (enough times), we want to express at, say , by a Taylor Polynomial of order in plus another monomial of order in :

Our does not depend on the nearby point , but the power’s coefficient does, being implicitly determined by and by , and since we’ll need to consider functions on the whole interval , we shall use instead of , to indicate that the end-point is arbitrary but fixed, and let denote the free variable.

So, once and are given, we start by setting so that

i.e. we let be a constant determined by both and , and stress it here that this is the final result, aside from the important fact that is not set in its proper, final form yet.

[ we could indicate that but this involves , so it’s not what we want and hence we don’t mention it ]

We notice that and both have the same value and first derivatives at , and so their difference and the first derivatives of it vanish at . Let’s call this the central property of , which we’ll make use of repeatedly below.

Then, we see that having (2) satisfied can also be written as

being zero at , i.e. as . As observed in brackets above, the equation does not provide a convenient expression for , but it allows us to use Rolle to get a point where , since is obvious (central property plus clearly zero term).

This may not feel very helpful, until we write

and notice that the undesired term , when the above is evaluated at , while it still does not vanish, it no longer has degree in but . So this equation is still unusable to determine , but the unwanted term left from looks one degree better!

Now add the fact that vanishes at (the central property) and also vanishes because it still has a positive power of present. So, the derivative vanishes at both and and we can apply Rolle again on , to , and reduce the polynomial order of what’s left from the undesired term one more time… and again, hopefully repeatedly, until this undesired term is gone and only and a derivative of remains.

We perform a few steps explicitly, to illustrate how the above insinuation works, and close with the final expression.

The second Rolle step, as laid out above, gives

vanishing at some , aside from , and the third Rolle step gives

vanishing at some , aside from .

When the derivative is taken, nothing is left from and only a constant from the 3rd term, allowing to be determined from the last Rolle application, i.e.

will vanish at some , where

which gives the desired expression for , which no longer contains :

and putting this back in (2) we get the theorem in std form (1):

[This is from Rosenlicht’s Analysis book]

Let be an open interval and be a function. Let variables

Temporarily fixing the unique polynomial in whose derivatives at (from th derivative to th derivative) agree with the derivatives of at is

So we can consider the bivariate remainder defined by

Now we can fix and compare the average rates of change of and as varies from some point to

By Generalised MVT, we have

for some

Note the derivative

So on simplifying we see

that is

for some as needed.

Let be infinitely differentiable (we’ll weaken this hypothesis later), on an open interval containing (so is for now, for simplicity).

Let’s try to approximate , over , with a polynomial of degree :

We didn’t yet fix our approximating polynomial . We’ll first fix it by picking some intuitively plausible , and then study the resulting error function .



Fixing an approximation: Intuitively, we want our approximation to be such that is “as flat and close to the -function on as possible”. So we can try to make These are constraints, to fix coefficients.
Since , setting a_0 = f(a),$$a_1 = \dfrac{f^{(1)}(a)}{1!},$$\dots, a_{n-1} = \dfrac{f^{(n-1)}(a)}{(n-1)!}, would do the job.

On the resulting error function: gives (by Rolle’s theorem) for some . Now gives for some . Now gives , so on. At the end, we get for some , that is for some .



Finally, substituting the explicit value of gives

as needed.

[Looking back at the proof, we could have taken ” is times differentiable…” instead of ” is infinitely differentiable…” to begin with. Also, the same idea works for too.]