02.05.03 · analysis / multivariable-differentiation

Chain rule for multi-variable functions

shipped3 tiersLean: partial

Anchor (Master): Apostol *Calculus* Vol. 2 Ch. 8 §8.18–8.21; Dieudonné *Foundations of Modern Analysis* Ch. VIII; Cartan *Calcul Différentiel*; Faà di Bruno 1855 *Sullo sviluppo delle funzioni* (Annali di Scienze Matematiche e Fisiche 6); Itô 1944 *Stochastic Integral* (Proc. Imp. Acad. Tokyo 20)

Intuition [Beginner]

A composition of functions chains one operation after another. The inner function takes an input and produces an intermediate output. The outer function takes that intermediate output and produces a final output. In one variable, the rate of change of the chained operation equals the rate of change of the outer function (at the intermediate output) times the rate of change of the inner function (at the original input). Two gears coupled together: a small turn of the input shaft turns the intermediate shaft a little, and that little turn of the intermediate shaft turns the output shaft a little more.

The multi-variable version keeps the same idea, with a small upgrade. The "rate of change" of a function from $n$ -space to $m$ -space is no longer a single number — it is a rectangular table of numbers called the Jacobian matrix, with one row per output coordinate and one column per input coordinate. Each entry records how one output coordinate responds to a small change in one input coordinate. The chain rule says: the Jacobian matrix of the composition is the matrix product of the outer Jacobian (evaluated at the intermediate output) and the inner Jacobian.

A picture worth carrying: nesting boxes. The inner box converts inputs into intermediates; the outer box converts intermediates into outputs. The chain rule says you can sense how the outermost output reacts to the innermost input by multiplying the two response tables together, in the correct order.

Visual [Beginner]

A three-panel diagram. The leftmost panel shows a number line carrying the input variable $t$ . An arrow labelled $g$ leads to the middle panel, which shows the plane $R^{2}$ with a circle traced by the point $(cos t, sin t)$ . A second arrow labelled $f$ leads to the rightmost panel, another number line, where the entire circle has been collapsed to the single value $1$ because the squared distance from the origin to any point on the unit circle equals $1$ .

The two short arrows above the panels carry the Jacobians of the two maps. The long bottom arrow expresses the chain rule: the Jacobian of the composition is the matrix product of the two inner Jacobians. The visual signature: the chain rule is multiplication of derivative-tables, with the outer table evaluated at the intermediate point.

Worked example [Beginner]

Let the inner map be $g (t) = (cos t, sin t)$ , a path tracing out the unit circle in the plane. Let the outer map be $f (x, y) = x^{2} + y^{2}$ , the squared distance from the origin. The composition is $(f \circ g) (t) = cos^{2} t + sin^{2} t = 1$ for every $t$ . So the composed function is constant; its rate of change is $0$ .

Check this two ways. The direct computation. The composed function equals $1$ everywhere, so its derivative at every point equals $0$ . The chain-rule computation. The inner Jacobian is $g^{'} (t) = (- sin t, cos t)$ , a $2 \times 1$ column. The outer Jacobian is the gradient row $(2 x, 2 y)$ evaluated at the intermediate point $(cos t, sin t)$ , which gives $(2 cos t, 2 sin t)$ . The matrix product is $(2 cos t) (- sin t) + (2 sin t) (cos t) = - 2 sin t cos t + 2 sin t cos t = 0$ . The two computations agree.

What this tells us: the chain rule reproduces the direct calculation, but it works without needing to first carry out the composition. The chain-rule machine takes the two Jacobian tables and multiplies — that is enough.

Check your understanding [Beginner]

Formal definition [Intermediate+]

Let $U \subseteq R^{n}$ and $V \subseteq R^{m}$ be open sets, let $g : U \to R^{m}$ with $g (U) \subseteq V$ , and let $f : V \to R^{p}$ . The map $g$ is differentiable at $a \in U$ when a linear map $D g (a) : R^{n} \to R^{m}$ exists with $$ g(a + h) = g(a) + Dg(a) h + \rho_g(h), \qquad \lim_{h \to 0} \frac{|\rho_g(h)|}{|h|} = 0. $$ The map $D g (a)$ is the (Fréchet) derivative of $g$ at $a$ . In the standard bases of $R^{n}$ and $R^{m}$ , the matrix of $D g (a)$ is the Jacobian matrix $J_{g} (a) = [\partial g_{i} / \partial x_{j} (a)]$ , an $m \times n$ array whose $(i, j)$ -entry is the partial derivative of the $i$ -th component of $g$ with respect to the $j$ -th variable, evaluated at $a$ . Analogous notation applies to $f$ at $b = g (a) \in V$ , where $D f (b) : R^{m} \to R^{p}$ has Jacobian matrix $J_{f} (b)$ of shape $p \times m$ .

The composition rule states: if $g$ is differentiable at $a$ and $f$ is differentiable at $b = g (a)$ , then $f \circ g$ is differentiable at $a$ with $$ D(f \circ g)(a) = Df(g(a)) \circ Dg(a). $$ In Jacobian-matrix form, $J_{f \circ g} (a) = J_{f} (g (a)) \cdot J_{g} (a)$ , an ordinary matrix product of shapes $p \times m$ and $m \times n$ , producing a $p \times n$ matrix. Following Apostol ^{[Apostol Ch. 8 §8.18–8.21]}.

A few equivalent restatements are worth recording.

Component form. Writing $f = (f_{1}, \dots, f_{p})$ and $g = (g_{1}, \dots, g_{m})$ , the $(i, j)$ -entry of $J_{f \circ g} (a)$ is the sum $\sum_{k = 1}^{m} \partial f_{i} / \partial y_{k} (g (a)) \cdot \partial g_{k} / \partial x_{j} (a)$ , the standard scalar-form chain rule of multivariable calculus.
Curve form. For a differentiable curve $γ : I \to V$ and a differentiable scalar $f : V \to R$ , the derivative of $f \circ γ$ at $t \in I$ equals the inner product of the gradient $\nabla f (γ (t))$ with the tangent vector $γ^{'} (t)$ . This is the $p = 1$ specialisation of the general statement.
Sign convention. Composition is read right-to-left: $g$ acts first, then $f$ . The matrix product $J_{f} \cdot J_{g}$ inherits the same right-to-left convention. Reversing the order produces a different and incorrect matrix in general.

Counterexamples to common slips

Differentiability is stronger than the existence of partial derivatives. A map can have every partial derivative at $a$ and still fail to be differentiable at $a$ , in which case the chain rule does not apply. The classical witness is $g (x, y) = x y / (x^{2} + y^{2})$ extended by $g (0, 0) = 0$ : the partials $\partial g / \partial x (0, 0)$ and $\partial g / \partial y (0, 0)$ both equal $0$ , yet $g$ is not even continuous at the origin and therefore not differentiable there.
Continuity of the partials is the standard sufficient condition. If every $\partial g_{i} / \partial x_{j}$ exists and is continuous on $U$ , then $g$ is differentiable on $U$ . This is the practical hypothesis under which one normally invokes the chain rule.
Matrix order matters. The product is $J_{f} (g (a)) \cdot J_{g} (a)$ , in that order. Swapping the factors gives a matrix of shape $m \times m$ when the chain rule wants a $p \times n$ matrix, and the entries are unrelated.
The inner derivative is evaluated at the input; the outer derivative is evaluated at the intermediate point. A reader who evaluates both at $a$ commits an off-by-one error and recovers a numerically wrong derivative.

Key theorem with proof [Intermediate+]

Theorem (multi-variable chain rule). Let $U \subseteq R^{n}$ and $V \subseteq R^{m}$ be open. Let $g : U \to R^{m}$ be differentiable at $a \in U$ with $g (U) \subseteq V$ , and let $f : V \to R^{p}$ be differentiable at $b = g (a)$ . Then $f \circ g$ is differentiable at $a$ and $$ D(f \circ g)(a) = Df(b) \circ Dg(a). $$

Proof. Set $A = D g (a) : R^{n} \to R^{m}$ and $B = D f (b) : R^{m} \to R^{p}$ . Define the two remainders $$ \rho_g(h) = g(a + h) - g(a) - A h, \qquad \rho_f(k) = f(b + k) - f(b) - B k, $$ defined for $h$ small enough that $a + h \in U$ and for $k$ small enough that $b + k \in V$ . By differentiability of $g$ at $a$ and of $f$ at $b$ , $$ \lim_{h \to 0} \frac{|\rho_g(h)|}{|h|} = 0, \qquad \lim_{k \to 0} \frac{|\rho_f(k)|}{|k|} = 0. $$

Fix $h \in R^{n}$ small. Set $k = k (h) = g (a + h) - g (a) = A h + ρ_{g} (h)$ , the intermediate displacement produced by $g$ . Compute the composition: \begin{align} (f \circ g)(a + h) &= f(g(a + h)) = f(b + k) \ &= f(b) + B k + \rho_f(k) \ &= f(b) + B(A h + \rho_g(h)) + \rho_f(k) \ &= f(g(a)) + (B A) h + B \rho_g(h) + \rho_f(k). \end{align}

The candidate linear approximation at $a$ is $B A = D f (b) \circ D g (a)$ . The remainder of $f \circ g$ at $a$ is therefore $$ R(h) = B \rho_g(h) + \rho_f(k(h)). $$ To establish differentiability of $f \circ g$ at $a$ with derivative $B A$ , the requirement is $∥ R (h) ∥/∥ h ∥ \to 0$ as $h \to 0$ .

For the first term, by linearity of $B$ and the operator-norm bound, $$ |B \rho_g(h)| \leq |B|_{\mathrm{op}} \cdot |\rho_g(h)|. $$ Since $∥ ρ_{g} (h) ∥/∥ h ∥ \to 0$ as $h \to 0$ , the ratio $∥ B ρ_{g} (h) ∥/∥ h ∥$ tends to $0$ as well.

For the second term, two estimates combine. The first bounds $∥ k (h) ∥$ in terms of $∥ h ∥$ . From $k = A h + ρ_{g} (h)$ and the triangle inequality, $$ |k(h)| \leq |A|{\mathrm{op}} |h| + |\rho_g(h)| \leq (|A|{\mathrm{op}} + 1) |h| $$ for all $h$ sufficiently small that $∥ ρ_{g} (h) ∥ \leq ∥ h ∥$ , which holds eventually because $∥ ρ_{g} (h) ∥/∥ h ∥ \to 0$ . Set $C = ∥ A ∥_{op} + 1$ , so $∥ k (h) ∥ \leq C ∥ h ∥$ on a small ball about $0$ . Note also that $k (h) \to 0$ as $h \to 0$ because $g$ is continuous at $a$ (differentiability implies continuity), so $g (a + h) \to g (a) = b$ .

The second estimate is the definition of $ρ_{f}$ . Given $ε > 0$ , choose $η > 0$ so that $∥ k ∥ < η$ implies $∥ ρ_{f} (k) ∥ < ε ∥ k ∥/ C$ . Choose $δ > 0$ so that $∥ h ∥ < δ$ forces both $∥ k (h) ∥ < η$ and the bound $∥ k (h) ∥ \leq C ∥ h ∥$ . Then $∥ h ∥ < δ$ implies $$ |\rho_f(k(h))| < \frac{\varepsilon |k(h)|}{C} \leq \frac{\varepsilon \cdot C |h|}{C} = \varepsilon |h|, $$ hence $∥ ρ_{f} (k (h)) ∥/∥ h ∥ < ε$ . Since $ε$ was arbitrary, $∥ ρ_{f} (k (h)) ∥/∥ h ∥ \to 0$ as $h \to 0$ .

Combining the two bounds, $$ \frac{|R(h)|}{|h|} \leq \frac{|B|_{\mathrm{op}} \cdot |\rho_g(h)|}{|h|} + \frac{|\rho_f(k(h))|}{|h|} \xrightarrow[h \to 0]{} 0. $$ The candidate linear map $B A$ satisfies the differentiability defining condition for $f \circ g$ at $a$ . Uniqueness of the derivative (the linear map approximating $f \circ g$ at $a$ to first order is determined by its values on a basis through partial derivatives along coordinate directions) identifies $D (f \circ g) (a) = B A = D f (b) \circ D g (a)$ . $□$

Bridge. The composition rule is the structural backbone of differential calculus on Euclidean spaces, and four neighbouring frames lock into it. First, the proof reduces to the multi-variable limit and continuity of 02.05.01: continuity of $g$ at $a$ is what makes $k (h) \to 0$ , which is what unlocks the $ρ_{f}$ estimate, and the path-independence requirement of 02.05.01 is what makes the linear-approximation language well-defined regardless of how $h$ approaches $0$ . Second, the rule connects to the implicit and inverse function theorems through a single corollary: if $D g (a)$ is invertible as a linear map, the chain rule applied to $g \circ g^{- 1} = id$ forces $D g^{- 1} (g (a)) = D g (a)^{- 1}$ — the derivative of the inverse equals the inverse of the derivative, the key identity behind local invertibility. Third, the rule generalises to higher derivatives through the Faà di Bruno formula, in which the $n$ -th derivative of $f \circ g$ is a sum over set partitions of ${1, \dots, n}$ with combinatorial coefficients; the bare chain rule is the case $n = 1$ with a single one-block partition. Fourth, the rule is the foundational reason that calculus has a local-to-global theory: it pushes derivatives through coordinate changes, through diffeomorphism reparametrisations, through pullbacks of differential forms, and through the local-coordinate patching that defines manifolds. Read together, the four bridges identify the chain rule as the load-bearing functoriality that lets the differential calculus of $R^{n}$ extend coherently to curves, surfaces, manifolds, bundles, and beyond.

Exercises [Intermediate+]

Exercise 5 (medium, short-answer).

Prove the directional-derivative form of the chain rule: if $f : R^{n} \to R$ is differentiable at $a$ and $γ : I \to R^{n}$ is a differentiable curve with $γ (t_{0}) = a$ , then $(f \circ γ)^{'} (t_{0}) = \nabla f (a) \cdot γ^{'} (t_{0})$ , the dot product of the gradient at $a$ with the velocity vector at $t_{0}$ .

Hint

Specialise the matrix-form chain rule to $n$ -inputs / $1$ -output for $f$ and $1$ -input / $n$ -outputs for $γ$ .

Answer

Apply the general chain rule to the composition $f \circ γ$ : $J_{f \circ γ} (t_{0}) = J_{f} (γ (t_{0})) \cdot J_{γ} (t_{0})$ . $J_{f} (γ (t_{0}))$ is the $1 \times n$ row whose entries are the partial derivatives $\partial f / \partial x_{i}$ at $a = γ (t_{0})$ — exactly the gradient row $\nabla f (a)$ . $J_{γ} (t_{0})$ is the $n \times 1$ column whose entries are $γ_{i}^{'} (t_{0})$ — the velocity vector $γ^{'} (t_{0})$ . The row-by-column product is the scalar $\sum_{i} \partial f / \partial x_{i} (a) \cdot γ_{i}^{'} (t_{0}) = \nabla f (a) \cdot γ^{'} (t_{0})$ . Rubric: full credit for the specialisation and the inner-product identification.

Exercise 6 (hard, short-answer).

Prove the derivative-of-the-inverse identity: if $g : U \to V$ is a $C^{1}$ bijection between open subsets of $R^{n}$ , if $D g (a)$ is invertible, and if $g^{- 1}$ is differentiable at $b = g (a)$ , then $D g^{- 1} (b) = D g (a)^{- 1}$ .

Hint

Apply the chain rule to the identity $g^{- 1} \circ g = id_{U}$ .

Answer

The identity map on $U$ has derivative the identity linear map $I_{n} : R^{n} \to R^{n}$ at every point. Apply the chain rule to the composition $g^{- 1} \circ g$ , which equals the identity on $U$ : $$ I_n = D(g^{-1} \circ g)(a) = Dg^{-1}(g(a)) \circ Dg(a) = Dg^{-1}(b) \circ Dg(a). $$ Since $D g (a)$ is invertible, compose on the right with $D g (a)^{- 1}$ : $$ Dg^{-1}(b) = I_n \circ Dg(a)^{-1} = Dg(a)^{-1}. $$ In Jacobian-matrix form, $J_{g^{- 1}} (b) = J_{g} (a)^{- 1}$ . The identity is the prelude to the inverse function theorem 02.05.04, which produces the differentiability of $g^{- 1}$ from invertibility of $D g (a)$ as its conclusion rather than its hypothesis. Rubric: full credit for the chain-rule expansion and the inversion.

Exercise 7 (hard, short-answer).

Let $f : R^{n} \to R^{n}$ be $C^{1}$ and assume $det J_{f} (a) \neq = 0$ . Use the chain rule to prove that, for any $C^{1}$ curve $γ$ through $a$ with $γ^{'} (0) \neq = 0$ , the image curve $f \circ γ$ has nonzero velocity at $f (a)$ . (This is the local invertibility witness powering the inverse function theorem.)

Hint

The chain rule gives $(f \circ γ)^{'} (0) = J_{f} (a) \cdot γ^{'} (0)$ . Use invertibility of $J_{f} (a)$ .

Answer

By the chain rule, $(f \circ γ)^{'} (0) = J_{f} (γ (0)) \cdot γ^{'} (0) = J_{f} (a) \cdot γ^{'} (0)$ . The condition $det J_{f} (a) \neq = 0$ means $J_{f} (a)$ is invertible, hence injective as a linear map. Since $γ^{'} (0) \neq = 0$ is a nonzero input vector and $J_{f} (a)$ is injective, the output $J_{f} (a) \cdot γ^{'} (0)$ is nonzero. So $(f \circ γ)^{'} (0) \neq = 0$ , and the image curve $f \circ γ$ has nonzero velocity at $f (a)$ . The argument shows that nondegenerate tangent directions at $a$ map to nondegenerate tangent directions at $f (a)$ , the linear-algebraic content of the regular-value theorem for self-maps of $R^{n}$ . Rubric: full credit for the chain-rule expansion plus the injectivity argument.

Lean formalization [Intermediate+]

lean_status: partial — Mathlib provides the multi-variable chain rule in Fréchet-derivative form through HasFDerivAt.comp and fderiv.comp, together with the continuous-linear-map composition on EuclideanSpace ℝ (Fin n). The Jacobian-matrix interpretation comes through ContinuousLinearMap.toMatrix paired with the standard basis on EuclideanSpace. The textbook-style packaging in Apostol notation and the Faà di Bruno higher-order chain rule under one named result is the Codex-facing gap.

[object Promise]

The companion module at Codex.Analysis.MultiVariable.ChainRule re-exports these statements and records the unification gap.

Advanced results [Master]

Banach-space chain rule. Let $X$ , $Y$ , $Z$ be Banach spaces, $U \subseteq X$ and $V \subseteq Y$ be open, and let $g : U \to Y$ be Fréchet-differentiable at $a$ with $g (U) \subseteq V$ and $f : V \to Z$ be Fréchet-differentiable at $b = g (a)$ . Then $f \circ g$ is Fréchet-differentiable at $a$ with $D (f \circ g) (a) = D f (b) \circ D g (a)$ . The proof transcribes the Euclidean argument with operator norms in place of the matrix operator norm; completeness of $X$ , $Y$ , $Z$ is not used directly in the chain rule itself, which holds for normed spaces in general, but is invoked in downstream constructions (inverse function theorem, ODE existence) that depend on the chain rule ^{[Dieudonné Ch. VIII]}.

Pushforward on tangent vectors. Let $ϕ : M \to N$ be a smooth map of smooth manifolds. The differential at $p \in M$ , $\phi_{, p} : T_p M \to T_{\phi(p)} N $, i s t h e l in e a r ma p o n t an g e n t s p a ces in d u ce d b y$ \phi $. T h eco m p os i t i o n$ \psi \circ \phi $ha s d i f f er e n t ia l$ (\psi \circ \phi){*, p} = \psi{, \phi(p)} \circ \phi_{, p} $. * T h e E u c l i d e an c hain r u l eo f t hi s u ni t i s t h es p ec ia l i s a t i o n w h e n$ M = \mathbb{R}^n $,$ N = \mathbb{R}^m$ in standard coordinates. Functoriality of the tangent functor on the category of smooth manifolds is precisely this statement.

Pullback on differential forms. For a smooth map $ϕ : M \to N$ and a differential form $ω$ on $N$ , the pullback $\phi^ \omega $s a t i s f i es$ (\psi \circ \phi)^* = \phi^* \circ \psi^$. The reversal of order from pushforward to pullback reflects the contravariance of the form bundle. The chain rule of this unit is the $0$ -form / function-pullback specialisation $(ψ \circ ϕ)^{*} f = f \circ ψ \circ ϕ$ , paired with the $1$ -form pullback formula $ϕ^{*} (df) = d (f \circ ϕ)$ , which itself encodes the chain rule.

Faà di Bruno formula. Let $f, g : R \to R$ be $n$ -times differentiable at the appropriate points. Then $$ \frac{d^n}{dx^n} (f \circ g)(x) = \sum_{\pi \in \mathrm{Part}(n)} f^{(|\pi|)}(g(x)) \prod_{B \in \pi} g^{(|B|)}(x), $$ where the sum runs over all set partitions $π$ of ${1, \dots, n}$ , $∣ π ∣$ is the number of blocks of $π$ , and the product is over the blocks $B$ of $π$ ^{[Faà di Bruno 1855]}. The bare chain rule is the $n = 1$ case with the single-block partition. The multivariable generalisation replaces real-valued $f^{(k)}$ with $k$ -multilinear maps and sums over compositions indexed by partition-and-flag data; Cartan's notation packages the construction efficiently.

Itô formula. Let $X_{t}$ be a continuous semimartingale with quadratic-variation process $[X, X]_{t}$ , and let $f \in C^{2} (R)$ . Then $$ f(X_t) = f(X_0) + \int_0^t f'(X_s) , dX_s + \frac{1}{2} \int_0^t f''(X_s) , d[X, X]_s. $$ The second integral is the stochastic correction term, absent from the deterministic chain rule, present because Brownian motion has nonzero quadratic variation ^{[Itô 1944]}. The formula extends to vector-valued semimartingales and $C^{2}$ functions of several variables, with the correction term involving the Hessian against the matrix-valued quadratic variation. Itô's discovery in 1944 is the foundation of stochastic calculus.

Synthesis. Five observations organise the unit. First, the chain rule reduces to four ingredients: the linear-approximation definition of the derivative, the operator-norm bound $∥ B ρ_{g} (h) ∥ \leq ∥ B ∥ \cdot ∥ ρ_{g} (h) ∥$ , the continuity of $g$ at $a$ that produces $k (h) \to 0$ , and the triangle inequality used to bound $∥ k (h) ∥$ in terms of $∥ h ∥$ . The four ingredients are precisely the basic machinery of the multi-variable limit and continuity unit 02.05.01, reorganised. Second, the rule supports both the Fréchet-derivative coordinate-free form and the Jacobian-matrix coordinate form simultaneously; the matrix product is the standard-basis representation of the operator composition. Third, the rule generalises smoothly to Banach spaces, with no change in the proof beyond the swap of matrix norms for operator norms; the Banach-space chain rule is the abstract platform from which the implicit function theorem, the Picard-Lindelöf theorem for ODEs, and Newton's method on Banach spaces all descend. Fourth, the rule is the foundational functoriality of differentiation: the tangent functor on the category of smooth manifolds, the pullback contravariantly on the de Rham complex, and the pushforward covariantly on tangent vectors all are the chain rule, dressed in categorical clothing. Fifth, the rule has a stochastic refinement — the Itô formula — in which the quadratic-variation correction term measures the failure of the classical chain rule for paths with nonzero quadratic variation; the correction term vanishes for processes of bounded variation, recovering the classical statement.

Full proof set [Master]

Multi-variable chain rule. Proved in §"Key theorem with proof" above by the linear-approximation argument with the two remainders $ρ_{g}$ and $ρ_{f}$ and the operator-norm bound on the outer linear map.

Derivative of the inverse. Proved as Exercise 6 by applying the chain rule to $g^{- 1} \circ g = id_{U}$ .

Directional-derivative form. Proved as Exercise 5 by specialising the general matrix form to a $1 \times n$ row times an $n \times 1$ column.

Banach-space chain rule. Statement above. The Euclidean proof transcribes verbatim with $∥ \cdot ∥_{op}$ now the operator norm on bounded linear maps between Banach spaces and with $ρ_{g}$ , $ρ_{f}$ defined by the same $o (∥ h ∥)$ condition. The key inequality $∥ B ρ_{g} (h) ∥ \leq ∥ B ∥_{op} ∥ ρ_{g} (h) ∥$ holds for bounded operators on normed spaces, and the rest of the proof uses only the triangle inequality and the $o (∥ h ∥)$ definitions. $□$

Faà di Bruno (sketch). Statement above. Proof by induction on $n$ . The base case $n = 1$ is the bare chain rule. The induction step differentiates the formula for $n - 1$ derivatives once more and reorganises the sum over partitions: each refinement of an existing partition (adding a new singleton block or extending an existing block) corresponds to a term in the derivative of the previous-stage product. The combinatorial bookkeeping is the content of the formula; the analytic content is the chain rule applied $n$ times. Full proof in Faà di Bruno ^{[Faà di Bruno 1855]} and modern accounts via exponential generating functions for set-partition statistics. $□$

Tangent-functor functoriality. Statement above (pushforward on tangent vectors). The differential $ϕ_{*, p}$ at $p$ is defined via the chain rule applied to test functions: $ϕ_{*, p} (v) (f) = v (f \circ ϕ)$ for a tangent vector $v \in T_{p} M$ and a smooth function $f$ near $ϕ (p)$ . The composition identity $(ψ \circ ϕ)_{*, p} (v) (f) = v (f \circ ψ \circ ϕ)$ unpacks by associativity of composition. Applying the chain rule in local coordinates on $M$ , $N$ , and the target $R$ shows the linear-map identity $(ψ \circ ϕ)_{*, p} = ψ_{*, ϕ (p)} \circ ϕ_{*, p}$ . $□$

Itô formula (sketch). Statement above. The proof discretises $X_{t}$ on a partition $0 = t_{0} < t_{1} < \dots < t_{N} = t$ and expands $f (X_{t_{k + 1}}) - f (X_{t_{k}})$ to second order via Taylor's theorem. The first-order terms $f^{'} (X_{t_{k}}) (X_{t_{k + 1}} - X_{t_{k}})$ converge in probability to $\int_{0}^{t} f^{'} (X_{s}) d X_{s}$ . The second-order terms $\frac{1}{2} f^{''} (X_{t_{k}}) (X_{t_{k + 1}} - X_{t_{k}})^{2}$ converge to $\frac{1}{2} \int_{0}^{t} f^{''} (X_{s}) d [X, X]_{s}$ , the correction term, by the definition of quadratic variation. Higher-order Taylor terms are negligible because the partition mesh shrinks. Full proof in Itô ^{[Itô 1944]} and modern stochastic-calculus texts. $□$

Connections [Master]

Multi-variable limit and continuity 02.05.01 — the chain rule's proof rests on continuity of $g$ at $a$ , the linear-approximation form of differentiability, and the operator-norm bound, all of which are the machinery the limit-and-continuity unit assembles. Without the path-independence requirement of 02.05.01, the linear-approximation language is not well-posed, and the chain rule loses its meaning.

Partial derivative and the differential (pending unit 02.05.02) — the chain rule's statement names the Fréchet derivative $D (f \circ g) (a)$ , whose existence requires the differentiability concept developed in the partial-derivative unit. The Jacobian-matrix form is the standard-basis representation of the operator composition. The chain rule is the principal computational theorem of the partial-derivative framework.

Implicit and inverse function theorems (pending unit 02.05.04) — the inverse function theorem produces a $C^{1}$ inverse $g^{- 1}$ from invertibility of $D g (a)$ ; the chain-rule identity $D g^{- 1} (b) = D g (a)^{- 1}$ (Exercise 6) is the bridge between the existence statement and the derivative formula. The implicit function theorem is a corollary of the inverse function theorem and inherits the chain-rule identity for its derivative formulas.

Smooth manifold 03.02.01 — the chain rule is the foundational reason that local-coordinate transition maps preserve the differential structure: a smooth atlas demands that overlapping charts compose to give smooth coordinate changes, and the chain rule lets one chart's partial derivatives translate to another. The tangent functor on the category of smooth manifolds is the chain rule, packaged categorically.

Differential forms and exterior derivative 03.04.04 — the pullback of a differential form $ω$ under a smooth map $ϕ$ satisfies $(ψ \circ ϕ)^{*} = ϕ^{*} \circ ψ^{*}$ , the contravariant functoriality of the form complex; specialisation to $0$ -forms (functions) gives the function-pullback chain rule. The identity $ϕ^{*} (d ω) = d (ϕ^{*} ω)$ — the exterior derivative commutes with pullback — is the chain rule in another guise.

Stokes's theorem and de Rham cohomology [03.04.05–06] — the change-of-variables formula for multi-variable integration, the differential-forms version of which is $\int_{M} ϕ^{*} ω = \int_{ϕ (M)} ω$ for an orientation-preserving diffeomorphism, has as its Jacobian-correction factor $∣ det D g ∣$ precisely the chain-rule Jacobian of this unit.

Ordinary differential equations (pending chapter 02.06) — the existence theorem for solutions to $\overset{x}{˙} = F (x)$ with $F : R^{n} \to R^{n}$ smooth, the Picard-Lindelöf theorem, uses Banach fixed-point on a function space; differentiability of the flow at the initial condition is computed via the chain rule, with the variational equation $\dot{Φ} = D F (x (t)) Φ$ governing the derivative of the flow.

Historical & philosophical context [Master]

Gottfried Wilhelm Leibniz introduced the chain rule in single-variable form in the 1684 Acta Eruditorum paper Nova methodus pro maximis et minimis ^{[Leibniz 1684]}, with the original $d y / d x \cdot d x / d t$ notation that survives in modern textbooks; the differential symbols $d x$ , $d y$ were Leibniz's invention and the chain rule was their first major computational payoff. Cauchy and Lagrange in the early nineteenth century gave rigorous proofs in the framework of single-variable analysis, with Cauchy's 1821 Cours d'analyse recording the $ε$ - $δ$ version.

The multi-variable version emerged through nineteenth-century pedagogical practice — Riemann's lectures, Jacobi's work on functional determinants, the implicit-function-theorem tradition of Dini — and was given its modern coordinate-free formulation by Élie Cartan around 1900 with the intrinsic differential $df$ , separated from its matrix representation. Apostol's 1969 Calculus Vol. 2 ^{[Apostol Ch. 8 §8.18–8.21]} packaged the Cartan-Dieudonné framing for an honours undergraduate audience, with the Jacobian-matrix form as the standard-basis incarnation. Francesco Faà di Bruno's 1855 paper ^{[Faà di Bruno 1855]} in the Annali di Scienze Matematiche e Fisiche gave the higher-order chain rule with set-partition coefficients; the combinatorial content was rediscovered independently several times before being attributed correctly. Kiyoshi Itô's 1944 Stochastic Integral ^{[Itô 1944]} in the Proceedings of the Imperial Academy of Tokyo extended the chain rule to stochastic processes with nonzero quadratic variation, introducing the correction term that defines Itô calculus and Itô's foundational role in modern probability theory.

Bibliography [Master]

[object Promise]

Prerequisites

02.05.01

Tier anchors

beginner: 3Blue1Brown style 'nested-function gears' framing; Strogatz informal kinematic 'rate-times-rate' picture
intermediate: Apostol *Calculus* Vol. 2 Ch. 8 §8.18–8.21; Rudin *Principles of Mathematical Analysis* Ch. 9; Spivak *Calculus on Manifolds* Ch. 2
master: Apostol *Calculus* Vol. 2 Ch. 8 §8.18–8.21; Dieudonné *Foundations of Modern Analysis* Ch. VIII; Cartan *Calcul Différentiel*; Faà di Bruno 1855 *Sullo sviluppo delle funzioni* (Annali di Scienze Matematiche e Fisiche 6); Itô 1944 *Stochastic Integral* (Proc. Imp. Acad. Tokyo 20)

References

TODO_REF
Apostol — Calculus Vol. 2 · Ch. 8 §8.18–8.21, the composition rule for multi-variable differentiation and the Jacobian-matrix product
TODO_REF
Rudin — Principles of Mathematical Analysis · Ch. 9, the chain rule for differentiable maps between Euclidean spaces
TODO_REF
Spivak — Calculus on Manifolds · Ch. 2 Theorem 2-2, the chain rule with the linear-approximation proof
TODO_REF
Dieudonné — Foundations of Modern Analysis · Ch. VIII §8.2, the chain rule in Banach spaces (Fréchet-derivative form)
TODO_REF
Cartan — Calcul Différentiel · Ch. I, intrinsic differential and composition rule
TODO_REF
Faà di Bruno 1855 — Sullo sviluppo delle funzioni · Annali di Scienze Matematiche e Fisiche 6, the higher-order chain rule with set-partition coefficients
TODO_REF
Itô 1944 — Stochastic Integral · Proceedings of the Imperial Academy of Tokyo 20, the stochastic chain rule with the quadratic-variation correction
TODO_REF
Leibniz 1684 — Nova methodus pro maximis et minimis · Acta Eruditorum, the original single-variable composition rule for differentials

Lean module

Codex.Analysis.MultiVariable.ChainRule

Mathlib gap

Mathlib provides the multi-variable chain rule in Fréchet-derivative
form through `HasFDerivAt.comp` and `fderiv.comp`, together with the
continuous-linear-map composition on `EuclideanSpace ℝ (Fin n)`. The
Jacobian-matrix interpretation comes through `LinearMap.toMatrix` and
`ContinuousLinearMap.toMatrix` paired with the standard basis on
`EuclideanSpace`. What is not packaged in Mathlib is a single
textbook-style namespace that names the composition rule
$D(f \circ g)(a) = Df(g(a)) \circ Dg(a)$ in Apostol notation, exhibits
the Jacobian-matrix product $J_{f \circ g}(a) = J_f(g(a)) \cdot J_g(a)$
as a corollary, and records the Faà di Bruno higher-order chain rule
with its set-partition combinatorics under the same namespace. The
Codex module collects these into the textbook presentation and records
the unification gap.

Reviewer

TBD

Estimated time

beginner: 15m
intermediate: 35m
master: 65m