Home

Machine Learning Prelude – Fundamental Mathematics

Li

Li Wei

May 18, 20269 min read

Advanced Mathematics

Derivative

Concept of Derivative

Derivative is a concept in calculus. The derivative of a function at a point refers to the rate of change of the function near that point (i.e., the slope of the tangent line at that point). The essence of the derivative is a local linear approximation of the function obtained via the limit concept.

When the independent variable of a function experiences an increment Δx at a point, the ratio of the increment of the function value Δy to the increment Δx, as Δx → 0, if the limit exists, is the derivative at x, denoted by f′(x), dy/dx, or df/dx.

For example, in kinematics the derivative of an object’s displacement with respect to time is the object’s instantaneous velocity: v = ds/dt.

Derivatives of Basic Functions

Explanation Formula Example
Derivative of a constant (c)' = 0
Derivative of a power function (xⁿ)' = n xⁿ⁻¹
Derivative of the exponential function (eˣ)' = eˣ
Derivative of the logarithmic function (ln x)' = 1/x
Derivative of trigonometric functions (sin x)' = cos x, (cos x)' = –sin x, (tan x)' = sec² x, …

Rules for Differentiation

Explanation Formula
Sum rule (f + g)′ = f′ + g′
Product rule (fg)′ = f′g + fg′
Quotient rule (f/g)′ = (f′g – fg′)/g²
Chain rule (f∘g)′ = (f′∘g)·g′

For instance, find the derivative of f(x) = (x³ + 2x) · eˣ at x = 0.

Using Derivatives to Find Extrema

Points where the derivative equals zero are called critical points (or candidate extrema). At such points the function may attain a local maximum or local minimum. To decide which, one needs to examine the sign of the derivative in a neighborhood of the point.

For example, f(x) = x³ has f′(0) = 0, but x = 0 is neither a maximum nor a minimum.

Second‑Order Derivative

Concept of Second‑Order Derivative

In calculus, the second‑order derivative of a function is the derivative of its first derivative. Roughly speaking, the second derivative describes how quickly the rate of change itself is changing. For instance, the second derivative of position with respect to time is the instantaneous acceleration, i.e., the rate of change of velocity: a = d²s/dt². The second derivative is usually denoted by f″(x), d²f/dx², or ∂²f/∂x².

Relationship Between the Second Derivative and Concavity

The second derivative indicates the direction and degree of curvature of the graph. If the second derivative is positive on an interval, the function is concave upward (also called convex). Conversely, if it is negative, the function is concave downward (also called concave).

If the second derivative changes sign at a point, the graph switches from concave upward to concave downward or vice versa. Such a point is a point of inflection. When the second derivative is continuous, the second derivative at an inflection point is zero, but the converse is not always true. For example, f(x) = x⁴ has f″(0) = 0, yet it has no inflection point on ℝ.

The relationship between concavity and extrema helps decide the nature of a critical point:

  • If f′(c) = 0, f″(c) < 0, then f attains a local maximum at c.
  • If f′(c) = 0, f″(c) > 0, then f attains a local minimum at c.
  • If f′(c) = 0, f″(c) = 0, the point may be an inflection point, a maximum, or a minimum.

Partial Derivatives and Gradient

Partial Derivative

When a function has several independent variables, e.g.

[ f(x_1, x_2, \dots, x_n), ]

we can treat all variables except one as constants and differentiate with respect to the chosen variable.

For example, treating y as a constant, the function f(x, y) = x²y can be regarded as a function of x: g(x) = x²y. With y fixed, the derivative with respect to x is

[ \frac{\partial f}{\partial x}=2xy. ]

This derivative is called a partial derivative, generally denoted

[ \frac{\partial f}{\partial x_i}. ]

More generally, for a multivariable function f at point a the partial derivative with respect to x_i is defined as

[ \frac{\partial f}{\partial x_i}(a)=\lim_{h\to0}\frac{f(a_1,\dots,a_i+h,\dots,a_n)-f(a)}{h}. ]

Directional Derivative

A partial derivative is the rate of change of a multivariable function along one coordinate axis. If we choose an arbitrary direction (\mathbf{u}) (a unit vector), the rate of change of a bivariate function f at point (\mathbf{a}) in that direction is defined by the limit

[ D_{\mathbf{u}}f(\mathbf{a})=\lim_{h\to0}\frac{f(\mathbf{a}+h\mathbf{u})-f(\mathbf{a})}{h}. ]

Here (h\mathbf{u}) is a small displacement along (\mathbf{u}); the components satisfy (h\mathbf{u}=h(u_1,u_2)) with (u_1^2+u_2^2=1).

Using the total differential, this can be written as

[ D_{\mathbf{u}}f(\mathbf{a})=f_x(\mathbf{a}),u_1+f_y(\mathbf{a}),u_2, ]

where (f_x) and (f_y) are the partial derivatives at (\mathbf{a}), and (u_1, u_2) are the direction cosines of (\mathbf{u}). The quantity (D_{\mathbf{u}}f) is called the directional derivative of f in the direction (\mathbf{u}).

Gradient

For a multivariable function f, if at point (\mathbf{a}) the partial derivatives with respect to each variable exist, they form a vector

[ \nabla f(\mathbf{a})=\bigl(f_{x_1}(\mathbf{a}),,f_{x_2}(\mathbf{a}),\dots,f_{x_n}(\mathbf{a})\bigr). ]

This vector is the gradient of f at (\mathbf{a}), denoted (\nabla f) or (\operatorname{grad} f).

For example, for (f(x,y)=x^2y) the gradient at ((1,2)) is

[ \nabla f(1,2)=(2xy,,x^2)\big|_{(1,2)}=(4,1). ]

The gradient points in the direction of the greatest increase of the function; its magnitude gives the maximal rate of increase.

Linear Algebra

Scalars and Vectors

Concepts of Scalars and Vectors

Scalar: a single number having only magnitude.
Vector: an ordered list of scalars having both magnitude and direction.

  • Row vector: ([,a_1;a_2;\dots;a_n,])
  • Column vector: (\begin{bmatrix}a_1\ a_2\ \vdots\ a_n\end{bmatrix})

Vector Operations

  • Transpose: a column vector becomes a row vector and vice versa.
  • Addition: add corresponding components.
  • Scalar multiplication: multiply each component by the scalar.
  • Inner product (dot product): sum of the products of corresponding components; the result is a scalar.
  • Angle between two vectors: defined by (\cos\theta = \dfrac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|;|\mathbf{v}|}).

Vector Norms

A norm assigns a “length” to a vector.

  • L⁰ “norm” (actually the count of non‑zero entries)

    Example: for (\mathbf{x} = (1,0,3)), (|\mathbf{x}|_0 = 2).

  • L¹ norm (Manhattan norm)

    (|\mathbf{x}|_1 = \sum_i |x_i|).

    Example: (\mathbf{x} = (1,-2,3)) ⇒ (|\mathbf{x}|_1 = 6).

  • L² norm (Euclidean norm)

    (|\mathbf{x}|_2 = \sqrt{\sum_i x_i^2}).

    Example: (\mathbf{x} = (1,2,2)) ⇒ (|\mathbf{x}|_2 = 3).

  • Lᵖ norm

    (|\mathbf{x}|_p = \bigl(\sum_i |x_i|^p\bigr)^{1/p}).

In NumPy, numpy.linalg.norm conveniently computes vector norms.

Matrices and Tensors

Concept of a Matrix

An (m\times n) matrix is a rectangular array with m rows and n columns. The set of all real (m\times n) matrices is denoted (\mathbb{R}^{m\times n}).

  • Square matrix: number of rows equals number of columns.
  • Diagonal matrix: all off‑diagonal entries are zero.
  • Identity matrix: a diagonal matrix whose diagonal entries are all 1.

Matrix Multiplication

Matrix multiplication is defined only when the number of columns of the left matrix equals the number of rows of the right matrix. If (A) is (m\times p) and (B) is (p\times n), their product (C = AB) is an (m\times n) matrix with entries

[ c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}. ]

Example:

[ A = \begin{bmatrix}1&2\3&4\end{bmatrix},\qquad B = \begin{bmatrix}0&1\1&0\end{bmatrix},\qquad AB = \begin{bmatrix}2&1\4&3\end{bmatrix}. ]

Multiplying by the identity leaves a matrix unchanged:

[ AI = A,\qquad IA = A. ]

Properties of Matrix Multiplication

Matrix multiplication is associative and distributes over addition on both sides, but it is not commutative in general:

  • Associativity: ((AB)C = A(BC)).
  • Left distributive: (A(B+C) = AB + AC).
  • Right distributive: ((A+B)C = AC + BC).
  • Non‑commutative: generally (AB \neq BA).

Matrix Transpose

The transpose of an (m\times n) matrix (A) is the (n\times m) matrix (A^{\mathsf T}) whose ((i,j)) entry equals the ((j,i)) entry of (A).

[ A^{\mathsf T}{ij}=A{ji}. ]

Example:

[ A = \begin{bmatrix}1&2\3&4\5&6\end{bmatrix},\qquad A^{\mathsf T}= \begin{bmatrix}1&3&5\2&4&6\end{bmatrix}. ]

Properties:

[ (A^{\mathsf T})^{\mathsf T}=A,\quad (A+B)^{\mathsf T}=A^{\mathsf T}+B^{\mathsf T},\quad (AB)^{\mathsf T}=B^{\mathsf T}A^{\mathsf T}. ]

Inverse of a Matrix

For a square matrix (A), if there exists a matrix (A^{-1}) such that (AA^{-1}=A^{-1}A=I), then (A^{-1}) is called the inverse of (A).

Example:

[ A = \begin{bmatrix}2&1\5&3\end{bmatrix},\qquad A^{-1}= \begin{bmatrix}3&-1\-5&2\end{bmatrix}. ]

Other Matrix Operations

  • Vectorization: stacking the columns of a matrix (A) into a single column vector, denoted (\operatorname{vec}(A)).

    Example:

    [ A=\begin{bmatrix}a&b\c&d\end{bmatrix};\Longrightarrow; \operatorname{vec}(A)=\begin{bmatrix}a\c\b\d\end{bmatrix}. ]

    Row‑wise vectorization (sometimes called “row‑vec”) stacks rows instead of columns.

  • Matrix inner product: for matrices (A) and (B) of the same size, (\langle A,B\rangle = \sum_{i,j} a_{ij}b_{ij}); the result is a scalar.

  • Hadamard product: element‑wise multiplication, denoted (A\circ B); the result has the same dimensions as (A) and (B).

  • Kronecker product: denoted (A\otimes B); each element (a_{ij}) of (A) is multiplied by the entire matrix (B), producing a block matrix. The Kronecker product is also called the tensor product or direct product.

Tensor

A tensor is a multidimensional array, generalizing scalars (0‑D), vectors (1‑D), and matrices (2‑D) to n dimensions.

Example: a 3‑D tensor ( \mathcal{T}\in\mathbb{R}^{p\times q\times r}).

Matrix Calculus

Matrix calculus essentially means differentiating a scalar‑ or vector‑valued function with respect to a matrix or vector variable, then arranging the resulting partial derivatives back into a matrix or vector form (rather than flattening everything into a single number).

To be precise, we first standardize the notation for variables (independent arguments) and functions:

  • Notation for variables

    • Real‑valued vector variable
      Let (\mathbf{x} = [x_1,\dots,x_n]) be a row vector. The superscript (^{\mathsf T}) denotes transpose, turning the row vector into an (n)-dimensional column vector, which is the convention in linear algebra to keep matrix dimensions consistent.

    • Real‑valued matrix variable
      (Further definitions continue in the original text.)


Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.

Keep reading

More related articles from DriftSeas.