Machine Learning Prelude – Fundamental Mathematics
Li Wei
Advanced Mathematics
Derivative
Concept of Derivative
Derivative is a concept in calculus. The derivative of a function at a point refers to the rate of change of the function near that point (i.e., the slope of the tangent line at that point). The essence of the derivative is a local linear approximation of the function obtained via the limit concept.
When the independent variable of a function experiences an increment Δx at a point, the ratio of the increment of the function value Δy to the increment Δx, as Δx → 0, if the limit exists, is the derivative at x, denoted by f′(x), dy/dx, or df/dx.
For example, in kinematics the derivative of an object’s displacement with respect to time is the object’s instantaneous velocity: v = ds/dt.
Derivatives of Basic Functions
| Explanation | Formula | Example |
|---|---|---|
| Derivative of a constant | (c)' = 0 | |
| Derivative of a power function | (xⁿ)' = n xⁿ⁻¹ | |
| Derivative of the exponential function | (eˣ)' = eˣ | |
| Derivative of the logarithmic function | (ln x)' = 1/x | |
| Derivative of trigonometric functions | (sin x)' = cos x, (cos x)' = –sin x, (tan x)' = sec² x, … |
Rules for Differentiation
| Explanation | Formula |
|---|---|
| Sum rule (f + g)′ = f′ + g′ | |
| Product rule (fg)′ = f′g + fg′ | |
| Quotient rule (f/g)′ = (f′g – fg′)/g² | |
| Chain rule (f∘g)′ = (f′∘g)·g′ |
For instance, find the derivative of f(x) = (x³ + 2x) · eˣ at x = 0.
Using Derivatives to Find Extrema
Points where the derivative equals zero are called critical points (or candidate extrema). At such points the function may attain a local maximum or local minimum. To decide which, one needs to examine the sign of the derivative in a neighborhood of the point.
For example, f(x) = x³ has f′(0) = 0, but x = 0 is neither a maximum nor a minimum.
Second‑Order Derivative
Concept of Second‑Order Derivative
In calculus, the second‑order derivative of a function is the derivative of its first derivative. Roughly speaking, the second derivative describes how quickly the rate of change itself is changing. For instance, the second derivative of position with respect to time is the instantaneous acceleration, i.e., the rate of change of velocity: a = d²s/dt². The second derivative is usually denoted by f″(x), d²f/dx², or ∂²f/∂x².
Relationship Between the Second Derivative and Concavity
The second derivative indicates the direction and degree of curvature of the graph. If the second derivative is positive on an interval, the function is concave upward (also called convex). Conversely, if it is negative, the function is concave downward (also called concave).
If the second derivative changes sign at a point, the graph switches from concave upward to concave downward or vice versa. Such a point is a point of inflection. When the second derivative is continuous, the second derivative at an inflection point is zero, but the converse is not always true. For example, f(x) = x⁴ has f″(0) = 0, yet it has no inflection point on ℝ.
The relationship between concavity and extrema helps decide the nature of a critical point:
- If f′(c) = 0, f″(c) < 0, then f attains a local maximum at c.
- If f′(c) = 0, f″(c) > 0, then f attains a local minimum at c.
- If f′(c) = 0, f″(c) = 0, the point may be an inflection point, a maximum, or a minimum.
Partial Derivatives and Gradient
Partial Derivative
When a function has several independent variables, e.g.
[ f(x_1, x_2, \dots, x_n), ]
we can treat all variables except one as constants and differentiate with respect to the chosen variable.
For example, treating y as a constant, the function f(x, y) = x²y can be regarded as a function of x: g(x) = x²y. With y fixed, the derivative with respect to x is
[ \frac{\partial f}{\partial x}=2xy. ]
This derivative is called a partial derivative, generally denoted
[ \frac{\partial f}{\partial x_i}. ]
More generally, for a multivariable function f at point a the partial derivative with respect to x_i is defined as
[ \frac{\partial f}{\partial x_i}(a)=\lim_{h\to0}\frac{f(a_1,\dots,a_i+h,\dots,a_n)-f(a)}{h}. ]
Directional Derivative
A partial derivative is the rate of change of a multivariable function along one coordinate axis. If we choose an arbitrary direction (\mathbf{u}) (a unit vector), the rate of change of a bivariate function f at point (\mathbf{a}) in that direction is defined by the limit
[ D_{\mathbf{u}}f(\mathbf{a})=\lim_{h\to0}\frac{f(\mathbf{a}+h\mathbf{u})-f(\mathbf{a})}{h}. ]
Here (h\mathbf{u}) is a small displacement along (\mathbf{u}); the components satisfy (h\mathbf{u}=h(u_1,u_2)) with (u_1^2+u_2^2=1).
Using the total differential, this can be written as
[ D_{\mathbf{u}}f(\mathbf{a})=f_x(\mathbf{a}),u_1+f_y(\mathbf{a}),u_2, ]
where (f_x) and (f_y) are the partial derivatives at (\mathbf{a}), and (u_1, u_2) are the direction cosines of (\mathbf{u}). The quantity (D_{\mathbf{u}}f) is called the directional derivative of f in the direction (\mathbf{u}).
Gradient
For a multivariable function f, if at point (\mathbf{a}) the partial derivatives with respect to each variable exist, they form a vector
[ \nabla f(\mathbf{a})=\bigl(f_{x_1}(\mathbf{a}),,f_{x_2}(\mathbf{a}),\dots,f_{x_n}(\mathbf{a})\bigr). ]
This vector is the gradient of f at (\mathbf{a}), denoted (\nabla f) or (\operatorname{grad} f).
For example, for (f(x,y)=x^2y) the gradient at ((1,2)) is
[ \nabla f(1,2)=(2xy,,x^2)\big|_{(1,2)}=(4,1). ]
The gradient points in the direction of the greatest increase of the function; its magnitude gives the maximal rate of increase.
Linear Algebra
Scalars and Vectors
Concepts of Scalars and Vectors
Scalar: a single number having only magnitude.
Vector: an ordered list of scalars having both magnitude and direction.
- Row vector: ([,a_1;a_2;\dots;a_n,])
- Column vector: (\begin{bmatrix}a_1\ a_2\ \vdots\ a_n\end{bmatrix})
Vector Operations
- Transpose: a column vector becomes a row vector and vice versa.
- Addition: add corresponding components.
- Scalar multiplication: multiply each component by the scalar.
- Inner product (dot product): sum of the products of corresponding components; the result is a scalar.
- Angle between two vectors: defined by (\cos\theta = \dfrac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|;|\mathbf{v}|}).
Vector Norms
A norm assigns a “length” to a vector.
L⁰ “norm” (actually the count of non‑zero entries)
Example: for (\mathbf{x} = (1,0,3)), (|\mathbf{x}|_0 = 2).
L¹ norm (Manhattan norm)
(|\mathbf{x}|_1 = \sum_i |x_i|).
Example: (\mathbf{x} = (1,-2,3)) ⇒ (|\mathbf{x}|_1 = 6).
L² norm (Euclidean norm)
(|\mathbf{x}|_2 = \sqrt{\sum_i x_i^2}).
Example: (\mathbf{x} = (1,2,2)) ⇒ (|\mathbf{x}|_2 = 3).
Lᵖ norm
(|\mathbf{x}|_p = \bigl(\sum_i |x_i|^p\bigr)^{1/p}).
In NumPy, numpy.linalg.norm conveniently computes vector norms.
Matrices and Tensors
Concept of a Matrix
An (m\times n) matrix is a rectangular array with m rows and n columns. The set of all real (m\times n) matrices is denoted (\mathbb{R}^{m\times n}).
- Square matrix: number of rows equals number of columns.
- Diagonal matrix: all off‑diagonal entries are zero.
- Identity matrix: a diagonal matrix whose diagonal entries are all 1.
Matrix Multiplication
Matrix multiplication is defined only when the number of columns of the left matrix equals the number of rows of the right matrix. If (A) is (m\times p) and (B) is (p\times n), their product (C = AB) is an (m\times n) matrix with entries
[ c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}. ]
Example:
[ A = \begin{bmatrix}1&2\3&4\end{bmatrix},\qquad B = \begin{bmatrix}0&1\1&0\end{bmatrix},\qquad AB = \begin{bmatrix}2&1\4&3\end{bmatrix}. ]
Multiplying by the identity leaves a matrix unchanged:
[ AI = A,\qquad IA = A. ]
Properties of Matrix Multiplication
Matrix multiplication is associative and distributes over addition on both sides, but it is not commutative in general:
- Associativity: ((AB)C = A(BC)).
- Left distributive: (A(B+C) = AB + AC).
- Right distributive: ((A+B)C = AC + BC).
- Non‑commutative: generally (AB \neq BA).
Matrix Transpose
The transpose of an (m\times n) matrix (A) is the (n\times m) matrix (A^{\mathsf T}) whose ((i,j)) entry equals the ((j,i)) entry of (A).
[ A^{\mathsf T}{ij}=A{ji}. ]
Example:
[ A = \begin{bmatrix}1&2\3&4\5&6\end{bmatrix},\qquad A^{\mathsf T}= \begin{bmatrix}1&3&5\2&4&6\end{bmatrix}. ]
Properties:
[ (A^{\mathsf T})^{\mathsf T}=A,\quad (A+B)^{\mathsf T}=A^{\mathsf T}+B^{\mathsf T},\quad (AB)^{\mathsf T}=B^{\mathsf T}A^{\mathsf T}. ]
Inverse of a Matrix
For a square matrix (A), if there exists a matrix (A^{-1}) such that (AA^{-1}=A^{-1}A=I), then (A^{-1}) is called the inverse of (A).
Example:
[ A = \begin{bmatrix}2&1\5&3\end{bmatrix},\qquad A^{-1}= \begin{bmatrix}3&-1\-5&2\end{bmatrix}. ]
Other Matrix Operations
Vectorization: stacking the columns of a matrix (A) into a single column vector, denoted (\operatorname{vec}(A)).
Example:
[ A=\begin{bmatrix}a&b\c&d\end{bmatrix};\Longrightarrow; \operatorname{vec}(A)=\begin{bmatrix}a\c\b\d\end{bmatrix}. ]
Row‑wise vectorization (sometimes called “row‑vec”) stacks rows instead of columns.
Matrix inner product: for matrices (A) and (B) of the same size, (\langle A,B\rangle = \sum_{i,j} a_{ij}b_{ij}); the result is a scalar.
Hadamard product: element‑wise multiplication, denoted (A\circ B); the result has the same dimensions as (A) and (B).
Kronecker product: denoted (A\otimes B); each element (a_{ij}) of (A) is multiplied by the entire matrix (B), producing a block matrix. The Kronecker product is also called the tensor product or direct product.
Tensor
A tensor is a multidimensional array, generalizing scalars (0‑D), vectors (1‑D), and matrices (2‑D) to n dimensions.
Example: a 3‑D tensor ( \mathcal{T}\in\mathbb{R}^{p\times q\times r}).
Matrix Calculus
Matrix calculus essentially means differentiating a scalar‑ or vector‑valued function with respect to a matrix or vector variable, then arranging the resulting partial derivatives back into a matrix or vector form (rather than flattening everything into a single number).
To be precise, we first standardize the notation for variables (independent arguments) and functions:
Notation for variables
Real‑valued vector variable
Let (\mathbf{x} = [x_1,\dots,x_n]) be a row vector. The superscript (^{\mathsf T}) denotes transpose, turning the row vector into an (n)-dimensional column vector, which is the convention in linear algebra to keep matrix dimensions consistent.Real‑valued matrix variable
(Further definitions continue in the original text.)
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.