8. Deep Learning
Li Wei
8. Deep Learning
Chapter 1 Overview of Deep Learning
Deep learning, a branch of machine learning, focuses on using multi‑layer neural networks (deep neural networks) to model and solve problems.
The human brain contains many interconnected neurons. When the brain processes information, these neurons interact via electrical signals and chemicals, transmitting information across different brain regions. Neural networks mimic this biological phenomenon with artificial neurons—software modules called nodes—that communicate and transmit information through numerical computation.
Characteristics of Deep Learning
- Uses multi‑layer neural networks that can automatically extract hierarchical features from data.
- Suited for unstructured data such as images, audio, and text.
- Relies on large amounts of data and computational resources; training can be time‑consuming.
- Models are complex and often regarded as “black boxes,” offering limited interpretability.
Application Scenarios
- Financial services: algorithmic trading, loan‑risk assessment, fraud detection, credit and portfolio management.
- Media & entertainment: personalized recommendations for e‑commerce and streaming services.
- Customer service: chatbots, virtual assistants, and inbound service portals that use speech recognition.
- Healthcare: medical imaging analysis and digitized health records to improve diagnosis and deliver more precise, efficient care.
- Industrial automation: safety monitoring in factories/warehouses, quality control, and predictive maintenance.
- Autonomous vehicles: researchers train models to detect stop signs, traffic lights, crosswalks, pedestrians, etc.
- Aerospace & defense: large‑scale surveillance to detect objects, identify areas of interest, and verify safe/unsafe zones for troops.
- Law enforcement: speech recognition, computer vision, and natural language processing (NLP) help analyze massive datasets, saving time and resources.
Chapter 2 Fundamentals of Neural Networks
Components of a Neural Network
Basic Concepts and Structure
In biology, a neuron receives electro‑chemical pulses from other neurons via its dendrites. When the summed input exceeds a threshold, the neuron fires an action potential that travels down the axon to the synapse, triggering the release of neurotransmitter‑filled vesicles. Those chemicals then diffuse to neighboring neurons.
An Artificial Neural Network (ANN), often simply called a Neural Network (NN), is a computational model that imitates the structure and function of biological neural networks. In most cases, an ANN can adapt its internal structure based on external information, making it an adaptive system—in plain terms, it can learn.
In an ANN, each artificial neuron typically computes a weighted sum of multiple inputs, passes the result through an activation function, and outputs the transformed value.
Stacking many neurons creates a multi‑layer network:
- The leftmost column of neurons receives the raw inputs → input layer.
- The rightmost column produces the final outputs → output layer.
- All layers in between are collectively called hidden (intermediate) layers.
Neurons in adjacent layers are fully connected (each neuron in the next layer connects to every neuron in the previous layer), and each connection carries a weight.
Information propagates layer‑by‑layer—commonly referred to as forward propagation—where the output of one layer becomes the input of the next.
Perceptron Recap
Recall the perceptron from the machine‑learning section.
A perceptron is a binary classifier that receives multiple input signals and outputs a single binary signal (0 or 1).
Below is an example of a perceptron with two inputs:
[input] → ○ (neuron) → [output]
Each input is multiplied by a fixed weight. The neuron sums the weighted inputs; if the sum exceeds a threshold, it outputs 1 (the neuron is “activated”).
Here the threshold is set to 0. In addition to the weights w, we can add a bias b.
Each input has its own weight, controlling the importance of that signal; larger weights mean higher importance. The bias adjusts how easily the neuron fires.
Introducing Activation Functions
Equation (2.1) actually performs two steps:
- Compute a weighted sum of inputs plus bias.
- Compare that sum to a threshold to decide whether to output 0 or 1.
If we define a function
f(x) = { 1 if x > 0 else 0 }
then (2.1) can be simplified to
output = f(weighted_sum + bias)
We often denote the weighted sum plus bias as z.
The function that maps z to the output is called an activation function because it “activates” the neuron.
Activation Functions
Role
Activation functions are the bridge between perceptrons and full neural networks. Without them, a network would be equivalent to a single linear transformation, no matter how many layers you stack—essentially a “no‑hidden‑layer” network. Therefore, activation functions must be non‑linear, introducing the capacity to learn and represent complex, non‑linear relationships.
Common Activation Functions
Binary Step
In the perceptron, the step function is the simplest activation: it sets a threshold and switches the output between 0 and 1. Its derivative is always 0, so it provides no gradient for learning and is unsuitable for modern networks.
Definition
step(x) = 1 if x >= 0 else 0
Explanation
It acts like a hard switch; the output jumps abruptly when the input crosses the boundary. Because gradients are zero almost everywhere, back‑propagation cannot update parameters, making training impossible for deep models.
Code
def step(x):
return 1.0 if x >= 0 else 0.0
Sigmoid (Logistic)
A smooth, differentiable function that maps any real input to the interval (0, 1).
Definition
σ(x) = 1 / (1 + exp(-x))
Derivative
σ'(x) = σ(x) * (1 - σ(x))
Usage
Commonly used in binary‑classification output layers. Because it involves exponentials, it is computationally heavier than ReLU.
Notes
- Compresses inputs outside [-6, 6] to near‑constant values, potentially losing information.
- Outputs are always positive, which can bias subsequent layers.
- Derivative ranges between 0 and 0.25; gradients can vanish, especially for large‑magnitude inputs, leading to the saturation problem in deep networks.
Code
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Tanh (Hyperbolic Tangent)
Maps inputs to the interval (‑1, 1) and is symmetric around the origin.
Definition
tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
Relation to Sigmoid
tanh(x) = 2 * sigmoid(2x) - 1
Derivative
tanh'(x) = 1 - tanh(x)^2
Notes
Zero‑centered output often yields more balanced gradient updates than sigmoid. However, it still suffers from saturation for large inputs.
Code
def tanh(x):
return np.tanh(x)
ReLU (Rectified Linear Unit)
Outputs 0 for negative inputs and passes positive inputs unchanged.
Definition
ReLU(x) = max(0, x)
Notes
- Simple and computationally cheap; gradients are 1 for positive inputs, mitigating the vanishing‑gradient problem.
- Produces sparse activations (many zeros), which can improve efficiency.
- Neurons that stay in the negative region become “dead” (zero gradient).
Leaky ReLU introduces a small slope (e.g., 0.01) for negative inputs to keep gradients alive.
Code
def relu(x):
return np.maximum(0, x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
Softmax
Transforms a vector of real numbers into a probability distribution (sums to 1). It generalizes the sigmoid to multi‑class problems.
Definition
softmax(z_i) = exp(z_i) / Σ_j exp(z_j)
Stability trick – subtract the maximum value before exponentiation to avoid overflow; the result is unchanged.
Derivative (Jacobian)
∂softmax_i/∂z_j = softmax_i * (δ_ij - softmax_j)
Notes
Amplifies larger inputs, making the highest‑scoring class dominate the probability mass.
Code
def softmax(z):
z = z - np.max(z, axis=-1, keepdims=True) # stability
exp_z = np.exp(z)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
Other Common Activations
- Identity (linear)
- Leaky ReLU, PReLU, RReLU, ELU, Swish (SiLU), Softplus, etc.
Choosing an Activation Function
Hidden layers
- Prefer ReLU; try Leaky ReLU if ReLU performs poorly.
- Avoid sigmoid (gradient vanishing).
- Tanh is zero‑centered and can be useful for shallow networks.
Output layer
- Binary classification → Sigmoid.
- Multi‑class classification → Softmax.
- Regression → Identity.
Simple Implementation of a Neural Network
A deep neural network consists of multiple layers and is often called a model. The model receives raw inputs (features), produces outputs (predictions), and holds a set of parameters (weights and biases). Each layer takes the output of the previous layer, applies its own weights, biases, and activation, and passes the result forward.
Three‑Layer Example
Consider a network with:
- Input layer (layer 0): 2 neurons
- Hidden layer 1 (layer 1): 3 neurons
- Hidden layer 2 (layer 2): 2 neurons
- Output layer (layer 3): 2 neurons
The diagram below (omitted) shows the connections. In practice each layer also has biases, and the weighted sums are passed through activation functions.
Signal flow from input to layer 1
Weights are denoted with a superscript indicating the destination layer, e.g., (w^{(1)}_{ij}) is the weight from input neuron j to hidden neuron i in layer 1. Biases have a single subscript because there is one bias per neuron.
Using equations (2.4) and (2.5) we obtain:
z_i^{(1)} = Σ_j w^{(1)}_{ij} * x_j + b_i^{(1)}
a_i^{(1)} = activation(z_i^{(1)})
In matrix form:
Z^{(1)} = X · W^{(1)} + b^{(1)}
A^{(1)} = activation(Z^{(1)})
Since we have 2 inputs and 3 neurons in layer 1, (W^{(1)}) is a 2 × 3 matrix.
Layer 1 → Layer 2
Similar computation; (W^{(2)}) is a 3 × 2 matrix.
Layer 2 → Output (layer 3)
(W^{(3)}) is a 2 × 2 matrix; the output activation may differ (often identity for regression).
Code Skeleton
All parameters (weights w and biases b) can be stored in a dictionary network. Define two functions:
init_network()– initialize parameters (weights as 2‑D arrays, biases as 1‑D arrays).forward()– forward propagation, converting inputs to outputs.
Hidden layers use sigmoid, the output layer uses identity.
def init_params(layers):
# layers = [input_dim, hidden1, hidden2, output_dim]
params = {}
for i in range(1, len(layers)):
params[f'W{i}'] = np.random.randn(layers[i-1], layers[i]) * 0.01
params[f'b{i}'] = np.zeros(layers[i])
return params
def forward(x, params):
a = x
for i in range(1, len(params)//2 + 1):
z = a @ params[f'W{i}'] + params[f'b{i}']
a = sigmoid(z) if i < len(params)//2 else z # identity on final layer
return a
Application Example: Handwritten Digit Recognition
Using the Digit Recognizer dataset (Kaggle competition), each row in train.csv contains a label (0‑9) followed by 784 pixel values (28 × 28 grayscale image). The task is to build a network that, given the 784‑dimensional input, predicts the digit—i.e., performs inference.
Network architecture:
- Input layer: 784 neurons
- Hidden layer 1: 50 neurons
- Hidden layer 2: 100 neurons
- Output layer: 10 neurons (one per digit)
Assume the network has already been trained; we load the saved parameters and run forward propagation.
params = load_parameters('digit_model.npz')
def predict(img):
return np.argmax(forward(img, params))
Batch‑processing optimization can be added (code omitted).
Chapter 3 Learning in Neural Networks
The key property of neural networks is their ability to learn from data—automatically adjusting weights to minimize a chosen loss function.
Unlike traditional machine‑learning methods that require handcrafted features (e.g., SIFT, HOG), deep learning can operate directly on raw data.
Loss Functions
A loss function quantifies how far the network’s predictions are from the true targets; training aims to minimize this value.
Common Losses
Mean Squared Error (MSE) – also called L2 loss
[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - t_i)^2 ]
- (y_i): network output
- (t_i): true label
- (n): dimensionality
Sensitive to outliers; large errors can cause gradient explosion.
Code
def mse(y, t):
return np.mean((y - t) ** 2)
Cross‑Entropy Error (for classification)
[ \text{CE} = -\sum_{i} t_i \log(y_i) ]
- (y_i): predicted probability (softmax output)
- (t_i): one‑hot encoded true label
Code
def cross_entropy(y, t):
return -np.sum(t * np.log(y + 1e-12))
Losses for Specific Tasks
Binary classification: Binary Cross‑Entropy
[ L = -[t\log(y) + (1-t)\log(1-y)] ]
Multi‑class classification: Categorical Cross‑Entropy (as above, summed over classes).
Regression:
MAE (Mean Absolute Error, L1 loss)
[ \text{MAE} = \frac{1}{n}\sum |y_i - t_i| ]
Robust to outliers but non‑differentiable at 0.
MSE (L2 loss) – see above.
Smooth L1 – combines L2 for small errors and L1 for large errors, providing a smooth, robust loss.
Numerical Differentiation
A smaller loss indicates better‑fitted parameters. (Further content omitted.)
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.