8. Deep Learning

Chapter 1 Overview of Deep Learning

Deep learning, a branch of machine learning, focuses on using multi‑layer neural networks (deep neural networks) to model and solve problems.

The human brain contains many interconnected neurons. When the brain processes information, these neurons interact via electrical signals and chemicals, transmitting information across different brain regions. Neural networks mimic this biological phenomenon with artificial neurons—software modules called nodes—that communicate and transmit information through numerical computation.

Characteristics of Deep Learning

Uses multi‑layer neural networks that can automatically extract hierarchical features from data.
Suited for unstructured data such as images, audio, and text.
Relies on large amounts of data and computational resources; training can be time‑consuming.
Models are complex and often regarded as “black boxes,” offering limited interpretability.

Application Scenarios

Financial services: algorithmic trading, loan‑risk assessment, fraud detection, credit and portfolio management.
Media & entertainment: personalized recommendations for e‑commerce and streaming services.
Customer service: chatbots, virtual assistants, and inbound service portals that use speech recognition.
Healthcare: medical imaging analysis and digitized health records to improve diagnosis and deliver more precise, efficient care.
Industrial automation: safety monitoring in factories/warehouses, quality control, and predictive maintenance.
Autonomous vehicles: researchers train models to detect stop signs, traffic lights, crosswalks, pedestrians, etc.
Aerospace & defense: large‑scale surveillance to detect objects, identify areas of interest, and verify safe/unsafe zones for troops.
Law enforcement: speech recognition, computer vision, and natural language processing (NLP) help analyze massive datasets, saving time and resources.

Chapter 2 Fundamentals of Neural Networks

Components of a Neural Network

Basic Concepts and Structure

In biology, a neuron receives electro‑chemical pulses from other neurons via its dendrites. When the summed input exceeds a threshold, the neuron fires an action potential that travels down the axon to the synapse, triggering the release of neurotransmitter‑filled vesicles. Those chemicals then diffuse to neighboring neurons.

An Artificial Neural Network (ANN), often simply called a Neural Network (NN), is a computational model that imitates the structure and function of biological neural networks. In most cases, an ANN can adapt its internal structure based on external information, making it an adaptive system—in plain terms, it can learn.

In an ANN, each artificial neuron typically computes a weighted sum of multiple inputs, passes the result through an activation function, and outputs the transformed value.

Stacking many neurons creates a multi‑layer network:

The leftmost column of neurons receives the raw inputs → input layer.
The rightmost column produces the final outputs → output layer.
All layers in between are collectively called hidden (intermediate) layers.

Neurons in adjacent layers are fully connected (each neuron in the next layer connects to every neuron in the previous layer), and each connection carries a weight.

Information propagates layer‑by‑layer—commonly referred to as forward propagation—where the output of one layer becomes the input of the next.

Perceptron Recap

Recall the perceptron from the machine‑learning section.

A perceptron is a binary classifier that receives multiple input signals and outputs a single binary signal (0 or 1).

Below is an example of a perceptron with two inputs:

[input] → ○ (neuron) → [output]

Each input is multiplied by a fixed weight. The neuron sums the weighted inputs; if the sum exceeds a threshold, it outputs 1 (the neuron is “activated”).

Here the threshold is set to 0. In addition to the weights w, we can add a bias b.

Each input has its own weight, controlling the importance of that signal; larger weights mean higher importance. The bias adjusts how easily the neuron fires.

Introducing Activation Functions

Equation (2.1) actually performs two steps:

Compute a weighted sum of inputs plus bias.
Compare that sum to a threshold to decide whether to output 0 or 1.

If we define a function

f(x) = { 1 if x > 0 else 0 }

then (2.1) can be simplified to

output = f(weighted_sum + bias)

We often denote the weighted sum plus bias as z.

The function that maps z to the output is called an activation function because it “activates” the neuron.

Activation Functions

Role

Activation functions are the bridge between perceptrons and full neural networks. Without them, a network would be equivalent to a single linear transformation, no matter how many layers you stack—essentially a “no‑hidden‑layer” network. Therefore, activation functions must be non‑linear, introducing the capacity to learn and represent complex, non‑linear relationships.

Common Activation Functions

Binary Step
In the perceptron, the step function is the simplest activation: it sets a threshold and switches the output between 0 and 1. Its derivative is always 0, so it provides no gradient for learning and is unsuitable for modern networks.

Definition

step(x) = 1 if x >= 0 else 0

Explanation
It acts like a hard switch; the output jumps abruptly when the input crosses the boundary. Because gradients are zero almost everywhere, back‑propagation cannot update parameters, making training impossible for deep models.

Code

def step(x):
    return 1.0 if x >= 0 else 0.0

Sigmoid (Logistic)
A smooth, differentiable function that maps any real input to the interval (0, 1).

Definition

σ(x) = 1 / (1 + exp(-x))

Derivative

σ'(x) = σ(x) * (1 - σ(x))

Usage
Commonly used in binary‑classification output layers. Because it involves exponentials, it is computationally heavier than ReLU.

Notes

Compresses inputs outside [-6, 6] to near‑constant values, potentially losing information.
Outputs are always positive, which can bias subsequent layers.
Derivative ranges between 0 and 0.25; gradients can vanish, especially for large‑magnitude inputs, leading to the saturation problem in deep networks.

Code

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Tanh (Hyperbolic Tangent)
Maps inputs to the interval (‑1, 1) and is symmetric around the origin.

Definition

tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

Relation to Sigmoid

tanh(x) = 2 * sigmoid(2x) - 1

Derivative

tanh'(x) = 1 - tanh(x)^2

Notes
Zero‑centered output often yields more balanced gradient updates than sigmoid. However, it still suffers from saturation for large inputs.

Code

def tanh(x):
    return np.tanh(x)

ReLU (Rectified Linear Unit)
Outputs 0 for negative inputs and passes positive inputs unchanged.

Definition

ReLU(x) = max(0, x)

Notes

Simple and computationally cheap; gradients are 1 for positive inputs, mitigating the vanishing‑gradient problem.
Produces sparse activations (many zeros), which can improve efficiency.
Neurons that stay in the negative region become “dead” (zero gradient).

Leaky ReLU introduces a small slope (e.g., 0.01) for negative inputs to keep gradients alive.

Code

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Softmax
Transforms a vector of real numbers into a probability distribution (sums to 1). It generalizes the sigmoid to multi‑class problems.

Definition

softmax(z_i) = exp(z_i) / Σ_j exp(z_j)

Stability trick – subtract the maximum value before exponentiation to avoid overflow; the result is unchanged.

Derivative (Jacobian)

∂softmax_i/∂z_j = softmax_i * (δ_ij - softmax_j)

Notes
Amplifies larger inputs, making the highest‑scoring class dominate the probability mass.

Code

def softmax(z):
    z = z - np.max(z, axis=-1, keepdims=True)   # stability
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

Other Common Activations

Identity (linear)
Leaky ReLU, PReLU, RReLU, ELU, Swish (SiLU), Softplus, etc.

Choosing an Activation Function

Hidden layers

Prefer ReLU; try Leaky ReLU if ReLU performs poorly.
Avoid sigmoid (gradient vanishing).
Tanh is zero‑centered and can be useful for shallow networks.

Output layer

Binary classification → Sigmoid.
Multi‑class classification → Softmax.
Regression → Identity.

Simple Implementation of a Neural Network

A deep neural network consists of multiple layers and is often called a model. The model receives raw inputs (features), produces outputs (predictions), and holds a set of parameters (weights and biases). Each layer takes the output of the previous layer, applies its own weights, biases, and activation, and passes the result forward.

Three‑Layer Example

Consider a network with:

Input layer (layer 0): 2 neurons
Hidden layer 1 (layer 1): 3 neurons
Hidden layer 2 (layer 2): 2 neurons
Output layer (layer 3): 2 neurons

The diagram below (omitted) shows the connections. In practice each layer also has biases, and the weighted sums are passed through activation functions.

Signal flow from input to layer 1

Weights are denoted with a superscript indicating the destination layer, e.g., (w^{(1)}_{ij}) is the weight from input neuron j to hidden neuron i in layer 1. Biases have a single subscript because there is one bias per neuron.

Using equations (2.4) and (2.5) we obtain:

z_i^{(1)} = Σ_j w^{(1)}_{ij} * x_j + b_i^{(1)}
a_i^{(1)} = activation(z_i^{(1)})

In matrix form:

Z^{(1)} = X · W^{(1)} + b^{(1)}
A^{(1)} = activation(Z^{(1)})

Since we have 2 inputs and 3 neurons in layer 1, (W^{(1)}) is a 2 × 3 matrix.

Layer 1 → Layer 2
Similar computation; (W^{(2)}) is a 3 × 2 matrix.

Layer 2 → Output (layer 3)
(W^{(3)}) is a 2 × 2 matrix; the output activation may differ (often identity for regression).

Code Skeleton

All parameters (weights w and biases b) can be stored in a dictionary network. Define two functions:

init_network() – initialize parameters (weights as 2‑D arrays, biases as 1‑D arrays).
forward() – forward propagation, converting inputs to outputs.

Hidden layers use sigmoid, the output layer uses identity.

def init_params(layers):
    # layers = [input_dim, hidden1, hidden2, output_dim]
    params = {}
    for i in range(1, len(layers)):
        params[f'W{i}'] = np.random.randn(layers[i-1], layers[i]) * 0.01
        params[f'b{i}'] = np.zeros(layers[i])
    return params

def forward(x, params):
    a = x
    for i in range(1, len(params)//2 + 1):
        z = a @ params[f'W{i}'] + params[f'b{i}']
        a = sigmoid(z) if i < len(params)//2 else z   # identity on final layer
    return a

Application Example: Handwritten Digit Recognition

Using the Digit Recognizer dataset (Kaggle competition), each row in train.csv contains a label (0‑9) followed by 784 pixel values (28 × 28 grayscale image). The task is to build a network that, given the 784‑dimensional input, predicts the digit—i.e., performs inference.

Network architecture:

Input layer: 784 neurons
Hidden layer 1: 50 neurons
Hidden layer 2: 100 neurons
Output layer: 10 neurons (one per digit)

Assume the network has already been trained; we load the saved parameters and run forward propagation.

params = load_parameters('digit_model.npz')
def predict(img):
    return np.argmax(forward(img, params))

Batch‑processing optimization can be added (code omitted).

Chapter 3 Learning in Neural Networks

The key property of neural networks is their ability to learn from data—automatically adjusting weights to minimize a chosen loss function.

Unlike traditional machine‑learning methods that require handcrafted features (e.g., SIFT, HOG), deep learning can operate directly on raw data.

Loss Functions

A loss function quantifies how far the network’s predictions are from the true targets; training aims to minimize this value.

Common Losses

Mean Squared Error (MSE) – also called L2 loss

[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - t_i)^2 ]

(y_i): network output
(t_i): true label
(n): dimensionality

Sensitive to outliers; large errors can cause gradient explosion.

Code

def mse(y, t):
    return np.mean((y - t) ** 2)

Cross‑Entropy Error (for classification)

[ \text{CE} = -\sum_{i} t_i \log(y_i) ]

(y_i): predicted probability (softmax output)
(t_i): one‑hot encoded true label

Code

def cross_entropy(y, t):
    return -np.sum(t * np.log(y + 1e-12))

Losses for Specific Tasks

Binary classification: Binary Cross‑Entropy

[ L = -[t\log(y) + (1-t)\log(1-y)] ]
Multi‑class classification: Categorical Cross‑Entropy (as above, summed over classes).
Regression:
- MAE (Mean Absolute Error, L1 loss)
  
  [ \text{MAE} = \frac{1}{n}\sum |y_i - t_i| ]
  
  Robust to outliers but non‑differentiable at 0.
- MSE (L2 loss) – see above.
- Smooth L1 – combines L2 for small errors and L1 for large errors, providing a smooth, robust loss.

Numerical Differentiation

A smaller loss indicates better‑fitted parameters. (Further content omitted.)

Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.

8. Deep Learning

8. Deep Learning

Chapter 1 Overview of Deep Learning

Characteristics of Deep Learning

Application Scenarios

Chapter 2 Fundamentals of Neural Networks

Components of a Neural Network

Basic Concepts and Structure

Perceptron Recap

Introducing Activation Functions

Activation Functions

Role

Common Activation Functions

Choosing an Activation Function

Simple Implementation of a Neural Network

Three‑Layer Example

Code Skeleton

Application Example: Handwritten Digit Recognition

Chapter 3 Learning in Neural Networks

Loss Functions

Common Losses

Losses for Specific Tasks

Numerical Differentiation

Sources & References

Keep reading

The Agent Economy: How RunbookHermes Is Reshaping Personal Productivity

Browser Agents Explained: How FinGPT Drives a Web Browser Autonomously

How ChatGPT Turns Market Data into Trading Signals in Real Time