7. Machine Learning

Chapter 1 Overview of Machine Learning

Basic Introduction

Machine learning studies algorithms and statistical models that improve a computer system’s performance on a specific task over time. By feeding massive training data into a model, the model learns the underlying patterns in the data and can then accurately classify or predict new inputs.

Machine learning is a multidisciplinary field that draws on probability theory, statistics, approximation theory, convex optimization, computational complexity, and more.

AI, Machine Learning, and Deep Learning

Artificial intelligence (AI) is a broad area of computer science whose goal is to give machines human‑like “intelligence,” including natural language processing (NLP), computer vision, robotics, expert systems, etc. Implementation methods vary widely, ranging from rule‑based systems and symbolic logic to statistical methods and machine learning.

Machine Learning (ML) is a subfield of AI. Its core idea is data‑driven learning: computers automatically discover patterns from data and use those patterns for prediction or decision‑making.

Deep Learning (DL) is a subfield of machine learning that uses multi‑layer neural networks to tackle complex tasks.

Historical Development

Early Exploration: 1950s–1970s
AI research was in a “reasoning” phase; many believed that endowing machines with logical reasoning would yield intelligence.

Turing proposed the Turing Test, a basic benchmark for machine intelligence.
Frank Rosenblatt introduced the perceptron, one of the earliest artificial neural network models for linear classification.
Arthur Lee Samuel coined the term “machine learning” and built a learning algorithm for playing checkers.

Knowledge‑Driven and Expert Systems: 1970s–1980s
As research progressed, it became clear that logic alone could not achieve AI. Some argued that machines needed knowledge. With advancing hardware, AI entered a rule‑centric “knowledge‑driven” era.

Numerous expert systems appeared, e.g., MYCIN for medical diagnosis.
Decision trees were introduced (e.g., the ID3 algorithm).
Statistical learning theory began to take shape, such as applying Bayes’ theorem to machine learning.

Data‑Driven and Statistical Learning: 1980s–2000s
Expert systems hit a “knowledge‑engineering bottleneck.” Machine learning emerged as an independent discipline, and a variety of techniques blossomed. Connectionist neural networks continued to develop, but their limitations became apparent. Statistical learning rose to prominence, with support vector machines (SVMs) as a flagship method. The Internet supplied massive data, making statistical approaches the core of machine learning.

Decision‑tree algorithms advanced (e.g., C4.5).
SVMs were proposed as powerful classifiers.
Unsupervised methods matured, such as k‑means clustering.
Ensemble methods like random forests and Boosting (e.g., AdaBoost) improved performance.

Deep Learning Rise: Early 2000s
Rapid growth in computing power (especially GPUs) and data volume enabled deep learning to flourish. Deep neural networks dominated and broke performance barriers in vision, speech, and text.

Deep Belief Networks (DBN) signaled a revival of deep learning research.
AlexNet won the ImageNet competition, demonstrating the strength of convolutional neural networks (CNNs).
Generative Adversarial Networks (GANs) were introduced for image generation and other generative tasks.
The Transformer architecture, presented in the paper Attention Is All You Need, transformed natural‑language processing.

Large Models and General AI: 2020s
Deep learning entered a large‑scale deployment phase. Massive language models and multimodal models sparked a new wave of progress. Model sizes exploded, and the focus shifted to generality and transferability.

NLP: OpenAI’s GPT series, Google’s BERT.
Multimodal: OpenAI’s CLIP, DeepMind’s Gato.
Self‑supervised learning reduced reliance on manually labeled data.
Combining reinforcement learning with deep learning powered projects like AlphaGo and AlphaFold.

Application Areas

Today, machine‑learning techniques appear across virtually every computer‑science subfield—multimedia, graphics, networking, software engineering, computer architecture, chip design, etc. In computer‑vision and natural‑language processing, machine learning is a primary driver of technological advancement and underpins many interdisciplinary innovations.

Key Terminology

Data set: A collection of records.
Training set: Data used to train a model.
Validation set: Data used to tune hyper‑parameters.
Test set: Data used to evaluate model performance.
Sample: One record describing an event or object.
Feature: A column describing some aspect of the event/object.
Feature vector: All features of a sample expressed as a vector, fed into the model.
Label: The target value in supervised learning.
Model: A learning algorithm together with its trained parameters, used for prediction or classification.
Parameter: Values learned during training (e.g., weights and bias in linear regression).
Hyper‑parameter: User‑specified settings not learned from data (e.g., learning rate, regularization coefficient).

Chapter 2 Fundamental Theory

The Three Elements of Machine Learning

A typical machine‑learning approach consists of three components: model, strategy, and algorithm. In other words:

Machine‑learning method = Model + Strategy + Algorithm

Model: A mathematical representation that captures the underlying regularities of the data.
Strategy: The criterion for selecting the best model.
Algorithm: The concrete procedure for finding that optimal model.

Classification of Machine‑Learning Methods

There are many kinds of machine‑learning methods, and no single theory covers them all. From different perspectives, we can categorize them as follows:

Usual taxonomy – based on supervision: supervised learning, unsupervised learning, semi‑supervised learning, and reinforcement learning.

Supervised learning: Uses labeled training data so the model learns an input → output mapping and can predict unseen data. Essentially “learn patterns from known answers.”
Unsupervised learning: Uses unlabeled data; the model discovers intrinsic structure, patterns, or distributions (e.g., clustering, dimensionality reduction, anomaly detection). Essentially “find patterns from the data itself.”
Semi‑supervised learning: Combines a small amount of labeled data with a large amount of unlabeled data, leveraging the labels to guide learning of the unlabeled structure. Essentially “leverage few labels to unlock many unlabeled examples.”
Reinforcement learning: An agent interacts with an environment, receiving rewards and adjusting its policy to maximize long‑term cumulative reward. Essentially “learn from trial and error.”

By model type – e.g., probabilistic vs. non‑probabilistic, linear vs. nonlinear, etc.

By learning technique – e.g., Bayesian learning, kernel methods, and so on.

A schematic overview of the various machine‑learning families is shown below:

(figure omitted)

Modeling Workflow

Machine learning is data‑driven; the core is to train a model on data, evaluate and refine it, and finally deploy the mature model for prediction or problem solving. The overall supervised‑learning pipeline is:

Collect data – gather the raw “ingredients” for modeling.
Assemble training + test sets – ensure coverage of real‑world scenarios (different times, populations, contexts) so later stages have representative data.
Data cleaning – improve data quality by filtering out dirty data (errors, duplicates, missing values) or completely unusable records, reducing noise and safeguarding downstream analysis.
Feature engineering – adapt the data to the model. Transform, format, scale, encode, reduce dimensionality, or create new features so the data meet the model’s input requirements (e.g., numeric range, sparsity preferences).
Choose algorithm – match “task + data” to a suitable method. Based on the task type (classification, regression, clustering, etc.) and data characteristics (size, dimensionality, distribution, linear separability), pick an algorithm such as logistic regression, random forest, k‑means, etc.
Model training – let the model “learn the patterns.” Feed the training set into the algorithm, iteratively updating parameters until the model captures decision boundaries, regression curves, etc.
Model evaluation – test “learning effectiveness.” Using a held‑out test set, compute metrics (accuracy, MSE, AUC, …) to see whether the model meets expectations (over‑/under‑fitting? Does accuracy satisfy business needs?).
Model optimization – “polish” performance. If evaluation falls short, improve the model on top of a reasonably good baseline (hyper‑parameter tuning, algorithm swap, more data, better features) to boost generalization and accuracy.
Model deployment – put the model into production. Push the trained and optimized model to a live environment (online service, business system) and monitor it in real time, because data distributions may drift, requiring ongoing maintenance and iteration.

Chapter 3 Feature Engineering

Basic Introduction

Feature engineering is a crucial step in the machine‑learning pipeline. It involves processing, transforming, and constructing raw data to create new features or select effective ones, thereby improving model performance. In short, feature engineering converts raw data into a representation that better captures the problem’s essence, helping the model learn underlying patterns. Good feature engineering can dramatically boost performance; neglecting it often leads to subpar results.

In practice, feature engineering is iterative and highly context‑dependent. It requires extensive data analysis and domain expertise. The appropriate encoding depends on the model type, the relationship between predictors and the target, and the problem at hand. Different data modalities (text, images, etc.) may call for different techniques, making a one‑size‑fits‑all prescription impossible.

Core Topics

Feature Selection

Feature selection picks the most relevant original features for the target variable, discarding redundant, irrelevant, or noisy ones. This reduces model complexity, speeds up training, and mitigates overfitting.

Key traits

Does not create new features or alter data structure.
Retains a subset of the original features.
Often improves model interpretability.

Some features may be critical, others negligible. For example:

A feature with very little variation may be unrelated to the outcome.
Two highly correlated features may carry redundant information.

Thus, feature selection is the most fundamental and common operation in feature engineering.

Categories of feature‑selection methods

Filter methods – evaluate the statistical relationship between each feature and the target, then pick the most relevant ones. Independent of any learning algorithm and fast, making them suitable for an initial screen.
- Variance threshold: remove features whose variance falls below a set threshold.
- Correlation‑based: compute Pearson or Spearman correlation coefficients.
- Chi‑square test: for categorical targets, assess independence.
- Mutual information: measure shared information between feature and target.
- F‑test: for regression, assess linear relationship strength.
Wrapper methods – use a specific model to assess feature importance and search for the best subset based on model performance.
- Recursive Feature Elimination (RFE): iteratively remove the least important features.
- Forward selection: add features step‑wise until performance stops improving.
- Backward elimination: start with all features and remove the least important iteratively.
- Bidirectional search: combine forward and backward steps.
Embedded methods – leverage the model’s own feature‑selection mechanism (e.g., tree‑based importance, L1 regularization) during training, blending advantages of filter and wrapper approaches.
- L1 regularization (Lasso): drives some coefficients to zero, effecting selection.
- Tree‑based importance: e.g., random forest or XGBoost feature importance scores.
- Elastic Net: mixes L1 and L2 penalties.
- Other tree‑model embeddings: use feature‑importance scores from gradient‑boosted trees, etc.

Low‑variance filter – the simplest approach: discard features whose variance is near zero because they provide almost no predictive signal.

Correlation‑based methods – compute correlations between features and the target (or among features) to keep highly related ones and drop redundant ones.

Pearson correlation coefficient – measures linear correlation between two variables, ranging from –1 to 1.

Positive: near 1, feature increases with the target.
Negative: near –1, feature decreases as the target rises.
No correlation: near 0.

Example: a dataset contains advertising spend across channels and sales revenue.

Use pandas.DataFrame.corrwith(method=“pearson“) to compute Pearson correlations between each feature and the label.
Use pandas.DataFrame.corr(method=“pearson“) to obtain the full correlation matrix.

Spearman rank correlation – Pearson correlation applied to the ranks of variables, capturing monotonic (not necessarily linear) relationships. Useful for non‑linear or non‑normally distributed data.

Formula components:

d = difference between the ranks of two variables.
n = number of samples.

Spearman values also lie in [–1, 1]:

1: perfect positive monotonic relationship.
–1: perfect negative monotonic relationship.
0: no monotonic relationship.

Example: weekly study hours vs. math exam scores. Rank the values, compute rank differences, then use pandas.DataFrame.corrwith(method=“spearman“) to calculate the Spearman coefficient.

Feature Transformation

Feature transformation applies mathematical or statistical operations to make data better suited to a model’s assumptions about input distribution.

Numerical transformations

Normalization – scale features to a specific range (commonly 0–1). Useful for scale‑sensitive models such as k‑NN or SVM.
Standardization (Z‑score) – subtract the mean μ and divide by the standard deviation σ, yielding a distribution with mean 0 and std 1. Preferred for linear models (logistic/linear regression) when features are roughly Gaussian.
Log transformation – apply log(x) to right‑skewed data (e.g., income, price) to compress large ranges and stabilize variance.

Polynomial transformation – raise features to higher powers or create interaction terms, expanding the feature space so linear models can capture non‑linear relationships.

Categorical encoding

One‑hot encoding – convert each category into a binary column. Ideal for nominal features with few distinct values (e.g., color, city).
Label encoding – map categories to integers (0, 1, 2, …). Suitable for ordinal features (e.g., low/medium/high) and tree‑based models that can handle integer‑encoded categories.
Target (mean) encoding – replace each category with the mean of the target variable for that category. Helpful for high‑cardinality features that strongly correlate with the target, reducing dimensionality.
Frequency encoding – replace each category with its occurrence frequency in the dataset. Useful for high‑cardinality features when frequency correlates with the target.

Feature Construction

Feature construction creates new, more informative features from existing ones by combining, transforming, or aggregating them. This step demands creativity and domain knowledge, as it often yields the biggest performance gains.

Interaction features – combine two or more features (e.g., multiplication, concatenation) to capture non‑linear relationships between them.

(content truncated)

Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.

7. Machine Learning