5. Python 数据分析-Numpy、Pandas、Matplotlib
Li Wei
5. Python Data Analysis – NumPy, Pandas, Matplotlib
Introduction to Data Analysis
Overview
Why learn data analysis?
| Feature | Excel | Python (Pandas) |
|---|---|---|
| Data size you can handle | Up to ~10,000 rows | Over 1,000,000 rows (automation) |
| Manual vs. code‑driven | Manual operation | One‑click code execution |
| Learning difficulty | Simple (requires basic programming) | Simple (requires basic programming) |
Traditional method: processing data manually with Excel
- Problem: when the dataset exceeds 10,000 rows Excel becomes sluggish; complex calculations require intricate formulas.
- Example: ranking the grades of 1,000 students manually takes about 2 hours.
Python data analysis:
- Advantage: automatically handles million‑row datasets; code is reusable.
- Example: the same task completed with Pandas in 3 minutes.
Full Workflow
Data collection
- Where does the data come from? Company databases | public datasets (e.g., government data) | manual web scraping
Data cleaning (the most important step!)
- Typical issues:
- Missing values (empty cells in Excel)
- Erroneous data (e.g., age entered as 200)
- Inconsistent formats (dates written as “2023年1月1日” and “01/01/2023” together)
- Typical issues:
Data analysis
- Common techniques:
- Statistics (mean, max, proportion)
- Group comparisons (e.g., spending differences between male and female users)
- Common techniques:
Data visualization
- A picture is worth a thousand words: line charts (trends) | bar charts (comparisons) | scatter plots (correlations)
The Data‑Science Toolchain
The core trio:
| Tool | Role | Analogy |
|---|---|---|
| NumPy | High‑performance numerical computing (matrices/vectors) | The “engine” of data |
| Pandas | Tabular data handling (like a sophisticated Excel) | The “scalpel” of data |
| Matplotlib | Data visualization (plotting library) | The “translator” of data |
Typical workflow: NumPy processes numbers → Pandas organizes tables → Matplotlib creates visual output.
Auxiliary Tools
Jupyter Notebook – an interactive programming environment that shows code and results side by side.
- Benefit: ideal for teaching and exploratory analysis; notebooks can mix text, code, and graphics.
Anaconda – a one‑click installer for the whole scientific‑computing stack.
- Includes: Python interpreter + common libraries + environment‑management tools.
Anaconda
What is Anaconda?
Official website: (link omitted)
In short, Anaconda = Python + package & environment manager (conda) + common libraries + integrated tools. It’s perfect for anyone who wants to spin up a data‑science or machine‑learning environment quickly. Think of Anaconda as the car and Python as the engine; installing Anaconda gives you a ready‑to‑drive vehicle without having to assemble the engine yourself.
The conda command manages packages, dependencies, and environments. Compared with the traditional pip tool, conda makes switching between environments easier and overall environment management simpler.
Why choose Anaconda?
- Easy installation: Installing Anaconda is as simple as installing any other application; it comes pre‑loaded with many useful tools, so you don’t have to configure each one manually.
- Package manager: Anaconda includes Conda, which can install, update, and manage packages for multiple languages, not just Python.
- Environment manager: Create and maintain separate Python environments (e.g., Python 2 and Python 3) and switch between them at will—great for avoiding version conflicts across projects.
- Bundled tools & libraries: Comes with essential data‑science packages such as NumPy, Pandas, Matplotlib, SciPy, Scikit‑learn, etc.
- Jupyter Notebook: An interactive computing environment that supports live code, equations, visualizations, and narrative text.
- Spyder IDE: A scientific‑computing‑oriented IDE with code editing, debugging, and data‑visualization features.
- Cross‑platform: Runs on Windows, macOS, and Linux.
- Community support: A large user community offers forums, tutorials, and troubleshooting help.
Core Advantages
- 200+ pre‑installed data‑science packages
- Ready‑to‑use: No need to manually install NumPy, Pandas, etc.
- Complete ecosystem: Includes tools for analysis, machine learning, and visualization.
Comparison: Anaconda vs. native Python + pip
| Dimension | Anaconda | Native Python + pip |
|---|---|---|
| Installation difficulty | ⭐️ One‑click install of everything | ⭐️⭐️⭐️ Install each library manually |
| Dependency management | Conda resolves conflicts automatically | pip may encounter version‑compatibility issues |
| Disk space | ⚠️ Larger (≈3 GB + base packages) | ✅ Minimal (install only what you need, as low as tens of MB) |
| Ideal scenario | Beginners / rapid start‑up for data analysis | Developers who need fine‑grained control over environments |
| Typical use case | Classroom teaching / personal learning | Production servers |
Jupyter
Jupyter is an open‑source interactive computing environment widely used in data science, machine learning, and scientific research. Its main components are Jupyter Notebook and JupyterLab. JupyterLab, the successor to Notebook, offers a modern, feature‑rich interface with multi‑document support, built‑in collaboration tools, and an extensible plugin system—making it a favorite among data scientists and researchers.
Common keyboard shortcuts
| Key combo | Action |
|---|---|
Esc |
Switch from edit mode to command mode |
A |
Insert a new cell above the current one |
B |
Insert a new cell below the current one |
DD |
Delete the current cell |
M |
Change cell type to Markdown |
Y |
Change cell type to Code |
Ctrl + Enter |
Run the current cell |
Shift + Enter |
Run the current cell and create a new one below |
NumPy – Scientific Computing
Basic Introduction
NumPy is the foundational package for scientific computing in Python. It provides a multi‑dimensional array object, various derived objects (e.g., masked arrays and matrices), and a suite of functions for fast array operations—including mathematics, logic, shape manipulation, sorting, selection, I/O, discrete Fourier transforms, basic linear algebra, basic statistics, random simulation, and more.
Key features of NumPy:
- ndarray: a fast, memory‑efficient multi‑dimensional array supporting vectorized arithmetic and sophisticated broadcasting.
- Universal functions (ufuncs): perform element‑wise operations on whole arrays without explicit Python loops.
- File I/O & memory‑mapped files: tools for reading/writing data to disk.
- Linear algebra, random number generation, Fourier transforms.
- C/Fortran/C++ integration APIs for extending NumPy with compiled code.
Why do we need NumPy?
(Example output omitted)
ndarray (N‑dimensional Array)
Core Features
- Multidimensionality: supports 0‑D (scalar), 1‑D (vector), 2‑D (matrix), and higher‑dimensional arrays.
- Homogeneity: all elements must share the same data type; mixed‑type input is up‑cast to a common type.
- Efficiency: stored in contiguous memory blocks, enabling fast vectorized operations.
Important Attributes
Core attributes (example arr = np.array([[1, 2], [3, 4]])):
| Attribute | Plain‑language meaning | Example | Output | Typical use |
|---|---|---|---|---|
shape |
Array dimensions (rows, columns, …) | arr.shape |
(2, 2) |
Inspect or reshape the array |
ndim |
Number of dimensions | arr.ndim |
2 |
Distinguish vectors, matrices, tensors |
size |
Total number of elements | arr.size |
4 |
Quick element count |
dtype |
Data type of elements | arr.dtype |
int64 (or int32) |
Ensure consistent calculations (e.g., avoid integer‑division pitfalls) |
shape– like asking “what does the array look like?”- Example:
arr = np.array([[1, 2], [3, 4]])has shape(2, 2), meaning 2 rows × 2 columns. - Reshaping:
arr.reshape(4, 1)can be reshaped into a 4‑row × 1‑column array.
- Example:
ndim– tells you the “dimensionality” of the space.- 1‑D (vector):
ndim = 1, e.g.,[1, 2, 3]. - 2‑D (matrix):
ndim = 2, e.g., tabular data. - 3‑D (tensor):
ndim = 3, e.g., RGB image data.
- 1‑D (vector):
dtype– guarantees all elements share the same type.- If any element is a float, the whole array becomes
float64to preserve precision. - You can force a type, e.g.,
np.array([1, 2], dtype=np.float32).
- If any element is a float, the whole array becomes
Advanced Attributes (for reference)
| Attribute | Plain meaning | Example code | Output | Use case |
|---|---|---|---|---|
T (transpose) |
Swap rows and columns | arr.T |
[[1, 3], [2, 4]] |
Matrix algebra (e.g., multiplication) |
itemsize |
Bytes per element | arr.itemsize |
8 (int64 occupies 8 bytes) |
Memory‑usage tuning |
nbytes |
Total memory consumption (size * itemsize) |
arr.nbytes |
32 (4 elements × 8 bytes) |
Monitoring large‑data workloads |
flags |
Memory layout info (e.g., C‑contiguous) | arr.flags |
C_CONTIGUOUS : True etc. |
Low‑level optimizations |
Creation Methods
| Category | Purpose | Example placeholder | Core action | Example output |
|---|---|---|---|---|
| Basic construction – for small, hand‑crafted arrays or copying existing data | From Python structures | np.array() np.array([[1, 2], [3, 4]]) |
Convert list/tuple to ndarray |
array([[1, 2], [3, 4]]) |
| Copy an array | np.copy() np.copy(arr) |
Create a deep copy (no shared memory) | — | |
| Pre‑defined shape filling – quick initialization of fixed‑size arrays (all zeros, all ones, etc.) | All‑zero array | np.zeros() np.zeros((2, 3)) |
Fast zero‑initialization | [[0., 0., 0.], [0., 0., 0.]] |
| All‑one array | np.ones() ⟙TOK12⟩ |
Fast one‑initialization | [[1, 1], [1, 1], [1, 1]] |
|
| Uninitialized array | np.empty() np.empty((2, 2)) |
Allocate memory (values random) | — | |
| Fill with a constant | np.full() np.full((2, 3), 5) |
Populate with a given value | [[5, 5, 5], [5, 5, 5]] |
|
| Range‑based generation – create numeric sequences (useful for time series, coordinate grids, etc.) | Arithmetic progression | np.arange() np.arange(0, 10, 2) |
Generate values with fixed step (exclusive of endpoint) | [0, 2, 4, 6, 8] |
| Evenly spaced values (inclusive) | np.linspace() np.linspace(0, 1, 5) |
Generate a specified number of points | [0., 0.25, 0.5, 0.75, 1.] |
|
| Log‑spaced values | np.logspace() np.logspace(0, 2, 3, base=10) |
Generate logarithmic progression | [1.0, 10.0, 100.0] |
|
| Special matrices – for linear‑algebra operations | Identity matrix | np.eye() np.eye(3) |
Create an identity matrix (1’s on diagonal) | [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]] |
| Diagonal matrix | np.diag() np.diag([1, 2, 3]) |
Create a matrix with specified diagonal values | [[1, 0, 0], [0, 2, 0], [0, 0, 3]] |
|
| Random arrays – simulate experimental data, initialise neural‑network weights, etc. | Uniform random numbers | np.random.rand() np.random.rand(2, 2) |
Generate numbers in (0, 1) uniformly | [[0.43, 0.89], [0.21, 0.57]] |
| Normal (Gaussian) random numbers | np.random.randn() np.random.randn(2, 2) |
Standard normal (mean 0, variance 1) | [[-0.5, 1.2], [0.3, -1.8]] |
|
| Random integers | np.random.randint() np.random.randint(1, 10, (2, 2)) |
Generate integers in a given range | [[3, 7], [5, 2]] |
|
| Advanced constructors – handle unstructured data (files, strings) or generate complex arrays via functions | From string | np.array() np.array(['a', 'bc']) |
Convert a string to a character array, e.g., array(['a', 'bc'], dtype=…) |
— |
Converting from Python data structures
The most direct way to obtain an ndarray is to feed a Python list, tuple, etc., into np.array.
np.array(object, dtype=None)
Notes
- When mixed types are present, NumPy up‑casts to the highest‑priority type (e.g.,
int + float → float,number + string → string). - The nesting depth of a list determines
ndim(the number of dimensions).
Pre‑defined Shape Filling
Quickly create arrays of a fixed shape, often used as placeholders or to initialise weight matrices. The default dtype is float64; specify dtype=int or another type if needed.
Uninitialized Arrays (np.empty) – np.empty(shape, dtype=float)
Creates a new array with the given shape and dtype without initializing its entries (values are whatever happens to be in memory). Useful for performance‑critical code, but you must fill the array yourself; otherwise unpredictable values may cause bugs. Remember, np.empty does not guarantee zeros—it merely allocates memory.
Repeating‑value Arrays (np.zeros_like, np.ones_like, np.empty_like) – np.full(shape, fill_value, dtype)
zeros_like(): returns a new array of zeros with the same shape and dtype as the input.ones_like(): returns a new array of ones with the same shape and dtype as the input.empty_like(): returns an uninitialized array with the same shape and dtype as the input.
Range‑based Generation
- Arithmetic sequence:
np.arange(start, stop, step)returns a 1‑D array filled with evenly spaced values within a given interval. - Linspace (inclusive):
np.linspace(start, stop, num=50)returns a specified number of evenly spaced points between start and stop (including the endpoint). - Logspace:
np.logspace(start, stop, num=50, base=10)returns numbers spaced evenly on a log scale.
Special Matrices – Basics
A matrix is a rectangular array of numbers, fundamental to linear algebra. It can represent data, systems of equations, or linear transformations.
What is a matrix?
A matrix consists of rows and columns, e.g.:[ [a11, a12, a13], [a21, a22, a23] ]- Shape: 2 × 3 (2 rows, 3 columns).
- Entry: each number, e.g., (a_{1,2}=2) (row 1, column 2).
Uses of matrices
- Representing linear systems:
[ A\mathbf{x} = \mathbf{b} ] where (A) is the coefficient matrix, (\mathbf{x}) the vector of unknowns, and (\mathbf{b}) the constants vector. - Describing linear transformations: e.g., a 2 × 2 matrix can rotate points in the plane by an angle (\theta).
- Data representation (e.g., machine learning): rows = samples (e.g., images), columns = features (e.g., pixel values).
- Representing linear systems:
Basic matrix operations
- Addition: element‑wise addition (requires identical shapes).
- Scalar multiplication: each element multiplied by a scalar.
- Matrix multiplication: dot product of rows and columns (not element‑wise). Note that matrix multiplication is not commutative ((AB \neq BA)).
- Transpose: swap rows and columns.
Special Matrices
- Identity matrix (
np.eye/np.identity) - Diagonal matrix (
np.diag)
Random Arrays
- Uniform distribution:
np.random.rand→ array of shape‑specified values drawn from ([0,1)). - Normal distribution:
np.random.randn→ array drawn from standard normal (mean 0, std 1). - Random integers:
np.random.randint→ array of integers in a half‑open interval ([low, high)). random.uniform()→ similar torandbut with user‑specified range.
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.