5. Python 数据分析-Numpy、Pandas、Matplotlib

5. Python Data Analysis – NumPy, Pandas, Matplotlib

Introduction to Data Analysis

Overview

Why learn data analysis?

Feature	Excel	Python (Pandas)
Data size you can handle	Up to ~10,000 rows	Over 1,000,000 rows (automation)
Manual vs. code‑driven	Manual operation	One‑click code execution
Learning difficulty	Simple (requires basic programming)	Simple (requires basic programming)

Traditional method: processing data manually with Excel

Problem: when the dataset exceeds 10,000 rows Excel becomes sluggish; complex calculations require intricate formulas.
Example: ranking the grades of 1,000 students manually takes about 2 hours.

Python data analysis:

Advantage: automatically handles million‑row datasets; code is reusable.
Example: the same task completed with Pandas in 3 minutes.

Full Workflow

Data collection
- Where does the data come from? Company databases | public datasets (e.g., government data) | manual web scraping
Data cleaning (the most important step!)
- Typical issues:
  - Missing values (empty cells in Excel)
  - Erroneous data (e.g., age entered as 200)
  - Inconsistent formats (dates written as “2023年1月1日” and “01/01/2023” together)
Data analysis
- Common techniques:
  - Statistics (mean, max, proportion)
  - Group comparisons (e.g., spending differences between male and female users)
Data visualization
- A picture is worth a thousand words: line charts (trends) | bar charts (comparisons) | scatter plots (correlations)

The Data‑Science Toolchain

The core trio:

Tool	Role	Analogy
NumPy	High‑performance numerical computing (matrices/vectors)	The “engine” of data
Pandas	Tabular data handling (like a sophisticated Excel)	The “scalpel” of data
Matplotlib	Data visualization (plotting library)	The “translator” of data

Typical workflow: NumPy processes numbers → Pandas organizes tables → Matplotlib creates visual output.

Auxiliary Tools

Jupyter Notebook – an interactive programming environment that shows code and results side by side.
- Benefit: ideal for teaching and exploratory analysis; notebooks can mix text, code, and graphics.
Anaconda – a one‑click installer for the whole scientific‑computing stack.
- Includes: Python interpreter + common libraries + environment‑management tools.

Anaconda

What is Anaconda?

Official website: (link omitted)

In short, Anaconda = Python + package & environment manager (conda) + common libraries + integrated tools. It’s perfect for anyone who wants to spin up a data‑science or machine‑learning environment quickly. Think of Anaconda as the car and Python as the engine; installing Anaconda gives you a ready‑to‑drive vehicle without having to assemble the engine yourself.

The conda command manages packages, dependencies, and environments. Compared with the traditional pip tool, conda makes switching between environments easier and overall environment management simpler.

Why choose Anaconda?

Easy installation: Installing Anaconda is as simple as installing any other application; it comes pre‑loaded with many useful tools, so you don’t have to configure each one manually.
Package manager: Anaconda includes Conda, which can install, update, and manage packages for multiple languages, not just Python.
Environment manager: Create and maintain separate Python environments (e.g., Python 2 and Python 3) and switch between them at will—great for avoiding version conflicts across projects.
Bundled tools & libraries: Comes with essential data‑science packages such as NumPy, Pandas, Matplotlib, SciPy, Scikit‑learn, etc.
Jupyter Notebook: An interactive computing environment that supports live code, equations, visualizations, and narrative text.
Spyder IDE: A scientific‑computing‑oriented IDE with code editing, debugging, and data‑visualization features.
Cross‑platform: Runs on Windows, macOS, and Linux.
Community support: A large user community offers forums, tutorials, and troubleshooting help.

Core Advantages

200+ pre‑installed data‑science packages
Ready‑to‑use: No need to manually install NumPy, Pandas, etc.
Complete ecosystem: Includes tools for analysis, machine learning, and visualization.

Comparison: Anaconda vs. native Python + pip

Dimension	Anaconda	Native Python + pip
Installation difficulty	⭐️ One‑click install of everything	⭐️⭐️⭐️ Install each library manually
Dependency management	Conda resolves conflicts automatically	pip may encounter version‑compatibility issues
Disk space	⚠️ Larger (≈3 GB + base packages)	✅ Minimal (install only what you need, as low as tens of MB)
Ideal scenario	Beginners / rapid start‑up for data analysis	Developers who need fine‑grained control over environments
Typical use case	Classroom teaching / personal learning	Production servers

Jupyter

Jupyter is an open‑source interactive computing environment widely used in data science, machine learning, and scientific research. Its main components are Jupyter Notebook and JupyterLab. JupyterLab, the successor to Notebook, offers a modern, feature‑rich interface with multi‑document support, built‑in collaboration tools, and an extensible plugin system—making it a favorite among data scientists and researchers.

Common keyboard shortcuts

Key combo	Action
`Esc`	Switch from edit mode to command mode
`A`	Insert a new cell above the current one
`B`	Insert a new cell below the current one
`DD`	Delete the current cell
`M`	Change cell type to Markdown
`Y`	Change cell type to Code
`Ctrl + Enter`	Run the current cell
`Shift + Enter`	Run the current cell and create a new one below

NumPy – Scientific Computing

Basic Introduction

NumPy is the foundational package for scientific computing in Python. It provides a multi‑dimensional array object, various derived objects (e.g., masked arrays and matrices), and a suite of functions for fast array operations—including mathematics, logic, shape manipulation, sorting, selection, I/O, discrete Fourier transforms, basic linear algebra, basic statistics, random simulation, and more.

Key features of NumPy:

ndarray: a fast, memory‑efficient multi‑dimensional array supporting vectorized arithmetic and sophisticated broadcasting.
Universal functions (ufuncs): perform element‑wise operations on whole arrays without explicit Python loops.
File I/O & memory‑mapped files: tools for reading/writing data to disk.
Linear algebra, random number generation, Fourier transforms.
C/Fortran/C++ integration APIs for extending NumPy with compiled code.

Why do we need NumPy?

(Example output omitted)

ndarray (N‑dimensional Array)

Core Features

Multidimensionality: supports 0‑D (scalar), 1‑D (vector), 2‑D (matrix), and higher‑dimensional arrays.
Homogeneity: all elements must share the same data type; mixed‑type input is up‑cast to a common type.
Efficiency: stored in contiguous memory blocks, enabling fast vectorized operations.

Important Attributes

Core attributes (example arr = np.array([[1, 2], [3, 4]])):

Attribute	Plain‑language meaning	Example	Output	Typical use
`shape`	Array dimensions (rows, columns, …)	`arr.shape`	`(2, 2)`	Inspect or reshape the array
`ndim`	Number of dimensions	`arr.ndim`	`2`	Distinguish vectors, matrices, tensors
`size`	Total number of elements	`arr.size`	`4`	Quick element count
`dtype`	Data type of elements	`arr.dtype`	`int64` (or `int32`)	Ensure consistent calculations (e.g., avoid integer‑division pitfalls)

shape – like asking “what does the array look like?”
- Example: arr = np.array([[1, 2], [3, 4]]) has shape (2, 2), meaning 2 rows × 2 columns.
- Reshaping: arr.reshape(4, 1) can be reshaped into a 4‑row × 1‑column array.
ndim – tells you the “dimensionality” of the space.
- 1‑D (vector): ndim = 1, e.g., [1, 2, 3].
- 2‑D (matrix): ndim = 2, e.g., tabular data.
- 3‑D (tensor): ndim = 3, e.g., RGB image data.
dtype – guarantees all elements share the same type.
- If any element is a float, the whole array becomes float64 to preserve precision.
- You can force a type, e.g., np.array([1, 2], dtype=np.float32).

Advanced Attributes (for reference)

Attribute	Plain meaning	Example code	Output	Use case
`T` (transpose)	Swap rows and columns	`arr.T`	`[[1, 3], [2, 4]]`	Matrix algebra (e.g., multiplication)
`itemsize`	Bytes per element	`arr.itemsize`	`8` (int64 occupies 8 bytes)	Memory‑usage tuning
`nbytes`	Total memory consumption (`size * itemsize`)	`arr.nbytes`	`32` (4 elements × 8 bytes)	Monitoring large‑data workloads
`flags`	Memory layout info (e.g., C‑contiguous)	`arr.flags`	`C_CONTIGUOUS : True` etc.	Low‑level optimizations

Creation Methods

Category	Purpose	Example placeholder	Core action	Example output
Basic construction – for small, hand‑crafted arrays or copying existing data	From Python structures	`np.array()` `np.array([[1, 2], [3, 4]])`	Convert list/tuple to `ndarray`	`array([[1, 2], [3, 4]])`
	Copy an array	`np.copy()` `np.copy(arr)`	Create a deep copy (no shared memory)	—
Pre‑defined shape filling – quick initialization of fixed‑size arrays (all zeros, all ones, etc.)	All‑zero array	`np.zeros()` `np.zeros((2, 3))`	Fast zero‑initialization	`[[0., 0., 0.], [0., 0., 0.]]`
	All‑one array	`np.ones()` ⟙TOK12⟩	Fast one‑initialization	`[[1, 1], [1, 1], [1, 1]]`
	Uninitialized array	`np.empty()` `np.empty((2, 2))`	Allocate memory (values random)	—
	Fill with a constant	`np.full()` `np.full((2, 3), 5)`	Populate with a given value	`[[5, 5, 5], [5, 5, 5]]`
Range‑based generation – create numeric sequences (useful for time series, coordinate grids, etc.)	Arithmetic progression	`np.arange()` `np.arange(0, 10, 2)`	Generate values with fixed step (exclusive of endpoint)	`[0, 2, 4, 6, 8]`
	Evenly spaced values (inclusive)	`np.linspace()` `np.linspace(0, 1, 5)`	Generate a specified number of points	`[0., 0.25, 0.5, 0.75, 1.]`
	Log‑spaced values	`np.logspace()` `np.logspace(0, 2, 3, base=10)`	Generate logarithmic progression	`[1.0, 10.0, 100.0]`
Special matrices – for linear‑algebra operations	Identity matrix	`np.eye()` `np.eye(3)`	Create an identity matrix (1’s on diagonal)	`[[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]`
	Diagonal matrix	`np.diag()` `np.diag([1, 2, 3])`	Create a matrix with specified diagonal values	`[[1, 0, 0], [0, 2, 0], [0, 0, 3]]`
Random arrays – simulate experimental data, initialise neural‑network weights, etc.	Uniform random numbers	`np.random.rand()` `np.random.rand(2, 2)`	Generate numbers in (0, 1) uniformly	`[[0.43, 0.89], [0.21, 0.57]]`
	Normal (Gaussian) random numbers	`np.random.randn()` `np.random.randn(2, 2)`	Standard normal (mean 0, variance 1)	`[[-0.5, 1.2], [0.3, -1.8]]`
	Random integers	`np.random.randint()` `np.random.randint(1, 10, (2, 2))`	Generate integers in a given range	`[[3, 7], [5, 2]]`
Advanced constructors – handle unstructured data (files, strings) or generate complex arrays via functions	From string	`np.array()` `np.array(['a', 'bc'])`	Convert a string to a character array, e.g., `array(['a', 'bc'], dtype=…)`	—

Converting from Python data structures
The most direct way to obtain an ndarray is to feed a Python list, tuple, etc., into np.array.

np.array(object, dtype=None)

Notes

When mixed types are present, NumPy up‑casts to the highest‑priority type (e.g., int + float → float, number + string → string).
The nesting depth of a list determines ndim (the number of dimensions).

Pre‑defined Shape Filling

Quickly create arrays of a fixed shape, often used as placeholders or to initialise weight matrices. The default dtype is float64; specify dtype=int or another type if needed.

Uninitialized Arrays (`np.empty`) – `np.empty(shape, dtype=float)`

Creates a new array with the given shape and dtype without initializing its entries (values are whatever happens to be in memory). Useful for performance‑critical code, but you must fill the array yourself; otherwise unpredictable values may cause bugs. Remember, np.empty does not guarantee zeros—it merely allocates memory.

Repeating‑value Arrays (`np.zeros_like`, `np.ones_like`, `np.empty_like`) – `np.full(shape, fill_value, dtype)`

zeros_like(): returns a new array of zeros with the same shape and dtype as the input.
ones_like(): returns a new array of ones with the same shape and dtype as the input.
empty_like(): returns an uninitialized array with the same shape and dtype as the input.

Range‑based Generation

Arithmetic sequence: np.arange(start, stop, step) returns a 1‑D array filled with evenly spaced values within a given interval.
Linspace (inclusive): np.linspace(start, stop, num=50) returns a specified number of evenly spaced points between start and stop (including the endpoint).
Logspace: np.logspace(start, stop, num=50, base=10) returns numbers spaced evenly on a log scale.

Special Matrices – Basics

A matrix is a rectangular array of numbers, fundamental to linear algebra. It can represent data, systems of equations, or linear transformations.

What is a matrix?
A matrix consists of rows and columns, e.g.:
```
[ [a11, a12, a13],
  [a21, a22, a23] ]
```
- Shape: 2 × 3 (2 rows, 3 columns).
- Entry: each number, e.g., (a_{1,2}=2) (row 1, column 2).
Uses of matrices
- Representing linear systems:
  [ A\mathbf{x} = \mathbf{b} ] where (A) is the coefficient matrix, (\mathbf{x}) the vector of unknowns, and (\mathbf{b}) the constants vector.
- Describing linear transformations: e.g., a 2 × 2 matrix can rotate points in the plane by an angle (\theta).
- Data representation (e.g., machine learning): rows = samples (e.g., images), columns = features (e.g., pixel values).
Basic matrix operations
- Addition: element‑wise addition (requires identical shapes).
- Scalar multiplication: each element multiplied by a scalar.
- Matrix multiplication: dot product of rows and columns (not element‑wise). Note that matrix multiplication is not commutative ((AB \neq BA)).
- Transpose: swap rows and columns.

Special Matrices

Identity matrix (np.eye / np.identity)
Diagonal matrix (np.diag)

Random Arrays

Uniform distribution: np.random.rand → array of shape‑specified values drawn from ([0,1)).
Normal distribution: np.random.randn → array drawn from standard normal (mean 0, std 1).
Random integers: np.random.randint → array of integers in a half‑open interval ([low, high)).
random.uniform() → similar to rand but with user‑specified range.

Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.

5. Python 数据分析-Numpy、Pandas、Matplotlib

5. Python Data Analysis – NumPy, Pandas, Matplotlib

Introduction to Data Analysis

Overview

Full Workflow

The Data‑Science Toolchain

Auxiliary Tools

Anaconda

Why choose Anaconda?

Comparison: Anaconda vs. native Python + pip

Jupyter

NumPy – Scientific Computing

Basic Introduction

Why do we need NumPy?

ndarray (N‑dimensional Array)

Core Features

Important Attributes

Advanced Attributes (for reference)

Creation Methods

Pre‑defined Shape Filling

Uninitialized Arrays (`np.empty`) – `np.empty(shape, dtype=float)`

Repeating‑value Arrays (`np.zeros_like`, `np.ones_like`, `np.empty_like`) – `np.full(shape, fill_value, dtype)`

Range‑based Generation

Special Matrices – Basics

Special Matrices

Random Arrays

Sources & References

Keep reading

How LangGraph Turns Market Data into Trading Signals in Real Time

Sourcegraph for Portfolio Management: AI-Driven Investing Deep Dive

How Perplexity Uses Sentiment Analysis to Predict Market Moves

5. Python 数据分析-Numpy、Pandas、Matplotlib

5. Python Data Analysis – NumPy, Pandas, Matplotlib

Introduction to Data Analysis

Overview

Full Workflow

The Data‑Science Toolchain

Auxiliary Tools

Anaconda

Why choose Anaconda?

Comparison: Anaconda vs. native Python + pip

Jupyter

NumPy – Scientific Computing

Basic Introduction

Why do we need NumPy?

ndarray (N‑dimensional Array)

Core Features

Important Attributes

Advanced Attributes (for reference)

Creation Methods

Pre‑defined Shape Filling

Uninitialized Arrays (np.empty) – np.empty(shape, dtype=float)

Repeating‑value Arrays (np.zeros_like, np.ones_like, np.empty_like) – np.full(shape, fill_value, dtype)

Range‑based Generation

Special Matrices – Basics

Special Matrices

Random Arrays

Sources & References

Keep reading

How LangGraph Turns Market Data into Trading Signals in Real Time

Sourcegraph for Portfolio Management: AI-Driven Investing Deep Dive

How Perplexity Uses Sentiment Analysis to Predict Market Moves

Comparison: Anaconda vs. native Python + pip

Uninitialized Arrays (`np.empty`) – `np.empty(shape, dtype=float)`

Repeating‑value Arrays (`np.zeros_like`, `np.ones_like`, `np.empty_like`) – `np.full(shape, fill_value, dtype)`