Home

5. Python 数据分析-Numpy、Pandas、Matplotlib

Li

Li Wei

February 21, 202612 min read

5. Python Data Analysis – NumPy, Pandas, Matplotlib

Introduction to Data Analysis

Overview

Why learn data analysis?

Feature Excel Python (Pandas)
Data size you can handle Up to ~10,000 rows Over 1,000,000 rows (automation)
Manual vs. code‑driven Manual operation One‑click code execution
Learning difficulty Simple (requires basic programming) Simple (requires basic programming)

Traditional method: processing data manually with Excel

  • Problem: when the dataset exceeds 10,000 rows Excel becomes sluggish; complex calculations require intricate formulas.
  • Example: ranking the grades of 1,000 students manually takes about 2 hours.

Python data analysis:

  • Advantage: automatically handles million‑row datasets; code is reusable.
  • Example: the same task completed with Pandas in 3 minutes.

Full Workflow

  • Data collection

    • Where does the data come from? Company databases | public datasets (e.g., government data) | manual web scraping
  • Data cleaning (the most important step!)

    • Typical issues:
      • Missing values (empty cells in Excel)
      • Erroneous data (e.g., age entered as 200)
      • Inconsistent formats (dates written as “2023年1月1日” and “01/01/2023” together)
  • Data analysis

    • Common techniques:
      • Statistics (mean, max, proportion)
      • Group comparisons (e.g., spending differences between male and female users)
  • Data visualization

    • A picture is worth a thousand words: line charts (trends) | bar charts (comparisons) | scatter plots (correlations)

The Data‑Science Toolchain

The core trio:

Tool Role Analogy
NumPy High‑performance numerical computing (matrices/vectors) The “engine” of data
Pandas Tabular data handling (like a sophisticated Excel) The “scalpel” of data
Matplotlib Data visualization (plotting library) The “translator” of data

Typical workflow: NumPy processes numbers → Pandas organizes tables → Matplotlib creates visual output.

Auxiliary Tools

  • Jupyter Notebook – an interactive programming environment that shows code and results side by side.

    • Benefit: ideal for teaching and exploratory analysis; notebooks can mix text, code, and graphics.
  • Anaconda – a one‑click installer for the whole scientific‑computing stack.

    • Includes: Python interpreter + common libraries + environment‑management tools.

Anaconda

What is Anaconda?

Official website: (link omitted)

In short, Anaconda = Python + package & environment manager (conda) + common libraries + integrated tools. It’s perfect for anyone who wants to spin up a data‑science or machine‑learning environment quickly. Think of Anaconda as the car and Python as the engine; installing Anaconda gives you a ready‑to‑drive vehicle without having to assemble the engine yourself.

The conda command manages packages, dependencies, and environments. Compared with the traditional pip tool, conda makes switching between environments easier and overall environment management simpler.

Why choose Anaconda?

  • Easy installation: Installing Anaconda is as simple as installing any other application; it comes pre‑loaded with many useful tools, so you don’t have to configure each one manually.
  • Package manager: Anaconda includes Conda, which can install, update, and manage packages for multiple languages, not just Python.
  • Environment manager: Create and maintain separate Python environments (e.g., Python 2 and Python 3) and switch between them at will—great for avoiding version conflicts across projects.
  • Bundled tools & libraries: Comes with essential data‑science packages such as NumPy, Pandas, Matplotlib, SciPy, Scikit‑learn, etc.
  • Jupyter Notebook: An interactive computing environment that supports live code, equations, visualizations, and narrative text.
  • Spyder IDE: A scientific‑computing‑oriented IDE with code editing, debugging, and data‑visualization features.
  • Cross‑platform: Runs on Windows, macOS, and Linux.
  • Community support: A large user community offers forums, tutorials, and troubleshooting help.

Core Advantages

  • 200+ pre‑installed data‑science packages
  • Ready‑to‑use: No need to manually install NumPy, Pandas, etc.
  • Complete ecosystem: Includes tools for analysis, machine learning, and visualization.
Comparison: Anaconda vs. native Python + pip
Dimension Anaconda Native Python + pip
Installation difficulty ⭐️ One‑click install of everything ⭐️⭐️⭐️ Install each library manually
Dependency management Conda resolves conflicts automatically pip may encounter version‑compatibility issues
Disk space ⚠️ Larger (≈3 GB + base packages) ✅ Minimal (install only what you need, as low as tens of MB)
Ideal scenario Beginners / rapid start‑up for data analysis Developers who need fine‑grained control over environments
Typical use case Classroom teaching / personal learning Production servers

Jupyter

Jupyter is an open‑source interactive computing environment widely used in data science, machine learning, and scientific research. Its main components are Jupyter Notebook and JupyterLab. JupyterLab, the successor to Notebook, offers a modern, feature‑rich interface with multi‑document support, built‑in collaboration tools, and an extensible plugin system—making it a favorite among data scientists and researchers.

Common keyboard shortcuts

Key combo Action
Esc Switch from edit mode to command mode
A Insert a new cell above the current one
B Insert a new cell below the current one
DD Delete the current cell
M Change cell type to Markdown
Y Change cell type to Code
Ctrl + Enter Run the current cell
Shift + Enter Run the current cell and create a new one below

NumPy – Scientific Computing

Basic Introduction

NumPy is the foundational package for scientific computing in Python. It provides a multi‑dimensional array object, various derived objects (e.g., masked arrays and matrices), and a suite of functions for fast array operations—including mathematics, logic, shape manipulation, sorting, selection, I/O, discrete Fourier transforms, basic linear algebra, basic statistics, random simulation, and more.

Key features of NumPy:

  • ndarray: a fast, memory‑efficient multi‑dimensional array supporting vectorized arithmetic and sophisticated broadcasting.
  • Universal functions (ufuncs): perform element‑wise operations on whole arrays without explicit Python loops.
  • File I/O & memory‑mapped files: tools for reading/writing data to disk.
  • Linear algebra, random number generation, Fourier transforms.
  • C/Fortran/C++ integration APIs for extending NumPy with compiled code.
Why do we need NumPy?

(Example output omitted)

ndarray (N‑dimensional Array)

Core Features
  • Multidimensionality: supports 0‑D (scalar), 1‑D (vector), 2‑D (matrix), and higher‑dimensional arrays.
  • Homogeneity: all elements must share the same data type; mixed‑type input is up‑cast to a common type.
  • Efficiency: stored in contiguous memory blocks, enabling fast vectorized operations.
Important Attributes

Core attributes (example arr = np.array([[1, 2], [3, 4]])):

Attribute Plain‑language meaning Example Output Typical use
shape Array dimensions (rows, columns, …) arr.shape (2, 2) Inspect or reshape the array
ndim Number of dimensions arr.ndim 2 Distinguish vectors, matrices, tensors
size Total number of elements arr.size 4 Quick element count
dtype Data type of elements arr.dtype int64 (or int32) Ensure consistent calculations (e.g., avoid integer‑division pitfalls)
  • shape – like asking “what does the array look like?”

    • Example: arr = np.array([[1, 2], [3, 4]]) has shape (2, 2), meaning 2 rows × 2 columns.
    • Reshaping: arr.reshape(4, 1) can be reshaped into a 4‑row × 1‑column array.
  • ndim – tells you the “dimensionality” of the space.

    • 1‑D (vector): ndim = 1, e.g., [1, 2, 3].
    • 2‑D (matrix): ndim = 2, e.g., tabular data.
    • 3‑D (tensor): ndim = 3, e.g., RGB image data.
  • dtype – guarantees all elements share the same type.

    • If any element is a float, the whole array becomes float64 to preserve precision.
    • You can force a type, e.g., np.array([1, 2], dtype=np.float32).
Advanced Attributes (for reference)
Attribute Plain meaning Example code Output Use case
T (transpose) Swap rows and columns arr.T [[1, 3], [2, 4]] Matrix algebra (e.g., multiplication)
itemsize Bytes per element arr.itemsize 8 (int64 occupies 8 bytes) Memory‑usage tuning
nbytes Total memory consumption (size * itemsize) arr.nbytes 32 (4 elements × 8 bytes) Monitoring large‑data workloads
flags Memory layout info (e.g., C‑contiguous) arr.flags C_CONTIGUOUS : True etc. Low‑level optimizations
Creation Methods
Category Purpose Example placeholder Core action Example output
Basic construction – for small, hand‑crafted arrays or copying existing data From Python structures np.array() np.array([[1, 2], [3, 4]]) Convert list/tuple to ndarray array([[1, 2], [3, 4]])
Copy an array np.copy() np.copy(arr) Create a deep copy (no shared memory)
Pre‑defined shape filling – quick initialization of fixed‑size arrays (all zeros, all ones, etc.) All‑zero array np.zeros() np.zeros((2, 3)) Fast zero‑initialization [[0., 0., 0.], [0., 0., 0.]]
All‑one array np.ones() ⟙TOK12⟩ Fast one‑initialization [[1, 1], [1, 1], [1, 1]]
Uninitialized array np.empty() np.empty((2, 2)) Allocate memory (values random)
Fill with a constant np.full() np.full((2, 3), 5) Populate with a given value [[5, 5, 5], [5, 5, 5]]
Range‑based generation – create numeric sequences (useful for time series, coordinate grids, etc.) Arithmetic progression np.arange() np.arange(0, 10, 2) Generate values with fixed step (exclusive of endpoint) [0, 2, 4, 6, 8]
Evenly spaced values (inclusive) np.linspace() np.linspace(0, 1, 5) Generate a specified number of points [0., 0.25, 0.5, 0.75, 1.]
Log‑spaced values np.logspace() np.logspace(0, 2, 3, base=10) Generate logarithmic progression [1.0, 10.0, 100.0]
Special matrices – for linear‑algebra operations Identity matrix np.eye() np.eye(3) Create an identity matrix (1’s on diagonal) [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]
Diagonal matrix np.diag() np.diag([1, 2, 3]) Create a matrix with specified diagonal values [[1, 0, 0], [0, 2, 0], [0, 0, 3]]
Random arrays – simulate experimental data, initialise neural‑network weights, etc. Uniform random numbers np.random.rand() np.random.rand(2, 2) Generate numbers in (0, 1) uniformly [[0.43, 0.89], [0.21, 0.57]]
Normal (Gaussian) random numbers np.random.randn() np.random.randn(2, 2) Standard normal (mean 0, variance 1) [[-0.5, 1.2], [0.3, -1.8]]
Random integers np.random.randint() np.random.randint(1, 10, (2, 2)) Generate integers in a given range [[3, 7], [5, 2]]
Advanced constructors – handle unstructured data (files, strings) or generate complex arrays via functions From string np.array() np.array(['a', 'bc']) Convert a string to a character array, e.g., array(['a', 'bc'], dtype=…)

Converting from Python data structures
The most direct way to obtain an ndarray is to feed a Python list, tuple, etc., into np.array.

np.array(object, dtype=None)

Notes

  • When mixed types are present, NumPy up‑casts to the highest‑priority type (e.g., int + float → float, number + string → string).
  • The nesting depth of a list determines ndim (the number of dimensions).

Pre‑defined Shape Filling

Quickly create arrays of a fixed shape, often used as placeholders or to initialise weight matrices. The default dtype is float64; specify dtype=int or another type if needed.

Uninitialized Arrays (np.empty) – np.empty(shape, dtype=float)

Creates a new array with the given shape and dtype without initializing its entries (values are whatever happens to be in memory). Useful for performance‑critical code, but you must fill the array yourself; otherwise unpredictable values may cause bugs. Remember, np.empty does not guarantee zeros—it merely allocates memory.

Repeating‑value Arrays (np.zeros_like, np.ones_like, np.empty_like) – np.full(shape, fill_value, dtype)

  • zeros_like(): returns a new array of zeros with the same shape and dtype as the input.
  • ones_like(): returns a new array of ones with the same shape and dtype as the input.
  • empty_like(): returns an uninitialized array with the same shape and dtype as the input.

Range‑based Generation

  • Arithmetic sequence: np.arange(start, stop, step) returns a 1‑D array filled with evenly spaced values within a given interval.
  • Linspace (inclusive): np.linspace(start, stop, num=50) returns a specified number of evenly spaced points between start and stop (including the endpoint).
  • Logspace: np.logspace(start, stop, num=50, base=10) returns numbers spaced evenly on a log scale.

Special Matrices – Basics

A matrix is a rectangular array of numbers, fundamental to linear algebra. It can represent data, systems of equations, or linear transformations.

  • What is a matrix?
    A matrix consists of rows and columns, e.g.:

    [ [a11, a12, a13],
      [a21, a22, a23] ]
    
    • Shape: 2 × 3 (2 rows, 3 columns).
    • Entry: each number, e.g., (a_{1,2}=2) (row 1, column 2).
  • Uses of matrices

    • Representing linear systems:
      [ A\mathbf{x} = \mathbf{b} ] where (A) is the coefficient matrix, (\mathbf{x}) the vector of unknowns, and (\mathbf{b}) the constants vector.
    • Describing linear transformations: e.g., a 2 × 2 matrix can rotate points in the plane by an angle (\theta).
    • Data representation (e.g., machine learning): rows = samples (e.g., images), columns = features (e.g., pixel values).
  • Basic matrix operations

    • Addition: element‑wise addition (requires identical shapes).
    • Scalar multiplication: each element multiplied by a scalar.
    • Matrix multiplication: dot product of rows and columns (not element‑wise). Note that matrix multiplication is not commutative ((AB \neq BA)).
    • Transpose: swap rows and columns.
Special Matrices
  • Identity matrix (np.eye / np.identity)
  • Diagonal matrix (np.diag)

Random Arrays

  • Uniform distribution: np.random.rand → array of shape‑specified values drawn from ([0,1)).
  • Normal distribution: np.random.randn → array drawn from standard normal (mean 0, std 1).
  • Random integers: np.random.randint → array of integers in a half‑open interval ([low, high)).
  • random.uniform() → similar to rand but with user‑specified range.

Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.

Keep reading

More related articles from DriftSeas.