9. Natural Language Processing
Li Wei
9. Natural Language Processing
Chapter 1: Introduction to NLP
Overview
Natural Language Processing (NLP) is a major branch of artificial intelligence. “Natural language” refers to the languages people use in daily life (e.g., Chinese, English). The goal of NLP is to enable computers to understand or use these languages.
Common Tasks
NLP comprises many typical tasks, which can be grouped into the following categories.
Text Classification
Assign a label or category to an entire piece of text.
Typical applications: sentiment analysis (detecting whether a review is positive or negative), spam detection, news topic classification, etc.
Sequence Labeling
Tag each word or character in a piece of text.
Typical applications: named‑entity recognition (identifying person names, place names, phone numbers, etc.).
Text Generation
Generate new natural‑language text based on existing content.
Typical applications: automatic writing, summarization, smart replies, dialogue systems, etc.
Information Extraction
Extract structured information from raw text.
Typical applications: given a passage and a question, extract the answer from the passage.
Text Transformation
Convert text from one form to another.
Typical applications: machine translation, summarization, etc.
Historical Evolution of Techniques
Rule‑Based Systems
From the 1950s to the early 1980s, NLP relied mainly on manually crafted linguistic rules written by linguists and programmers. Representative systems include early machine‑translation projects (e.g., the Georgetown‑IBM experiment) and the ELIZA chatbot. These systems performed well in narrow domains but lacked generality, scalability, and the ability to handle linguistic complexity.
Examples
Georgetown‑IBM experiment – Conducted in 1954 by Georgetown University and IBM, it demonstrated fully automatic translation of more than 60 Russian sentences into English.
ELIZA – Developed in 1966 by Joseph Weizenbaum, ELIZA simulated a psychotherapist and pioneered “conversation‑style” interaction, becoming one of the world’s first chatbots.
Statistical Methods
In the 1990s, increased computing power and larger corpora made statistical approaches mainstream. By modeling probabilities over massive text data, systems could “learn” language patterns. Typical methods include n‑gram models, Hidden Markov Models (HMM), and Maximum Entropy models. This era marked the shift from expert‑driven to data‑driven NLP.
Typical method: N‑gram model
An n‑gram model predicts the likelihood of a word given the preceding N‑1 words. It was one of the earliest language‑modeling techniques in NLP.
- Bigram (2‑gram) assumes each word depends only on the previous word.
- Trigram (3‑gram) considers the two preceding words.
Machine‑Learning Era
In the 2000s, traditional machine‑learning algorithms such as logistic regression, Support Vector Machines (SVM), decision trees, and Conditional Random Fields (CRF) were applied to tasks like named‑entity recognition and text classification. Feature engineering became crucial; researchers manually designed many features to boost performance. Models grew more sophisticated and generalized better.
Example: Text classification with a bag‑of‑words model and logistic regression
The bag‑of‑words (BoW) model represents a document by word frequencies. It is simple but ignores word order. Consider two opposite reviews:
- Review A: “服务很好但味道差劲”
- Review B: “味道很好但服务差劲”
After tokenization:
- A → [“服务”, “很”, “好”, “但”, “味道”, “差劲”]
- B → [“味道”, “很”, “好”, “但”, “服务”, “差劲”]
Both produce identical BoW vectors, losing the crucial ordering information. To address this, researchers introduced n‑grams, which treat adjacent n words as a unit, preserving some sequence information. Using trigrams (3‑grams), the two reviews become:
- A → [“服务很好”, “很好但”, “好但味道”, “但味道差劲”]
- B → [“味道很好”, “很好但”, “好但服务”, “但服务差劲”]
Now their feature vectors differ.
Deep‑Learning Era
Since the mid‑2010s, deep learning has surged in NLP. Neural architectures such as RNN, LSTM, and GRU replaced handcrafted features, automatically learning semantic representations from massive data. The introduction of the Transformer architecture dramatically improved language understanding and generation, leading to pre‑trained models (e.g., GPT, BERT) and transfer learning. NLP has become more universal and powerful.
- RNN (Recurrent Neural Network)
- LSTM (Long Short‑Term Memory)
- GRU (Gated Recurrent Unit)
- Transformer
Chapter 2: Text Representation
Overview
Text representation converts natural language into numerical forms that computers can process. It is the foundational step for almost all NLP tasks.
Early methods like the bag‑of‑words model encode an entire document as a single vector. These approaches are simple and computationally cheap but struggle to capture word order and contextual meaning. Modern NLP therefore adopts richer, more expressive representations to model linguistic structure and semantics more effectively.
The first step is usually tokenization and vocabulary construction, as illustrated below:
- Tokenization splits raw text into the smallest meaningful units (tokens).
- Vocabulary is the set of tokens the model knows, each assigned a unique ID with bidirectional mapping.
During training or inference, the model tokenizes the input, maps each token to its ID via the vocabulary, and feeds the IDs into an embedding layer, which converts them into low‑dimensional dense vectors (word embeddings).
In text‑generation tasks, the model’s output layer produces a probability distribution over the vocabulary for the next token. The token with the highest probability is selected, looked up in the vocabulary, and appended to the generated text.
Tokenization
Different languages have different tokenization strategies due to structural differences. This section covers common approaches for English and Chinese.
English Tokenization
Based on granularity, tokenization can be word‑level, character‑level, or subword‑level.
Word‑level
Splits text at spaces and punctuation— the most intuitive method. However, it suffers from OOV (Out‑Of‑Vocabulary) problems: new slang, proper nouns, compounds, or misspellings that are absent from the pre‑built vocabulary are replaced by a special token (e.g.,<UNK>), causing loss of meaning.Character‑level
Treats each character (letter, digit, punctuation, even spaces) as a token. The vocabulary is tiny and covers virtually everything, eliminating OOV issues. The downside is that single characters carry little semantic information, forcing the model to rely on longer contexts, which significantly raises modeling difficulty and training cost and leads to longer input sequences.Subword‑level
Lies between word and character tokenization. Words are broken into smaller units—subwords such as roots, prefixes, suffixes, or frequent fragments. Subword tokenization mitigates theOOVproblem and retains more semantic structure than pure characters. Even if a whole word is unseen, it can be represented by known subwords, avoiding replacement by<UNK>.Common subword algorithms include
BPE(Byte Pair Encoding),WordPiece, andUnigram Language Model. BPE is the earliest widely used method (explained later).
Chinese Tokenization
Although Chinese differs greatly from English, we can still categorize its tokenization by granularity.
Character‑level
Splits text into individual Chinese characters, each treated as a token. Because characters often carry meaning, this approach is naturally viable and more “semantically friendly” than English character tokenization.Word‑level
Segments Chinese text into complete words, aligning better with human reading habits. Since Chinese lacks spaces, word‑level tokenization relies on dictionaries, rules, or statistical models to locate word boundaries.Subword‑level
Even though Chinese lacks explicit prefixes/suffixes, subword algorithms (e.g., BPE) can still be applied. They treat characters as base units and learn frequent character combinations (e.g., “自然”, “语言”, “处理”) to build a subword vocabulary, requiring no manual dictionary. Major Chinese large models (e.g., Tongyi Qianwen, DeepSeek) use subword tokenization.
Difference between “subword‑level” and “word‑level” tokenization in Chinese
- Unit: word‑level → complete words (自然语言处理 → 自然语言 / 处理); subword‑level → smaller fragments (自然 / 语言 / 处理, or even finer).
- Dictionary dependence: word‑level usually needs a dictionary, rules, or annotated models; subword‑level is data‑driven (e.g., BPE/WordPiece) and learns frequent fragments.
- OOV handling: word‑level may mis‑segment or miss new terms; subword‑level can split unseen words into known subwords, offering better robustness.
Tokenization Process
In NLP, a vocabulary consists of words or characters.
- Word‑level tokenization leads to a huge vocabulary and many OOVs.
- Character‑level yields a tiny vocabulary but very long sequences with overly fine granularity.
Subword tokenization strikes a balance: frequent whole words stay as single tokens, while rare words are broken into shorter pieces, keeping the vocabulary size manageable while still being able to represent virtually any new word.
BPE (Byte Pair Encoding) is a data‑driven algorithm that learns subword merge rules from a corpus. Originally designed for data compression, it was adapted for machine translation by Sennrich et al. (Subword NMT) and later became a common tokenizer foundation for models such as GPT‑2 and RoBERTa.
Core idea
- Training phase – Start by splitting every word in the corpus into individual characters, forming the initial vocabulary. Iteratively count the most frequent adjacent character pair, merge it into a new subword token, and add it to the vocabulary. Continue until the vocabulary reaches a predefined size.
- Tokenization phase –
BPEapplies the learned merge rules to new text: split the text into the smallest units (characters or bytes), then sequentially apply the merge operations learned during training until no more merges are possible. The final output is a sequence of subwords.
Reference article:
Example
Corpus contains four words: low ×5, lower ×2, ⟟TOK8⟧ ×6, widest ×3. After adding the end‑of‑word marker ** ** and splitting:
low→ l o wlower→ l o w e rnewest→ n e w e s twidest→ w i d e s t
Vocabulary construction
Count adjacent pairs, weighted by word frequency. For example,
(l,o)and(o,w)each appear 7 times;(w,e)appears 8 times;(e,s),(s,t),(t, )each appear 9 times. When ties occur, implementation‑specific rules decide; here we assume merginge+s→esfirst. After replacement:newest→ n e w es t,widest→ w i d es t; the other two words remain unchanged. Vocabulary size becomes initial 11 +es= 12 types.Re‑count on the new strings. The next most frequent pair is
(es,t)(6 + 3 = 9, possibly tied with(t, )). Assume we mergees+t→est, yieldingnewest→ n e w est,widest→ w i d est. Vocabulary grows to 13 types.Continue similarly; merges such as
est+and laterl+oare performed, each time adding the new token to the vocabulary. Frequently occurring suffixes tend to be merged earlier, while rarer fragments are merged later.
Tokenization Tools
Overview
Chinese tokenization tools fall into two broad categories:
- Traditional dictionary/model‑based methods that segment at the word level. Examples include
jieba,HanLP, etc., widely used in classic NLP pipelines. - Subword‑model‑based methods (e.g.,
**BPE**) that learn frequent character combinations from data. Examples includeHugging Face Tokenizer,SentencePiece,tiktoken, commonly employed in large pre‑trained language models.
jieba Tokenizer
Overview – jieba is a popular open‑source Chinese tokenizer known for its simple API, flexible modes, and extensible dictionaries, still valuable for many traditional NLP tasks.
Installation – pip install jieba
Tokenization Modes – jieba offers several modes to suit different needs.
Precise mode (default) – Attempts the most accurate segmentation, suitable for text analysis. Use
jieba.cut(generator) orjieba.lcut(list) to obtain results.Full mode – Scans the sentence for all possible word candidates. Use
jieba.cutorjieba.lcutwith thecut_all=Trueargument.Search engine mode – Further splits long words identified in precise mode, optimized for search indexing. Use
jieba.cut_for_searchorjieba.lcut_for_search.
Custom Dictionary – jieba allows users to load a custom dictionary to recognize terms absent from the default lexicon, enhancing domain‑specific coverage.
The custom dictionary format: one entry per line, three fields separated by spaces—word, frequency (optional; higher frequency gives higher segmentation priority), and part‑of‑speech tag (optional, does not affect segmentation). Example:
Load a dictionary with jieba.load_userdict(file_name); you can also modify it dynamically using jieba.add_word(word, freq=None, tag=None) and jieba.del_word(word).
Word Representations
Overview
After tokenization, text becomes a sequence of tokens (words, subwords, or characters). Since these symbols are not directly computable, they must be transformed into numerical vectors—a process called word representation.
Representations have evolved from sparse one-hot encodings to dense semantic embeddings, and more recently to contextualized embeddings. Different methods vary greatly in expressive power, semantic modeling, and adaptability to context.
One‑Hot Encoding
The earliest representation maps each vocabulary item to a sparse vector whose length equals the vocabulary size; the position corresponding to the word is 1, all others are 0.
one-hot While simple and intuitive, one‑hot vectors cannot capture semantic relationships between words, and the dimensionality explodes as the vocabulary grows, making computation inefficient. Consequently, one-hot representations are rarely used directly in modern NLP tasks.
Semantic Word Embeddings
Traditional one-hot embeddings lack the ability to reflect semantic similarity. To address this, researchers introduced the Word2Vec model, which learns dense, meaning‑rich vectors from large corpora. In this continuous space, words with similar meanings lie close together.
Word2Vec Overview – (content truncated)
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.