2. 提示词工程 Prompt engineering

Title: 2. Prompt Engineering

Prompt Engineering

Large language models (LLMs) are fundamentally prediction engines. They take a sequence of text as input and, based on their training data, predict the next most likely token. The model predicts one token at a time, appends it to the input sequence, and then predicts the next token. Each token prediction depends on the preceding context and the patterns learned during training.

When you write a prompt, you are essentially trying to steer the model toward producing the correct token sequence. The goal of prompt engineering is to craft high‑quality prompts that drive the model to generate accurate outputs. This usually involves iterative debugging, such as adjusting prompt length, language style, and structure to fit the task objective. In natural language processing (NLP) and LLM applications, a prompt is the input on which the model bases its response or prediction.

Prompts can be used for many tasks, including text summarization, information extraction, question answering, text classification, language or code translation, code generation, code documentation generation, reasoning, and more.

The first step in prompt engineering is selecting a model. Different models (e.g., Gemini in Vertex AI, GPT, Claude, open‑source Gemma or LLaMA) have different prompt‑compatibility requirements.

In addition to the prompt content, you also need to tune various model configuration parameters.

LLM Output Configuration

Once you have chosen a model, you need to understand its configuration options. Most LLMs expose a rich set of parameters that control how output is generated. Effective prompt engineering depends on setting these parameters sensibly.

Output Length

Output length is the number of tokens the model generates in its response. More tokens mean higher computational cost, slower response time, and higher expense.

Shortening the output length does not automatically make the model “more concise”; it simply stops generation once the length limit is reached. If you want short content, you must also guide the model via the prompt to finish early.

In some prompting techniques, limiting output length is especially important—for example, in the ReAct [9] method, where the model can keep generating redundant text after the desired answer.

In AI, traditional approaches treat reasoning and acting as two separate processes: reasoning focuses on thinking, while acting focuses on executing a solution. The ReAct method innovatively fuses reasoning and acting into a single, dynamic loop—think, act, observe, and adapt—mirroring how humans solve problems. ReAct stands for “Reasoning and Acting.”

Sampling Controls

LLMs do not predict a single deterministic token; they produce a probability distribution over all possible next tokens and then sample from it. Key sampling parameters include temperature, top‑K, and top‑P.

Temperature

Temperature controls randomness (range: 0–1). Lower values make output more deterministic; higher values increase diversity and randomness. temperature = 0 corresponds to greedy decoding, always picking the highest‑probability token.

Note that when multiple tokens share the highest probability, even a temperature of 0 can yield different results depending on the implementation’s tie‑breaking strategy.

In Gemini, temperature works similarly to the softmax function in machine learning. Low temperature leans toward deterministic output; high temperature allows more uncertainty, which is useful for creative tasks.

📌 Softmax function – A widely used activation function that converts raw scores (logits) into a probability distribution. It exponentiates each input, normalizes by the sum of all exponentiated values, and yields values between 0 and 1 that sum to 1. This makes it ideal for multi‑class classification (e.g., image recognition, NLP, recommendation systems). After softmax, the output can be interpreted as the probability of each class (e.g., “tree,” “house,” or “car” in an image).
Cross‑entropy (also called log loss) measures the performance of a classification model whose output is a probability value between 0 and 1. Larger differences between predicted probability and true label produce higher loss; a perfect model has a loss of 0.

Top‑K and Top‑P

Top‑K and Top‑P (also called nucleus sampling) limit the “candidate token pool” during sampling:

Top‑K: Sample from the K most probable tokens. Larger K yields richer output; smaller K yields more conservative, fact‑based output. K = 1 is equivalent to greedy decoding.
Top‑P: Sample from the smallest set of tokens whose cumulative probability exceeds P. P ranges from 0 (very deterministic) to 1 (consider all tokens). The smaller the P, the smaller the candidate pool. At the extreme, it collapses to “use only the highest‑probability token,” essentially greedy search.

Example: Suppose the model’s next‑token probabilities are:
A = 0.40, B = 0.30, C = 0.15, D = 0.10, E = 0.05.
Cumulative probabilities: A = 0.40, A+B = 0.70, A+B+C = 0.85, …
With Top‑P = 0.8, we stop once the cumulative sum exceeds 0.8, so the candidate set is {A, B} (cumulative 0.70).

In practice, try either method (or both) to see which works better for your task.

Putting It All Together

In the world of generative models, temperature, top‑K, top‑P, and the number of generated tokens form a delicate, inter‑dependent control system that determines style and quality. If a platform offers all three parameters, think of them as gatekeepers:

Top‑K and Top‑P first filter the candidate pool.
Temperature then modulates the final selection within that pool.

This is like crafting a cocktail: you want layers of flavor without the drink becoming chaotic. If you disable one filter, the decision process becomes simpler; without temperature, the remaining random draw lacks fine‑tuning, akin to a pure coin toss.

Extreme settings break the balance:

temperature = 0 → the model always picks the highest‑probability token (no nuance).
temperature ≫ 1 (e.g., up to 10) → chaotic randomness; top‑K and top‑P become coarse filters only.
top‑K = 1 → only one token passes, rendering temperature and top‑P irrelevant.
top‑K equal to the entire vocabulary → every non‑zero‑probability token passes, nullifying the filter.
top‑P ≈ 0 → only the top token survives; top‑P → 1 includes almost every token.

Typical recommendations:

Desired style	temperature	top‑P	top‑K
Moderate creativity	0.2	0.95	30
High creativity	0.9	0.99	40
Low creativity	0.1	0.9	20
Precise tasks (e.g., math)	0	any	any

Be aware of the “repetition loop bug”: the model may get stuck repeating the same words or phrases. Low temperature can cause the model to cling to a high‑probability path that loops; high temperature can randomly bring the model back to a previous state, also causing loops. Balancing temperature, top‑K, and top‑P—much like a sound engineer adjusting frequencies—helps avoid endless repetition.

Summary

temperature, top‑K, top‑P, and token count influence each other. Parameter choices should align with your specific goal and an understanding of how the model combines them:

If temperature = 0, top‑K and top‑P are ignored.
If top‑K = 1, temperature and top‑P are ignored.
If top‑P ≈ 0, the model usually selects only the highest‑probability token, ignoring other settings.

Suggested defaults for general use:

Medium creativity: temperature = 0.2, top‑P = 0.95, top‑K = 30
High creativity: temperature = 0.9, top‑P = 0.99, top‑K = 40
Low creativity: temperature = 0.1, top‑P = 0.9, top‑K = 20
Exact tasks: temperature = 0, top‑P = any, top‑K = any (default)

Prompting Techniques

Large language models (LLMs) are trained on massive datasets and fine‑tuned with instructions, giving them the ability to understand prompts and generate responses. However, LLMs are not perfect; output quality heavily depends on the clarity and precision of the input prompt. The clearer the prompt, the more accurately the model can predict the next text. Mastering a suite of prompting techniques—designed around how LLMs are trained and how they work—can dramatically increase the likelihood of obtaining relevant, high‑quality results.

Having grasped the basics of prompt engineering, let’s dive into the most important and commonly used techniques.

Basic Prompt / Zero‑Shot (General Prompting / Zero‑Shot)

Zero‑shot prompting is the most straightforward form: you provide only a description of the task, possibly with a brief starter text, but no concrete examples (hence “zero‑shot”). The input can be a question, the beginning of a story, or a set of instructions.

Use cases: tasks the model has already learned during pre‑training, relatively simple or common tasks.
Key point: the prompt must be clear and the task description precise.

In Vertex AI Studio we recorded zero‑shot prompts for classifying movie reviews in a table—useful for tracking and iterating on prompt work. For classification tasks, a low temperature helps produce stable, consistent predictions because creativity isn’t needed. Be aware that words like disturbing and masterpiece appearing together can make the model’s judgment a bit tricky.

When zero‑shot performance is insufficient or you need the model to follow a specific format or style, you move to prompting techniques that include examples.

One‑Shot & Few‑Shot Prompting

Providing examples in the prompt is an effective way to convey the intended task and output format.

One‑shot: give a single complete example (input + desired output). The model imitates this pattern for new, similar inputs.
Few‑shot: provide multiple examples (typically 3–5 or more, depending on task complexity and model capacity). Showing a series of input‑output pairs helps the model learn the pattern more robustly.

Key considerations:

Example quality: must be accurate and representative; even a tiny error can mislead the model.
Example diversity: cover different scenarios, especially edge cases, to improve robustness.
Number of examples: balance enough information with the model’s context‑length limits.

The table below shows a few‑shot prompt that guides the model to generate or classify specific text types. (We continue using the same Gemini‑Pro configuration, only raising the token limit to accommodate longer responses.)

System, Contextual, and Role Prompting

These three prompt types all steer LLM behavior, but each focuses on a different aspect. They can also be combined.

System Prompt: sets global rules, core tasks, or overall constraints—think of it as installing an “operating system” for the model.
Role Prompt: assigns a specific identity or persona (e.g., “act as an experienced travel advisor”).
Contextual Prompt: supplies immediate, task‑specific background information that may change during an interaction.

System Prompting

Purpose: define the model’s macro‑behavior, core function, or safety/style guidelines (e.g., “You are a translation assistant,” “Output must be valid JSON,” “Maintain a respectful tone”).
Benefit: helps the model produce output that conforms to a required structure or policy, reducing hallucinations and enforcing safety.

The table below demonstrates how a system prompt can force the model to output a specific structure (like a list or JSON) even when temperature is high.

Role Prompting

Purpose: give the model a persona (e.g., “pretend you are a seasoned travel consultant,” “simulate a kindergarten teacher”).
Benefit: the model’s responses adopt the knowledge base, tone, and style appropriate to that role. You can request styles such as confrontational, descriptive, direct, formal, humorous, influential, informal, inspirational, persuasive, etc.

The table below shows a travel‑advisor role with a humorous and inspirational tone. Switching the role (e.g., to a geography teacher) yields a completely different response.

Contextual Prompting

Purpose: provide background details, constraints, or nuances that are directly relevant to the current request.
Benefit: improves relevance and personalization, making the interaction smoother and more efficient.

The table below illustrates how adding context helps the model generate more precise answers.

Step‑Back Prompting

Core idea: before tackling a specific, complex question, first ask the LLM to consider a broader or more foundational question related to the problem. Then feed the model’s answer (or reasoning) back into the prompt for the concrete question.
Mechanism: this “step back” activates a wider knowledge base and reasoning pathways, encouraging deeper, more systematic thinking rather than jumping straight to details.
Benefits:
- Improves accuracy and depth for complex queries.
- Encourages critical thinking and creative knowledge application.
- May reduce bias from over‑focusing on narrow details.

The example below shows a naïve prompt that yields vague or generic answers. By first prompting the model to brainstorm related themes (with temperature = 1 for creativity) and then feeding those themes back into the original prompt, we obtain a more specific and interesting game‑plot idea.

Chain‑of‑Thought (CoT)

CoT was introduced by Wei et al. (2022) [13] and involves prompting the model to generate intermediate reasoning steps before delivering the final answer. It works well when combined with few‑shot prompting for tasks that require deliberation.

Core idea: explicitly ask the LLM to produce a series of reasoning steps, mimicking human problem‑solving.
Use cases: logical reasoning, calculations, multi‑step planning (e.g., math problems, complex Q&A, code generation planning).
Benefits:
- Accuracy: breaking down a problem reduces guesswork.
- Explainability: the “thought process” is visible, making it easier to spot errors.
- Robustness: CoT prompts tend to be more stable across variations.

(Content continues with additional techniques, examples, and best‑practice tables.)

Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.