Prefill vs Decode: LLM Inference 的两个阶段

理解 LLM 推理中 prefill 和 decode 的区别，以及为什么 prefill 更适合 batching。

Jun 21, 2026 Updated Jun 21, 2026 1 min read

什么是 prefill

Prefill 是模型处理输入 prompt 的阶段。系统会一次性读取上下文 token，计算每一层 attention 所需的 key/value，并写入 KV Cache。

Decode 是自回归生成阶段。模型每次生成一个新 token，同时复用已有 KV Cache，只为新 token 追加新的 key/value。

Decode 每一步都要读取历史 KV Cache。上下文越长，读取量越大，memory bandwidth 和调度策略会更明显地影响 TPOT。

Stage	Main pressure	Typical metric
Prefill	Compute throughput	TTFT
Decode	Memory bandwidth and scheduling	TPOT

def tokens_per_second(total_tokens: int, elapsed_seconds: float) -> float:
    return total_tokens / elapsed_seconds