Back to projects

LLM Inference Lab

In Progress

A long-term learning and experiment lab for LLM inference systems.

PythonPyTorchvLLMSGLangLMCacheDockerPrometheusGrafana

#llm-inference #kv-cache #vllm #sglang #benchmark

项目背景

LLM inference systems 正在从单纯模型调用转向系统工程问题：显存管理、缓存复用、调度策略、吞吐延迟权衡和成本优化。

项目目标

系统学习 KV Cache、Prefix Cache 和 KV offloading。
复现 vLLM、SGLang、LMCache 中关键机制。
建立可重复的 benchmark methodology。
输出技术笔记、实验报告和工程总结。

技术栈

Python, PyTorch, vLLM, SGLang, LMCache, Docker, Prometheus, Grafana.

当前进度

已确定学习主题和实验方向。
正在整理 prefill/decode、KV Cache memory 与 Prefix Cache benchmark 的笔记。

实验记录

Experiment	Metric	Status
Prefill vs decode latency	TTFT / TPOT	Planned
KV Cache memory estimate	GPU memory	Planned
Prefix Cache hit rate	TTFT	Planned

相关笔记

后续计划

搭建基础 benchmark harness。
记录模型、硬件、并发、prompt 分布等实验条件。
将实验结果整理为可复现报告。