KV Offloading Usage Guide¶

This guide covers configuration of the OffloadingConnector, which extends the prefix cache by offloading completed KV blocks to slower but larger tiers (CPU host memory, plus optional secondary tiers) as they are produced. Hits in the offload tiers are promoted back to GPU on demand. Transfers between GPU and CPU use DMA (cudaMemcpyAsync) and run asynchronously alongside model computation, so offloading adds minimal CPU- and GPU-core overhead.

Note

The OffloadingConnector currently supports CUDA, ROCm, and XPU only.

Overview¶

Two specs are available, selected by the spec_name key in kv_connector_extra_config:

CPUOffloadingSpec (default): single CPU tier. Completed GPU blocks are copied into pinned host memory.
TieringOffloadingSpec: multi-tier. A CPU primary tier plus one or more secondary tiers.

Only the CPU primary tier has direct GPU access. Secondary tiers cannot read from or write to GPU memory; all GPU↔secondary transfers are staged through the CPU primary tier.

flowchart LR
    GPU <--> CPU["CPU primary tier"]
    CPU <--> S0["Secondary tier 0"]
    CPU <--> S1["Secondary tier 1"]
    CPU <--> SN["..."]

Single-Tier Setup (CPU Only)¶

vllm serve <model> \
  --kv-transfer-config '{
    "kv_connector": "OffloadingConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "block_size": 64,
      "cpu_bytes_to_use": 1000000000
    }
  }'

Multi-Tier Setup¶

Set spec_name to "TieringOffloadingSpec" and supply a secondary_tiers list. Each entry is a dict with a required type key plus tier-specific fields. The list is ordered: tier 0 is consulted before tier 1, and so on. See Secondary Tiers for tier-specific keys.

vllm serve <model> \
  --kv-transfer-config '{
    "kv_connector": "OffloadingConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "spec_name": "TieringOffloadingSpec",
      "cpu_bytes_to_use": 10737418240,
      "block_size": 16,
      "eviction_policy": "lru",
      "secondary_tiers": [
        {
          "type": "fs",
          "root_dir": "/mnt/kv_cache",
          "n_read_threads": 32,
          "n_write_threads": 16
        }
      ]
    }
  }'

`kv_connector_extra_config` Reference¶

Key	Required	Default	Scope	Notes
`spec_name`	no	`CPUOffloadingSpec`	both	Set to `TieringOffloadingSpec` for multi-tier.
`cpu_bytes_to_use`	yes	—	both	Total bytes of host memory reserved for the CPU tier across all workers (not per-worker).
`block_size`	no	GPU block size	both	Offloaded block size in tokens; must be a multiple of the GPU block size.
`eviction_policy`	no	`lru`	both	Primary tier policy: `lru` or `arc`.
`store_threshold`	no	`0`	single-tier	Min lookups before a block is offloaded. Values ≥ 2 are rejected by `TieringOffloadingSpec`.
`max_tracker_size`	no	`64000`	single-tier	Max entries in the lookup tracker.
`secondary_tiers`	no	`[]`	multi-tier	List of secondary tier configs (see below).
`offload_prompt_only`	no	`true`	both	If `true`, only prompt (prefill) blocks are offloaded; decode blocks are skipped.
`spec_module_path`	no	—	both	Python import path for a custom `OffloadingSpec` not in the built-in registry. Required only when `spec_name` is not built-in (advanced).

Secondary Tiers¶

Each entry in secondary_tiers is a dict with a required type field plus tier-specific fields.

Filesystem (FS)¶

The filesystem tier (type: "fs") writes blocks to a directory on local storage.

Key	Required	Default	Notes
`type`	yes	—	Must be `fs`.
`root_dir`	yes	—	Base directory; vLLM creates subdirectories beneath it (see On-Disk Layout).
`n_read_threads`	no	`16`	Read-priority I/O threads (load path).
`n_write_threads`	no	`16`	Write-priority I/O threads (store path).

Each thread group prefers its own queue but pulls from the other when its primary queue is empty, so a write-heavy or read-heavy burst won't leave the off-priority queue waiting. Size the totals to your storage's effective concurrency.

On-Disk Layout¶

Under root_dir, vLLM creates a subdirectory <model>_<digest>, where <model> is the model name with / replaced by _ (so HuggingFace IDs like meta-llama/Llama-3-8B don't nest), and <digest> is a short SHA256 prefix derived from the run configuration (model, block size, parallelism, dtype, etc.). Runs with the same configuration share the same subdirectory; runs with different configurations live side-by-side under the same root_dir without colliding.

Inside that subdirectory, blocks are sharded across hash-prefix subdirectories to limit directory fan-out:

<root_dir>/
  <model>_<digest>/
    config.json
  <model>_<digest>_r<rank>/
    <hhh>/                    # first 3 hex chars of the block hash
      <hh>_g<group_idx>/      # next 2 hex chars + KV cache group index
        <hash_hex>.bin        # full block hash (in hex)

config.json records the run (block size, number of KV groups, etc.) and is written on first start. Each rank writes blocks under its own _r<rank> sibling directory, so multiple ranks can safely share the same root_dir.

To enable KV cache sharing between multiple vLLM instances using the same root_dir (e.g., via a shared PVC), the PYTHONHASHSEED environment variable must be set to the same fixed value (e.g., "0") on every instance. Without this, each process initializes NONE_HASH (the chain-hash seed for block content hashes) with random bytes, producing different block filenames for identical token content.

PYTHONHASHSEED=0 vllm serve ...

Tuning Tips¶

cpu_bytes_to_use: a bigger CPU tier means fewer trips to slower secondary tiers and a higher hit rate. The value is total across all workers, not per-worker. Leave headroom for the rest of the host workload.
For single-tier (CPU-only) setups, set cpu_bytes_to_use larger than the aggregate GPU KV cache. Because offloading is immediate, a smaller CPU tier just mirrors what the GPU already holds and adds no hit rate.
block_size: larger offloaded blocks reduce per-block bookkeeping overhead but increase the granularity of lookups. Must be a multiple of the GPU block size.
FS thread counts: tune n_read_threads and n_write_threads to the parallelism your storage can sustain. Reads are latency-sensitive on the prefill path, so prefer more read threads when prefill hit rates are high.
Sharing root_dir across runs: runs with the same model, block_size, parallelism layout, and dtype share files under the same <digest> subdirectory. Changing any of these produces a new subdirectory; old ones are orphaned but harmless. Delete them to reclaim disk.