vllm.v1.kv_offload.spec ¶
CanonicalKVCacheRef dataclass ¶
Per-layer (or group of layers) reference to a specific (by index) CanonicalKVCacheTensor and records the un-padded page size used by that layer.
Source code in vllm/v1/kv_offload/spec.py
CanonicalKVCacheTensor dataclass ¶
A canonicalized KV cache tensor whose first dimension is num_blocks.
For attention backends where the raw tensor has num_blocks at a non-leading physical dimension (e.g. FlashAttention's (2, num_blocks, ...) layout), the tensor is split so that each resulting CanonicalKVCacheTensor starts with (num_blocks, ...).
Source code in vllm/v1/kv_offload/spec.py
CanonicalKVCaches dataclass ¶
Canonicalized block-level representation of the KV caches.
Composed of
- Unique list of KV cache data tensors, each with shape (num_blocks, page_size_in_bytes) and int8 dtype.
- Per-group data references of the tensors. i.e. how each KV cache group maps to the tensors.
Source code in vllm/v1/kv_offload/spec.py
OffloadingSpec ¶
Bases: ABC
Spec for an offloading connector
Source code in vllm/v1/kv_offload/spec.py
get_handlers abstractmethod ¶
get_handlers(
kv_caches: CanonicalKVCaches,
) -> Iterator[
tuple[
type[LoadStoreSpec],
type[LoadStoreSpec],
OffloadingHandler,
]
]
Get offloading handlers along with their respective src and dst types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_caches | CanonicalKVCaches | Canonicalized KV caches. | required |
Yields:
| Type | Description |
|---|---|
tuple[type[LoadStoreSpec], type[LoadStoreSpec], OffloadingHandler] | Tuples of (src_type, dst_type, offloading_handler). |
Source code in vllm/v1/kv_offload/spec.py
get_manager abstractmethod ¶
get_manager() -> OffloadingManager
Get an OffloadingManager that will be used by the scheduler-side offloading connector to track offloaded blocks and manage evictions.