vllm.model_executor.model_loader.reload.layerwise ¶
_get_original_loader ¶
Return the weight loader with any layerwise wrappers removed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_layerwise_process ¶
_layerwise_process(layer: Module, info: LayerReloadingInfo)
Finalize layer loading after all weights have been buffered.
This function: 1. Materializes the layer onto the target device 2. Loads all buffered weights 3. Runs quantization processing if applicable 4. Copies processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
finalize_layerwise_processing ¶
finalize_layerwise_processing(
model: Module, model_config: ModelConfig
)
Apply processing to any layers which were not layerwise processed during loading. This includes attention layers and layers which have weight elements which are not loaded (due to padding).
This function should be applied after initialize_layerwise_reload is applied unwrap the layerwise weight loaders.
:param model: model to finalize processing for :param model_config: config needed for applying processing to attention layers
Source code in vllm/model_executor/model_loader/reload/layerwise.py
get_layerwise_info ¶
get_layerwise_info(layer: Module) -> LayerReloadingInfo
Get information related to restoring and layerwise processing. If no previous information existed, a new entry is constructed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_layerwise_reload ¶
initialize_layerwise_reload(model: Module)
Set up layerwise weight loading with deferred processing.
Must be called after record_metadata_for_reloading. This function: 1. Saves current kernel tensors for later copying 2. Restores layer parameters/buffers from metadata (on meta device) 3. Wraps weight loaders to defer processing until all weights are loaded
When all weights for a layer are loaded, the wrapped loaders will: 1. Materialize the layer onto the target device 2. Load all cached weights 3. Run quantization processing if applicable 4. Copy processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_online_processing ¶
initialize_online_processing(layer: Module)
Wrap a layer's weight loaders with online processing loaders. Called by either initialize_layerwise_reload or an online quantization scheme, prevents double wrapping in the case of online quantization + reloading
:param layer: layer whose parameter weight loaders will be wrapped
Source code in vllm/model_executor/model_loader/reload/layerwise.py
make_online_process_loader ¶
Create a wrapped weight loader that defers processing.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
record_metadata_for_reloading ¶
record_metadata_for_reloading(model: Module)
Record layer metadata needed for later reloading.
Stores parameter and buffer metadata as meta tensors for restoration. Must be called before initialize_layerwise_reload.