vllm.model_executor.models.config ¶
Classes:
-
Gemma4Config– -
HybridAttentionMambaModelConfig– -
LlamaNemotronVLConfig–Config handler for LlamaNemotronVL embedding models.
-
MambaModelConfig– -
NemotronHForCausalLMConfig– -
Qwen3_5ForConditionalGenerationConfig–
Gemma4Config ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Configure attention for heterogeneous head dimensions.
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) staticmethod ¶
Configure attention for heterogeneous head dimensions.
Gemma4 uses different head dimensions for sliding window (head_dim) vs full attention (global_head_dim) layers. The default FA3 on Hopper cannot handle head_dim > 256, which causes mixed backend selection and numerical divergence.
When FA4 is available we force it for ALL layers, giving a uniform kernel path and avoiding the mixed FA3+FA4 penalty. When FA4 is not available we fall back to Triton.
Source code in vllm/model_executor/models/config.py
HybridAttentionMambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Perform early validation and setup for hybrid attention/mamba models.
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) classmethod ¶
Perform early validation and setup for hybrid attention/mamba models.
Block size alignment with mamba page sizes is handled later by Platform.update_block_size_for_backend(), which runs after model layers are constructed and the attention backend is known.
Parameters:
-
(vllm_config¶VllmConfig) –vLLM Config
Source code in vllm/model_executor/models/config.py
LlamaNemotronVLConfig ¶
Bases: VerifyAndUpdateConfig
Config handler for LlamaNemotronVL embedding models.
Source code in vllm/model_executor/models/config.py
MambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Enable FULL_AND_PIECEWISE cuda graph mode by default (required
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) classmethod ¶
Enable FULL_AND_PIECEWISE cuda graph mode by default (required to get good performance for mamba layers in V1).
Parameters:
-
(vllm_config¶VllmConfig) –vLLM Config
Source code in vllm/model_executor/models/config.py
NemotronHForCausalLMConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
update_mamba_ssm_cache_dtype–Update mamba_ssm_cache_dtype for NemotronH models when set to 'auto'
Attributes:
-
DEFAULT_MAMBA_SSM_CACHE_DTYPE–Only
float32is known to have no accuracy issues by default.
Source code in vllm/model_executor/models/config.py
DEFAULT_MAMBA_SSM_CACHE_DTYPE = 'float32' class-attribute instance-attribute ¶
Only float32 is known to have no accuracy issues by default.
update_mamba_ssm_cache_dtype(*, cache_config, hf_config) classmethod ¶
Update mamba_ssm_cache_dtype for NemotronH models when set to 'auto' (or not explicitly set), to the value specified in the HF config, or to float32 if not specified.
Source code in vllm/model_executor/models/config.py
Qwen3_5ForConditionalGenerationConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Update mamba_ssm_cache_dtype for Qwen3.5 models when set to 'auto'
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) staticmethod ¶
Update mamba_ssm_cache_dtype for Qwen3.5 models when set to 'auto' (or not explicitly set), to the value specified in the HF config's mamba_ssm_dtype field. Warn if the user explicitly overrides it to a different value.