`vllm.model_executor.layers.quantization.utils.mxfp4_utils` ¶

Functions:

should_use_cdna4_mx_scale_swizzle –

Whether to use the CDNA4 swizzled scale layout for mxfp4 on gfx950.

`_swizzle_mxfp4(quant_tensor, scale, num_warps=8)` ¶

weight swizzle for mxfp4 moe, used for OAI mxfp4 kernel

Source code in vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

def _swizzle_mxfp4(quant_tensor, scale, num_warps=8):
    """weight swizzle for mxfp4 moe, used for OAI mxfp4 kernel"""
    assert has_triton_kernels()
    import triton_kernels.matmul_ogs_details.opt_flags as opt_flags
    from triton_kernels.numerics import InFlexData
    from triton_kernels.tensor import FP4, convert_layout, wrap_torch_tensor
    from triton_kernels.tensor_details import layout
    from triton_kernels.tensor_details.layout import StridedLayout

    value_layout_opts: dict[str, Any] = {}
    scale_layout_opts: dict[str, Any] = {}

    if (
        current_platform.is_cuda()
        and current_platform.is_device_capability(90)
        and not is_torch_equal_or_newer("2.8.1")
    ):
        logger.warning_once(
            "Mxfp4 on hopper is running on torch < 2.8.1, "
            "this cause swizling to be disabled, which may "
            "cause performance degradation. Please upgrade to torch nightly"
        )
        value_layout = StridedLayout
        scale_layout = StridedLayout
    elif current_platform.is_rocm():
        value_layout = StridedLayout
        if should_use_cdna4_mx_scale_swizzle():
            try:
                # triton < 3.6
                from triton_kernels.tensor_details.layout import GFX950MXScaleLayout

                scale_layout = GFX950MXScaleLayout
            except ImportError:
                # triton >= 3.6
                from triton_kernels.tensor_details.layout import CDNA4MXScaleLayout

                scale_layout = CDNA4MXScaleLayout
        else:
            scale_layout = StridedLayout
    else:
        value_layout, value_layout_opts = layout.make_default_matmul_mxfp4_w_layout(
            mx_axis=1
        )
        scale_layout, scale_layout_opts = (
            layout.make_default_matmul_mxfp4_w_scale_layout(
                mx_axis=1, num_warps=num_warps
            )
        )
    if current_platform.is_cuda():
        if current_platform.is_device_capability(90):
            constraints = {
                "split_k": 1,
            }
            opt_flags.update_opt_flags_constraints(constraints)
        elif current_platform.is_device_capability_family(100):
            constraints = {
                "is_persistent": True,
                "epilogue_subtile": 1,
            }
            opt_flags.update_opt_flags_constraints(constraints)
    # transpose the tensor so that the quantization axis is on dim1
    quant_tensor = quant_tensor.transpose(-2, -1)
    scale = scale.transpose(-2, -1)
    quant_tensor = convert_layout(
        wrap_torch_tensor(quant_tensor, dtype=FP4), value_layout, **value_layout_opts
    )
    scale = convert_layout(wrap_torch_tensor(scale), scale_layout, **scale_layout_opts)
    return quant_tensor, InFlexData(), scale

`should_use_cdna4_mx_scale_swizzle()` ¶

Whether to use the CDNA4 swizzled scale layout for mxfp4 on gfx950.

CDNA4 swizzle requires BLOCK_K%256==0; at TP>=4 the A8W4 dispatch picks BK<256 tiles for the smaller per-rank shapes, so swizzle must be off. Used by both the weight-load swizzle in _swizzle_mxfp4 and the kernel-argument gate in aiter_mxfp4_w4a8_moe; they must agree.

Source code in vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

def should_use_cdna4_mx_scale_swizzle() -> bool:
    """Whether to use the CDNA4 swizzled scale layout for mxfp4 on gfx950.

    CDNA4 swizzle requires BLOCK_K%256==0; at TP>=4 the A8W4 dispatch
    picks BK<256 tiles for the smaller per-rank shapes, so swizzle must
    be off. Used by both the weight-load swizzle in `_swizzle_mxfp4` and
    the kernel-argument gate in `aiter_mxfp4_w4a8_moe`; they must agree.
    """
    from vllm.distributed import get_tensor_model_parallel_world_size
    from vllm.platforms.rocm import on_gfx950

    return on_gfx950() and get_tensor_model_parallel_world_size() <= 2

vllm.model_executor.layers.quantization.utils.mxfp4_utils ¶

_swizzle_mxfp4(quant_tensor, scale, num_warps=8) ¶

should_use_cdna4_mx_scale_swizzle() ¶

`vllm.model_executor.layers.quantization.utils.mxfp4_utils` ¶

`_swizzle_mxfp4(quant_tensor, scale, num_warps=8)` ¶

`should_use_cdna4_mx_scale_swizzle()` ¶