The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.
The main SIMD instruction set extensions that have been introduced for x86 are:
These instructions are, unless otherwise noted, available in the following forms:
For many of the instruction mnemonics, <code>(V)</code> is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.
For the instructions in the below table, the following considerations apply unless otherwise noted:
From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - <code>PXOR</code>, <code>XORPS</code>, and <code>XORPD</code> - these are intended for use on integer, FP32, and FP64 data, respectively.)
These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:
SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.
This covers instructions/opcodes that are new to AVX and AVX2.
AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.
Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.
SIMD instructions set extensions that are using the VEX prefix, and are not considered part of baseline AVX/AVX2/AVX-512, FMA3/4 or AMX.
Integer, opmask and cryptographic instructions that use the VEX prefix (e.g. the BMI2, CMPccXADD, VAES and SHA512 extensions) are not included.
Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands â a destination operand and three source operands.
FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions. <br /> The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two. <br /> FMA3 encoding <br /> FMA3 instructions are encoded with the VEX or EVEX prefixes â on the form <code>VEX.66.0F38 <b>xy</b> /r</code> or <code>EVEX.66.0F38 <b>xy</b> /r</code>. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte <code><b>xy</b></code> consists of two nibbles, where the top nibble <code><b>x</b></code> selects operand ordering (<code>9</code>='132', <code>A</code>='213', <code>B</code>='231') and the bottom nibble <code><b>y</b></code> (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (<code><b>x</b></code> and <code><b>y</b></code> outside the given ranges will result in something that is not an FMA3 instruction.) <br /> At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:
For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls. <br /> The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions â these all take the form <code>EVEX.66.MAP6.W0 <b>xy</b> /r</code> with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024, similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions â these all take the form <code>EVEX.NP.MAP6.W0 <b>xy</b> /r</code> with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.) <br /> FMA4 encoding <br /> FMA4 instructions are encoded with the VEX prefix, on the form <code>VEX.66.0F3A <b>xx</b> /r ib</code> (no EVEX encodings are defined). The opcode byte <code><b>xx</b></code> uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.
For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
<br /> Opcode table <br /> The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table â with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:
AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory. Most of the added instructions may also be used with the 256- and 128-bit registers.
This covers instructions that are new to AVX-512's F, BW and DQ subsets.
These AVX-512 subsets also include extended EVEX-encoded forms of a large number of MMX/SSE/AVX instructions - please see tables above.
These instructions all follow a given pattern where:
AVX-512 introduces, in addition to 512-bit vectors, a set of eight opmask registers, named k0,k1,k2...k7. These registers are 64 bits wide in implementations that support AVX512BW and 16 bits wide otherwise. They are mainly used to enable/disable operation on a per-lane basis for most of the AVX-512 vector instructions. They are usually set with vector-compare instructions or instructions that otherwise produce a 1-bit per-lane result as a natural part of their operation - however, AVX-512 defines a set of 55 new instructions to help assist manual manipulation of the opmask registers.
These instructions are, for the most part, defined in groups of 4 instructions, where the four instructions in a group are basically just 8-bit, 16-bit, 32-bit and 64-bit variants of the same basic operation (where only the low 8/16/32/64 bits of the registers participate in the given operation and, if a result is written back to a register, all bits except the bottom 8/16/32/64 bits are set to zero). The opmask instructions are all encoded with the VEX prefix (unlike all other AVX-512 instructions, which are encoded with the EVEX prefix).
In general, the 16-bit variants of the instructions are introduced by AVX512F (except <code>KADDW</code> and <code>KTESTW</code>), the 8-bit variants by the AVX512DQ extension, and the 32/64-bit variants by the AVX512BW extension.
Most of the instructions follow a very regular encoding pattern where the four instructions in a group have identical encodings except for the VEX.pp and VEX.W fields:
Not all of the opmask instructions fit the pattern above - the remaining ones are:
Vector-register instructions that use opmasks in ways other than just as a result writeback mask.
Intel AMX adds eight new tile-registers, <code>tmm0</code>-<code>tmm7</code>, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a <code>TILECFG</code> register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.