AVX-512 are 512-bit extensions to the 256-bit

Advanced Vector Extensions Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge ...

SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...

instructions for

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...

instruction set architecture In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...

(ISA) proposed by

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...

in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and

Skylake-X Skylake is the codename used by Intel for a processor microarchitecture that was launched in August 2015 succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process tech ...

CPUs; this includes the Core-X series (excluding the Core i5-7640X and Core i7-7740X), as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations. Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations. The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX-512-capable processors (see )—these instructions may also be used on the 128-bit and 256-bit vector sizes. AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation

Xeon Phi Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application program ...

coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.

Instruction set

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them. ; F, CD, ER, PF: Introduced with Xeon Phi x200 (Knights Landing) and Xeon Gold/Platinum ( Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing. :* ''AVX-512 Foundation (F)'' expands most 32-bit and 64-bit based

AVX AVX may refer to: Technology * Advanced Vector Extensions, an instruction set extension in the x86 microprocessor architecture ** AVX2, an expansion of the AVX instruction set ** AVX-512, 512-bit extensions to the 256-bit AVX * AVX Corporation, a m ...

instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon :* ''AVX-512 Conflict Detection Instructions (CD)'' efficient conflict detection to allow more loops to be vectorized, implemented by Knights Landing and Skylake X :* ''AVX-512

Exponential Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above * Exponential decay, decrease at a rate proportional to value *Exp ...

and

Reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

Instructions (ER)'' exponential and reciprocal operations designed to help implement transcendental operations, implemented by Knights Landing :* ''AVX-512 Prefetch Instructions (PF)'' new prefetch capabilities, implemented by Knights Landing ; VL, DQ, BW: Introduced with Skylake X and Cannon Lake. :* ''AVX-512 Vector Length Extensions (VL)'' extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers :* ''AVX-512 Doubleword and Quadword Instructions (DQ)'' adds new 32-bit and 64-bit AVX-512 instructions :* ''AVX-512 Byte and Word Instructions (BW)'' extends AVX-512 to cover 8-bit and 16-bit integer operations ; IFMA, VBMI: Introduced with Cannon Lake. :* ''AVX-512 Integer Fused Multiply Add (IFMA)'' – fused multiply add of integers using 52-bit precision. :* ''AVX-512 Vector Byte Manipulation Instructions (VBMI)'' adds vector byte permutation instructions which were not present in AVX-512BW. ; 4VNNIW, 4FMAPS:Introduced with Knights Mill. :* ''AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW)'' – vector instructions for deep learning, enhanced word, variable precision. :* ''AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS)'' – vector instructions for deep learning, floating point, single precision. ; VPOPCNTDQ: Vector population count instruction. Introduced with Knights Mill and Ice Lake. ; VNNI, VBMI2, BITALG:Introduced with Ice Lake. :* ''AVX-512 Vector Neural Network Instructions (VNNI)'' – vector instructions for deep learning. :* ''AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2)'' – byte/word load, store and concatenation with shift. :* ''AVX-512 Bit Algorithms (BITALG)'' – byte/word

bit manipulation Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, ...

instructions expanding VPOPCNTDQ. ; VP2INTERSECT: Introduced with Tiger Lake. :* ''AVX-512 Vector Pair Intersection to a Pair of Mask Registers (VP2INTERSECT)''. ; GFNI, VPCLMULQDQ, VAES:Introduced with Ice Lake. :* These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, PCLMULQDQ and AES instructions.

Encoding and features

The

VEX prefix The VEX prefix (from "vector extensions") and VEX coding scheme are an extension to the x86 and x86-64 instruction set architecture for microprocessors from Intel, AMD and others. Features The VEX coding scheme allows the definition of new inst ...

used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX. Compared to VEX, EVEX adds the following benefits: * Expanded register encoding allowing 32 512-bit registers. * Adds 8 new opmask registers for masking most AVX-512 instructions. * Adds a new scalar memory mode that automatically performs a broadcast. * Adds room for explicit rounding control in each instruction. * Adds a new compressed displacement memory

addressing mode Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in ...

. The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.

SIMD modes

The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension (byte & word support).

Extended registers

The width of the

register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from

Streaming SIMD Extensions In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) ...

, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

Opmask registers

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register 'k0' is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, 'k0' is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions. The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension. How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used. The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.

New instructions in AVX-512 foundation

Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or majorly reworked instructions are listed below. These ''foundation'' instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.

Blend using mask

There are no EVEX-prefixed versions of the blend instructions from

SSE4 SSE4 (Streaming SIMD Extensions 4) is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more ...

; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV. Since blending is an integral part of the EVEX encoding, these instructions may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

Compare into mask

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.

Logical set mask

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.

Compress and expand

The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

Permute

A new set of

permute instruction Permute (and Shuffle) instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the so ...

s have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.

Bitwise ternary logic

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed. These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ. The difference in the doubleword and quadword versions is only the application of the opmask.

Truth table A truth table is a mathematical table used in logic—specifically in connection with Boolean algebra, boolean functions, and propositional calculus—which sets out the functional values of logical expressions on each of their functional argumen ...

Conversions

A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.

Floating-point decomposition

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.

Floating-point arithmetic

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2⁻¹⁴.

Broadcast

Miscellaneous

New instructions by sets

Conflict detection

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.

Exponential and reciprocal

AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2⁻²⁸. They also contain two new exponential functions that have a relative error of at most 2⁻²³.

Prefetch

AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in

AVX2 Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge ...

and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.

4FMAPS and 4VNNIW

The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.

BW, DQ and VBMI

AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension (VPERMB, VPERMI2B, VPERMT2B, VPMULTISHIFTQB). Two new instructions were added to the mask instructions set: KADD and KTEST (B and W forms with AVX-512DQ, D and Q with AVX-512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW was extended to KUNPCKWD and KUNPCKDQ by AVX-512BW. Among the instructions added by AVX-512DQ are several SSE and AVX instructions that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions. Instructions that are completely new are covered below.

Floating-point instructions

Three new floating-point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions. The VFPCLASS instructions tests if the floating-point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.

Other instructions

VBMI2

Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.

VNNI

VNNI stands for Vector Neural Network Instructions. AVX512-VNNI adds EVEX-coded instructions described below. With AVX-512F, these instructions can operate on 512-bit vectors, and AVX-512VL further adds support for 128- and 256-bit vectors. A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors. AVX-VNNI is not part of the AVX-512 suite, it does not require AVX-512F and can be implemented independently.

IFMA

VPOPCNTDQ and BITALG

VP2INTERSECT

GFNI

EVEX-encoded

Galois field In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field that contains a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subtra ...

new instructions:

VPCLMULQDQ

VPCLMULQDQ with AVX-512F adds an EVEX-encoded 512-bit version of the PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone (that is, on non-AVX512 CPUs) adds only VEX-encoded 256-bit version. (Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).

VAES

VEX- and EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers. The VEX versions can be used without AVX-512 support.

BF16

AI acceleration instructions operating on the Bfloat16 numbers.

FP16

An extension of the earlier

F16C The F16C (previously/informally known as CVT16) instruction set is an x86 instruction set architecture extension which provides support for converting between half-precision and standard IEEE single-precision floating-point formats. History T ...

instruction set, adding comprehensive support for the binary16 floating-point numbers (also known as FP16, float16 or half-precision floating-point numbers). The new instructions implement most operations that were previously available for

single Single may refer to: Arts, entertainment, and media * Single (music), a song release Songs * "Single" (Natasha Bedingfield song), 2004 * "Single" (New Kids on the Block and Ne-Yo song), 2008 * "Single" (William Wei song), 2016 * "Single", by ...

and

double A double is a look-alike or doppelgänger; one person or being that resembles another. Double, The Double or Dubble may also refer to: Film and television * Double (filmmaking), someone who substitutes for the credited actor of a character * Th ...

-precision floating-point numbers and also introduce new

complex number In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted , called the imaginary unit and satisfying the equation i^= -1; every complex number can be expressed in the form ...

instructions and conversion instructions. Scalar and packed operations are supported. Unlike the single and double-precision format instructions, the half-precision operands are neither conditionally flushed to zero ( FTZ) nor conditionally treated as zero ( DAZ) based on MXCSR settings. Subnormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ bit.

Arithmetic instructions

Complex arithmetic instructions

Approximate reciprocal instructions

Comparison instructions

Conversion instructions

Decomposition instructions

Move instructions

Legacy instructions with EVEX-encoded versions

CPUs with AVX-512

** Knights Landing (Xeon Phi x200): AVX-512 F, CD, ER, PF ** Knights Mill (Xeon Phi x205): AVX-512 F, CD, ER, PF, 4FMAPS, 4VNNIW, VPOPCNTDQ **

Skylake-SP Skylake is the codename used by Intel for a processor microarchitecture that was launched in August 2015 succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process tech ...

: AVX-512 F, CD, VL, DQ, BW ** Cannon Lake: AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI ** Cascade Lake: AVX-512 F, CD, VL, DQ, BW, VNNI ** Cooper Lake: AVX-512 F, CD, VL, DQ, BW, VNNI, BF16 ** Ice Lake,

Rocket Lake Rocket Lake is Intel's codename for its 11th generation Core microprocessors. Released on March 30, 2021, it is based on the new Cypress Cove microarchitecture, a variant of Sunny Cove (used by Intel's Ice Lake mobile processors) backported ...

: AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES **

Tiger Lake Tiger Lake is Intel's codename for the 11th generation Intel Core mobile processors based on the new Willow Cove Core microarchitecture, manufactured using Intel's third-generation 10 nm process node known as 10SF ("10 nm SuperFin"). Tig ...

(except Pentium and Celeron but some reviewer have the CPU-Z Screenshot of Celeron 6305 with AVX-512 support): AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, VP2INTERSECT **

Alder Lake Alder Lake is Intel's codename for the 12th generation of Intel Core processors based on a hybrid architecture utilizing Golden Cove performance cores and Gracemont efficient cores. It is fabricated using Intel's Intel 7 process, previousl ...

Sapphire Rapids Sapphire Rapids is a List of Intel codenames, codename for Intel's server (fourth generation Xeon Scalable) and workstation processors based on 7 nm process, Intel 7. Sapphire Rapids was intended as part of the Eagle Stream server platform. In a ...

: AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, BF16, VP2INTERSECT, FP16 * Centaur Technology ** "CNS" core (8c/8t): AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI *

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...

Zen 4 Zen 4 is the codename for a CPU microarchitecture by AMD, released on September 27, 2022. It is the successor to Zen 3 and uses TSMC's N5 process for CCDs. Zen 4 powers Ryzen 7000 mainstream desktop processors (codenamed "Raphael") and wil ...

: AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, BF16

: The

microprocessors contain two different types of cores, only one of which has the silicon units for the AVX-512 family of instructions. The feature is disabled by default due to lack of support in the Alder Lake E-cores, based on the Gracemont microarchitecture. Certain motherboards, with specific combinations of BIOS and microcode revisions, can have the instruction set enabled, so long as all cores that do not support the instructions are disabled. Intel has physically fused off AVX-512 in later revisions of Alder Lake microprocessors.

Performance

Intel Vectorization Advisor (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization. On some processors AVX-512 instructions cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and

clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC), ...

default to prefer using the 256-bit vectors.

References

* {{Multimedia extensions X86 instructions SIMD computing de:Advanced Vector Extensions#Erweiterung AVX-512