Permute (and Shuffle) instructions, part of

bit manipulation Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, ...

as well as

vector processing In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...

, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size. There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows: for i = 0 to length-1 dest = src ndices[i where the scatter variant is: for i = 0 to length-1 dest ndices[i = src[i] Note that unlike in memory-based Gather-scatter (vector addressing), gather-scatter all three of dest, src, and indices are ''registers'' (or parts of registers in the case of bit-level permute), not memory locations. The scatter variant can be seen to "scatter" the source elements ''across'' (into) to the destination, where the "gather" variant is gathering data ''from'' the indexed source elements. Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical permutation because duplicates can occur in the output. A special case of permute is also used in GPU " swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate SIMD lane.

Occurrences of permute instructions

Permute instructions occur in both

scalar processor Scalar processors are a class of computer processors that process only one data item at a time. Typical data items include integers and floating point numbers. Classification A scalar processor is classified as a single instruction, single data ...

s as well as

engines as well as

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in

RISC-V RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on estab ...

vectors, and take Vectors as input for both source elements and source array, and output another Vector. In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked,

SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it shoul ...

style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result. Some ISAs, particularly for cryptographic applications, even have ''bit-level'' permute operations, such as bdep (bit deposit) in RISC-V bitmanip; in the Power ISA it is known as and has been included for several decades, and is still in the Power ISA v.3.0 B spec. Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include VSHUFF32x4 from

AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...

. Permute operations in different forms are surprisingly common, occurring in AltiVec, Power ISA,

PowerPC G4 PowerPC G4 is a designation formerly used by Apple and Eyetech to describe a ''fourth generation'' of 32-bit PowerPC microprocessors. Apple has applied this name to various (though closely related) processor models from Freescale, a former part of ...

, SVE2, vector processors, and

. They are sufficiently important that

LLVM LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate repre ...

added the intrinsic and GCC added the intrinsic. GCC's intrinsic matches the functionality of

OpenCL OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-progra ...

's shuffle intrinsics. Note that all of these, mathematically, are not permutations because duplicates can occur in the output.

References

External links

Unofficial convenient list of AVX-512 shuffle instructions

Official Intel AVX-512 ISA Manual

ARM NEON tutorial on data rearrangement
{{Multimedia extensions Binary arithmetic Computer arithmetic

Occurrences of permute instructions

See also

References

External links