In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, a vector processor or array processor is a
central processing unit
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
(CPU) that implements an
instruction set
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...
where its
instructions
Instruction or instructions may refer to:
Computing
* Instruction, one operation of a processor within a computer architecture instruction set
* Computer program, a collection of instructions
Music
* Instruction (band), a 2002 rock band from Ne ...
are designed to operate efficiently and effectively on large
one-dimensional array
In computer science, an array is a data structure consisting of a collection of ''elements'' (values or variables), each identified by at least one ''array index'' or ''key''. An array is stored such that the position of each element can be co ...
s of data called ''vectors''. This is in contrast to
scalar processor
Scalar processors are a class of computer processors that process only one data item at a time. Typical data items include integers and floating point numbers.
Classification
A scalar processor is classified as a single instruction, single data ...
s, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional
single instruction, multiple data
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...
(SIMD) or
SWAR
SIMD within a register (SWAR), also known by the name "packed SIMD" is a technique for performing parallel operations on data contained in a processor register. SIMD stands for ''single instruction, multiple data''. Flynn's 1972 taxonomy categorise ...
Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably
numerical simulation
Computer simulation is the process of mathematical modelling, performed on a computer, which is designed to predict the behaviour of, or the outcome of, a real-world or physical system. The reliability of some mathematical models can be deter ...
and similar tasks. Vector processing techniques also operate in
video-game console hardware and in
graphics accelerator
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...
s.
Vector machines appeared in the early 1970s and dominated
supercomputer
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
design through the 1970s into the 1990s, notably the various
Cray
Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
platforms. The rapid fall in the
price-to-performance ratio of conventional
microprocessor
A microprocessor is a computer processor where the data processing logic and control is included on a single integrated circuit, or a small number of integrated circuits. The microprocessor contains the arithmetic, logic, and control circu ...
designs led to a decline in vector supercomputers during the 1990s.
History
Early work
Vector processing development began in the early 1960s at
Westinghouse in their "Solomon" project. Solomon's goal was to dramatically increase math performance by using a large number of simple
math co-processors under the control of a single master
CPU. The CPU fed a single common instruction to all of the
arithmetic logic unit
In computing, an arithmetic logic unit (ALU) is a Combinational logic, combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on ...
s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
to a large
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
, fed in the form of an array.
In 1962, Westinghouse cancelled the project, but the effort was restarted at the
University of Illinois
The University of Illinois Urbana-Champaign (U of I, Illinois, University of Illinois, or UIUC) is a public land-grant research university in Illinois in the twin cities of Champaign and Urbana. It is the flagship institution of the University ...
as the
ILLIAC IV
The ILLIAC IV was the first massively parallel computer. The system was originally designed to have 256 64-bit floating point units (FPUs) and four central processing units (CPUs) able to process 1 billion operations per second. Due to budget cons ...
. Their version of the design originally called for a 1
GFLOPS
In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...
machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as
computational fluid dynamics
Computational fluid dynamics (CFD) is a branch of fluid mechanics that uses numerical analysis and data structures to analyze and solve problems that involve fluid flows. Computers are used to perform the calculations required to simulate th ...
, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category,
massively parallel
Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of t ...
computing. Around this time Flynn categorised this type of processing as an early form of
SIMT.
A
computer for operations with functions
Within computer engineering and computer science, a computer for operations with (mathematical) functions (unlike the usual computer) operates with functions at the hardware level (i.e. without programming these operations).see also here http:/ ...
was presented and developed by Kartsev in 1967.
Supercomputers
The first vector supercomputers are the
Control Data Corporation
Control Data Corporation (CDC) was a mainframe and supercomputer firm. CDC was one of the nine major United States computer companies through most of the 1960s; the others were IBM, Burroughs Corporation, DEC, NCR, General Electric, Honeywel ...
STAR-100 and
Texas Instruments
Texas Instruments Incorporated (TI) is an American technology company headquartered in Dallas, Texas, that designs and manufactures semiconductors and various integrated circuits, which it sells to electronics designers and manufacturers globall ...
Advanced Scientific Computer
The Advanced Scientific Computer (ASC) is a supercomputer designed and manufactured by Texas Instruments (TI) between 1966 and 1973. The ASC's central processing unit (CPU) supported vector processing, a performance-enhancing technique which was ...
(ASC), which were introduced in 1974 and 1972, respectively.
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
The STAR-100 was otherwise slower than CDC's own supercomputers like the
CDC 7600
The CDC 7600 was the Seymour Cray-designed successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz (27.5 ns clock cycle) and had a 65 Kword primary memory (with a 6 ...
, but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.
The vector technique was first fully exploited in 1976 by the famous
Cray-1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight
vector registers
A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. ...
, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.
The Cray design used
pipeline parallelism
In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time ...
to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called
''vector chaining''. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era.
Other examples followed.
Control Data Corporation
Control Data Corporation (CDC) was a mainframe and supercomputer firm. CDC was one of the nine major United States computer companies through most of the 1960s; the others were IBM, Burroughs Corporation, DEC, NCR, General Electric, Honeywel ...
tried to re-enter the high-end market again with its
ETA-10
The ETA10 is a vector supercomputer designed, manufactured, and marketed by ETA Systems, a spin-off division of Control Data Corporation (CDC). The ETA10 was an evolution of the CDC Cyber 205, which can trace its origins back to the CDC STA ...
machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (
Fujitsu
is a Japanese multinational information and communications technology equipment and services corporation, established in 1935 and headquartered in Tokyo. Fujitsu is the world's sixth-largest IT services provider by annual revenue, and the la ...
,
Hitachi
() is a Japanese multinational corporation, multinational Conglomerate (company), conglomerate corporation headquartered in Chiyoda, Tokyo, Japan. It is the parent company of the Hitachi Group (''Hitachi Gurūpu'') and had formed part of the Ni ...
and
Nippon Electric Corporation
is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network soluti ...
(NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller.
Oregon
Oregon () is a U.S. state, state in the Pacific Northwest region of the Western United States. The Columbia River delineates much of Oregon's northern boundary with Washington (state), Washington, while the Snake River delineates much of it ...
-based
Floating Point Systems
Floating Point Systems, Inc. (FPS), was a Beaverton, Oregon vendor of attached array processors and minisupercomputers. The company was founded in 1970 by former Tektronix engineer Norm Winningstad, with partners Tom Prince, Frank Bouton and Robe ...
(FPS) built add-on array processors for
minicomputer
A minicomputer, or colloquially mini, is a class of smaller general purpose computers that developed in the mid-1960s and sold at a much lower price than mainframe and mid-size computers from IBM and its direct competitors. In a 1970 survey, ...
s, later building their own
minisupercomputer
Minisupercomputers constituted a short-lived class of computers that emerged in the mid-1980s, characterized by the combination of vector processing and small-scale multiprocessing. As scientific computing using vector processors became more pop ...
s.
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the
Cray-2
The Cray-2 is a supercomputer with four vector processors made by Cray Research starting in 1985. At 1.9 GFLOPS peak performance, it was the fastest machine in the world when it was released, replacing the Cray X-MP in that spot. It was, i ...
,
Cray X-MP
The Cray X-MP was a supercomputer designed, built and sold by Cray Research. It was announced in 1982 as the "cleaned up" successor to the 1975 Cray-1, and was the world's fastest computer from 1983 to 1985 with a quad-processor system performance ...
and
Cray Y-MP
The Cray Y-MP was a supercomputer sold by Cray Research from 1988, and the successor to the company's X-MP. The Y-MP retained software compatibility with the X-MP, but extended the address registers from 24 to 32 bits. High-density VLSI ECL tech ...
. Since then, the supercomputer market has focused much more on
massively parallel
Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of t ...
processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed
Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor.
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their
SX series of computers. Most recently, the
SX-Aurora TSUBASA
The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
places the processor and either 24 or 48 gigabytes of memory on an
HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
GPU
Modern graphics processing units (
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...
) include an array of
shader pipelines which may be driven by
compute kernel
In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main progr ...
s, and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in
Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is ''significantly'' more complex and involved than
"Packed SIMD", which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.
Recent development
Several modern CPU architectures are being designed as vector processors. The
RISC-V vector extension follows similar principles as the early vector processors, and is being implemented in commercial products such as the
Andes Technology Andes Technology Corporation is a Taiwanese supplier of 32/64-bit embedded CPU cores and a founding Premier member of RISC-V International Association. It focuses on the embedded market and delivers CPU cores with integrated development environmen ...
AX45MPV. There are also several
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
vector processor architectures being developed, including
ForwardCom and
Libre-SOC
Libre-SOC is a libre soft processor core originally written by Luke Leighton and other contributors, announced at the OpenPOWER Summit NA 2020. It adheres to the Power ISA 3.0 instruction set and can be run on FPGA boards, currently booting Mi ...
.
Comparison with modern architectures
most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However by definition the addition of SIMD cannot by itself qualify a processor ''as'' an actual Vector Processor because SIMD is fixed-length and Vectors are variable. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.
* Pure (fixed) SIMD - also known as "Packed SIMD",
SIMD within a Register (SWAR), and
Pipelined Processor
In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incom ...
in Flynn's Taxonomy. Common examples using SIMD with features inspired ''by'' Vector processors include Intel x86's
MMX,
SSE and
AVX AVX may refer to:
Technology
* Advanced Vector Extensions, an instruction set extension in the x86 microprocessor architecture
** AVX2, an expansion of the AVX instruction set
** AVX-512, 512-bit extensions to the 256-bit AVX
* AVX Corporation, a m ...
instructions, AMD's
3DNow!
3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of float ...
extensions,
ARM NEON
ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of reduced instruction set computer (RISC) instruction set architectures for computer processors, configured ...
, Sparc's
VIS extension,
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
's
AltiVec and MIPS'
MSA. In 2000,
IBM,
Toshiba
, commonly known as Toshiba and stylized as TOSHIBA, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan. Its diversified products and services include power, industrial and social infrastructure system ...
and
Sony
, commonly stylized as SONY, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan. As a major technology company, it operates as one of the world's largest manufacturers of consumer and professional ...
collaborated to create the
Cell processor, which is also SIMD.
* Predicated SIMD - also known as
associative processing. Two notable examples which have per-element (lane-based) predication are
ARM SVE2 and
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
* Pure Vectors - as categorised in
Duncan's taxonomy -these include the original
Cray-1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
,
RISC-V RVV and
SX-Aurora TSUBASA
The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
. Although memory-based the
STAR-100 was also a Vector Processor.
Other CPU designs include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as
MIMD
In computing, multiple instruction, multiple data (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be exe ...
(Multiple Instruction, Multiple Data) and realized with
VLIW
Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...
(Very Long Instruction Word). The
Fujitsu
is a Japanese multinational information and communications technology equipment and services corporation, established in 1935 and headquartered in Tokyo. Fujitsu is the world's sixth-largest IT services provider by annual revenue, and the la ...
FR-V The Fujitsu FR-V (Fujitsu RISC- VLIW) is one of the very few processors ever able to process both a very long instruction word (VLIW) and vector processor instructions at the same time, increasing throughput with high parallel computing while incre ...
VLIW/''vector processor'' combines both technologies.
Difference between SIMD and vector processor
SIMD instruction sets lack crucial features when compared to vector processor instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.
Where pure (fixed-width, no predication) SIMD is commonly mistakenly claimed to be "vectors" (because SIMD processes data which ''happens'' to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector processors may be observed to have the following features that no SIMD ISA has:
* a way to set the vector length (such as the instruction in RISCV RVV) or providing a (instruction repeating) feature in some form, without limiting repeats to a
power of two
A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer as the exponent.
In a context where only integers are considered, is restricted to non-negative ...
* Iteration and reduction over elements ''within'' vectors. RISC-V vectors as of version 0.10 have reduction only, whilst the SX-Aurora and later Cray systems have iteration as well as reduction.
Predicated SIMD (part of
Flynn's taxonomy
Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in design of modern processors and their functionalities. ...
) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. and
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
, almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions.
SIMD, due to it being fixed width batch processing, is ''unable by design'' to cope with iteration and reduction. This is illustrated further with examples, below.
Additionally, vector processors can be more resource-efficient (use slower hardware, saving power, but still achieving throughput) and have less latency than SIMD, through
vector chaining.
Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.
Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a
multi-issue execution model, the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when ''all four'' LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.
A vector processor by contrast, ''even if it is single-issue'' and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the
Cray-1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
, the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilisation, due to
vector chaining, is a key advantage and difference compared to SIMD. SIMD ''by design and definition'' cannot perform chaining except to the entire group of results.
Description
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this
memory latency
''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will hav ...
has historically become a large impediment to performance; see
Memory wall
Random-access memory (RAM; ) is a form of computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the ...
.
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as
instruction pipelining
In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incom ...
in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an
assembly line
An assembly line is a manufacturing process (often called a ''progressive assembly'') in which parts (usually interchangeable parts) are added as the semi-finished assembly moves from workstation to workstation where the parts are added in seq ...
, so the
address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the ''
latency'', but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction ''itself'' that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:
; Hypothetical RISC machine
; add 10 numbers in a to 10 numbers in b, storing results in c
; assume a, b, and c are memory locations in their respective registers
move $10, count ; count := 10
loop:
load r1, a
load r2, b
add r3, r1, r2 ; r3 := r1 + r2
store r3, c
add a, a, $4 ; move on
add b, b, $4
add c, c, $4
dec count ; decrement
jnez count, loop ; loop back if count is not yet 0
ret
But to a vector processor, this task looks considerably different:
; assume we have vector registers v1-v3
; with size equal or larger than 10
move $10, count ; count = 10
vload v1, a, count
vload v2, b, count
vadd v3, v1, v2
vstore v3, c, count
ret
Note the complete lack of looping in the instructions, because it is the ''hardware'' which has performed 10 sequential operations: effectively the loop count is on an explicit ''per-instruction'' basis.
Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL):
; again assume we have vector registers v1-v3
; with size larger than or equal to 10
setvli $10 # Set vector length VL=10
vload v1, a # 10 loads from a
vload v2, b # 10 loads from b
vadd v3, v1, v2 # 10 adds
vstore v3, c # 10 stores into c
ret
There are several savings inherent in this approach.
# only three address translations are needed. Depending on the architecture, this can represent a significant savings by itself.
# Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten.
# The code itself is also smaller, which can lead to more efficient memory use, reduction in L1 instruction cache size, reduction in power consumption.
# With the program size being reduced branch prediction has an easier job.
# With the length (equivalent to SIMD width) not being hard-coded into the instruction, not only is the encoding more compact, it's also "future-proof" and allows even
embedded processor
An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' as ...
designs to consider using vectors purely to gain all the other advantages, rather than go for high performance.
Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages.
But more than that, a high performance vector processor may have multiple
functional unit
In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
s adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes ''other'' instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. (''This can be somewhat mitigated by keeping the entire ISA to
RISC
In computer engineering, a reduced instruction set computer (RISC) is a computer designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set comput ...
principles: RVV only adds around 190 vector instructions even with the advanced features.'')
Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in
supercomputer
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
s, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the ''efficiency'' of vector ISAs brings other benefits which are compelling even for Embedded use-cases.
Vector instructions
The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.
The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like ). They are also found in the
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
architecture as the prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.
Interestingly, though, Broadcom included space in all vector operations of the
Videocore
VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic Ltd and now owned by Broadcom. Its two-dimensional DSP architecture makes it flexible and efficient enough to decode (as well as encode) a number of multim ...
IV ISA for a field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of
power of two
A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer as the exponent.
In a context where only integers are considered, is restricted to non-negative ...
or sourced from one of the scalar registers.
The
Cray-1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
introduced the idea of using
processor register
A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. ...
s to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in
GPU
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
s, which face exactly the same issue.
Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the
SX-Aurora TSUBASA
The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD (
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
, ARM
SVE2) are capable of this kind of selective, per-element (
"predicated") processing, and it is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication
(
MMX,
SSE,
AltiVec) categorically do not.
Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use
Single Instruction Multiple Threads Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are e ...
(SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units.
In addition, GPUs such as the Broadcom
Videocore
VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic Ltd and now owned by Broadcom. Its two-dimensional DSP architecture makes it flexible and efficient enough to decode (as well as encode) a number of multim ...
IV and other external vector processors like the
NEC SX-Aurora TSUBASA
The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The Broadcom
Videocore
VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic Ltd and now owned by Broadcom. Its two-dimensional DSP architecture makes it flexible and efficient enough to decode (as well as encode) a number of multim ...
IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads".
Vector instruction example
This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in
C:
void iaxpy(size_t n, int a, const int x[], int y[])
In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability.
Scalar assembler
The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:
loop:
load32 r1, x ; load one 32bit data
load32 r2, y
mul32 r1, a, r1 ; r1 := r1 * a
add32 r3, r1, r2 ; r3 := r1 + r2
store32 r3, y
addl x, x, $4 ; x := x + 4
addl y, y, $4
subl n, n, $1 ; n := n - 1
jgz n, loop ; loop back if n > 0
out:
ret
The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.
; Assume tmp is pre-allocated
vmul tmp, a, x, n ; tmp = a * x vadd y, y, tmp, n ; y = y + tmp ret
Pure (non-predicated, packed) SIMD
A modern packed SIMD architecture, known by many names (listed in
Flynn's taxonomy
Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in design of modern processors and their functionalities. ...
), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are
properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:
splatx4 v4, a ; v4 = a,a,a,a
The time taken would be basically the same as a vector implementation of described above.
vloop:
load32x4 v1, x
load32x4 v2, y
mul32x4 v1, a, v1 ; v1 := v1 * a
add32x4 v3, v1, v2 ; v3 := v1 + v2
store32x4 v3, y
addl x, x, $16 ; x := x + 16
addl y, y, $16
subl n, n, $4 ; n := n - 4
jgz n, vloop ; go back if n > 0
out:
ret
Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm ''shall'' only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.
Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.
Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
* first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm inner loop.
* perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
* have a cleanup phase which, like the preparatory section, is just as large and just as complex.
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).
This more than ''triples'' the size of the code, in fact in extreme cases it results in an ''order of magnitude'' increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
, using the options to
gcc.
Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
exists in x86.
Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.
Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).
Predicated SIMD
Assuming a hypothetical predicated (mask capable) SIMD ISA, and again assuming that the SIMD instructions can cope with misaligned data, the instruction loop would look like this:
vloop:
# prepare mask. few ISAs have min though
min t0, n, $4 ; t0 = min(n, 4)
shift m, $1, t0 ; m = 1< 0
out:
ret
Here it can be seen that the code is much cleaner but a little complex: at least, however, there is no setup or cleanup: on the last iteration of the loop, the predicate mask wil be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being performed, respectively.
One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.
It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks. The final evolving step to a "true" vector ISA, however, is to not have any evidence in the ISA ''at all'' of a SIMD width, leaving that entirely up to the hardware.
Pure (true) vector ISA
For Cray-style vector ISAs such as RVV, an instruction called "" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and Videocore IV, MVL may be an actual hardware lane quantity ''or a virtual one''. ''(Note: As mentioned in the ARM SVE2 Tutorial, programmers must not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).''
On calling with the number of outstanding data elements to be processed, "" is permitted (essentially required) to limit that to the Maximum Vector Length (MVL) and thus returns the ''actual'' number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.
Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:
vloop:
setvl t0, n # VL=t0=min(MVL, n)
vld32 v0, x # load vector x
vld32 v1, y # load vector y
vmadd32 v1, v0, a # v1 += v0 * a
vst32 v1, y # store Y
add y, t0*4 # advance y by VL*4
add x, t0*4 # advance x by VL*4
sub n, t0 # n -= VL (t0)
bnez n, vloop # repeat if n != 0
This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is ''going'' to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.
A number of things to note, when comparing against the Predicated SIMD assembly variant:
# The instruction has embedded within it a instruction
# Where the SIMD variant hard-coded both the width (4) into the creation of the mask ''and'' in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
# Setting VL effectively ''creates a hidden predicate mask'' that is automatically applied to the vectors
# Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in 1977).
Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.
Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them ''and there would still be no SIMD cleanup code''. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources.
Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.
Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into
no-op
In computer science, a NOP, no-op, or NOOP (pronounced "no op"; short for no operation) is a machine language instruction and its assembly language mnemonic, programming language statement, or computer protocol command that does nothing.
Mac ...
s, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.
Vector reduction example
This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in
c:
void (size_t n, int a, const int x[])
Here, an accumulator (y) is used to sum up all the values in the array, x.
Scalar assembler
The scalar version of this would load each of x, add it to y, and loop:
set y, 0 ; y initialised to zero
loop:
load32 r1, x ; load one 32bit data
add32 y, y, r1 ; y := y + r1
addl x, x, $4 ; x := x + 4
subl n, n, $1 ; n := n - 1
jgz n, loop ; loop back if n > 0
out:
ret y ; returns result, y
This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.
SIMD reduction
This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may not be added to anything other than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:
addl r3, x, $16 ; for 2nd 4 of x
load32x4 v1, x ; first 4 of x
load32x4 v2, r3 ; 2nd 4 of x
add32x4 v1, v2, v1 ; add 2 groups
At this point four adds have been performed:
* - First SIMD ADD: element 0 of first group added to element 0 of second group
* - Second SIMD ADD: element 1 of first group added to element 1 of second group
* - Third SIMD ADD: element 2 of first group added to element 2 of second group
* - Fourth SIMD ADD: element 3 of first group added to element 2 of second group
but with 4-wide SIMD being incapable by design of adding for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle:
examples online can be found for
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
of how to do "Horizontal Sum"
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
Vector ISA reduction
Vector instruction sets have arithmetic reduction operations ''built-in'' to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required:
setvl t0, n # VL=t0=min(MVL, n)
vld32 v0, x # load vector x
vredadd32 y, v0 # reduce-add into y
The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY").
set y, 0
vloop:
setvl t0, n # VL=t0=min(MVL, n)
vld32 v0, x # load vector x
vredadd32 y, y, v0 # add all x into y
add x, t0*4 # advance x by VL*4
sub n, t0 # n -= VL (t0)
bnez n, vloop # repeat if n != 0
ret y
The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where maximum vector length could be only one).
Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is critical.
This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
Insights from examples
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
# From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being successful, regardless of alignment of the start of the vector.
# Whilst from the reduction example it can be seen that, aside from
permute instruction
Permute (and Shuffle) instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the so ...
s, SIMD by definition avoids inter-lane operations entirely (element 0 can only be added to another element 0), vector processors tackle this head-on. What programmers are forced to do in software (using shuffle and other tricks, to swap data into the right "lane") vector processors must do in hardware, automatically.
Overall then there is a choice to either have
# complex software and simplified hardware (SIMD)
# simplified software and complex hardware (vector processors)
These stark differences are what distinguishes a vector processor from one that has SIMD.
Vector processor features
Where many SIMD ISAs borrow or are inspired by the list below, typical features that a vector processor will have are:
* Vector Load and Store – Vector architectures with a register-to-register design (analogous to load–store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported. The unit-stride addressing mode is essential; modern vector architectures typically also support arbitrary constant strides, as well as the scatter/gather (also called ''indexed'') addressing mode. Advanced architectures may also include support for ''segment'' load and stores, and ''fail-first'' variants of the standard vector load and stores. Segment loads read a vector from memory, where each element is a
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
* Masked Operations –
predicate masks allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
* Compress and Expand – usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
.
* Register Gather, Scatter (aka permute) – a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with
vector chaining. Not to be confused with
Gather-scatter Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a
permute instruction
Permute (and Shuffle) instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the so ...
instead.
* Splat and Extract – useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
* Iota – a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
* Reduction and
Iteration
Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...
– operations that perform
mapreduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
on a vector (for example, find the one maximum value of an entire vector, or sum all elements). Iteration is of the form
x = y + x -1/code> where Reduction is of the form x = y + y €¦ + y -1/code>
* Matrix Multiply support – either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources. NVidia provides a high-level Matrix CUDA
CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ca ...
API although the internal details are not available. The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
* Advanced Math formats – often includes Galois field
In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field that contains a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subtra ...
arithmetic, but can include binary-coded decimal
In computing and electronic systems, binary-coded decimal (BCD) is a class of binary encodings of decimal numbers where each digit is represented by a fixed number of bits, usually four or eight. Sometimes, special bit patterns are used for ...
or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
* Bit manipulation
Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, ...
– including vectorised versions of bit-level permutation operations, bitfield insert and extract, centrifuge operations, population count, and many others
Many may refer to:
* grammatically plural in number
*an English quantifier used with count nouns indicating a large but indefinite number of; at any rate, more than a few
;Place names
* Many, Moselle, a commune of the Moselle department in Franc ...
.
GPU vector processing features
With many 3D shader
In computer graphics, a shader is a computer program that calculates the appropriate levels of light, darkness, and color during the rendering of a 3D scene - a process known as ''shading''. Shaders have evolved to perform a variety of spec ...
applications needing trigonometric
Trigonometry () is a branch of mathematics that studies relationships between side lengths and angles of triangles. The field emerged in the Hellenistic world during the 3rd century BC from applications of geometry to astronomical studies. ...
operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
* Sub-vectors – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL"). Subvectors are a critical integral part of the Vulkan
Vulkan is a low- overhead, cross-platform API, open standard for 3D graphics and computing. Vulkan targets high-performance real-time 3D graphics applications, such as video games and interactive media. Vulkan is intended to offer higher perfor ...
SPIR-V
Standard Portable Intermediate Representation (SPIR) is an intermediate language for parallel compute and graphics by Khronos Group. It is used in multiple execution environments, including the Vulkan graphics API and the OpenCL compute API, to re ...
spec.
* Sub-vector Swizzle – aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight mini-permute of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the Vulkan
Vulkan is a low- overhead, cross-platform API, open standard for 3D graphics and computing. Vulkan targets high-performance real-time 3D graphics applications, such as video games and interactive media. Vulkan is intended to offer higher perfor ...
SPIR-V
Standard Portable Intermediate Representation (SPIR) is an intermediate language for parallel compute and graphics by Khronos Group. It is used in multiple execution environments, including the Vulkan graphics API and the OpenCL compute API, to re ...
spec. The Broadcom Videocore
VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic Ltd and now owned by Broadcom. Its two-dimensional DSP architecture makes it flexible and efficient enough to decode (as well as encode) a number of multim ...
IV uses the terminology "Lane rotate" where the rest of the industry uses the term "swizzle".
* Transcendentals – trigonometric
Trigonometry () is a branch of mathematics that studies relationships between side lengths and angles of triangles. The field emerged in the Hellenistic world during the 3rd century BC from applications of geometry to astronomical studies. ...
operations such as sine
In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is oppo ...
, cosine and logarithm
In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
obviously feature much more predominantly in 3D than in many demanding HPC workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan
Vulkan is a low- overhead, cross-platform API, open standard for 3D graphics and computing. Vulkan targets high-performance real-time 3D graphics applications, such as video games and interactive media. Vulkan is intended to offer higher perfor ...
specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the MIPS-3D MIPS-3D is an extension to the MIPS V instruction set architecture (ISA) that added 13 new instructions for improving the performance of 3D graphics applications. The instructions improved performance by reducing the number of instructions required ...
extension.
Fault (or Fail) First
Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault Register", where RVV modifies (truncates) the Vector Length (VL).
The basic principle of is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the ''actual'' amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that ''subsequent'' instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.
Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.
This begins to hint at the reason why is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240.
By contrast, the same strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.
The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the vector architecture the freedom to decide how many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on ''subsequent'' iterations of the loop the batches of vectorised memory reads are optimally aligned with the underlying caches and virtual memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads ''exactly'' on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next virtual memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.ARM SVE2 paper by N. Stevens
/ref>
Performance and speed up
Let ''r'' be the vector speed ratio and ''f'' be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e., 90% of the work is done by the vector unit. It follows the achievable speed up of: