Multiply–accumulate Operation
   HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, especially
digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...
, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC unit); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator ''a'': :\ a \leftarrow a + ( b \times c ) When done with
floating point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...
numbers, it might be performed with two
rounding Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with . Rounding is often done to obta ...
s (typical in many
DSP DSP may refer to: Computing * Digital signal processing, the mathematical manipulation of an information signal * Digital signal processor, a microprocessor designed for digital signal processing * Yamaha DSP-1, a proprietary digital signal ...
s), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC). Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in
combinational logic In automata theory, combinational logic (also referred to as time-independent logic or combinatorial logic) is a type of digital logic which is implemented by Boolean circuits, where the output is a pure function of the present input only. This i ...
followed by an adder and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers.
Percy Ludgate Percy Edwin Ludgate (2 August 1883 – 16 October 1922) was an Irish amateur scientist who designed the second analytical engine (general-purpose Turing-complete computer) in history. Life Ludgate was born on 2 August 1883 in Skibbereen, ...
was the first to conceive a MAC in his Analytical Machine of 1909, and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series ). The first modern processors to be equipped with MAC units were
digital signal processor A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on MOS integrated circuit chips. They are widely used in audio si ...
s, but the technique is now also common in general-purpose processors.


In floating-point arithmetic

When done with
integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign (−1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
s, the operation is typically exact (computed modulo some
power of two A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer  as the exponent. In a context where only integers are considered, is restricted to non-negative ...
). However,
floating-point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
numbers have only a certain amount of mathematical
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
. That is, digital floating-point arithmetic is generally not
associative In mathematics, the associative property is a property of some binary operations, which means that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement f ...
or distributive. (See .) Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add).
IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
specifies that it must be performed with one rounding, yielding a more accurate result.


Fused multiply–add

A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product , round it to ''N'' significant bits, add the result to ''a'', and round back to ''N'' significant bits, a fused multiply–add would compute the entire expression to its full precision before rounding the final result down to ''N'' significant bits. A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products: *
Dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
*
Matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...
* Polynomial evaluation (e.g., with
Horner's rule In mathematics and computer science, Horner's method (or Horner's scheme) is an algorithm for polynomial evaluation. Although named after William George Horner, this method is much older, as it has been attributed to Joseph-Louis Lagrange by Horn ...
) *
Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...
for evaluating functions (from the inverse function) *
Convolutions In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution'' ...
and
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
* Multiplication in
double-double arithmetic In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision. This 128-bit quadruple precision is desi ...
Fused multiply–add can usually be relied on to give more accurate results. However,
William Kahan William "Velvel" Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist, who received the Turing Award in 1989 for "''his fundamental contributions to numerical analysis''", was named an ACM Fellow in 1994, and inducte ...
has pointed out that it can give problems if used unthinkingly. If is evaluated as (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the term first) using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated. When implemented inside a
microprocessor A microprocessor is a computer processor where the data processing logic and control is included on a single integrated circuit, or a small number of integrated circuits. The microprocessor contains the arithmetic, logic, and control circu ...
, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2''N''-bit adder to compute the sum properly. Another benefit of including this instruction is that it allows an efficient software implementation of
division Division or divider may refer to: Mathematics *Division (mathematics), the inverse of multiplication *Division algorithm, a method for computing the result of mathematical division Military *Division (military), a formation typically consisting ...
(see
division algorithm A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of Euclidean division. Some are applied by hand, while others are employed by digital circuit designs and software. Divis ...
) and
square root In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . E ...
(see
methods of computing square roots Methods of computing square roots are numerical analysis algorithms for approximating the principal, or non-negative, square root (usually denoted \sqrt, \sqrt /math>, or S^) of a real number. Arithmetically, it means given S, a procedure for fi ...
) operations, thus eliminating the need for dedicated hardware for those operations.


Dot product instruction

Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit
SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...
registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.


Support

The FMA operation is included in
IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
. The
Digital Equipment Corporation Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president unt ...
(DEC)
VAX VAX (an acronym for Virtual Address eXtension) is a series of computers featuring a 32-bit instruction set architecture (ISA) and virtual memory that was developed and sold by Digital Equipment Corporation (DEC) in the late 20th century. The V ...
's POLY instruction is used for evaluating polynomials with
Horner's rule In mathematics and computer science, Horner's method (or Horner's scheme) is an algorithm for polynomial evaluation. Although named after William George Horner, this method is much older, as it has been attributed to Joseph-Louis Lagrange by Horn ...
using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977. The 1999 standard of the
C programming language ''The C Programming Language'' (sometimes termed ''K&R'', after its authors' initials) is a computer programming book written by Brian Kernighan and Dennis Ritchie, the latter of whom originally designed and implemented the language, as well as ...
supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (). The GCC and
Clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC), ...
C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma, this can be globally controlled by the -ffp-contract command line option. The fused multiply–add operation was introduced as "multiply–add fused" in the IBM
POWER1 The POWER1 is a Integrated circuit, multi-chip Central processing unit, CPU developed and Semiconductor device fabrication, fabricated by IBM that implemented the IBM POWER Instruction Set Architecture, POWER instruction set, instruction set arc ...
(1990) processor, but has been added to numerous other processors since then: * HP
PA-8000 The PA-8000 (PCX-U), code-named ''Onyx'', is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). Hunt 1995 It was a completely new design with no circuitry deriv ...
(1996) and above *
Hitachi () is a Japanese multinational corporation, multinational Conglomerate (company), conglomerate corporation headquartered in Chiyoda, Tokyo, Japan. It is the parent company of the Hitachi Group (''Hitachi Gurūpu'') and had formed part of the Ni ...
SuperH SH-4 (1998) *
SCE SCE is an abbreviation with multiple meanings: Science * Short-channel effect, a secondary effect describing the reduction in threshold voltage Vth in MOSFETs with non-uniformly doped channel regions as the gate length increases * Saturated calomel ...
-
Toshiba , commonly known as Toshiba and stylized as TOSHIBA, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan. Its diversified products and services include power, industrial and social infrastructure system ...
Emotion Engine The Emotion Engine is a central processing unit developed and manufactured by Sony Computer Entertainment and Toshiba for use in the PlayStation 2 video game console. It was also used in early PlayStation 3 models sold in Japan and North Americ ...
(1999) * Intel
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance computin ...
(2001) * STI
Cell Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...
(2006) *
Fujitsu is a Japanese multinational information and communications technology equipment and services corporation, established in 1935 and headquartered in Tokyo. Fujitsu is the world's sixth-largest IT services provider by annual revenue, and the la ...
SPARC64 VI SPARC64 or sparc64 may refer to: * sparc64, an alternative name used by free software projects for the SPARC V9 instruction set architecture * HAL SPARC64, a microprocessor designed by HAL Computer Systems {{Letter-NumberCombDisambig ...
(2007) and above * ( MIPS-compatible)
Loongson Loongson () is the name of a family of general-purpose, MIPS architecture-compatible microprocessors, as well as the name of the Chinese fabless company (Loongson Technology) that develops them. The processors are alternately called Godson proc ...
-2F (2008) * Elbrus-8SV (2018) * x86 processors with FMA3 and/or FMA4 instruction set ** AMD
Bulldozer A bulldozer or dozer (also called a crawler) is a large, motorized machine equipped with a metal blade to the front for pushing material: soil, sand, snow, rubble, or rock during construction work. It travels most commonly on continuous track ...
(2011, FMA4 only) ** AMD Piledriver (2012, FMA3 and FMA4) ** AMD
Steamroller A steamroller (or steam roller) is a form of road roller – a type of heavy construction machinery used for leveling surfaces, such as roads or airfields – that is powered by a steam engine. The leveling/flattening action is achieved through ...
(2014) ** AMD
Excavator Excavators are heavy construction equipment consisting of a boom, dipper (or stick), bucket and cab on a rotating platform known as the "house". The house sits atop an undercarriage with tracks or wheels. They are a natural progression fro ...
(2015) ** AMD
Zen Zen ( zh, t=禪, p=Chán; ja, text= 禅, translit=zen; ko, text=선, translit=Seon; vi, text=Thiền) is a school of Mahayana Buddhism that originated in China during the Tang dynasty, known as the Chan School (''Chánzong'' 禪宗), and ...
(2017, FMA3 only) ** Intel Haswell (2013, FMA3 only) ** Intel Skylake (2015, FMA3 only) * ARM processors with VFPv4 and/or NEONv2: ** ARM Cortex-M4F (2010) **
ARM Cortex-A5 The ARM Cortex-A5 is a 32-bit processor core licensed by ARM Holdings implementing the ARMv7-A architecture announced in 2009. Overview The Cortex-A5 is intended to replace the ARM9 and ARM11 cores for use in low-end devices. The Cortex-A5 offe ...
(2012) **
ARM Cortex-A7 The ARM Cortex-A7 MPCore is a 32-bit microprocessor core licensed by ARM Holdings implementing the ARMv7-A architecture announced in 2011. Overview It has two target applications; firstly as a smaller, simpler, and more power-efficient success ...
(2013) **
ARM Cortex-A15 The ARM Cortex-A15 MPCore is a 32-bit processor core licensed by ARM Holdings implementing the ARMv7-A architecture. It is a multicore processor with out-of-order superscalar pipeline running at up to 2.5 GHz. Overview ARM has claimed ...
(2012) ** Qualcomm Krait (2012) **
Apple A6 The Apple A6 is a 32-bit package on package (PoP) system on a chip (SoC) designed by Apple Inc. that was introduced on September 12, 2012 at the launch of the iPhone 5. Apple states that it is up to twice as fast and has up to twice the graphics ...
(2012) ** All
ARMv8 ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of reduced instruction set computer (RISC) instruction set architectures for computer processors, configured ...
processors ***
Fujitsu A64FX The A64FX is a 64-bit ARM architecture microprocessor designed by Fujitsu. The processor is replacing the SPARC64 V as Fujitsu's processor for supercomputer applications. It powers the Fugaku supercomputer, the fastest supercomputer in the wor ...
has "Four-operand FMA with Prefix Instruction". * IBM
z/Architecture z/Architecture, initially and briefly called ESA Modal Extensions (ESAME), is IBM's 64-bit complex instruction set computer (CISC) instruction set architecture, implemented by its mainframe computers. IBM introduced its first z/Architecture-b ...
(since 1998) * GPUs and GPGPU boards: ** AMD GPUs (2009) and newer *** TeraScale 2 "Evergreen"-series based ***
Graphics Core Next Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was lau ...
-based **
Nvidia GPUs This list contains general information about graphics processing units (GPUs) and video cards from Nvidia, based on official specifications. In addition some Nvidia motherboards come with integrated onboard GPUs. Limited/Special/Collectors' Editio ...
(2010) and newer ***
Fermi Enrico Fermi (; 29 September 1901 – 28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age" and ...
-based (2010) ***
Kepler Johannes Kepler (; ; 27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws o ...
-based (2012) *** Maxwell-based (2014) ***
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Fren ...
-based (2016) *** Volta-based (2017) ** Intel GPUs since
Sandy Bridge Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors (Core i7, i5, i3). The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture ...
**
Intel MIC Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programm ...
(2012) ** ARM Mali T600 Series (2012) and above * Vector Processors: **
NEC SX-Aurora TSUBASA The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
*
RISC-V RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on estab ...
instruction set (2010)


References

{{DEFAULTSORT:Multiply-accumulate operation Computer arithmetic Digital signal processing