Hardware acceleration is the use of

computer hardware Computer hardware includes the physical parts of a computer, such as the computer case, case, central processing unit (CPU), Random-access memory, random access memory (RAM), Computer monitor, monitor, Computer mouse, mouse, Computer keyboard, ...

designed to perform specific functions more efficiently when compared to

software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...

running on a general-purpose

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...

(CPU). Any

transformation Transformation may refer to: Science and mathematics In biology and medicine * Metamorphosis, the biological process of changing physical form after birth or hatching * Malignant transformation, the process of cells becoming cancerous * Trans ...

data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...

that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both. To perform computing tasks more quickly (or better in some other way), generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreased latency, increased

throughput Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network. The data that these messages contain may be delivered ov ...

and reduced

energy consumption Energy consumption is the amount of energy used. Biology In the body, energy consumption is part of energy homeostasis. It derived from food energy. Energy consumption in the body is a product of the basal metabolic rate and the physical activity ...

. Typical advantages of focusing on software may include more rapid

development Development or developing may refer to: Arts *Development hell, when a project is stuck in development *Filmmaking, development phase, including finance and budgeting *Development (music), the process thematic material is reshaped * Photograph ...

, lower

non-recurring engineering Non-recurring engineering (NRE) cost refers to the one-time cost to research, design, develop and test a new product or product enhancement. When budgeting for a new product, NRE must be considered to analyze if a new product will be profitable. ...

costs, heightened portability, and ease of updating features or

patch Patch or Patches may refer to: Arts, entertainment and media * Patch Johnson, a fictional character from ''Days of Our Lives'' * Patch (''My Little Pony''), a toy * "Patches" (Dickey Lee song), 1962 * "Patches" (Chairmen of the Board song) ...

ing bugs, at the cost of overhead to compute general operations. Advantages of focusing on hardware may include

speedup In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with d ...

, reduced

power consumption Electric energy consumption is the form of energy consumption that uses electrical energy. Electric energy consumption is the actual energy demand made on existing electricity supply for transportation, residential, industrial, commercial, and o ...

, lower latency, increased parallelism and

bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...

, and better utilization of area and functional components available on an

integrated circuit An integrated circuit or monolithic integrated circuit (also referred to as an IC, a chip, or a microchip) is a set of electronic circuits on one small flat piece (or "chip") of semiconductor material, usually silicon. Large numbers of tiny ...

; at the cost of lower ability to update designs once etched onto silicon and higher costs of

functional verification In electronic design automation, functional verification is the task of verifying that the logic design conforms to specification. Functional verification attempts to answer the question "Does this proposed design do what is intended?" This is a ...

, and times to market. In the hierarchy of digital computing systems ranging from general-purpose processors to fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by

orders of magnitude An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic dis ...

when any given application is implemented higher up that hierarchy. This hierarchy includes general-purpose processors such as CPUs, more specialized processors such as

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

fixed-function Fixed-function is a term canonically used to contrast 3D graphics APIs and earlier GPUs designed prior to the advent of shader-based 3D graphics APIs and GPU architectures. History Historically fixed-function APIs consisted of a set of function ...

implemented on

field-programmable gate array A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term '' field-programmable''. The FPGA configuration is generally specified using a hardware d ...

s (FPGAs), and fixed-function implemented on

application-specific integrated circuit An application-specific integrated circuit (ASIC ) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficie ...

s (ASICs). Hardware acceleration is advantageous for

performance A performance is an act of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Management science In the work place ...

, and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing

control flow In computer science, control flow (or flow of control) is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. The emphasis on explicit control flow distinguishes an ''imper ...

. The disadvantage however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.

Overview

Integrated circuits can be created to perform arbitrary operations on

analog Analog or analogue may refer to: Computing and electronics * Analog signal, in which information is encoded in a continuous variable ** Analog device, an apparatus that operates on analog signals *** Analog electronics, circuits which use analog ...

and digital signals. Most often in computing, signals are digital and can be interpreted as

binary number A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" ( one). The base-2 numeral system is a positional notatio ...

data. Computer hardware and software operate on information in binary representation to perform

computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...

; this is accomplished by calculating

boolean function In mathematics, a Boolean function is a function whose arguments and result assume values from a two-element set (usually , or ). Alternative names are switching function, used especially in older computer science literature, and truth function ( ...

s on the

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

s of input and outputting the result to some

output device An output device is any piece of computer hardware equipment which converts information into a human-perceptible form or, historically, into a physical machine-readable form for use with other non-computerized equipment. It can be text, graphics, ...

downstream for storage or further processing.

Computational equivalence of hardware and software

Because all

Turing machines A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

can run any

computable function Computable functions are the basic objects of study in computability theory. Computable functions are the formalized analogue of the intuitive notion of algorithms, in the sense that a function is computable if there exists an algorithm that can do ...

, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can be always used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software.

Hardware description language In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, and most commonly, digital logic circuits. A hardware description language e ...

s (HDLs) such as

Verilog Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to model electronic systems. It is most commonly used in the design and verification of digital circuits at the register-transfer level of abstraction. It is als ...

and

VHDL The VHSIC Hardware Description Language (VHDL) is a hardware description language (HDL) that can model the behavior and structure of digital systems at multiple levels of abstraction, ranging from the system level down to that of logic gates ...

can model the same

semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...

as software and synthesize the design into a

netlist In electronic design, a netlist is a description of the connectivity of an electronic circuit. In its simplest form, a netlist consists of a list of the electronic components in a circuit and a list of the nodes they are connected to. A network ...

that can be programmed to an FPGA or composed into the

logic gate A logic gate is an idealized or physical device implementing a Boolean function, a logical operation performed on one or more binary inputs that produces a single binary output. Depending on the context, the term may refer to an ideal logic gate, ...

s of an ASIC.

Stored-program computers

The vast majority of software-based computing occurs on machines implementing the

von Neumann architecture The von Neumann architecture — also known as the von Neumann model or Princeton architecture — is a computer architecture based on a 1945 description by John von Neumann, and by others, in the ''First Draft of a Report on the EDVAC''. The ...

, collectively known as

stored-program computer A stored-program computer is a computer that stores program instructions in electronically or optically accessible memory. This contrasts with systems that stored the program instructions with plugboards or similar mechanisms. The definition i ...

Computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer program ...

s are stored as data and

executed Capital punishment, also known as the death penalty, is the state-sanctioned practice of deliberately killing a person as a punishment for an actual or supposed crime, usually following an authorized, rule-governed process to conclude that t ...

processor Processor may refer to: Computing Hardware * Processor (computing) **Central processing unit (CPU), the hardware within a computer that executes a program *** Microprocessor, a central processing unit contained on a single integrated circuit (I ...

s. Such processors must fetch and decode instructions, as well as loading data operands from

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...

(as part of the

instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instruction ...

) to execute the instructions constituting the software program. Relying on a common

cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache County ...

for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the

modified Harvard architecture The modified Harvard architecture is a variation of the Harvard computer architecture that, unlike the pure Harvard architecture, allows the contents of the instruction memory to be accessed as data. Most modern computers that are documented as ...

, where instructions and data have separate caches in the memory hierarchy, there is overhead to decoding instruction

opcode In computing, an opcode (abbreviated from operation code, also known as instruction machine code, instruction code, instruction syllable, instruction parcel or opstring) is the portion of a machine language instruction that specifies the operat ...

s and

multiplexing In telecommunications and computer networking, multiplexing (sometimes contracted to muxing) is a method by which multiple analog or digital signals are combined into one signal over a shared medium. The aim is to share a scarce resource - a ...

available

execution unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...

s on a

microprocessor A microprocessor is a computer processor where the data processing logic and control is included on a single integrated circuit, or a small number of integrated circuits. The microprocessor contains the arithmetic, logic, and control circu ...

microcontroller A microcontroller (MCU for ''microcontroller unit'', often also MC, UC, or μC) is a small computer on a single VLSI integrated circuit (IC) chip. A microcontroller contains one or more CPUs (processor cores) along with memory and programmable i ...

, leading to low circuit utilization. Modern processors that provide

simultaneous multithreading Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern process ...

exploit under-utilization of available processor functional units and

instruction level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Disc ...

between different hardware threads.

Hardware execution units

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an

and incur those stages' overhead. If needed calculations are specified in a

register transfer level In digital circuit design, register-transfer level (RTL) is a design abstraction which models a synchronous digital circuit in terms of the flow of digital signals (data) between hardware registers, and the logical operations performed on those ...

(RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses. This reclamation saves time, power and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication or memory, as well as increased

input/output In computing, input/output (I/O, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals ...

capabilities. This comes at the cost of general-purpose utility.

Emerging hardware architectures

Greater RTL customization of hardware designs allows emerging architectures such as

in-memory computing In computer science, in-memory processing is an emerging technology for processing of data stored in an in-memory database. In-memory processing is one method of addressing the performance and power bottlenecks caused by the movement of data be ...

transport triggered architecture In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data i ...

s (TTA) and networks-on-chip (NoC) to further benefit from increased

locality Locality may refer to: * Locality (association), an association of community regeneration organizations in England * Locality (linguistics) * Locality (settlement) * Suburbs and localities (Australia), in which a locality is a geographic subdivis ...

of data to execution context, thereby reducing computing and communication latency between modules and functional units. Custom hardware is limited in parallel processing capability only by the area and

logic block In computing, a logic block or configurable logic block (CLB) is a fundamental building block of field-programmable gate array (FPGA) technology. Logic blocks can be configured by the engineer to provide reconfigurable logic gates. Logic blocks ...

s available on the integrated circuit die. Therefore, hardware is much more free to offer massive parallelism than software on general-purpose processors, offering a possibility of implementing the

parallel random-access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confused ...

(PRAM) model. It is common to build

multicore A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions (such ...

and

manycore Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores (from a few tens of cores to thousands or more). Manycore processors are use ...

processing units out of microprocessor IP core schematics on a single FPGA or ASIC. Similarly, specialized functional units can be composed in parallel as in digital signal processing without being embedded in a processor

IP core In electronic design, a semiconductor intellectual property core (SIP core), IP core, or IP block is a reusable unit of logic, cell, or integrated circuit layout design that is the intellectual property of one party. IP cores can be licensed to ...

. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little

conditional branching In computer science, conditionals (that is, conditional statements, conditional expressions and conditional constructs,) are programming language commands for handling decisions. Specifically, conditionals perform different computations or actio ...

, especially on large amounts of data. This is how

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ca ...

line of GPUs are implemented.

Implementation metrics

As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering the characteristics such as physical hardware dimensions, power consumption and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.

Applications

Examples of hardware acceleration include

bit blit Bit blit (also written BITBLT, BIT BLT, BitBLT, Bit BLT, Bit Blt etc., which stands for ''bit block transfer'') is a data operation commonly used in computer graphics in which several bitmaps are combined into one using a ''boolean function''. The ...

acceleration functionality in graphics processing units (GPUs), use of

memristor A memristor (; a portmanteau of ''memory resistor'') is a non-linear two-terminal electrical component relating electric charge and magnetic flux linkage. It was described and named in 1971 by Leon Chua, completing a theoretical quartet of fu ...

s for accelerating

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

and

regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...

hardware acceleration for spam control in the

server Server may refer to: Computing *Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients Role * Waiting staff, those who work at a restaurant or a bar attending customers and su ...

industry, intended to prevent

regular expression denial of service A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many reg ...

(ReDoS) attacks. The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred with a more specific term, such as 3D accelerator, or

cryptographic accelerator In computing, a cryptographic accelerator is a co-processor designed specifically to perform computationally intensive cryptographic operations, doing so far more efficiently than the general-purpose CPU. Because many servers' system load consists ...

. Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by

instruction fetch The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instruction ...

(for example moving temporary results to and from a

register file A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit- ...

). Hardware accelerators improve the execution of a specific algorithm by allowing greater concurrency, having specific

datapath A datapath is a collection of functional units such as arithmetic logic units or multipliers that perform data processing operations, registers, and buses. Along with the control unit it composes the central processing unit (CPU). A larger data ...

s for their

temporary variable In computer programming, a temporary variable is a variable with short lifetime, usually to hold data that will soon be discarded, or before it can be placed at a more permanent memory location. Because it is short-lived, it is usually declared ...

s, and reducing the overhead of instruction control in the fetch-decode-execute cycle. Modern processors are multi-core and often feature parallel "single-instruction; multiple data" (

SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...

) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like

motion estimation Motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions b ...

MPEG-2 MPEG-2 (a.k.a. H.222/H.262 as was defined by the ITU) is a standard for "the generic video coding format, coding of moving pictures and associated audio information". It describes a combination of Lossy compression, lossy video compression and ...

Hardware acceleration units by application

References

External links

* {{Digital electronics Application-specific integrated circuits Central processing unit Computer optimization Gate arrays Graphics hardware Articles with example C code