General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

(GPU), which typically handles computation only for

computer graphics Computer graphics deals with generating images and art with the aid of computers. Computer graphics is a core technology in digital photography, film, video games, digital art, cell phone and computer displays, and many specialized applications. ...

, to perform computation in applications traditionally handled by the

central processing unit A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary Processor (computing), processor in a given computer. Its electronic circuitry executes Instruction (computing), instructions ...

(CPU). The use of multiple

video card A graphics card (also called a video card, display card, graphics accelerator, graphics adapter, VGA card/VGA, video adapter, display adapter, or colloquially GPU) is a computer expansion card that generates a feed of graphics output to a displa ...

s in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. Essentially, a GPGPU

pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...

is a kind of parallel processing between one or more GPUs and CPUs that analyzes data as if it were in image or other graphic form. While GPUs operate at lower frequencies, they typically have many times the number of cores. Thus, GPUs can process far more pictures and graphical data per second than a traditional CPU. Migrating data into graphical form and then using the GPU to scan and analyze it can create a large

speedup In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with ...

. GPGPU pipelines were developed at the beginning of the 21st century for graphics processing (e.g. for better

shader In computer graphics, a shader is a computer program that calculates the appropriate levels of light, darkness, and color during the rendering of a 3D scene—a process known as '' shading''. Shaders have evolved to perform a variety of s ...

s). These pipelines were found to fit

scientific computing Computational science, also known as scientific computing, technical computing or scientific computation (SC), is a division of science, and more specifically the Computer Sciences, which uses advanced computing capabilities to understand and s ...

needs well, and have since been developed in this direction. The best-known GPGPUs are

Nvidia Tesla Nvidia Tesla is the former name for a line of products developed by Nvidia targeted at stream processing or GPGPU, general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began us ...

that are used for Nvidia DGX, alongside

AMD Instinct AMD Instinct is AMD's brand of data center Graphics processing unit, GPUs. It replaced AMD's AMD FirePro, FirePro S brand in 2016. Compared to the Radeon brand of mainstream consumer/gamer products, the Instinct product line is intended to acce ...

and Intel Gaudi.

History

In principle, any arbitrary

Boolean function In mathematics, a Boolean function is a function whose arguments and result assume values from a two-element set (usually , or ). Alternative names are switching function, used especially in older computer science literature, and truth functi ...

, including addition, multiplication, and other mathematical functions, can be built up from a functionally complete set of logic operators. In 1987,

Conway's Game of Life The Game of Life, also known as Conway's Game of Life or simply Life, is a cellular automaton devised by the British mathematician John Horton Conway in 1970. It is a zero-player game, meaning that its evolution is determined by its initial ...

became one of the first examples of general-purpose computing using an early stream processor called a blitter to invoke a special sequence of logical operations on bit vectors. General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable

s and floating point support on graphics processors. Notably, problems involving

matrices Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the ...

and/or

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

s especially two-, three-, or four-dimensional vectors were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs. These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors,

OpenGL OpenGL (Open Graphics Library) is a Language-independent specification, cross-language, cross-platform application programming interface (API) for rendering 2D computer graphics, 2D and 3D computer graphics, 3D vector graphics. The API is typic ...

and

DirectX Microsoft DirectX is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with "Direct" ...

. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as Sh/ RapidMind, Brook and Accelerator. These were followed by Nvidia's

CUDA In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...

, which allowed programmers to ignore the underlying graphical concepts in favor of more common

high-performance computing High-performance computing (HPC) is the use of supercomputers and computer clusters to solve advanced computation problems. Overview HPC integrates systems administration (including network and security knowledge) and parallel programming into ...

concepts. Newer, hardware-vendor-independent offerings include Microsoft's DirectCompute and Apple/Khronos Group's

OpenCL OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...

. This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form. Mark Harris, the founder of GPGPU.org, claims he coined the term ''GPGPU''.

Implementations

Any language that allows the code running on the CPU to poll a GPU

for return values, can create a GPGPU framework. Programming standards for parallel computing include

(vendor-independent),

OpenACC OpenACC (for ''open accelerators'') is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/ GPU systems. As in OpenMP, the prog ...

OpenMP OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, ...

and OpenHMPP. , OpenCL is the dominant open general-purpose GPU computing language, and is an open standard defined by the

Khronos Group The Khronos Group, Inc. is an open, non-profit, member-driven consortium of 170 organizations developing, publishing and maintaining royalty-free interoperability standards for 3D graphics, virtual reality, augmented reality, parallel computat ...

. OpenCL provides a

cross-platform Within computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several Computing platform, computing platforms. Some ...

GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also standardised and implemented

SYCL SYCL (pronounced "sickle") is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language ( eDSL) based on pure C++17. It is a standard develope ...

, a higher-level programming model for

as a single-source domain specific embedded language based on pure C++11. The dominant proprietary framework is

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

. Nvidia launched CUDA in 2006, a

software development kit A software development kit (SDK) is a collection of software development tools in one installable package. They facilitate the creation of applications by having a compiler, debugger and sometimes a software framework. They are normally specific t ...

(SDK) and

application programming interface An application programming interface (API) is a connection between computers or between computer programs. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standard that des ...

(API) that allows using the programming language C to code algorithms for execution on GeForce 8 series and later GPUs. ROCm, launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features, and still lacking in consumer support. OpenVIDIA was developed at

University of Toronto The University of Toronto (UToronto or U of T) is a public university, public research university whose main campus is located on the grounds that surround Queen's Park (Toronto), Queen's Park in Toronto, Ontario, Canada. It was founded by ...

between 2003–2005, in collaboration with Nvidia. Altimesh Hybridizer created by Altimesh compiles

Common Intermediate Language Common Intermediate Language (CIL), formerly called Microsoft Intermediate Language (MSIL) or Intermediate Language (IL), is the intermediate language binary instruction set defined within the Common Language Infrastructure (CLI) specification. ...

to CUDA binaries. It supports generics and virtual functions. Debugging and profiling is integrated with

Visual Studio Visual Studio is an integrated development environment (IDE) developed by Microsoft. It is used to develop computer programs including web site, websites, web apps, web services and mobile apps. Visual Studio uses Microsoft software development ...

and Nsight. It is available as a Visual Studio extension on Visual Studio Marketplace.

Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...

introduced the DirectCompute GPU computing API, released with the DirectX 11 API. ', created by QuantAlea, introduces native GPU computing capabilities for the Microsoft .NET languages F# and C#. Alea GPU also provides a simplified GPU programming model based on GPU parallel-for and parallel aggregate using delegates and automatic memory management.

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

supports GPGPU acceleration using the ''Parallel Computing Toolbox'' and ''MATLAB Distributed Computing Server'', and third-party packages like

Jacket A jacket is a garment for the upper body, usually extending below the hips. A jacket typically has sleeves and fastens in the front or slightly on the side. Jackets without sleeves are vests. A jacket is generally lighter, tighter-fitting, and ...

. GPGPU processing is also used to simulate

Newtonian physics Classical mechanics is a physical theory describing the motion of objects such as projectiles, parts of machinery, spacecraft, planets, stars, and galaxies. The development of classical mechanics involved substantial change in the methods ...

physics engine A physics engine is computer software that provides an approximate simulation of certain physical systems, typically classical dynamics, including rigid body dynamics (including collision detection), soft body dynamics, and fluid dynamics. I ...

s, and commercial implementations include Havok Physics, FX and

PhysX PhysX is an Open-source software, open-source Real-time computer graphics, realtime physics engine middleware Software development kit, SDK developed by Nvidia as part of the Nvidia GameWorks software suite. Initially, video games supporting Ph ...

, both of which are typically used for computer and

video game A video game or computer game is an electronic game that involves interaction with a user interface or input device (such as a joystick, game controller, controller, computer keyboard, keyboard, or motion sensing device) to generate visual fe ...

s. C++ Accelerated Massive Parallelism ( C++ AMP) is a library that accelerates execution of C++ code by exploiting the data-parallel hardware on GPUs.

Mobile computers

Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major

mobile operating system A mobile operating system is an operating system used for smartphones, tablets, smartwatches, smartglasses, or other non-laptop personal mobile computing devices. While computers such as laptops are "mobile", the operating systems used on the ...

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

Android 4.2 enabled running RenderScript code on the mobile device GPU. Renderscript has since been deprecated in favour of first OpenGL compute shaders and later Vulkan Compute. OpenCL is available on many Android devices, but is not officially supported by Android.

Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...

introduced the proprietary

Metal A metal () is a material that, when polished or fractured, shows a lustrous appearance, and conducts electrical resistivity and conductivity, electricity and thermal conductivity, heat relatively well. These properties are all associated wit ...

API for

iOS Ios, Io or Nio (, ; ; locally Nios, Νιός) is a Greek island in the Cyclades group in the Aegean Sea. Ios is a hilly island with cliffs down to the sea on most sides. It is situated halfway between Naxos and Santorini. It is about long an ...

applications, able to execute arbitrary code through Apple's GPU compute shaders.

Hardware support

Computer

s are produced by various vendors, such as

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...

. Cards from such vendors differ on implementing data-format support, such as

integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

and

floating-point In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a ''significand'' (a Sign (mathematics), signed sequence of a fixed number of digits in some Radix, base) multiplied by an integer power of that ba ...

formats (32-bit and 64-bit).

introduced a '' Shader Model'' standard, to help rank the various features of graphic cards into a simple Shader Model version number (1.0, 2.0, 3.0, etc.).

Integer numbers

Pre-DirectX 9 video cards only supported paletted or integer color types. Sometimes another alpha value is added, to be used for transparency. Common formats are: * 8 bits per pixel – Sometimes palette mode, where each value is an index in a table with the real color value specified in one of the other formats. Sometimes three bits for red, three bits for green, and two bits for blue. * 16 bits per pixel – Usually the bits are allocated as five bits for red, six bits for green, and five bits for blue. * 24 bits per pixel – There are eight bits for each of red, green, and blue. * 32 bits per pixel – There are eight bits for each of red, green, blue, and

alpha Alpha (uppercase , lowercase ) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of one. Alpha is derived from the Phoenician letter ''aleph'' , whose name comes from the West Semitic word for ' ...

Floating-point numbers

For early fixed-function or limited programmability graphics (i.e., up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as floating point data formats, to obtain effects such as

high-dynamic-range imaging High dynamic range (HDR), also known as wide dynamic range, extended dynamic range, or expanded dynamic range, is a signal with a higher dynamic range than usual. The term is often used in discussing the dynamic ranges of images, videos, audio or ...

. Many GPGPU applications require floating point accuracy, which came with video cards conforming to the DirectX 9 specification. DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16. ATI's Radeon R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while

's NV30 series supported both FP16 and FP32; other vendors such as

S3 Graphics S3 Graphics, Ltd. was an American computer graphics company. The company sold the S3 Trio, Trio, S3 ViRGE, ViRGE, S3 Savage, Savage, and S3 Chrome, Chrome series of graphics processors. Struggling against competition from 3dfx Interactive, ATI T ...

and XGI supported a mixture of formats up to FP24. The implementations of floating point on Nvidia GPUs are mostly

IEEE The Institute of Electrical and Electronics Engineers (IEEE) is an American 501(c)(3) organization, 501(c)(3) public charity professional organization for electrical engineering, electronics engineering, and other related disciplines. The IEEE ...

compliant; however, this is not true across all vendors. This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs. Some GPU architectures sacrifice IEEE compliance, while others lack double-precision. Efforts have occurred to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computing onto the GPU in the first place.Double precision on GPUs (Proceedings of ASIM 2005)
: Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005 18th Symposium on Simulation Technique, 2005.

Vectorization

Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once. For example, if one color is to be modulated by another color , the GPU can produce the resulting color in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional). Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions, termed single instruction, multiple data (

SIMD Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

), have long been available on CPUs.

GPU vs. CPU

Originally, data was simply passed one-way from a

(CPU) to a

(GPU), then to a

display device A display device is an output device for presentation of information in visual or tactile form (the latter used for example in tactile electronic displays for blind people). When the input information that is supplied has an electrical signa ...

. As time progressed, however, it became valuable for GPUs to store at first simple, then complex structures of data to be passed back to the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format that a video card can understand. Because the GPU has access to every draw operation, it can analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much more slowly, as the speed of access between a CPU and its larger pool of

random-access memory Random-access memory (RAM; ) is a form of Computer memory, electronic computer memory that can be read and changed in any order, typically used to store working Data (computing), data and machine code. A random-access memory device allows ...

(or in an even worse case, a

hard drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating hard disk drive platter, pla ...

) is slower than GPUs and video cards, which typically contain smaller amounts of more expensive memory that is much faster to access. Transferring the portion of the data set to be actively analyzed to that GPU memory in the form of textures or other easily readable GPU forms results in speed increase. The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in a multiplier effect on the speed of a specific high-use

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

. GPGPU pipelines may improve efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in complex graphics pipelines as well as

; more so in fields with large data sets like genome mapping, or where two- or three-dimensional analysis is useful especially at present

biomolecule A biomolecule or biological molecule is loosely defined as a molecule produced by a living organism and essential to one or more typically biological processes. Biomolecules include large macromolecules such as proteins, carbohydrates, lipids ...

analysis,

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

study, and other complex

organic chemistry Organic chemistry is a subdiscipline within chemistry involving the science, scientific study of the structure, properties, and reactions of organic compounds and organic matter, organic materials, i.e., matter in its various forms that contain ...

. An example of such applications is NVIDIA software suite for genome analysis. Such pipelines can also vastly improve efficiency in

image processing An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...

and

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

, among other fields; as well as parallel processing generally. Some very heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task. A simple example would be a GPU program that collects data about average

lighting Lighting or illumination is the deliberate use of light to achieve practical or aesthetic effects. Lighting includes the use of both artificial light sources like lamps and light fixtures, as well as natural illumination by capturing daylight. ...

values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use

edge detection Edge or EDGE may refer to: Technology Computing * Edge computing, a network load-balancing system * Edge device, an entry point to a computer network * Adobe Edge, a graphical development application * Microsoft Edge, a web browser developed b ...

to return both numerical information and a processed image representing outlines to a

program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a Raster graphics, raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, p ...

or other picture element in an image, it can analyze and average it (for the first example) or apply a Sobel edge filter or other

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

filter (for the second) with much greater speed than a CPU, which typically must access slower

copies of the graphic in question. GPGPU is fundamentally a software concept, not a hardware concept; it is a type of

, not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a ''rack''), which adds a third layer many computing units each using many CPUs to correspond to many GPUs. Some

Bitcoin Bitcoin (abbreviation: BTC; Currency symbol, sign: ₿) is the first Decentralized application, decentralized cryptocurrency. Based on a free-market ideology, bitcoin was invented in 2008 when an unknown entity published a white paper under ...

"miners" used such setups for high-quantity processing.

Caches

Historically, CPUs have used hardware-managed caches, but the earlier GPUs only provided software-managed local memories. However, as GPUs are being increasingly used for general-purpose applications, state-of-the-art GPUs are being designed with hardware-managed multi-level caches which have helped the GPUs to move towards mainstream computing. For example, GeForce 200 series GT200 architecture GPUs did not feature an L2 cache, the Fermi GPU has 768 KiB last-level cache, the

Kepler Johannes Kepler (27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws of p ...

GPU has 1.5 MiB last-level cache, the

Maxwell Maxwell may refer to: People * Maxwell (surname), including a list of people and fictional characters with the name ** James Clerk Maxwell, mathematician and physicist * Justice Maxwell (disambiguation) * Maxwell baronets, in the Baronetage of N ...

GPU has 2 MiB last-level cache, and the Pascal GPU has 4 MiB last-level cache.

Register file

GPUs have very large register files, which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively. By comparison, the size of a register file on CPUs is small, typically tens or hundreds of kilobytes.

Energy efficiency

The high performance of GPUs comes at the cost of high power consumption, which under full load is in fact as much power as the rest of the PC system combined. The maximum power consumption of the Pascal series GPU (Tesla P100) was specified to be 250W.

Classical GPGPU

Before CUDA was published in 2007, GPGPU was "classical" and involved repurposing graphics primitives. A standard structure of such was: # Load arrays into textures # Draw a quadrangle # Apply pixel shaders and textures to quadrangle # Read out pixel values in the quadrangle as array More examples are available in part 4 of ''GPU Gems 2''.

Linear algebra

Using GPU for numerical linear algebra began at least in 2001. It had been used for Gauss-Seidel solver, conjugate gradients, etc.

Stream processing

GPUs are designed specifically for graphics and thus are very restrictive in operations and programming. Due to their design, GPUs are only effective for problems that can be solved using

stream processing In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views Stream (computing), streams, or sequences of events in time, as the centr ...

and the hardware can only be used in certain ways. The following discussion referring to vertices, fragments and textures concerns mainly the legacy model of GPGPU programming, where graphics APIs (

) were used to perform general-purpose computation. With the introduction of the

(Nvidia, 2007) and

(vendor-independent, 2008) general-purpose computing APIs, in new GPGPU codes it is no longer necessary to map the computation to graphics primitives. The stream processing nature of GPUs remains valid regardless of the APIs used. (See e.g.,) GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors processors that can operate in parallel by running one kernel on many records in a stream at once. A ''stream'' is simply a set of records that require similar computation. Streams provide data parallelism. '' Kernels'' are the functions that are applied to each element in the stream. In the GPUs, ''vertices'' and ''fragments'' are the elements in streams and vertex and fragment shaders are the kernels to be run on them. For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup. Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.

GPU programming concepts

Computational resources

There are a variety of computational resources available on the GPU: * Programmable processors – vertex, primitive, fragment and mainly compute pipelines allow programmer to perform kernel on streams of data * Rasterizer – creates fragments and interpolates per-vertex constants such as texture coordinates and color * Texture unit – read-only memory interface * Framebuffer – write-only memory interface In fact, a program can substitute a write only texture for output instead of the framebuffer. This is done either through Render to Texture (RTT), Render-To-Backbuffer-Copy-To-Texture (RTBCTT), or the more recent stream-out.

Textures as stream

The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on. Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.

Kernels

Compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...

s can be thought of as the body of loops. For example, a programmer operating on a grid on the CPU might have code that looks like this: // Input and output grids have 10000 x 10000 or 100 million elements. void transform_10k_by_10k_grid(float in 000010000], float out 000010000]) On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.

Flow control

In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs. Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible. Recent GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such as static branch resolution, pre-computation, predication, loop splitting,Future Chips
"Tutorial on removing branches", 2011 and Z-cullGPGPU survey paper
: John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. "A Survey of General-Purpose Computation on Graphics Hardware". Computer Graphics Forum, volume 26, number 1, 2007, pp. 80–113. can be used to achieve branching when hardware support does not exist.

GPU methods

Map

The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.

Reduce

Some computations require calculating a smaller stream (possibly a stream of only one element) from a larger stream. This is called a reduction of the stream. Generally, a reduction can be performed in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.

Scan

The scan operation, also termed '' prefix sum#Parallel algorithm, parallel prefix sum'', takes in a vector (stream) of data elements and an (arbitrary) associative binary function '+' with an identity element 'i'. If the input is 0, a1, a2, a3, ... an ''exclusive scan'' produces the output , a0, a0 + a1, a0 + a1 + a2, ... while an ''inclusive scan'' produces the output 0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...and does not require an identity to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.

Scatter

The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects. The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with another gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot. In dedicated

compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...

s, scatter can be performed by indexed writes.

Gather

Gather is the reverse of scatter. After scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used. In dedicated compute kernels, gather may be performed by indexed reads. In other shaders, it is performed with texture-lookups.

Sort

The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using radix sort for integer and floating point data and coarse-grained

merge sort In computer science, merge sort (also commonly spelled as mergesort and as ) is an efficient, general-purpose, and comparison sort, comparison-based sorting algorithm. Most implementations of merge sort are Sorting algorithm#Stability, stable, wh ...

and fine-grained sorting networks for general comparable data.Merrill, Duane. Allocation-oriented Algorithm Design with Application to GPU Computing
Ph.D. dissertation, Department of Computer Science, University of Virginia. Dec. 2011.
, 2013.

Search

The search operation allows the programmer to find a given element within the stream, or possibly find neighbors of a specified element. Mostly the search method used is

binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...

on sorted elements.

Data structures

A variety of data structures can be represented on the GPU: * Dense

arrays An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...

Sparse matrices In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix (mathematics), matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix ...

( sparse array) static or dynamic * Adaptive structures ( union type)

Applications

The following are some of the areas where GPUs have been used for general purpose computing: * Automatic parallelization * Physical based simulation and

s (usually based on

models) **

, cloth simulation, fluid

incompressible flow In fluid mechanics, or more generally continuum mechanics, incompressible flow is a flow in which the material density does not vary over time. Equivalently, the divergence of an incompressible flow velocity is zero. Under certain conditions, t ...

by solution of

Euler equations (fluid dynamics) In fluid dynamics, the Euler equations are a set of partial differential equations governing adiabatic and inviscid flow. They are named after Leonhard Euler. In particular, they correspond to the Navier–Stokes equations with zero viscosity ...

Navier–Stokes equations The Navier–Stokes equations ( ) are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician Georg ...

Statistical physics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

Ising model The Ising model (or Lenz–Ising model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical models in physics, mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that r ...

* Lattice gauge theory * Segmentation 2D and 3D * Level set methods * CT reconstruction *

Fast Fourier transform A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). A Fourier transform converts a signal from its original domain (often time or space) to a representation in ...

* GPU learning

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

computations, e.g., with software BIDMach * k-nearest neighbor algorithm *

Fuzzy logic Fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely ...

Tone mapping Tone mapping is a technique used in image processing and computer graphics to map one set of colors to another to approximate the appearance of high-dynamic-range (HDR) images in a medium that has a more limited dynamic range. Print-outs, C ...

Audio signal processing Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting ...

** Audio and sound effects processing, to use a GPU for

digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a ...

(DSP) ** Analog signal processing ** Speech processing *

Digital image processing Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allo ...

Video processing In electronics engineering, video processing is a particular case of signal processing, in particular image processing, which often employs filter (video), video filters and where the input and output Signal (electrical engineering), signals are vid ...

** Hardware accelerated video decoding and post-processing ***

Motion compensation Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video ...

(mo comp) *** Inverse

discrete cosine transform A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequency, frequencies. The DCT, first proposed by Nasir Ahmed (engineer), Nasir Ahmed in 1972, is a widely ...

(iDCT) *** Variable-length decoding (VLD),

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

*** Inverse quantization (IQ, not to be confused with

Intelligence Quotient An intelligence quotient (IQ) is a total score derived from a set of standardized tests or subtests designed to assess human intelligence. Originally, IQ was a score obtained by dividing a person's mental age score, obtained by administering ...

) *** In-loop deblocking *** Bitstream processing ( CAVLC/ CABAC) using special purpose hardware for this task because this is a serial task not suitable for regular GPGPU computation ***

Deinterlacing Deinterlacing is the process of converting interlaced video into a non-interlaced or Progressive scan, progressive form. Interlaced video signals are commonly found in analog television, VHS, Laserdisc, digital television (HDTV) when in the 1080 ...

**** Spatial-temporal deinterlacing *** Noise reduction *** Edge enhancement *** Color correction ** Hardware accelerated video encoding and pre-processing * Global illumination ray tracing, photon mapping, radiosity among others, subsurface scattering * Geometric computing constructive solid geometry, distance fields, collision detection, transparency computation, shadow generation * Scientific computing ** Monte Carlo simulation of light propagation **

Weather forecasting Weather forecasting or weather prediction is the application of science and technology forecasting, to predict the conditions of the Earth's atmosphere, atmosphere for a given location and time. People have attempted to predict the weather info ...

** Climate research ** Molecular modeling on GPU ** Quantum mechanical physics **

Astrophysics Astrophysics is a science that employs the methods and principles of physics and chemistry in the study of astronomical objects and phenomena. As one of the founders of the discipline, James Keeler, said, astrophysics "seeks to ascertain the ...

Number theory Number theory is a branch of pure mathematics devoted primarily to the study of the integers and arithmetic functions. Number theorists study prime numbers as well as the properties of mathematical objects constructed from integers (for example ...

Primality test A primality test is an algorithm for determining whether an input number is prime. Among other fields of mathematics, it is used for cryptography. Unlike integer factorization, primality tests do not generally give prime factors, only stating wheth ...

ing and

integer factorization In mathematics, integer factorization is the decomposition of a positive integer into a product of integers. Every positive integer greater than 1 is either the product of two or more integer factors greater than 1, in which case it is a comp ...

Bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

Medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...

* Clinical decision support system (CDSS) *

Computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

Digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a ...

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

Control engineering Control engineering, also known as control systems engineering and, in some European countries, automation engineering, is an engineering discipline that deals with control systems, applying control theory to design equipment and systems with d ...

Operations research Operations research () (U.S. Air Force Specialty Code: Operations Analysis), often shortened to the initialism OR, is a branch of applied mathematics that deals with the development and application of analytical methods to improve management and ...

** Implementations of: the GPU Tabu Search algorithm solving the Resource Constrained Project Scheduling problem is freely available on GitHub; the GPU algorithm solving the Nurse scheduling problem is freely available on GitHub. *

Neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s *

Database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

operations *

Computational Fluid Dynamics Computational fluid dynamics (CFD) is a branch of fluid mechanics that uses numerical analysis and data structures to analyze and solve problems that involve fluid dynamics, fluid flows. Computers are used to perform the calculations required ...

especially using

Lattice Boltzmann methods The lattice Boltzmann methods (LBM), originated from the lattice gas automata (LGA) method (Hardy- Pomeau-Pazzis and Frisch- Hasslacher- Pomeau models), is a class of computational fluid dynamics (CFD) methods for fluid simulation. Instead of s ...

Cryptography Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or ''-logy, -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of Adversary (cryptography), ...

and

cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic se ...

* Performance modeling: computationally intensive tasks on GPU ** Implementations of: MD6,

Advanced Encryption Standard The Advanced Encryption Standard (AES), also known by its original name Rijndael (), is a specification for the encryption of electronic data established by the U.S. National Institute of Standards and Technology (NIST) in 2001. AES is a variant ...

(AES),

Data Encryption Standard The Data Encryption Standard (DES ) is a symmetric-key algorithm for the encryption of digital data. Although its short key length of 56 bits makes it too insecure for modern applications, it has been highly influential in the advancement of cryp ...

(DES), RSA,

elliptic curve cryptography Elliptic-curve cryptography (ECC) is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. ECC allows smaller keys to provide equivalent security, compared to cryptosystems based on modula ...

(ECC) ** Password cracking **

Cryptocurrency A cryptocurrency (colloquially crypto) is a digital currency designed to work through a computer network that is not reliant on any central authority, such as a government or bank, to uphold or maintain it. Individual coin ownership record ...

transactions processing ("mining") ( Bitcoin mining) *

Electronic design automation Electronic design automation (EDA), also referred to as electronic computer-aided design (ECAD), is a category of software tools for designing Electronics, electronic systems such as integrated circuits and printed circuit boards. The tools wo ...

Antivirus software Antivirus software (abbreviated to AV software), also known as anti-malware, is a computer program used to prevent, detect, and remove malware. Antivirus software was originally developed to detect and remove computer viruses, hence the name ...

Intrusion detection An intrusion detection system (IDS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically either reported to an administrator or collec ...

Regular Expression Matching on Graphics Hardware for Intrusion Detection
. Giorgos Vasiliadis et al., Regular Expression Matching on Graphics Hardware for Intrusion Detection. In proceedings of RAID 2009. * Increase computing power for

distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...

projects like

SETI@home SETI@home ("SETI at home") is a project of the Berkeley SETI Research Center to analyze radio signals with the aim of Search for extraterrestrial intelligence, searching for signs of extraterrestrial intelligence. Until March 2020, it was run ...

Einstein@home Einstein@Home is a volunteer computing project that searches for signals from spinning neutron stars in data from gravitational-wave detectors, from large radio telescopes, and from a gamma-ray telescope. Neutron stars are detected by their puls ...

Bioinformatics

GPGPU usage in Bioinformatics:

Molecular dynamics

† Expected speedups are highly dependent on system configuration. GPU performance compared against multi-core

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...

CPU socket. GPU performance benchmarked on GPU supported features and may be a kernel to kernel performance comparison. For details on configuration used, view application website. Speedups as per Nvidia in-house testing or ISV's documentation. ‡ Q= Quadro GPU, T= Tesla GPU. Nvidia recommended GPUs for this application. Check with developer or ISV to obtain certification information.