Direct memory access (DMA) is a feature of computer systems and allows certain hardware subsystems to access main system

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...

independently of the

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...

(CPU). Without DMA, when the CPU is using

programmed input/output Programmed may refer to: * ''Programmed'' (Innerzone Orchestra album), 1999 * ''Programmed'' (Lethal album), 1990 See also * Program (disambiguation) Program, programme, programmer, or programming may refer to: Business and management * P ...

, it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in progress, and it finally receives an

interrupt In digital computers, an interrupt (sometimes referred to as a trap) is a request for the processor to ''interrupt'' currently executing code (when permitted), so that the event can be processed in a timely manner. If the request is accepted, ...

from the DMA controller (DMAC) when the operation is done. This feature is useful at any time that the CPU cannot keep up with the rate of data transfer, or when the CPU needs to perform work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA, including

disk drive Disk storage (also sometimes called drive storage) is a general category of storage mechanisms where data is recorded by various electronic, magnetic, optical, or mechanical changes to a surface layer of one or more rotating disks. A disk drive is ...

controllers,

graphics cards A graphics card (also called a video card, display card, graphics adapter, VGA card/VGA, video adapter, display adapter, or mistakenly GPU) is an expansion card which generates a feed of output images to a display device, such as a computer moni ...

network cards A network interface controller (NIC, also known as a network interface card, network adapter, LAN adapter or physical network interface, and by similar terms) is a computer hardware component that connects a computer to a computer network. Ear ...

and

sound card A sound card (also known as an audio card) is an internal expansion card that provides input and output of audio signals to and from a computer under the control of computer programs. The term ''sound card'' is also applied to external audio i ...

s. DMA is also used for intra-chip data transfer in

multi-core processor A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions (such a ...

s. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without DMA channels. Similarly, a processing circuitry inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel. DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or

scatter-gather In computing, vectored I/O, also known as scatter/gather I/O, is a method of input and output by which a single procedure call sequentially reads data from multiple buffers and writes it to a single data stream, or reads data from a data stream a ...

operations, from the CPU to a dedicated DMA engine. An implementation example is the I/O Acceleration Technology. DMA is of interest in network-on-chip and

in-memory computing In computer science, in-memory processing is an emerging technology for processing of data stored in an in-memory database. In-memory processing is one method of addressing the performance and power bottlenecks caused by the movement of data bet ...

architectures.

Principles

Third-party

Standard DMA, also called third-party DMA, uses a DMA controller. A DMA controller can generate

memory address In computing, a memory address is a reference to a specific memory location used at various levels by software and hardware. Memory addresses are fixed-length sequences of digits conventionally displayed and manipulated as unsigned integers. Su ...

es and initiate memory read or write cycles. It contains several

hardware register In digital electronics, especially computing, hardware registers are circuits typically composed of flip flops, often with many characteristics similar to memory, such as: * The ability to read or write multiple bits at a time, and * Using an a ...

s that can be written and read by the CPU. These include a memory address register, a byte count register, and one or more control registers. Depending on what features the DMA controller provides, these control registers might specify some combination of the source, the destination, the direction of the transfer (reading from the I/O device or writing to the I/O device), the size of the transfer unit, and/or the number of bytes to transfer in one burst. To carry out an input, output or memory-to-memory operation, the host processor initializes the DMA controller with a count of the number of

words A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...

to transfer, and the memory address to use. The CPU then commands the peripheral device to initiate a data transfer. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a byte of data is ready to be transferred between the peripheral device and memory, the DMA controller increments its internal address register until the full block of data is transferred.

Bus mastering

In a

bus mastering In computing, bus mastering is a feature supported by many bus architectures that enables a device connected to the bus to initiate direct memory access (DMA) transactions. It is also referred to as first-party DMA, in contrast with third-party ...

system, also known as a first-party DMA system, the CPU and peripherals can each be granted control of the memory bus. Where a peripheral can become a bus master, it can directly write to system memory without the involvement of the CPU, providing memory address and control signals as required. Some measures must be provided to put the processor into a hold condition so that bus contention does not occur.

Modes of operation

Burst mode

In ''burst mode'', an entire block of data is transferred in one contiguous sequence. Once the DMA controller is granted access to the system bus by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU, but renders the CPU inactive for relatively long periods of time. The mode is also called "Block Transfer Mode".

Cycle stealing mode

The ''

cycle stealing In computing, traditionally cycle stealing is a method of accessing computer memory (RAM) or bus without interfering with the CPU. It is similar to direct memory access (DMA) for allowing I/O controllers to read or write RAM without CPU interven ...

mode'' is used in systems in which the CPU should not be disabled for the length of time needed for burst transfer modes. In the cycle stealing mode, the DMA controller obtains access to the system bus the same way as in burst mode, using ''BR ( Bus Request)'' and ''BG (

Bus Grant In computer architecture, a control bus is part of the system bus and is used by CPUs for communicating with other devices within the computer. While the address bus carries the information about the device with which the CPU is communicating an ...

)'' signals, which are the two signals controlling the interface between the CPU and the DMA controller. However, in cycle stealing mode, after one unit (e.g. byte) of data transfer, the control of the system bus is deasserted to the CPU via BG. It is then continually requested again via BR, transferring one unit (e.g. byte) of data per request, until the entire block of data has been transferred. By continually obtaining and releasing the control of the system bus, the DMA controller essentially interleaves instruction and data transfers. The CPU processes an instruction, then the DMA controller transfers one data value, and so on. Data is not transferred as quickly, but CPU is not idled for as long as in burst mode. Cycle stealing mode is useful for controllers that monitor data in real time.

Transparent mode

Transparent mode takes the most time to transfer a block of data, yet it is also the most efficient mode in terms of overall system performance. In transparent mode, the DMA controller transfers data only when the CPU is performing operations that do not use the system buses. The primary advantage of transparent mode is that the CPU never stops executing its programs and the DMA transfer is free in terms of time, while the disadvantage is that the hardware needs to determine when the CPU is not using the system buses, which can be complex. This is also called "''Hidden DMA data transfer mode''".

Cache coherency

DMA can lead to

cache coherency In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, whi ...

problems. Imagine a CPU equipped with a cache and an external memory that can be accessed directly by devices using DMA. When the CPU accesses location X in the memory, the current value will be stored in the cache. Subsequent operations on X will update the cached copy of X, but not the external memory version of X, assuming a

write-back cache In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

. If the cache is not flushed to the memory before the next time a device tries to access X, the device will receive a stale value of X. Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU will operate on a stale value of X. This issue can be addressed in one of two ways in system design: Cache-coherent systems implement a method in hardware, called

bus snooping Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache (a snoopy cache) monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. A cache cont ...

, whereby external writes are signaled to the cache controller which then performs a

cache invalidation Cache invalidation is a process in a computer system whereby entries in a cache are replaced or removed. It can be done explicitly, as part of a cache coherence protocol. In such a case, a processor changes a memory location and then invalidates ...

for DMA writes or cache flush for DMA reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range is not accessed by any running threads in the meantime. The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually. Hybrids also exist, where the secondary L2 cache is coherent while the L1 cache (typically on-CPU) is managed by software.

Examples

ISA

In the original

IBM PC The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the IBM PC model line and the basis for the IBM PC compatible de facto standard. Released on August 12, 1981, it was created by a team ...

(and the follow-up

PC/XT The IBM Personal Computer XT (model 5160, often shortened to PC/XT) is the second computer in the IBM Personal Computer line, released on March 8, 1983. Except for the addition of a built-in hard drive and extra expansion slots, it is very simila ...

), there was only one

Intel 8237 Intel 8237 is a direct memory access (DMA) controller, a part of the MCS 85 microprocessor family. It enables data transfer between memory and the I/O with reduced load on the system's main processor by providing the memory with control signals a ...

DMA controller capable of providing four DMA channels (numbered 0–3). These DMA channels performed 8-bit transfers (as the 8237 was an 8-bit device, ideally matched to the PC's i8088 CPU/bus architecture), could only address the first ( i8086/8088-standard) megabyte of RAM, and were limited to addressing single 64 kB segments within that space (although the source and destination channels could address different segments). Additionally, the controller could only be used for transfers to, from or between expansion bus I/O devices, as the 8237 could only perform memory-to-memory transfers using channels 0 & 1, of which channel 0 in the PC (& XT) was dedicated to

dynamic memory Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when ...

refresh. This prevented it from being used as a general-purpose "

Blitter A blitter is a circuit, sometimes as a coprocessor or a logic block on a microprocessor, dedicated to the rapid movement and modification of data within a computer's memory. A blitter can copy large quantities of data from one memory area to anot ...

", and consequently block memory moves in the PC, limited by the general PIO speed of the CPU, were very slow. With the

IBM PC/AT The IBM Personal Computer/AT (model 5170, abbreviated as IBM AT or PC/AT) was released in 1984 as the fourth model in the IBM Personal Computer line, following the IBM PC/XT and its IBM Portable PC variant. It was designed around the Intel 8028 ...

, the enhanced

AT Bus Industry Standard Architecture (ISA) is the 16-bit internal bus of IBM PC/AT and similar computers based on the Intel 80286 and its immediate successors during the 1980s. The bus was (largely) backward compatible with the 8-bit bus of the 808 ...

(more familiarly retronymed as the ISA, or "Industry Standard Architecture") added a second 8237 DMA controller to provide three additional, and as highlighted by resource clashes with the XT's additional expandability over the original PC, much-needed channels (5–7; channel 4 is used as a cascade to the first 8237). The page register was also rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller was also integrated in a way capable of performing 16-bit transfers when an I/O device is used as the data source and/or destination (as it actually only processes data itself for memory-to-memory transfers, otherwise simply ''controlling'' the data flow between other parts of the 16-bit system, making its own data bus width relatively immaterial), doubling data throughput when the upper three channels are used. For compatibility, the lower four DMA channels were still limited to 8-bit transfers only, and whilst memory-to-memory transfers were now technically possible due to the freeing up of channel 0 from having to handle DRAM refresh, from a practical standpoint they were of limited value because of the controller's consequent low throughput compared to what the CPU could now achieve (i.e., a 16-bit, more optimised

80286 The Intel 80286 (also marketed as the iAPX 286 and often called Intel 286) is a 16-bit microprocessor that was introduced on February 1, 1982. It was the first 8086-based CPU with separate, non-multiplexed address and data buses and also the fi ...

running at a minimum of 6 MHz, vs an 8-bit controller locked at 4.77 MHz). In both cases, the 64 kB segment boundary issue remained, with individual transfers unable to cross segments (instead "wrapping around" to the start of the same segment) even in 16-bit mode, although this was in practice more a problem of programming complexity than performance as the continued need for DRAM refresh (however handled) to monopolise the bus approximately every 15 μs prevented use of large (and fast, but uninterruptible) block transfers. Due to their lagging performance (1.6 MB/s maximum 8-bit transfer capability at 5 MHz, but no more than 0.9 MB/s in the PC/XT and 1.6 MB/s for 16-bit transfers in the AT due to ISA bus overheads and other interference such as memory refresh interruptions) and unavailability of any speed grades that would allow installation of direct replacements operating at speeds higher than the original PC's standard 4.77 MHz clock, these devices have been effectively obsolete since the late 1980s. Particularly, the advent of the

80386 The Intel 386, originally released as 80386 and later renamed i386, is a 32-bit microprocessor introduced in 1985. The first versions had 275,000 transistors 80186 The Intel 80186, also known as the iAPX 186, or just 186, is a microprocessor and microcontroller introduced in 1982. It was based on the Intel 8086 and, like it, had a 16-bit external data bus multiplexed with a 20-bit address bus. The 801 ...

meant that PIO transfers even by the 16-bit-bus 286 and 386SX could still easily outstrip the 8237), as well as the development of further evolutions to ( EISA) or replacements for ( MCA, VLB and

PCI PCI may refer to: Business and economics * Payment card industry, businesses associated with debit, credit, and other payment cards ** Payment Card Industry Data Security Standard, a set of security requirements for credit card processors * Pro ...

) the "ISA" bus with their own much higher-performance DMA subsystems (up to a maximum of 33 MB/s for EISA, 40 MB/s MCA, typically 133 MB/s VLB/PCI) made the original DMA controllers seem more of a performance millstone than a booster. They were supported to the extent they are required to support built-in legacy PC hardware on later machines. The pieces of legacy hardware that continued to use ISA DMA after 32-bit expansion buses became common were

Sound Blaster Sound Blaster is a family of sound cards designed by Singaporean technology company Creative Technology (known in the US as Creative Labs). Sound Blaster sound cards were the de facto standard for consumer audio on the IBM PC compatible system pl ...

cards that needed to maintain full hardware compatibility with the

Sound Blaster standard In physics, sound is a vibration that propagates as an acoustic wave, through a transmission medium such as a gas, liquid or solid. In human physiology and psychology, sound is the ''reception'' of such waves and their ''perception'' by the ...

; and

Super I/O Super I/O is a class of I/O controller integrated circuits that began to be used on personal computer motherboards in the late 1980s, originally as add-in cards, later embedded on the motherboards. A super I/O chip combines interfaces for a vari ...

devices on motherboards that often integrated a built-in

floppy disk A floppy disk or floppy diskette (casually referred to as a floppy, or a diskette) is an obsolescent type of disk storage composed of a thin and flexible disk of a magnetic storage medium in a square or nearly square plastic enclosure lined w ...

controller, an

IrDA The Infrared Data Association (IrDA) is an industry-driven interest group that was founded in 1994 by around 50 companies. IrDA provides specifications for a complete set of protocols for wireless infrared communications, and the name "IrDA" also ...

infrared controller when FIR (fast infrared) mode is selected, and an

IEEE 1284 IEEE 1284 is a standard that defines bi-directional parallel communications between computers and other devices. It was originally developed in the 1970s by Centronics, and was widely known as the Centronics port, both before and after its IEEE ...

parallel port controller when ECP mode is selected. In cases where an original 8237s or direct compatibles were still used, transfer to or from these devices may still be limited to the first 16 MB of main

RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch * ...

regardless of the system's actual address space or amount of installed memory. Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then instructs the DMA hardware to begin the transfer. When the transfer is complete, the device

s the CPU. Scatter-gather or

vectored I/O In computing, vectored I/O, also known as scatter/gather I/O, is a method of input and output by which a single procedure call sequentially reads data from multiple buffers and writes it to a single data stream, or reads data from a data stream a ...

DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA requests. The motivation is to off-load multiple

input/output In computing, input/output (I/O, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals ...

interrupt and data copy tasks from the CPU. DRQ stands for ''Data request''; DACK for ''Data acknowledge''. These symbols, seen on hardware

schematic A schematic, or schematic diagram, is a designed representation of the elements of a system using abstract, graphic symbols rather than realistic pictures. A schematic usually omits all details that are not relevant to the key information the sc ...

s of computer systems with DMA functionality, represent electronic signaling lines between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge line. A device that uses DMA must be configured to use both lines of the assigned DMA channel. 16-bit ISA permitted bus mastering. Standard ISA DMA assignments:

PCI

architecture has no central DMA controller, unlike ISA. Instead, A PCI device can request control of the bus ("become the

bus master In computing, bus mastering is a feature supported by many bus architectures that enables a device connected to the bus to initiate direct memory access (DMA) transactions. It is also referred to as first-party DMA, in contrast with third-party ...

") and request to read from and write to system memory. More precisely, a PCI component requests bus ownership from the PCI bus controller (usually PCI host bridge, and PCI to PCI bridge), which will arbitrate if several devices request bus ownership simultaneously, since there can only be one bus master at one time. When the component is granted ownership, it will issue normal read and write commands on the PCI bus, which will be claimed by the PCI bus controller. As an example, on an

Intel Core Intel Core is a line of streamlined midrange consumer, workstation and enthusiast computer central processing units (CPUs) marketed by Intel Corporation. These processors displaced the existing mid- to high-end Pentium processors at the time ...

-based PC, the southbridge will forward the transactions to the

memory controller The memory controller is a digital circuit that manages the flow of data going to and from the computer's main memory. A memory controller can be a separate chip or integrated into another chip, such as being placed on the same die or as an int ...

(which is integrated on the CPU die) using DMI, which will in turn convert them to DDR operations and send them out on the memory bus. As a result, there are quite a number of steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI bus itself are an order of magnitude slower than the rest of the components (see

list of device bandwidths This is a list of interface bit rates, is a measure of information transfer rates, or digital bandwidth capacity, at which digital interfaces in a computer or network can communicate over various kinds of buses and channels. The distinction can ...

). A modern x86 CPU may use more than 4 GB of memory, either utilizing the native 64-bit mode of

x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...

CPU, or the

Physical Address Extension In computing, Physical Address Extension (PAE), sometimes referred to as Page Address Extension, is a memory management feature for the x86 architecture. PAE was first introduced by Intel in the Pentium Pro, and later by AMD in the Athlon process ...

(PAE), a 36-bit addressing mode. In such a case, a device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The new Double Address Cycle (DAC) mechanism, if implemented on both the PCI bus and the device itself, enables 64-bit DMA addressing. Otherwise, the operating system would need to work around the problem by either using costly double buffers (DOS/Windows nomenclature) also known as

bounce buffer In computer science, multiple buffering is the use of more than one buffer to hold a block of data, so that a "reader" will see a complete (though perhaps old) version of the data, rather than a partially updated version of the data being create ...

s (

FreeBSD FreeBSD is a free and open-source Unix-like operating system descended from the Berkeley Software Distribution (BSD), which was based on Research Unix. The first version of FreeBSD was released in 1993. In 2005, FreeBSD was the most popular ...

/Linux), or it could use an IOMMU to provide address translation services if one is present.

I/OAT

As an example of DMA engine incorporated in a general-purpose CPU, some Intel

Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...

chipsets include a DMA engine called I/O Acceleration Technology (I/OAT), which can offload memory copying from the main CPU, freeing it to do other work. In 2006, Intel's

Linux kernel The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally authored in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU ope ...

developer Andrew Grover performed benchmarks using I/OAT to offload network traffic copies and found no more than 10% improvement in CPU utilization with receiving workloads.

DDIO

Further performance-oriented enhancements to the DMA mechanism have been introduced in Intel Xeon E5 processors with their Data Direct I/O (DDIO) feature, allowing the DMA "windows" to reside within

CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...

s instead of system RAM. As a result, CPU caches are used as the primary source and destination for I/O, allowing

network interface controller A network interface controller (NIC, also known as a network interface card, network adapter, LAN adapter or physical network interface, and by similar terms) is a computer hardware component that connects a computer to a computer network. Ear ...

s (NICs) to DMA directly to the Last level cache (L3 cache) of local CPUs and avoid costly fetching of the I/O data from system RAM. As a result, DDIO reduces the overall I/O processing latency, allows processing of the I/O to be performed entirely in-cache, prevents the available RAM bandwidth/latency from becoming a performance bottleneck, and may lower the power consumption by allowing RAM to remain longer in low-powered state.

AHB

In systems-on-a-chip and

embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' as ...

s, typical system bus infrastructure is a complex on-chip bus such as AMBA High-performance Bus. AMBA defines two kinds of AHB components: master and slave. A slave interface is similar to programmed I/O through which the software (running on embedded CPU, e.g.

ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...

) can write/read I/O registers or (less commonly) local memory blocks inside the device. A master interface can be used by the device to perform DMA transactions to/from system memory without heavily loading the CPU. Therefore, high bandwidth devices such as network controllers that need to transfer huge amounts of data to/from system memory will have two interface adapters to the AHB: a master and a slave interface. This is because on-chip buses like AHB do not support tri-stating the bus or alternating the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is bus-mastering, but an arbiter is required in case of multiple masters present on the system. Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent

operations as programmed by the software.

Cell

As an example usage of DMA in a

multiprocessor-system-on-chip A multiprocessor system on a chip (, ' or ) is a system on a chip (SoC) which includes multiple microprocessors. As such, it is a multi-core system on a chip. MPSoCs are usually targeted for embedded applications. It is used by platforms that c ...

, IBM/Sony/Toshiba's

Cell processor Cell is a Multi-core processor, multi-core microprocessor microarchitecture that combines a general-purpose PowerPC Central processing unit, core of modest performance with streamlined coprocessor, coprocessing elements which greatly accelerate m ...

incorporates a DMA engine for each of its 9 processing elements including one Power processor element (PPE) and eight synergistic processor elements (SPEs). Since the SPE's load/store instructions can read/write only its own local memory, an SPE entirely depends on DMAs to transfer data to and from the main memory and local memories of other SPEs. Thus the DMA acts as a primary means of data transfer among cores inside this CPU (in contrast to cache-coherent CMP architectures such as Intel's cancelled general-purpose GPU, Larrabee). DMA in Cell is fully cache coherent (note however local stores of SPEs operated upon by DMA do not act as globally coherent cache in the standard sense). In both read ("get") and write ("put"), a DMA command can transfer either a single block area of size up to 16 KB, or a list of 2 to 2048 such blocks. The DMA command is issued by specifying a pair of a local address and a remote address: for example when a SPE program issues a put DMA command, it specifies an address of its own local memory as the source and a virtual memory address (pointing to either the main memory or the local memory of another SPE) as the target, together with a block size. According to an experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200 GB per second.

Pipelining

Processors with

scratchpad memory Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocesso ...

and DMA (such as

digital signal processor A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on MOS integrated circuit chips. They are widely used in audio si ...

s and the

Cell Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...

processor) may benefit from software overlapping DMA memory operations with processing, via

double buffering In computer science, multiple buffering is the use of more than one buffer to hold a block of data, so that a "reader" will see a complete (though perhaps old) version of the data, rather than a partially updated version of the data being create ...

or multibuffering. For example, the on-chip memory is split into two buffers; the processor may be operating on data in one, while the DMA engine is loading and storing data in the other. This allows the system to avoid

memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will hav ...

and exploit burst transfers, at the expense of needing a predictable

memory access pattern In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performa ...

Notes

References

DMA Fundamentals on Various PC Platforms
from A. F. Harvey and Data Acquisition Division Staff NATIONAL INSTRUMENTS

from ''Linux Device Drivers, 2nd Edition'', Alessandro Rubini &

Jonathan Corbet LWN.net is a computing webzine with an emphasis on free software and software for Linux and other Unix-like operating systems. It consists of a weekly issue, separate stories which are published most days, and threaded discussion attached to ever ...

Memory Mapping and DMA
from ''Linux Device Drivers, 3rd Edition'',

, Alessandro Rubini,

Greg Kroah-Hartman Greg Kroah-Hartman (GKH) is a major Linux kernel developer. he is the Linux kernel maintainer for the branch, the staging subsystem, USB, driver core, debugfs, kref, kobject, and the sysfs kernel subsystems, Userspace I/O (with Hans J. Koch ...

DMA and Interrupt Handling

External links

Mastering the DMA and IOMMU APIs
Embedded Linux Conference 2014, San Jose, by Laurent Pinchart {{Computer-bus Computer memory Motherboard Computer storage buses Hardware acceleration Input/output ;