Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large

data center A data center (American English) or data centre (British English)See spelling differences. is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunic ...

computers, initially designed to be layered on top of

PCI Express PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common ...

, for directly connecting

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...

s (CPUs) to external accelerators like

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...

s (GPUs), ASICs,

FPGAs A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term '' field-programmable''. The FPGA configuration is generally specified using a hardware de ...

or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different

instruction set architecture In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...

History

The performance scaling traditionally associated with

Moore's Law Moore's law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years. Moore's law is an observation and projection of a historical trend. Rather than a law of physics, it is an empir ...

—dating back to 1965—began to taper off around 2004, as both Intel's Prescott architecture and IBM's Cell processor pushed toward a 4 GHz operating frequency. Here both projects ran into a thermal scaling wall, whereby heat extraction problems associated with further increases in operating frequency largely outweighed gains from shorter cycle times. Over the decade that followed, few commercial CPU products exceeded 4 GHz, with the majority of performance improvements now coming from incrementally improved microarchitectures, better systems integration, and higher compute density—this largely in the form of packing a larger numbers of independent cores onto the same die, often at the ''expense'' of peak operating frequency (Intel's 24-core Xeon E7-8890 from June 2016 has a base operating frequency of just 2.2 GHz, so as to operate within the constraints of a single-socket 165 W power consumption and cooling budget). Where large performance gains have been realized, it was often associated with increasingly specialized compute units, such as GPU units added to the processor die, or external GPU- or FPGA-based accelerators. In many applications, accelerators struggle with limitations of the interconnect's performance (bandwidth and latency) or with limitations due to the interconnect's architecture (such as lacking memory coherence). Especially in the datacenter, improving the interconnect became paramount in moving toward a heterogeneous architecture in which hardware becomes increasingly tailored to specific compute workloads. CAPI was developed to enable computers to more easily and efficiently attach specialized accelerators. Memory intensive and computation intensive works like

matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...

s for deep

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

s can be offloaded into CAPI-supported platforms. It was designed by IBM for use in its

POWER8 POWER8 is a family of superscalar multi-core microprocessors based on the Power ISA, announced in August 2013 at the Hot Chips conference. The designs are available for licensing under the OpenPOWER Foundation, which is the first time for s ...

based systems which came to market in 2014. At the same time, IBM and several other companies founded the

OpenPOWER Foundation The OpenPOWER Foundation is a collaboration around Power ISA-based products initiated by IBM and announced as the "OpenPOWER Consortium" on August 6, 2013. IBM is opening up technology surrounding their Power Architecture offerings, such as pro ...

to build an ecosystem around

Power Power most often refers to: * Power (physics), meaning "rate of doing work" ** Engine power, the power put out by an engine ** Electric power * Power (social and political), the ability to influence people or events ** Abusive power Power may a ...

based technologies, including CAPI. In October 2016 several OpenPOWER partners formed the ''OpenCAPI Consortium'' together with GPU and CPU designer

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...

and systems designers

Dell EMC Dell EMC (EMC Corporation until 2016) is an American multinational corporation headquartered in Hopkinton, Massachusetts and Round Rock, Texas, United States. Dell EMC sells data storage, information security, virtualization, analytics, cloud c ...

and

Hewlett Packard Enterprise The Hewlett Packard Enterprise Company (HPE) is an American multinational information technology company based in Spring, Texas, United States. HPE was founded on November 1, 2015, in Palo Alto, California, as part of the splitting of the H ...

to spread the technology beyond the scope of OpenPOWER and IBM. On August 1, 2022, OpenCAPI specifications and assets were transferred to the

Compute Express Link Compute Express Link (CXL) is an open standard for high-speed central processing unit (CPU)-to-device and CPU-to-memory connections, designed for high performance data center computers. CXL is built on the PCI Express (PCIe) physical and electr ...

(CXL) Consortium.

Implementation

CAPI

CAPI is implemented as a functional unit inside the CPU, called the Coherent Accelerator Processor Proxy (CAPP) with a corresponding unit on the accelerator called the Power Service Layer (PSL). The CAPP and PSL units acts like a cache directory so the attached device and the CPU can share the same coherent memory space, and the accelerator becomes an Accelerator Function Unit (AFU), a peer to other functional units integrated in the CPU.Reconfigurable Accelerators for Big Data and Cloud – RAW 2016
/ref> Since the CPU and AFU share the same memory space, low latency and high speeds can be achieved since the CPU doesn't have to do memory translations and memory shuffling between the CPU's main memory and the accelerator's memory spaces. An application can make use of the accelerator without specific device drivers as everything is enabled by a general CAPI kernel extension in the host operating system. The CPU and PSL can read and write directly to each other's memories and registers, as demanded by the application.

CAPI

CAPI is layered on top of PCIe Gen 3, using 16 PCIe lanes, and is an additional functionality for the PCIe slots on CAPI enabled systems. Usually there are designated CAPI enabled PCIe slots on such machines. Since there is only one CAPP per POWER8 processor the number of possible CAPI units are determined by the number of POWER8 processors, regardless of how many PCIe slots there are. In certain POWER8 systems, IBM makes use of dual chip modules, thus doubling the CAPI capacity per processor socket. Traditional transactions between a PCIe device and a CPU can take around 20,000 operations, whereas a CAPI attached device will only use around 500, significantly reducing latency, and effectively increasing bandwidth due to decreased operations overhead. The total bandwidth of a CAPI port is determined by the underlying PCIe 3.0 x16 technology, peaking at ca 16 GB/s, bidirectional.Opening Up The Server Bus For Coherent Acceleration
/ref>

CAPI 2

CAPI-2 is an incremental evolution of the technology introduced with IBM POWER9 processor. It runs on top of PCIe Gen 4 that effectively doubles the performance to 32 GB/s. It also introduces some new features like support for DMA and Atomics from the accelerator.

OpenCAPI

The technology behind OpenCAPI is governed by the ''OpenCAPI Consortium'', founded in October 2016 by

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

, IBM,

Mellanox Mellanox Technologies Ltd. ( he, מלאנוקס טכנולוגיות בע"מ) was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, softwa ...

and

Micron The micrometre ( international spelling as used by the International Bureau of Weights and Measures; SI symbol: μm) or micrometer (American spelling), also commonly known as a micron, is a unit of length in the International System of Unit ...

together with partners

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

and

Xilinx Xilinx, Inc. ( ) was an American technology and semiconductor company that primarily supplied programmable logic devices. The company was known for inventing the first commercially viable field-programmable gate array (FPGA) and creating the fi ...

OpenCAPI 3

OpenCAPI, formerly ''New CAPI'' or ''CAPI 3.0'', is not layered on top of PCIe and will therefore not use PCIe slots. In IBM's CPU

POWER9 POWER9 is a family of superscalar, multithreading, multi-core microprocessors produced by IBM, based on the Power ISA. It was announced in August 2016. The POWER9-based processors are being manufactured using a 14 nm FinFET process, in ...

it will use the ''Bluelink 25G'' I/O facility that it shares with NVLink 2.0, peaking at 50 GB/s. OpenCAPI doesn't need the PSL unit (required for CAPI 1 and 2) in the accelerator, as it's not layered on top of PCIe but uses its own transaction protocol.

OpenCAPI 4

Planned for future chip after the General Availability of POWER9.
Slides
_(PDF)
AIX VUG page
has links to slides and video

OMI

OpenCAPI Memory Interface (OMI) is a serial attached

RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch * ...

technology based on OpenCAPI, providing

low latency Latency, from a general point of view, is a time delay between the cause and the effect of some physical change in the system being observed. Lag, as it is known in gaming circles, refers to the latency between the input to a simulation and ...

, high bandwidth connection for main memory. OMI uses a controller chip on the memory modules that allows for technology agnostic approach to what is used on the modules, be it

DDR4 Double Data Rate 4 Synchronous Dynamic Random-Access Memory (DDR4 SDRAM) is a type of synchronous dynamic random-access memory with a high bandwidth (" double data rate") interface. Released to the market in 2014, it is a variant of dynamic ra ...

DDR5 Double Data Rate 5 Synchronous Dynamic Random-Access Memory (DDR5 SDRAM) is a type of synchronous dynamic random-access memory. Compared to its predecessor DDR4 SDRAM Double Data Rate 4 Synchronous Dynamic Random-Access Memory (DDR4 SDRAM) ...

, HBM or storage class

non-volatile RAM Non-volatile random-access memory (NVRAM) is random-access memory that retains data without applied power. This is in contrast to dynamic random-access memory (DRAM) and static random-access memory (SRAM), which both maintain data only for as lo ...

. An OMI based CPU can therefore change RAM type by changing the memory modules. A serial connection uses less floorspace for the interface on the CPU die therefore potentially allowing more of them compared to using common DDR memory. OMI is implemented in IBM's

Power10 Power10 is a superscalar, multithreading, multi-core microprocessor family, based on the open source Power ISA, and announced in August 2020 at the Hot Chips conference; systems with Power10 CPUs. Generally available from September 2021 ...

CPU, which has 8 OMI memory controllers on-chip, allowing for 4 TB RAM and 410 GB/s memory bandwidth per processor. These DDIMMs (Differential Dynamic Memory Module) includes a OMI controller and memory buffer, and can address individual memory chips for fault tolerance and redundancy purposes.

Microchip Technology Microchip Technology Inc. is a publicly-listed American corporation that manufactures microcontroller, mixed-signal, analog and Flash-IP integrated circuits. Its products include microcontrollers ( PIC, dsPIC, AVR and SAM), Serial EEPROM ...

manufactures the OMI controller on the DDIMMs. Their SMC 1000 OpenCAPI memory is described as "the next progression in the market adopting serial attached memory."

References

External links

OpenCAPI Consortium

Open Memory Interface (OMI)
{{Computer-bus Peripheral Component Interconnect Serial buses Motherboard expansion slot

History

Implementation

CAPI

CAPI

CAPI 2

OpenCAPI

OpenCAPI 3

OpenCAPI 4

OMI

See also

References

External links