Kepler is the codename for a

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

microarchitecture In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be impl ...

developed by

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

, first introduced at retail in April 2012, as the successor to the

Fermi Enrico Fermi (; 29 September 1901 – 28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age" and ...

microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most

GeForce 600 series The GeForce 600 series is a series of graphics processing units developed by Nvidia, first released in 2012. They served as the introduction of the Kepler architecture. Overview Where the goal of the previous architecture, Fermi, was to inc ...

, most

GeForce 700 series The GeForce 700 series (stylized as GEFORCE GTX 700 SERIES) is a series of graphics processing units developed by Nvidia. While mainly a refresh of the Kepler microarchitecture (GK-codenamed chips), some cards use Fermi (GF) and later cards us ...

, and some

GeForce 800M series The GeForce 800M series is a family of graphics processing units by Nvidia for laptop PCs. It consists of rebrands of mobile versions of the GeForce 700 series and some newer chips that are lower end compared to the rebrands. The GeForce 800 s ...

GPUs were based on Kepler, all manufactured in 28 nm. Kepler also found use in the GK20A, the GPU component of the

Tegra K1 Tegra is a system on a chip (SoC) series developed by Nvidia for mobile devices such as smartphones, personal digital assistants, and mobile Internet devices. The Tegra integrates an ARM architecture central processing unit (CPU), graphics proc ...

SoC, as well as in the

Quadro Quadro was Nvidia's brand for graphics cards intended for use in workstations running professional computer-aided design (CAD), computer-generated imagery (CGI), digital content creation (DCC) applications, scientific calculations and machine l ...

Kxxx series, the Quadro NVS 510, and

Nvidia Tesla Nvidia Tesla was the name of Nvidia's line of products targeted at stream processing or general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began using GPUs from the G80 ser ...

computing modules. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the

and

. The architecture is named after

Johannes Kepler Johannes Kepler (; ; 27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws ...

, a German mathematician and key figure in the 17th century

scientific revolution The Scientific Revolution was a series of events that marked the emergence of modern science during the early modern period, when developments in mathematics, physics, astronomy, biology (including human anatomy) and chemistry transfo ...

Overview

Where the goal of Nvidia's previous architecture was design focused on increasing performance on compute and tessellation, with Kepler architecture Nvidia targeted their focus on efficiency, programmability and performance. The efficiency aim was achieved through the use of a unified GPU clock, simplified static scheduling of instruction and higher emphasis on performance per watt. By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires additional cores to achieve higher levels of performance. This is not only because the cores are more power-friendly (two Kepler cores using 90% power of one Fermi core, according to Nvidia's numbers), but also the change to a unified GPU clock scheme delivers a 50% reduction in power consumption in that area. Programmability aim was achieved with Kepler's Hyper-Q, Dynamic Parallelism and multiple new Compute Capabilities 3.x functionality. With it, higher GPU utilization and simplified code management was achievable with GK GPUs thus enabling more flexibility in programming for Kepler GPUs. Finally with the performance aim, additional execution resources (more CUDA cores, registers and cache) and with Kepler's ability to achieve a memory clock speed of 7 GHz, increases Kepler's performance when compared to previous Nvidia GPUs.

Features

The GK Series GPU contains features from both the older Fermi and newer Kepler generations. Kepler based members add the following standard features: * PCI Express 3.0 interface *

DisplayPort DisplayPort (DP) is a digital display interface developed by a consortium of PC and chip manufacturers and standardized by the Video Electronics Standards Association (VESA). It is primarily used to connect a video source to a display device su ...

1.2 *

HDMI High-Definition Multimedia Interface (HDMI) is a proprietary audio/video interface for transmitting uncompressed video data and compressed or uncompressed digital audio data from an HDMI-compliant source device, such as a display controller, ...

1.4a 4K x 2K video output * Purevideo VP5 hardware video acceleration (up to 4K x 2K H.264 decode) * Hardware H.264 encoding acceleration block (NVENC) * Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround) * Next Generation Streaming Multiprocessor (SMX) * Polymorph-Engine 2.0 * Simplified Instruction Scheduler * Bindless Textures *

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ca ...

Compute Capability 3.0 to 3.5 * GPU Boost (Upgraded to 2.0 on GK110) * TXAA Support * Manufactured by

TSMC Taiwan Semiconductor Manufacturing Company Limited (TSMC; also called Taiwan Semiconductor) is a Taiwanese multinational corporation, multinational semiconductor contract manufacturing and design company. It is the world's most valuable semicon ...

on a 28 nm process * New Shuffle Instructions * Dynamic Parallelism * Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only) * Grid Management Unit * NVIDIA GPUDirect (GPU Direct's RDMA functionality reserve for Tesla only)

Next Generation Streaming Multiprocessor (SMX)

The Kepler architecture employs a new Streaming Multiprocessor Architecture called "SMX". SMXs are the reason for Kepler's power efficiency as the whole GPU uses a single unified clock speed. Although SMX's usage of a single unified clock increases power efficiency due to the fact that multiple lower clock Kepler CUDA Cores consume 90% less power than multiple higher clock Fermi CUDA Core, additional processing units are needed to execute a whole warp per cycle. Doubling 16 to 32 per CUDA array solve the warp execution problem, the SMX front-end are also double with warp schedulers, dispatch unit and the register file doubled to 64K entries as to feed the additional execution units. With the risk of inflating die area, SMX PolyMorph Engines are enhanced to 2.0 rather than double alongside the execution units, enabling it to spurr polygon in shorter cycles. There are 192 shaders per SMX. Dedicated

FP64 Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. Fl ...

CUDA cores are also used as all Kepler CUDA cores are not FP64 capable to save die space. With the improvement Nvidia made on the SMX, the results include an increase in GPU performance and efficiency. With GK110, the 48KB texture cache are unlocked for compute workloads. In compute workload the texture cache becomes a read-only data cache, specializing in unaligned memory access workloads. Furthermore, error detection capabilities have been added to make it safer for workloads that rely on ECC. The register per thread count is also doubled in GK110 with 255 registers per thread.

Simplified Instruction Scheduler

Additional die space reduction and power saving was achieved by removing a complex hardware block that handled the prevention of data hazards.

GPU Boost

GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within TDP specifications, even at maximum loads. When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target (which is 170 W by default). By taking this approach, the GPU will ramp its clock up or down dynamically, so that it is providing the maximum amount of speed possible while remaining within TDP specifications. The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.

Microsoft Direct3D Support

Nvidia Fermi and Kepler GPUs of the GeForce 600 series support the Direct3D 11.0 specification. Nvidia originally stated that the Kepler architecture has full

DirectX Microsoft DirectX is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with "Direct", ...

11.1 support, which includes the Direct3D 11.1 path. The following "Modern UI" Direct3D 11.1 features, however, are not supported: * Target-Independent Rasterization (2D rendering only). * 16xMSAA Rasterization (2D rendering only). * Orthogonal Line Rendering Mode. * UAV (Unordered Access View) in non-pixel-shader stages. According to the definition by Microsoft,

Direct3D feature level Direct3D is a graphics application programming interface (API) for Microsoft Windows. Part of DirectX, Direct3D is used to render three-dimensional graphics in applications where performance is important, such as games. Direct3D uses hardwa ...

11_1 must be complete, otherwise the Direct3D 11.1 path can not be executed. The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.

Next Microsoft Direct3D Support

NVIDIA Kepler GPUs of the GeForce 600/700 series support Direct3D 12 feature level 11_0.

TXAA Support

Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the MSAA technique and custom resolve filters. It is designed to address a key problem in games known as shimmering or

temporal aliasing In signal processing and related disciplines, aliasing is an effect that causes different signals to become indistinguishable (or ''aliases'' of one another) when sampled. It also often refers to the distortion or artifact that results when a ...

. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.

Shuffle Instructions

At a low level, GK110 sees an additional instructions and operations to further improve performance. New shuffle instructions allow for threads within a warp to share data without going back to memory, making the process much quicker than the previous load/share/store method. Atomic operations are also overhauled, speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data.

Hyper-Q

Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn't enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it's easily mapped to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it's possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.

Dynamic Parallelism

Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.

Grid Management Unit

Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed.

NVIDIA GPUDirect

NVIDIA GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory. It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. Kepler GK110 also supports other GPUDirect features including Peer‐to‐Peer and GPUDirect for Video.

Video decompression/compression

NVDEC

NVENC

NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode. Like Intel's Quick Sync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.

Performance

The theoretical single-precision processing power of a Kepler GPU in

GFLOPS In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...

is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × core clock speed (in GHz). Note that like the previous generation

, Kepler is not able to benefit from increased processing power by dual-issuing MAD+MUL like Tesla was capable of. The theoretical double-precision processing power of a Kepler GK110/210 GPU is 1/3 of its single precision performance. This double-precision processing power is however only available on professional

, Tesla, and high-end TITAN-branded

GeForce GeForce is a brand of graphics processing units (GPUs) designed by Nvidia. As of the GeForce 40 series, there have been eighteen iterations of the design. The first GeForce products were discrete GPUs designed for add-on graphics boards, inten ...

cards, while drivers for consumer GeForce cards limit the performance to 1/24 of the single precision performance. The lower performance GK10x chips are similarly capped to 1/24 of the single precision performance.

Kepler chips

* GK104 * GK106 * GK107 * GK110 * GK208 * GK210 * GK20A (

)

References

{{Nvidia Nvidia microarchitectures Nvidia Kepler