A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via

shared memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...

, barrier synchronization or other synchronization primitives such as atomic operations. Multiple blocks are combined to form a grid. All the blocks in the same grid contain the same number of threads. The number of threads in a block is limited, but grids can be used for computations that require a large number of thread blocks to operate in parallel and to use all available multiprocessors.

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ...

is a parallel computing platform and

programming model A programming model is an execution model coupled to an API or a particular pattern of code. In this style, there are actually two execution models in play: the execution model of the base programming language and the execution model of the prog ...

that higher level languages can use to exploit parallelism. In CUDA, the

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...

is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A

is a function that compiles to run on a special device. Multi threaded applications use many such threads that are running at the same time, to organize parallel computation. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

Dimensions

CUDA operates on a heterogeneous programming model which is used to run host device application programs. It has an execution model that is similar to

OpenCL OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-progra ...

. In this model, we start executing an application on the host device which is usually a CPU core. The device is a throughput oriented device, i.e., a

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...

core which performs parallel computations. Kernel functions are used to do these parallel executions. Once these kernel functions are executed the control is passed back to the host device that resumes serial execution. As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D, 2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. 'When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched.' The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block. Blocks can be organized into one, two or three-dimensional grids of up to 2³¹-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively. Unlike the maximum threads per block, there is not a blocks per grid limit distinct from the maximum grid dimensions.

Indexing

1D-indexing

Every thread in CUDA is associated with a particular index so that it can calculate and access memory locations in an array. Consider an example in which there is an array of 512 elements. One of the organization structure is taking a grid with a single block that has a 512 threads. Consider that there is an array C of 512 elements that is made of element wise multiplication of two arrays A and B which are both 512 elements each. Every thread has an index i and it performs the multiplication of i^th element of A and B and then store the result in the i^th element of C. i is calculated by using blockIdx (which is 0 in this case as there is only one block), blockDim (512 in this case as the block has 512 elements) and threadIdx that varies from 0 to 511 for each block. Block-thread

The thread index i is calculated by the following formula :

i = blockIdx.x * blockDim.x + threadIdx.x

blockIdx.x is the x dimension block identifier blockDim.x is the x dimension of the block dimension threadIdx.x is the x dimension of the thread identifier Thus ‘i’ will have values ranging from 0 to 511 that covers the entire array. If we want to consider computations for an array that is larger than 1024 we can have multiple blocks with 1024 threads each. Consider an example with 2048 array elements. In this case we have 2 thread blocks with 1024 threads each. Thus the thread identifiers' values will vary from 0 to 1023, the block identifier will vary from 0 to 1 and the block dimension will be 1024. Thus the first block will get index values from 0 to 1023 and the last one will have index values from 1024 to 2047. Thus each thread will first calculate the index of memory that it has to access and then proceed with the calculation. Consider an example in which elements from arrays A and B are added in parallel by using threads and the results is stored in an array C. The corresponding code in a thread is shown below : __global__ void vecAddKernel (float *A , float *B , float * C , int n)

2D-indexing

In the same way in particularly complex grids, the blockId as well as the threadId need to be calculated by each thread depending on geometry of the grid. Consider, a 2-dimensional Grid with 2-dimensional blocks. The threadId and the blockId will be calculated by the following formulae :

blockId = blockIdx.x + blockIdx.y * gridDim.x;

threadId = blockId * (blockDim.x * blockDim.y)
 + (threadIdx.y * blockDim.x) + threadIdx.x;

Hardware perspective

Although we have stated the hierarchy of threads, we should note that, threads, thread blocks and grid are essentially a programmer's perspective. In order to get a complete gist of thread block, it is critical to know it from a hardware perspective. The hardware groups threads that execute the same instruction into warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid). Software-Perspective for thread block

Streaming multiprocessors

Each architecture in GPU (say

Kepler Johannes Kepler (; ; 27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws o ...

or Fermi) consists of several SM or Streaming Multiprocessors. These are general purpose processors with a low clock rate target and a small cache. An SM is able to execute several thread blocks in parallel. As soon as one of its thread blocks has completed execution, it takes up the serially next thread block. In general, SMs support

instruction-level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Disc ...

but not

branch prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...

To achieve this purpose, an SM contains the following: * Execution cores. (single precision floating-point units, double precision floating-point units, special function units (SFUs)). * Caches: # L1

cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache County ...

. (for reducing memory access latency). #

Shared memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...

. (for shared data between threads). # Constant cache (for broadcasting of reads from a read-only memory). #

Texture cache This is a glossary of terms relating to computer graphics. For more general computer hardware terms, see glossary of computer hardware terms. 0–9 A B ...

. (for aggregating bandwidth from texture memory). * Schedulers for warps. (these are for issuing instructions to warps based on particular scheduling policies). * A substantial number of registers. (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands.) The hardware schedules thread blocks to an SM. In general an SM can handle multiple thread blocks at the same time. An SM may contain up to 8 thread blocks in total. A thread ID is assigned to a thread by its respective SM. Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time. Hence to free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution. Each thread block is divided in scheduled units known as a warp. These are discussed in detail in the following section. Warp-Scheduler-Gpu

The warp scheduler of SM decides which of the warp gets prioritized during issuance of instructions. Some of the warp prioritizing policies have also been discussed in the following sections.

Warps

On the hardware side, a thread block is composed of ‘warps’. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block. Consider a warp of 32 threads executing an instruction. If one or both of its operands are not ready (e.g. have not yet been fetched from global memory), a

process A process is a series or set of activities that interact to produce a result; it may occur once-only or be recurrent or periodic. Things called a process include: Business and management *Business process, activities that produce a specific se ...

called ‘

context switching In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes ...

’ takes place which transfers control to another warp. When switching away from a particular warp, all the data of that warp remains in the register file so that it can be quickly resumed when its operands become ready. When an instruction has no outstanding data dependencies, that is, both of its operands are ready, the respective warp is considered to be ready for execution. If more than one warps are eligible for execution, the parent SM uses a warp

scheduling policy In computing, scheduling is the action of assigning ''resources'' to perform ''tasks''. The ''resources'' may be processors, network links or expansion cards. The ''tasks'' may be threads, processes or data flows. The scheduling activity is ca ...

for deciding which warp gets the next fetched instruction. Different policies for scheduling warps that are eligible for execution are discussed below: #

Round Robin Round-robin may refer to: Computing * Round-robin DNS, a technique for dealing with redundant Internet Protocol service hosts * Round-robin networks, communications networks made up of radio nodes organized in a mesh topology * Round-robin schedu ...

(RR) - Instructions are fetched in round robin manner. RR makes sure that SMs are kept busy and no clock cycles are wasted on memory latencies. # Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been fetched for the longest time gets priority in the fetching of an instruction. # Fair (FAIR) - In this policy, the scheduler makes sure all warps are given ‘fair’ opportunity in the number of instruction fetched for them. It fetches instruction to a warp for which minimum number of instructions have been fetched. # Thread block-based CAWS (criticality aware warp scheduling) - The emphasis of this scheduling policy is on improving the execution time of the thread blocks. It allocated more time resources to the warp that shall take the longest time to execute. By giving priority to the most critical warp, this policy allows thread blocks to finish faster, such that the resources become available quicker. Traditional CPU thread context "switching" requires saving and restoring allocated register values and the program counter to off-chip memory (or cache) and is therefore a much more heavyweight operation than with warp context switching. All of a warp's register values (including its program counter) remain in the register file, and the shared memory (and cache) remain in place too since these are shared between all the warps in the thread block. In order to take advantage of the warp architecture, programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence. If each thread in a warp takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the warp architecture are lost and performance will significantly degrade.

References

{{reflist, 2 Parallel computing