Parallel Thread Execution (PTX or NVPTX) is a low-level
parallel
Parallel is a geometric term of location which may refer to:
Computing
* Parallel algorithm
* Parallel computing
* Parallel metaheuristic
* Parallel (software), a UNIX utility for running programs in parallel
* Parallel Sysplex, a cluster of IB ...
thread execution virtual machine and
instruction set architecture
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...
used in
Nvidia
Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
's
CUDA
CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach c ...
programming environment. The
NVCC compiler translates code written in CUDA, a
C++
C, or c, is the third letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''cee'' (pronounced ), plural ''cees''.
History
"C" ...
-like language, into PTX instructions (an
assembly language
In computer programming, assembly language (or assembler language, or symbolic machine code), often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence b ...
represented as
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
text), and the graphics driver contains a compiler which translates the PTX instructions into the executable binary code which can be run on the processing cores of
Nvidia GPUs
This list contains general information about graphics processing units (GPUs) and video cards from Nvidia, based on official specifications. In addition some Nvidia motherboards come with integrated onboard GPUs. Limited/Special/Collectors' Editio ...
. The
GNU Compiler Collection
The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free softwa ...
also has basic ability for PTX generation in the context of
OpenMP
OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating sy ...
offloading. Inline PTX assembly can be used in CUDA.
Registers
PTX uses an arbitrarily large register set; the output from the compiler is almost pure
single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form
.reg .u32 %r<335>; // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer
It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:
shr.u64 %rd14, %rd12, 32; // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14
cvt.u64.u32 %rd142, %r112; // convert an unsigned 32-bit integer to 64-bit
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is
@%p14 bra $label; // branch to $label
The
setp.cc.type
instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a
set
instruction, where
set.le.u32.u64 %r101, %rd12, %rd28 sets the 32-bit register
%r101
to
0xffffffff
if the 64-bit register
%rd12
is less than or equal to the 64-bit register
%rd28
. Otherwise
%r101
is set to
0x00000000
.
There are a few predefined identifiers that denote pseudoregisters. Among others,
%tid, %ntid, %ctaid
, and
%nctaid
contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.
State spaces
Load (
ld
) and store (
st
) commands refer to one of several distinct state spaces (memory banks), e.g.
ld.param
.
There are eight state spaces:
*
.reg
: registers
*
.sreg
: special, read-only, platform-specific registers
*
.const
: shared, read-only memory
*
.global
: global memory, shared by all threads
*
.local
: local memory, private to each thread
*
.param
: parameters passed to the kernel
*
.shared
: memory shared between threads in a block
*
.tex
: global texture memory (deprecated)
Shared memory is declared in the PTX file via lines at the start of the form:
.shared .align 8 .b8 pbatch_cache 5744 // define 15,744 bytes, aligned to an 8-byte boundary
Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.
See also
*
Standard Portable Intermediate Representation
Standard Portable Intermediate Representation (SPIR) is an intermediate language for parallel compute and graphics by Khronos Group. It is used in multiple execution environments, including the Vulkan graphics API and the OpenCL compute API, to ...
(SPIR)
*
CUDA binary (cubin) – a type of fat binary
References
{{Reflist
External links
PTX ISA page on NVIDIA Developer Zone
Nvidia