In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (
SIMD
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it shoul ...
)
instruction set
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...
extension to the
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
architecture, designed by
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
and introduced in 1999 in their
Pentium III
The Pentium III (marketed as Intel Pentium III Processor, informally PIII or P3) brand refers to Intel's 32-bit x86 desktop and mobile CPUs based on the sixth-generation P6 microarchitecture introduced on February 28, 1999. The brand's initial p ...
series of
Central processing unit
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
s (CPUs) shortly after the appearance of
Advanced Micro Devices
Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufact ...
(AMD's)
3DNow!. SSE contains 70 new instructions (65 unique mnemonics using 70 encodings), most of which work on
single precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floating- ...
floating-point
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are
digital signal processing
Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...
and
graphics processing
Computer graphics is a sub-field of computer science which studies methods for digitally synthesizing and manipulating visual content. Although the term often refers to the study of three-dimensional computer graphics, it also encompasses two-di ...
.
Intel's first
IA-32
IA-32 (short for "Intel Architecture, 32-bit", commonly called i386) is the 32-bit version of the x86 instruction set architecture, designed by Intel and first implemented in the 80386 microprocessor in 1985. IA-32 is the first incarnation o ...
SIMD effort was the
MMX instruction set. MMX had two main problems: it re-used existing
x87 floating-point registers making the CPUs unable to work on both floating-point and SIMD data at the same time, and it only worked on
integers
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign (−1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language o ...
. SSE floating-point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.
SSE was subsequently expanded by Intel to
SSE2,
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revis ...
,
SSSE3
Supplemental Streaming SIMD Extensions 3 (SSSE3 or SSE3S) is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.
History
SSSE3 was first introduced with Intel processors based on the Core microarchitectu ...
and
SSE4. Because it supports floating-point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX largely redundant, though further performance increases can be attained in some situations by using MMX in parallel with SSE operations.
SSE was originally called Katmai New Instructions (KNI),
Katmai being the code name for the first Pentium III core revision. During the Katmai project Intel sought to distinguish it from their earlier product line, particularly their flagship
Pentium II
The Pentium II brand refers to Intel's sixth-generation microarchitecture (" P6") and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors (27.4 million in the case of the mobile Dixon with 256 KB ...
. It was later renamed Internet Streaming SIMD Extensions (ISSE
), then SSE. AMD eventually added support for SSE instructions, starting with its
Athlon XP
Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by Advanced Micro Devices (AMD). The original Athlon (now called Athlon Classic) was the first seventh-generation x86 processor and the fi ...
and
Duron
Duron is a line of budget x86-compatible microprocessors manufactured by AMD. Released on June 19, 2000 as a lower-cost offering to complement AMD's then mainstream performance Athlon processor line, it also competed with rival chipmaker Inte ...
(
Morgan core) processors.
Registers
SSE originally added eight new 128-bit registers known as
XMM0
through
XMM7
. The
AMD64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...
extensions from AMD (originally called ''x86-64'') added a further eight registers
XMM8
through
XMM15
, and this extension is duplicated in the
Intel 64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mo ...
architecture. There is also a new 32-bit control/status register,
MXCSR
. The registers
XMM8
through
XMM15
are accessible only in 64-bit operating mode.
SSE used only a single data type for XMM registers:
* four 32-bit
single-precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floatin ...
floating-point numbers
SSE2 would later expand the usage of the XMM registers to include:
* two 64-bit
double-precision floating-point numbers or
* two 64-bit integers or
* four 32-bit integers or
* eight 16-bit short integers or
* sixteen 8-bit bytes or characters.
Because these 128-bit registers are additional machine states that the
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs.
Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
must preserve across
task switches, they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use the
FXSAVE
and
FXRSTOR
instructions, which is the extended pair of instructions that can save all
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
and SSE register states at once. This support was quickly added to all major IA-32 operating systems.
The first CPU to support SSE, the
Pentium III
The Pentium III (marketed as Intel Pentium III Processor, informally PIII or P3) brand refers to Intel's 32-bit x86 desktop and mobile CPUs based on the sixth-generation P6 microarchitecture introduced on February 28, 1999. The brand's initial p ...
, shared execution resources between SSE and the
floating-point unit
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
(FPU).
While a
compiled
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the same
clock cycle
In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') oscillates between a high and a low state and is used like a metronome to coordinate actions of digital circuits.
A clock sig ...
. This limitation reduces the effectiveness of
pipelining, but the separate XMM registers do allow SIMD and scalar floating-point operations to be mixed without the performance hit from explicit MMX/floating-point mode switching.
SSE instructions
SSE introduced both
scalar
Scalar may refer to:
*Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers
* Scalar (physics), a physical quantity that can be described by a single element of a number field such ...
and
packed
Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.
The CPU in modern computer hardware performs reads and ...
floating-point instructions.
Floating-point instructions
* Memory-to-register/register-to-memory/register-to-register data movement
** Scalar –
MOVSS
** Packed –
MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS, MOVMSKPS
* Arithmetic
** Scalar –
ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
** Packed –
ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS
*
Compare
Comparison or comparing is the act of evaluating two or more things by determining the relevant, comparable characteristics of each thing, and then determining which characteristics of each are similar to the other, which are different, and t ...
** Scalar –
CMPSS, COMISS, UCOMISS
** Packed –
CMPPS
* Data shuffle and unpacking
** Packed –
SHUFPS, UNPCKHPS, UNPCKLPS
*
Data-type conversion
** Scalar –
CVTSI2SS, CVTSS2SI, CVTTSS2SI
** Packed –
CVTPI2PS, CVTPS2PI, CVTTPS2PI
*
Bitwise logical operations
** Packed –
ANDPS, ORPS, XORPS, ANDNPS
Integer instructions
* Arithmetic
**
PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW
* Data movement
**
PEXTRW, PINSRW
* Other
**
PMOVMSKB, PSHUFW
Other instructions
*
MXCSR
management
**
LDMXCSR, STMXCSR
* Cache and Memory management
**
MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE
Example
The following simple example demonstrates the advantage of using SSE. Consider an operation like vector addition, which is used very often in computer graphics applications. To add two single precision, four-component vectors together using x86 requires four floating-point addition instructions.
vec_res.x = v1.x + v2.x;
vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z;
vec_res.w = v1.w + v2.w;
This corresponds to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.
movaps xmm0, 1;xmm0 = v1.w , v1.z , v1.y , v1.x
addps xmm0, 2 ;xmm0 = v1.w+v2.w , v1.z+v2.z , v1.y+v2.y , v1.x+v2.x
movaps ec_res xmm0 ;xmm0
Later versions
*
SSE2, Willamette New Instructions (WNI), introduced with the
Pentium 4
Pentium 4 is a series of single-core CPUs for desktops, laptops and entry-level servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 2008. The production of Netburst processors was active from 2000 ...
, is a major enhancement to SSE. SSE2 adds two major features:
double-precision (64-bit) floating-point for all SSE operations, and MMX integer operations on 128-bit XMM registers. In the original SSE instruction set, conversion to and from integers placed the integer data in the 64-bit MMX registers. SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers. It offers an
orthogonal set of instructions for dealing with common data types.
*
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revis ...
, also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions. It also allowed addition or multiplication of two numbers that are stored in the same register, which wasn't possible in SSE2 and earlier. This capability, known as horizontal in Intel terminology, was the major addition to the SSE3 instruction set. AMD's
3DNow! extension could do the latter too.
*
SSSE3
Supplemental Streaming SIMD Extensions 3 (SSSE3 or SSE3S) is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.
History
SSSE3 was first introduced with Intel processors based on the Core microarchitectu ...
, Merom New Instructions (MNI), is an upgrade to SSE3, adding 16 new instructions which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions. SSSE3 is often mistaken for SSE4 as this term was used during the development of the Core
microarchitecture.
*
SSE4, Penryn New Instructions (PNI), is another major enhancement, adding a dot product instruction, additional integer instructions, a
popcnt instruction (
Population count: count number of bits set to 1, used extensively e.g. in
cryptography
Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
), and more.
*
XOP,
FMA4 and
CVT16 are new iterations announced by
AMD
Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
in August 2007
and revised in May 2009.
*
Advanced Vector Extensions
Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bri ...
(AVX), Gesher New Instructions (GNI), is an advanced version of SSE announced by Intel featuring a widened data path from 128 bits to 256 bits and 3-operand instructions (up from 2). Intel released processors in early 2011 with AVX support.
*
AVX2
Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge ...
is an expansion of the AVX instruction set.
*
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; t ...
(3.1 and 3.2) are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture.
Software and hardware issues
With all x86 instruction set extensions, it is up to the
BIOS
In computing, BIOS (, ; Basic Input/Output System, also known as the System BIOS, ROM BIOS, BIOS ROM or PC BIOS) is firmware used to provide runtime services for operating systems and programs and to perform hardware initialization during the ...
, operating system and application programmer to test and detect their existence and proper operation.
* Intel and AMD offer applications to detect what extensions a CPU supports.
* The
CPUID
In the x86 architecture, the CPUID instruction (identified by a CPUID opcode) is a processor supplementary instruction (its name derived from CPU IDentification) allowing software to discover details of the processor. It was introduced by Intel ...
opcode is a processor supplementary instruction (its name derived from CPU IDentification) for the x86 architecture. It was introduced by Intel in 1993 when it introduced the Pentium and SL-Enhanced 486 processors.
User application uptake of the x86 extensions has been slow with even bare minimum baseline
MMX and SSE support (in some cases) being non-existent by applications some 10 years after these extensions became commonly available. Distributed computing has accelerated the use of these extensions in the scientific community—and many scientific applications refuse to run unless the CPU supports SSE2 or SSE3.
The use of multiple revisions of an application to cope with the many different sets of extensions available is the simplest way around the x86 extension optimization problem. Software libraries and some applications have begun to support multiple extension types hinting that full use of available x86 instructions may finally become common some 5 to 15 years after the instructions were initially introduced.
Identifying
The following programs can be used to determine which, if any, versions of SSE are supported on a system
* Intel Processor Identification Utility
*
CPU-Z
CPU-Z is a freeware system profiling and monitoring application for Microsoft Windows and Android that detects the central processing unit, RAM, motherboard chip-set, and other hardware features of a modern personal computer or Android device. ...
– CPU, motherboard, and memory identification utility.
*
lscpu - provided by the util-linux package in most Linux distributions.
References
External links
Intel Intrinsics Guide
{{DEFAULTSORT:Streaming Simd Extensions
SIMD computing
X86 instructions