Basic Linear Algebra Subprograms (BLAS) is a
specification
A specification often refers to a set of documented requirements to be satisfied by a material, design, product, or service. A specification is often a type of technical standard.
There are different types of technical or engineering specificati ...
that prescribes a set of low-level routines for performing common
linear algebra
Linear algebra is the branch of mathematics concerning linear equations such as
:a_1x_1+\cdots +a_nx_n=b,
linear maps such as
:(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n,
and their representations in vector spaces and through matrix (mathemat ...
operations such as
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
addition,
scalar multiplication
In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra (or more generally, a module in abstract algebra). In common geometrical contexts, scalar multiplication of a real Euclidean vector ...
,
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
s, linear combinations, and
matrix multiplication
In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix (mathematics), matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the n ...
. They are the ''
de facto'' standard low-level routines for linear algebra libraries; the routines have bindings for both
C ("CBLAS interface") and
Fortran ("BLAS interface"). Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
instructions.
It originated as a Fortran library in 1979
[*] and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the
netlib website. This Fortran library is known as the ''
reference implementation
In the software development process, a reference implementation (or, less frequently, sample implementation or model implementation) is a program that implements all requirements from a corresponding specification. The reference implementation ...
'' (sometimes confusingly referred to as ''the'' BLAS library) and is not optimized for speed but is in the
public domain
The public domain (PD) consists of all the creative work to which no Exclusive exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly Waiver, waived, or may be inapplicable. Because no one holds ...
.
Most libraries that offer linear algebra routines conform to the BLAS interface, allowing library users to develop programs that are indifferent to the BLAS library being used.
Many BLAS libraries have been developed, targeting various different hardware platforms. Examples includes
cuBLAS (NVIDIA GPU,
GPGPU
General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...
),
rocBLAS (AMD GPU), and
OpenBLAS. Examples of CPU-based BLAS library branches include:
OpenBLAS,
BLIS (BLAS-like Library Instantiation Software), Arm Performance Libraries,
ATLAS
An atlas is a collection of maps; it is typically a bundle of world map, maps of Earth or of a continent or region of Earth. Advances in astronomy have also resulted in atlases of the celestial sphere or of other planets.
Atlases have traditio ...
, and
Intel Math Kernel Library (iMKL). AMD maintains a fork of BLIS that is optimized for the
AMD
Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...
platform. ATLAS is a portable library that automatically optimizes itself for an arbitrary architecture. iMKL is a freeware
and proprietary
vendor library optimized for x86 and x86-64 with a performance emphasis on
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
processors.
OpenBLAS is an open-source library that is hand-optimized for many of the popular architectures. The
LINPACK benchmarks
The LINPACK benchmarks are a measure of a system's floating-point computing power. Introduced by Jack Dongarra, they measure how fast a computer solves a dense ''n'' × ''n'' system of linear equations ''Ax'' = ''b'', which i ...
rely heavily on the BLAS routine
gemm
for its performance measurements.
Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including
LAPACK,
LINPACK,
Armadillo
Armadillos () are New World placental mammals in the order (biology), order Cingulata. They form part of the superorder Xenarthra, along with the anteaters and sloths. 21 extant species of armadillo have been described, some of which are dis ...
,
GNU Octave
GNU Octave is a scientific programming language for scientific computing and numerical computation. Octave helps in solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly ...
,
Mathematica
Wolfram (previously known as Mathematica and Wolfram Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network ...
,
MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...
,
NumPy,
R,
Julia and Lisp-Stat.
Background
With the advent of numerical programming, sophisticated subroutine libraries became useful. These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations. The language of choice was
FORTRAN. The most prominent numerical programming library was
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
's
Scientific Subroutine Package (SSP). These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms. The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy. The library routines would also have more efficient routines. For example, a library may include a program to solve a matrix that is upper triangular. The libraries would include single-precision and double-precision versions of some algorithms.
Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine needed to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
s). Between 1973 and 1977, several of these kernel operations were identified. These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using
scalars
Scalar may refer to:
*Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers
*Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...
and
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
s, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979. BLAS was used to implement the linear algebra subroutine library
LINPACK.
The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated,
vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.)
Other machine features became available and could also be exploited. Consequently, BLAS was augmented from 1984 to 1986 with level-2 kernel operations that concerned vector-matrix operations. Memory hierarchy was also recognized as something to exploit. Many computers have
cache memory
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsew ...
that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. In 1987 and 1988, the level 3 BLAS were identified to do matrix-matrix operations. The level 3 BLAS encouraged block-partitioned algorithms. The
LAPACK library uses level 3 BLAS.
The original BLAS concerned only densely stored vectors and matrices. Further extensions to BLAS, such as for sparse matrices, have been addressed.
Functionality
BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take
linear time
In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations ...
, , Level 2 operations quadratic time and Level 3 operations cubic time. Modern BLAS implementations typically provide all three levels.
Level 1
This level consists of all the routines described in the original presentation of BLAS (1979),
which defined only ''vector operations'' on
strided arrays:
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
s,
vector norms
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
, a generalized vector addition of the form
:
(called "
axpy
", "a x plus y") and several other operations.
Level 2
This level contains ''matrix-vector operations'' including, among other things, a ''ge''neralized ''m''atrix-''v''ector multiplication (
gemv
):
:
as well as a solver for in the linear equation
:
with being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988.
The Level 2 subroutines are especially intended to improve performance of programs using BLAS on
vector processor
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
s, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."
Level 3
This level, formally published in 1990,
contains ''matrix-matrix operations'', including a "general
matrix multiplication
In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix (mathematics), matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the n ...
" (
gemm
), of the form
:
where and can optionally be
transpose
In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal;
that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
d or
hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication can be performed by setting to one and to an all-zeros matrix of the appropriate size.
Also included in Level 3 are routines for computing
:
where is a
triangular matrix
In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called if all the entries ''above'' the main diagonal are zero. Similarly, a square matrix is called if all the entries ''below'' the main diagonal are z ...
, among other functionality.
Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,
and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication,
gemm
is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of , into
block matrices
In mathematics, a block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices.
Intuitively, a matrix interpreted as a block matrix can be visualized as the original matrix w ...
,
gemm
can be
implemented recursively. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated. Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of . This decomposition allows for better
locality of reference
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
both in space and time of the data used in the product. This, in turn, takes advantage of the
cache on the system. For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as
ATLAS
An atlas is a collection of maps; it is typically a bundle of world map, maps of Earth or of a continent or region of Earth. Advances in astronomy have also resulted in atlases of the celestial sphere or of other planets.
Atlases have traditio ...
. More recently, implementations by
Kazushige Goto have shown that blocking only for the
L2 cache, combined with careful
amortizing of copying to contiguous memory to reduce
TLB misses, is superior to
ATLAS
An atlas is a collection of maps; it is typically a bundle of world map, maps of Earth or of a continent or region of Earth. Advances in astronomy have also resulted in atlases of the celestial sphere or of other planets.
Atlases have traditio ...
.
A highly tuned implementation based on these ideas is part of the
GotoBLAS,
OpenBLAS and
BLIS.
A common variation of is the , which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to
Strassen algorithm first described by Peter Ungar.
Implementations
; Accelerate:
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
's framework for
macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
and
iOS
Ios, Io or Nio (, ; ; locally Nios, Νιός) is a Greek island in the Cyclades group in the Aegean Sea. Ios is a hilly island with cliffs down to the sea on most sides. It is situated halfway between Naxos and Santorini. It is about long an ...
, which includes tuned versions of
BLAS and
LAPACK.
; Arm Performance Libraries:
Arm Performance Libraries, supporting Arm 64-bit
AArch64
AArch64, also known as ARM64, is a 64-bit version of the ARM architecture family, a widely used set of computer processor designs. It was introduced in 2011 with the ARMv8 architecture and later became part of the ARMv9 series. AArch64 allows ...
-based processors, available from
Arm.
; ATLAS:
Automatically Tuned Linear Algebra Software, an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
implementation of BLAS
API
An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
s for
C and
Fortran 77.
;
BLIS: BLAS-like Library Instantiation Software framework for rapid instantiation. Optimized for most modern CPUs. BLIS is a complete refactoring of the GotoBLAS that reduces the amount of code that must be written for a given platform.
; C++ AMP BLAS: The
C++ AMP BLAS Library is an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
implementation of BLAS for Microsoft's AMP language extension for Visual C++.
; cuBLAS: Optimized BLAS for NVIDIA based GPU cards, requiring few additional library calls.
; NVBLAS: Optimized BLAS for NVIDIA based GPU cards, providing only Level 3 functions, but as direct drop-in replacement for other BLAS libraries.
; clBLAS: An
OpenCL
OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...
implementation of BLAS by AMD. Part of the AMD Compute Libraries.
; clBLAST: A tuned
OpenCL
OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...
implementation of most of the BLAS api.
; Eigen BLAS: A
Fortran 77 and
C BLAS library implemented on top of the
MPL-licensed
Eigen library, supporting
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
,
x86-64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit extension of the x86 instruction set architecture, instruction set. It was announced in 1999 and first available in the AMD Opteron family in 2003. It introduces two new ope ...
,
ARM (NEON), and
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
architectures.
; ESSL:
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
's Engineering and Scientific Subroutine Library, supporting the
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
architecture under
AIX and
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
.
;
GotoBLAS:
Kazushige Goto's BSD-licensed implementation of BLAS, tuned in particular for
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
Nehalem/
Atom
Atoms are the basic particles of the chemical elements. An atom consists of a atomic nucleus, nucleus of protons and generally neutrons, surrounded by an electromagnetically bound swarm of electrons. The chemical elements are distinguished fr ...
,
VIA Nanoprocessor,
AMD
Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...
Opteron.
;
GNU Scientific Library
The GNU Scientific Library (or GSL) is a software library for numerical computations in applied mathematics and science. The GSL is written in C (programming language), C; wrappers are available for other programming languages. The GSL is part of ...
: Multi-platform implementation of many numerical routines. Contains a CBLAS interface.
; HP MLIB:
HP's Math library supporting
IA-64
IA-64 (Intel Itanium architecture) is the instruction set architecture (ISA) of the discontinued Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by ...
,
PA-RISC
Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...
,
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
and
Opteron architecture under
HP-UX
HP-UX (from "Hewlett Packard Unix") is a proprietary software, proprietary implementation of the Unix operating system developed by Hewlett Packard Enterprise; current versions support HPE Integrity Servers, based on Intel's Itanium architect ...
and
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
.
; Intel MKL: The
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
Math Kernel Library
Intel oneAPI Math Kernel Library (Intel oneMKL), formerly known as Intel Math Kernel Library, is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sp ...
, supporting x86 32-bits and 64-bits, available free from
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
.
Includes optimizations for Intel
Pentium
Pentium is a series of x86 architecture-compatible microprocessors produced by Intel from 1993 to 2023. The Pentium (original), original Pentium was Intel's fifth generation processor, succeeding the i486; Pentium was Intel's flagship proce ...
,
Core and Intel
Xeon
Xeon (; ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same archite ...
CPUs and Intel
Xeon Phi; support for
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
,
Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
and
macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
.
; MathKeisan:
NEC
is a Japanese multinational information technology and electronics corporation, headquartered at the NEC Supertower in Minato, Tokyo, Japan. It provides IT and network solutions, including cloud computing, artificial intelligence (AI), Inte ...
's math library, supporting
NEC SX architecture under
SUPER-UX, and
Itanium
Itanium (; ) is a discontinued family of 64-bit computing, 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly dev ...
under
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
; Netlib BLAS: The official reference implementation on
Netlib, written in
Fortran 77.
; Netlib CBLAS: Reference
C interface to the BLAS. It is also possible (and popular) to call the Fortran BLAS from C.
;
OpenBLAS: Optimized BLAS based on GotoBLAS, supporting
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
,
x86-64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit extension of the x86 instruction set architecture, instruction set. It was announced in 1999 and first available in the AMD Opteron family in 2003. It introduces two new ope ...
,
MIPS and
ARM processors.
; PDLIB/SX:
NEC
is a Japanese multinational information technology and electronics corporation, headquartered at the NEC Supertower in Minato, Tokyo, Japan. It provides IT and network solutions, including cloud computing, artificial intelligence (AI), Inte ...
's Public Domain Mathematical Library for the NEC
SX-4 system.
; rocBLAS: Implementation that runs on
AMD
Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...
GPUs via
ROCm.
;SCSL
:
SGI's Scientific Computing Software Library contains BLAS and LAPACK implementations for SGI's
Irix
IRIX (, ) is a discontinued operating system developed by Silicon Graphics (SGI) to run on the company's proprietary MIPS architecture, MIPS workstations and servers. It is based on UNIX System V with Berkeley Software Distribution, BSD extensio ...
workstations.
; Sun Performance Library: Optimized BLAS and LAPACK for
SPARC,
Core and
AMD64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit extension of the x86 instruction set. It was announced in 1999 and first available in the AMD Opteron family in 2003. It introduces two new operating modes: 64-bit mode an ...
architectures under
Solaris 8, 9, and 10 as well as Linux.
; uBLAS: A generic
C++ template class library providing BLAS functionality. Part of the
Boost library. It provides bindings to many hardware-accelerated libraries in a unifying notation. Moreover, uBLAS focuses on correctness of the algorithms using advanced C++ features.
Libraries using BLAS
; Armadillo:
Armadillo
Armadillos () are New World placental mammals in the order (biology), order Cingulata. They form part of the superorder Xenarthra, along with the anteaters and sloths. 21 extant species of armadillo have been described, some of which are dis ...
is a C++ linear algebra library aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS/ATLAS and LAPACK. It is sponsored by
NICTA
NICTA (formerly named National ICT Australia Ltd) was Australia's Information and Communications Technology (ICT) Research Centre of Excellence and is now known as CSIRO's Data61. The term "Centre of Excellence" is common marketing terminology ...
(in Australia) and is licensed under a free license.
;
LAPACK: LAPACK is a higher level Linear Algebra library built upon BLAS. Like BLAS, a reference implementation exists, but many alternatives like libFlame and MKL exist.
; Mir: An
LLVM
LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...
-accelerated generic numerical library for science and machine learning written in
D. It provides generic linear algebra subprograms (GLAS). It can be built on a CBLAS implementation.
Similar libraries (not compatible with BLAS)
; Elemental: Elemental is an open source software for
distributed-memory dense and sparse-direct linear algebra and optimization.
; HASEM: is a C++ template library, being able to solve linear equations and to compute eigenvalues. It is licensed under BSD License.
; LAMA: The Library for Accelerated Math Applications (
LAMA
Lama () is a title bestowed to a realized practitioner of the Dharma in Tibetan Buddhism. Not all monks are lamas, while nuns and female practitioners can be recognized and entitled as lamas. The Tibetan word ''la-ma'' means "high mother", ...
) is a C++ template library for writing numerical solvers targeting various kinds of hardware (e.g.
GPUs through
CUDA
In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...
or
OpenCL
OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...
) on
distributed memory
In computer science, distributed memory refers to a Multiprocessing, multiprocessor computer system in which each Central processing unit, processor has its own private Computer memory, memory. Computational tasks can only operate on local data ...
systems, hiding the hardware specific programming from the program developer
; MTL4: The
Matrix Template Library version 4 is a generic
C++ template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to
MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...
) and broad applicability thanks to
generic programming
Generic programming is a style of computer programming in which algorithms are written in terms of data types ''to-be-specified-later'' that are then ''instantiated'' when needed for specific types provided as parameters. This approach, pioneer ...
.
Sparse BLAS
Several extensions to BLAS for handling
sparse matrices
In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix (mathematics), matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix ...
have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.
Batched BLAS
The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified.
Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:
The index
in square brackets indicates that the operation is performed for all matrices
in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays
,
and
.
Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of
exponential integrators and
Magnus integrators
Magnus, meaning "Great" in Latin, was used as cognomen of Gnaeus Pompeius Magnus in the first century BC. The best-known use of the name during the Roman Empire is for the fourth-century Western Roman Emperor Magnus Maximus. The name gained wid ...
that handle long integration periods with many time steps.
Here, the
matrix exponentiation, the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.
See also
*
List of numerical libraries
*
Math Kernel Library
Intel oneAPI Math Kernel Library (Intel oneMKL), formerly known as Intel Math Kernel Library, is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sp ...
, math library optimized for the
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
architecture; includes BLAS, LAPACK
*
Numerical linear algebra, the type of problem BLAS solves
References
Further reading
*
*
*
* J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 14 (1988), pp. 18–32.
* J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 16 (1990), pp. 1–17.
* J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Softw., 16 (1990), pp. 18–28.
; New BLAS
* L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, An Updated Set of Basic Linear Algebra Subprograms (BLAS), ACM Trans. Math. Softw., 28-2 (2002), pp. 135–151.
* J. Dongarra, Basic Linear Algebra Subprograms Technical Forum Standard, International Journal of High Performance Applications and Supercomputing, 16(1) (2002), pp. 1–111, and International Journal of High Performance Applications and Supercomputing, 16(2) (2002), pp. 115–199.
External links
BLAS homepageon Netlib.org
from LAPACK Users' Guide
One of the original authors of the BLAS discusses its creation in an oral history interview. Charles L. Lawson Oral history interview by Thomas Haigh, 6 and 7 November 2004, San Clemente, California. Society for Industrial and Applied Mathematics, Philadelphia, PA.
In an oral history interview, Jack Dongarra explores the early relationship of BLAS to LINPACK, the creation of higher level BLAS versions for new architectures, and his later work on the ATLAS system to automatically optimize BLAS for particular machines. Jack Dongarra, Oral history interview by Thomas Haigh, 26 April 2005, University of Tennessee, Knoxville TN. Society for Industrial and Applied Mathematics, Philadelphia, PA
How does BLAS get such extreme performance?Ten naive 1000×1000 matrix multiplications (10
10 floating point multiply-adds) takes 15.77 seconds on 2.6 GHz processor; BLAS implementation takes 1.32 seconds.
* An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum
{{Linear algebra
Numerical linear algebra
Numerical software
Public-domain software with source code