Basic Linear Algebra Subprograms (BLAS) is a

specification A specification often refers to a set of documented requirements to be satisfied by a material, design, product, or service. A specification is often a type of technical standard. There are different types of technical or engineering specificati ...

that prescribes a set of low-level routines for performing common

linear algebra Linear algebra is the branch of mathematics concerning linear equations such as: :a_1x_1+\cdots +a_nx_n=b, linear maps such as: :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrices. ...

operations such as

vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...

addition,

scalar multiplication In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra (or more generally, a module in abstract algebra). In common geometrical contexts, scalar multiplication of a real Euclidean vector by ...

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...

s, linear combinations, and

matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...

. They are the ''

de facto ''De facto'' ( ; , "in fact") describes practices that exist in reality, whether or not they are officially recognized by laws or other formal norms. It is commonly used to refer to what happens in practice, in contrast with ''de jure'' ("by la ...

'' standard low-level routines for linear algebra libraries; the routines have bindings for both C ("CBLAS interface") and Fortran ("BLAS interface"). Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or

SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...

instructions. It originated as a Fortran library in 1979* and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the

netlib Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises many separate programs and libraries. Most of the code is written in ...

website. This Fortran library is known as the ''

reference implementation In the software development process, a reference implementation (or, less frequently, sample implementation or model implementation) is a program that implements all requirements from a corresponding specification. The reference implementation o ...

'' (sometimes confusingly referred to as ''the'' BLAS library) and is not optimized for speed but is in the

public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...

. Most computing libraries that offer linear algebra routines conform to common BLAS user interface command structures, thus queries to those libraries (and the associated results) are often portable between BLAS library branches, such as cuBLAS (nvidia GPU,

GPGPU General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...

), rocBLAS (amd GPU, GPGP), and

OpenBLAS In scientific computing, OpenBLAS is an open-source implementation of the BLAS (Basic Linear Algebra Subprograms) and LAPACK APIs with many hand-crafted optimizations for specific processor types. It is developed at the Lab of Parallel Software ...

. This interoperability is then the basis of functioning homogenous code implementations between heterzygous cascades of computing architectures (such as those found in some advanced clustering implementations). Examples of CPU-based BLAS library branches include:

, BLIS (BLAS-like Library Instantiation Software), Arm Performance Libraries,

ATLAS An atlas is a collection of maps; it is typically a bundle of maps of Earth or of a region of Earth. Atlases have traditionally been bound into book form, but today many atlases are in multimedia formats. In addition to presenting geographic ...

, and

Intel Math Kernel Library Intel oneAPI (compute acceleration), oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL) is a Library (computer science), library of optimized math routines for science, engineering, and financial applicati ...

(iMKL). AMD maintains a fork of BLIS that is optimized for the

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...

platform, although it is unclear whether integrated ombudsmen resources are present in that particular software-hardware implementation. ATLAS is a portable library that automatically optimizes itself for an arbitrary architecture. iMKL is a freeware and proprietary vendor library optimized for x86 and x86-64 with a performance emphasis on

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...

processors. OpenBLAS is an open-source library that is hand-optimized for many of the popular architectures. The

LINPACK benchmarks The LINPACK Benchmarks are a measure of a system's floating-point computing power. Introduced by Jack Dongarra, they measure how fast a computer solves a dense ''n'' by ''n'' system of linear equations ''Ax'' = ''b'', which is a common ...

rely heavily on the BLAS routine gemm for its performance measurements. Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including

LAPACK LAPACK ("Linear Algebra Package") is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also ...

, LINPACK,

Armadillo Armadillos (meaning "little armored ones" in Spanish) are New World placental mammals in the order Cingulata. The Chlamyphoridae and Dasypodidae are the only surviving families in the order, which is part of the superorder Xenarthra, along wi ...

GNU Octave GNU Octave is a high-level programming language primarily intended for scientific computing and numerical computation. Octave helps in solving linear and nonlinear problems numerically, and for performing other numerical experiments using a langu ...

Mathematica Wolfram Mathematica is a software system with built-in libraries for several areas of technical computing that allow machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimizat ...

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...

, NumPy, R, and

Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g. ...

Background

With the advent of numerical programming, sophisticated subroutine libraries became useful. These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations. The language of choice was FORTRAN. The most prominent numerical programming library was IBM's

Scientific Subroutine Package Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...

(SSP). These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms. The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy. The library routines would also have more efficient routines. For example, a library may include a program to solve a matrix that is upper triangular. The libraries would include single-precision and double-precision versions of some algorithms. Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine needed to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...

s). Between 1973 and 1977, several of these kernel operations were identified. These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using

scalars Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers *Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...

and

s, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979. BLAS was used to implement the linear algebra subroutine library LINPACK. The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated, vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.) Other machine features became available and could also be exploited. Consequently, BLAS was augmented from 1984 to 1986 with level-2 kernel operations that concerned vector-matrix operations. Memory hierarchy was also recognized as something to exploit. Many computers have

cache memory In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. In 1987 and 1988, the level 3 BLAS were identified to do matrix-matrix operations. The level 3 BLAS encouraged block-partitioned algorithms. The

library uses level 3 BLAS. The original BLAS concerned only densely stored vectors and matrices. Further extensions to BLAS, such as for sparse matrices, have been addressed.

Functionality

BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take

linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

, , Level 2 operations quadratic time and Level 3 operations cubic time. Modern BLAS implementations typically provide all three levels.

Level 1

This level consists of all the routines described in the original presentation of BLAS (1979), which defined only ''vector operations'' on strided arrays:

vector norms In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers, on which a norm is defined. A norm is the formalization and the generalization to real vector spaces of the intuitive notion of "length" i ...

, a generalized vector addition of the form :

\boldsymbol \leftarrow \alpha \boldsymbol + \boldsymbol

(called "axpy", "a x plus y") and several other operations.

Level 2

This level contains ''matrix-vector operations'' including, among other things, a ''ge''neralized ''m''atrix-''v''ector multiplication (gemv): :

\boldsymbol \leftarrow \alpha \boldsymbol \boldsymbol + \beta \boldsymbol

as well as a solver for in the linear equation :

\boldsymbol \boldsymbol = \boldsymbol

with being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988. The Level 2 subroutines are especially intended to improve performance of programs using BLAS on

vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...

s, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."

Level 3

This level, formally published in 1990, contains ''matrix-matrix operations'', including a "general

" (gemm), of the form :

\boldsymbol \leftarrow \alpha \boldsymbol \boldsymbol + \beta \boldsymbol,

where and can optionally be

transpose In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The tr ...

d or hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication can be performed by setting to one and to an all-zeros matrix of the appropriate size. Also included in Level 3 are routines for computing :

\boldsymbol \leftarrow \alpha \boldsymbol^ \boldsymbol,

where is a

triangular matrix In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called if all the entries ''above'' the main diagonal are zero. Similarly, a square matrix is called if all the entries ''below'' the main diagonal are ...

, among other functionality. Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS, and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, gemm is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of , into

block matrices In mathematics, a block matrix or a partitioned matrix is a matrix that is '' interpreted'' as having been broken into sections called blocks or submatrices. Intuitively, a matrix interpreted as a block matrix can be visualized as the original mat ...

, gemm can be implemented recursively. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated. Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of . This decomposition allows for better

locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...

both in space and time of the data used in the product. This, in turn, takes advantage of the

cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...

on the system. For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as

. More recently, implementations by

Kazushige Goto is a software engineer specializing in high performance, hand-written, machine code. Education Goto was a research associate at the Texas Advanced Computing Center at the University of Texas at Austin when he wrote his famously hand-optimized as ...

have shown that blocking only for the

L2 cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...

, combined with careful

amortizing Amortization or amortisation may refer to: * The process by which loan principal decreases over the life of an amortizing loan * Amortization (accounting), the expensing of acquisition cost minus the residual value of intangible assets in a system ...

of copying to contiguous memory to reduce TLB misses, is superior to

. A highly tuned implementation based on these ideas is part of the

GotoBLAS In scientific computing, GotoBLAS and GotoBLAS2 are open source implementations of the BLAS (Basic Linear Algebra Subprograms) API with many hand-crafted optimizations for specific processor types. GotoBLAS was developed by Kazushige Goto at th ...

and BLIS. A common variation of is the , which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to

Strassen algorithm In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although t ...

first described by Peter Ungar.

Implementations

; Accelerate:

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...

's framework for

macOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...

and

iOS iOS (formerly iPhone OS) is a mobile operating system created and developed by Apple Inc. exclusively for its hardware. It is the operating system that powers many of the company's mobile devices, including the iPhone; the term also includes ...

, which includes tuned versions of

BLAS Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix ...

and

. ; Arm Performance Libraries: Arm Performance Libraries, supporting Arm 64-bit

AArch64 AArch64 or ARM64 is the 64-bit extension of the ARM architecture family. It was first introduced with the Armv8-A architecture. Arm releases a new extension every year. ARMv8.x and ARMv9.x extensions and features Announced in October 2011, AR ...

-based processors, available from

Arm In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...

. ; ATLAS:

Automatically Tuned Linear Algebra Software Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and Fortran77. ATLAS is often recommended as a way to automatically generate an ...

, an

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

implementation of BLAS

API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standa ...

s for C and Fortran 77. ; BLIS: BLAS-like Library Instantiation Software framework for rapid instantiation. Optimized for most modern CPUs. BLIS is a complete refactoring of the GotoBLAS that reduces the amount of code that must be written for a given platform. ; C++ AMP BLAS: The

C++ AMP C, or c, is the third letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''cee'' (pronounced ), plural ''cees''. History "C" ...

BLAS Library is an

implementation of BLAS for Microsoft's AMP language extension for Visual C++. ; cuBLAS: Optimized BLAS for NVIDIA based GPU cards, requiring few additional library calls. ; NVBLAS: Optimized BLAS for NVIDIA based GPU cards, providing only Level 3 functions, but as direct drop-in replacement for other BLAS libraries. ; clBLAS: An

OpenCL OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-progra ...

implementation of BLAS by AMD. Part of the AMD Compute Libraries. ; clBLAST: A tuned

implementation of most of the BLAS api. ; Eigen BLAS: A Fortran 77 and C BLAS library implemented on top of the MPL-licensed Eigen library, supporting

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...

x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...

, ARM (NEON), and

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

architectures. ; ESSL: IBM's Engineering and Scientific Subroutine Library, supporting the

architecture under

AIX Aix or AIX may refer to: Computing * AIX, a line of IBM computer operating systems *An Alternate Index, for a Virtual Storage Access Method Key Sequenced Data Set *Athens Internet Exchange, a European Internet exchange point Places Belgium ...

and

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...

. ;

's BSD-licensed implementation of BLAS, tuned in particular for

Nehalem/

Atom Every atom is composed of a nucleus and one or more electrons bound to the nucleus. The nucleus is made of one or more protons and a number of neutrons. Only the most common variety of hydrogen has no neutrons. Every solid, liquid, gas, and ...

VIA Via or VIA may refer to the following: Science and technology * MOS Technology 6522, Versatile Interface Adapter * ''Via'' (moth), a genus of moths in the family Noctuidae * Via (electronics), a through-connection * VIA Technologies, a Taiwan ...

Nanoprocessor,

Opteron Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64 or AMD64). It was released on April 22, 2003, with the ''SledgeHa ...

. ;

GNU Scientific Library The GNU Scientific Library (or GSL) is a software library for numerical computations in applied mathematics and science. The GSL is written in C; wrappers are available for other programming languages. The GSL is part of the GNU Project and is d ...

: Multi-platform implementation of many numerical routines. Contains a CBLAS interface. ; HP MLIB: HP's Math library supporting

IA-64 IA-64 (Intel Itanium architecture) is the instruction set architecture (ISA) of the Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in coll ...

PA-RISC PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as ...

and

architecture under

HP-UX HP-UX (from "Hewlett Packard Unix") is Hewlett Packard Enterprise's proprietary implementation of the Unix operating system, based on Unix System V (initially System III) and first released in 1984. Current versions support HPE Integrity Ser ...

and

. ; Intel MKL: The

Math Kernel Library Intel oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, ...

, supporting x86 32-bits and 64-bits, available free from

. Includes optimizations for Intel

Pentium Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and Pe ...

Core Core or cores may refer to: Science and technology * Core (anatomy), everything except the appendages * Core (manufacturing), used in casting and molding * Core (optical fiber), the signal-carrying portion of an optical fiber * Core, the central ...

and Intel

Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...

CPUs and Intel

Xeon Phi Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programmi ...

; support for

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...

and

. ; MathKeisan:

NEC is a Japanese multinational corporation, multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It prov ...

's math library, supporting

NEC SX architecture is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network soluti ...

under

SUPER-UX SUPER-UX is a version of the Unix operating system from NEC that is used on its SX series of supercomputers. History The initial version of SUPER-UX was based on UNIX System V version 3.1 with features from BSD 4.3. The version for the NEC SX-9 ...

, and

Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance computin ...

under

; Netlib BLAS: The official reference implementation on

Netlib Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises many separate programs and libraries. Most of the code is written in ...

, written in Fortran 77. ; Netlib CBLAS: Reference C interface to the BLAS. It is also possible (and popular) to call the Fortran BLAS from C. ;

: Optimized BLAS based on GotoBLAS, supporting

, MIPS and

ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...

processors. ; PDLIB/SX:

's Public Domain Mathematical Library for the NEC SX-4 system. ; rocBLAS: Implementation that runs on

GPUs via

ROCm ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous c ...

. ;SCSL :

SGI SGI may refer to: Companies *Saskatchewan Government Insurance *Scientific Games International, a gambling company *Silicon Graphics, Inc., a former manufacturer of high-performance computing products *Silicon Graphics International, formerly Rac ...

's Scientific Computing Software Library contains BLAS and LAPACK implementations for SGI's

Irix IRIX ( ) is a discontinued operating system developed by Silicon Graphics (SGI) to run on the company's proprietary MIPS workstations and servers. It is based on UNIX System V with BSD extensions. In IRIX, SGI originated the XFS file system and ...

workstations. ; Sun Performance Library: Optimized BLAS and LAPACK for

SPARC SPARC (Scalable Processor Architecture) is a reduced instruction set computer (RISC) instruction set architecture originally developed by Sun Microsystems. Its design was strongly influenced by the experimental Berkeley RISC system developed ...

and

AMD64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...

architectures under Solaris 8, 9, and 10 as well as Linux. ; uBLAS: A generic

C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...

template class library providing BLAS functionality. Part of the

Boost library Boost is a set of libraries for the C++ programming language that provides support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading, image processing, regular expressions, and unit testing. It conta ...

. It provides bindings to many hardware-accelerated libraries in a unifying notation. Moreover, uBLAS focuses on correctness of the algorithms using advanced C++ features.

Libraries using BLAS

; Armadillo:

is a C++ linear algebra library aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS/ATLAS and LAPACK. It is sponsored by

NICTA NICTA (formerly named National ICT Australia Ltd) was Australia's Information and Communications Technology (ICT) Research Centre of Excellence and is now known as CSIRO's Data61. The term "Centre of Excellence" is common marketing terminology u ...

(in Australia) and is licensed under a free license. ;

: LAPACK is a higher level Linear Algebra library built upon BLAS. Like BLAS, a reference implementation exists, but many alternatives like libFlame and MKL exist. ; Mir: An

LLVM LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate represen ...

-accelerated generic numerical library for science and machine learning written in D. It provides generic linear algebra subprograms (GLAS). It can be built on a CBLAS implementation.

Similar libraries (not compatible with BLAS)

; Elemental: Elemental is an open source software for distributed-memory dense and sparse-direct linear algebra and optimization. ; HASEM: is a C++ template library, being able to solve linear equations and to compute eigenvalues. It is licensed under BSD License. ; LAMA: The Library for Accelerated Math Applications (

LAMA Lama (; "chief") is a title for a teacher of the Dharma in Tibetan Buddhism. The name is similar to the Sanskrit term ''guru'', meaning "heavy one", endowed with qualities the student will eventually embody. The Tibetan word "lama" means "hi ...

) is a C++ template library for writing numerical solvers targeting various kinds of hardware (e.g.

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

s through

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ca ...

) on

distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task mu ...

systems, hiding the hardware specific programming from the program developer ; MTL4: The

Matrix Template Library The Matrix Template Library (MTL) is a linear algebra library for C++ programs. The MTL uses template programming, which considerably reduces the code length. All matrices and vectors are available in all classical numerical formats: float, dou ...

version 4 is a generic

template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to

) and broad applicability thanks to

generic programming Generic programming is a style of computer programming in which algorithms are written in terms of types ''to-be-specified-later'' that are then ''instantiated'' when needed for specific types provided as parameters. This approach, pioneered b ...

Sparse BLAS

Several extensions to BLAS for handling

sparse matrices In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.

Batched BLAS

The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified. Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:

\quad \forall k

The index

k

in square brackets indicates that the operation is performed for all matrices

k

in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays

A

B

and

C

. Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of

exponential integrators Exponential integrators are a class of numerical methods for the solution of ordinary differential equations, specifically initial value problems. This large class of methods from numerical analysis is based on the exact integration of the linear ...

and

Magnus integrators Magnus, meaning "Great" in Latin, was used as cognomen of Gnaeus Pompeius Magnus in the first century BC. The best-known use of the name during the Roman Empire is for the fourth-century Western Roman Emperor Magnus Maximus. The name gained wid ...

that handle long integration periods with many time steps. Here, the

matrix exponentiation In mathematics, the matrix exponential is a matrix function on square matrices analogous to the ordinary exponential function. It is used to solve systems of linear differential equations. In the theory of Lie groups, the matrix exponential gives ...

, the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.

References

External links

BLAS homepage
on Netlib.org

from LAPACK Users' Guide

One of the original authors of the BLAS discusses its creation in an oral history interview. Charles L. Lawson Oral history interview by Thomas Haigh, 6 and 7 November 2004, San Clemente, California. Society for Industrial and Applied Mathematics, Philadelphia, PA.

In an oral history interview, Jack Dongarra explores the early relationship of BLAS to LINPACK, the creation of higher level BLAS versions for new architectures, and his later work on the ATLAS system to automatically optimize BLAS for particular machines. Jack Dongarra, Oral history interview by Thomas Haigh, 26 April 2005, University of Tennessee, Knoxville TN. Society for Industrial and Applied Mathematics, Philadelphia, PA
How does BLAS get such extreme performance?
Ten naive 1000×1000 matrix multiplications (10¹⁰ floating point multiply-adds) takes 15.77 seconds on 2.6 GHz processor; BLAS implementation takes 1.32 seconds. * An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum {{Linear algebra Numerical linear algebra Numerical software Public-domain software with source code