computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...

, a benchmark is the act of running a

computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...

, a set of programs, or other operations, in order to assess the relative

performance A performance is an act or process of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Performance has evolved glo ...

of an object, normally by running a number of standard

tests Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film) ...

and trials against it. The term ''benchmark'' is also commonly utilized for the purposes of elaborately designed benchmarking programs themselves. Benchmarking is usually associated with assessing performance characteristics of

computer hardware Computer hardware includes the physical parts of a computer, such as the central processing unit (CPU), random-access memory (RAM), motherboard, computer data storage, graphics card, sound card, and computer case. It includes external devices ...

, for example, the floating point operation performance of a CPU, but there are circumstances when the technique is also applicable to

software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...

. Software benchmarks are, for example, run against

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

s or

database management system In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and an ...

s (DBMS). Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures. Benchmarking as a part of

continuous integration Continuous integration (CI) is the practice of integrating source code changes frequently and ensuring that the integrated codebase is in a workable state. Typically, developers Merge (version control), merge changes to an Branching (revisio ...

is called Continuous Benchmarking.

Purpose

As computer architecture advanced, it became more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that allowed comparison of different architectures. For example,

Pentium 4 Pentium 4 is a series of single-core central processing unit, CPUs for Desktop computer, desktops, laptops and entry-level Server (computing), servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 20 ...

processors generally operated at a higher clock frequency than Athlon XP or

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

processors, which did not necessarily translate to more computational power; a processor with a slower clock frequency might perform as well as or even better than a processor operating at a higher frequency. See BogoMips and the megahertz myth. Benchmarks are designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component. Application benchmarks run real-world programs on the system. While application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks are useful for testing individual components, like a

hard disk A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating hard disk drive platter, pla ...

or networking device. Benchmarks are particularly important in

CPU design Processor design is a subfield of computer science and computer engineering (fabrication) that deals with creating a processor (computing), processor, a key component of computer hardware. The design process involves choosing an instruction set an ...

, giving processor architects the ability to measure and make tradeoffs in microarchitectural decisions. For example, if a benchmark extracts the key

algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...

of an application, it will contain the performance-sensitive aspects of that application. Running this much smaller snippet on a cycle-accurate simulator can give clues on how to improve performance. Prior to 2000, computer and microprocessor architects used SPEC to do this, although SPEC's Unix-based benchmarks were quite lengthy and thus unwieldy to use intact. Computer companies are known to configure their systems to give unrealistically high performance on benchmark tests that are not replicated in real usage. For instance, during the 1980s some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a faster mathematically equivalent operation. However, such a transformation was rarely useful outside the benchmark until the mid-1990s, when

RISC In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a comp ...

and

VLIW Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction-level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel computing, para ...

architectures emphasized the importance of

technology as it related to performance. Benchmarks are now regularly used by

companies to improve not only their own benchmark scores, but real application performance. CPUs that have many execution units — such as a

superscalar A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single in ...

CPU, a

CPU, or a

reconfigurable computing Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like FPGA, field-programmable gate arrays (FPGAs). The princip ...

CPU — typically have slower clock rates than a sequential CPU with one or two execution units when built from transistors that are just as fast. Nevertheless, CPUs with many execution units often complete real-world and benchmark tasks in less time than the supposedly faster high-clock-rate CPU. Given the large number of benchmarks available, a vendor can usually find at least one benchmark that shows its system will outperform another system; the other systems can be shown to excel with a different benchmark. Software vendors also use benchmarks in their marketing, such as the "benchmark wars" between rival

relational database A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...

makers in the 1980s and 1990s. Companies commonly report only those benchmarks (or aspects of benchmarks) that show their products in the best light. They also have been known to mis-represent the significance of benchmarks, again to show their products in the best possible light. Ideally benchmarks should only substitute for real applications if the application is unavailable, or too difficult or costly to port to a specific processor or computer system. If performance is critical, the only benchmark that matters is the target environment's application suite.

Functionality

Features of benchmarking software may include recording/ exporting the course of performance to a

spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...

file, visualization such as drawing

line graph In the mathematics, mathematical discipline of graph theory, the line graph of an undirected graph is another graph that represents the adjacencies between edge (graph theory), edges of . is constructed in the following way: for each edge i ...

s or color-coded tiles, and pausing the process to be able to resume without having to start over. Software can have additional features specific to its purpose, for example, disk benchmarking software may be able to optionally start measuring the disk speed within a specified range of the disk rather than the full disk, measure

random access Random access (also called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any other, no matter how many elemen ...

reading speed and latency, have a "quick scan" feature which measures the speed through samples of specified intervals and sizes, and allow specifying a data block size, meaning the number of requested bytes per read request.

Challenges

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges: * Vendors tend to tune their products specifically for industry-standard benchmarks. Norton SysInfo (SI) is particularly easy to tune for, since it mainly biased toward the speed of multiple operations. Use extreme caution in interpreting such results. * Some vendors have been accused of "cheating" at benchmarks — designing their systems such that they give much higher benchmark numbers, but are not as effective at the actual likely workload. * Many benchmarks focus entirely on the speed of computational performance, neglecting other important features of a computer system, such as: ** Qualities of service, aside from raw performance. Examples of unmeasured qualities of service include security, availability, reliability, execution integrity, serviceability, scalability (especially the ability to quickly and nondisruptively add or reallocate capacity), etc. There are often real trade-offs between and among these qualities of service, and all are important in business computing. Transaction Processing Performance Council Benchmark specifications partially address these concerns by specifying

ACID An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...

property tests, database scalability rules, and service level requirements. ** In general, benchmarks do not measure

Total cost of ownership Total cost of ownership (TCO) is a financial estimate intended to help buyers and owners determine the direct and indirect costs of a product or service. It is a management accounting concept that can be used in full cost accounting or even eco ...

. Transaction Processing Performance Council Benchmark specifications partially address this concern by specifying that a price/performance metric must be reported in addition to a raw performance metric, using a simplified TCO formula. However, the costs are necessarily only partial, and vendors have been known to price specifically (and only) for the benchmark, designing a highly specific "benchmark special" configuration with an artificially low price. Even a tiny deviation from the benchmark package results in a much higher price in real world experience. ** Facilities burden (space, power, and cooling). When more power is used, a portable system will have a shorter battery life and require recharging more often. A server that consumes more power and/or space may not be able to fit within existing data center resource constraints, including cooling limitations. There are real trade-offs as most semiconductors require more power to switch faster. See also performance per watt. ** In some embedded systems, where memory is a significant cost, better code density can significantly reduce costs. * Vendor benchmarks tend to ignore requirements for development, test, and

disaster recovery IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, ...

computing capacity. Vendors only like to report what might be narrowly required for production capacity in order to make their initial acquisition price seem as low as possible. * Benchmarks are having trouble adapting to widely distributed servers, particularly those with extra sensitivity to network topologies. The emergence of

grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished fro ...

, in particular, complicates benchmarking since some workloads are "grid friendly", while others are not. * Users can have very different perceptions of performance than benchmarks may suggest. In particular, users appreciate predictability — servers that always meet or exceed

service level agreement A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. T ...

s. Benchmarks tend to emphasize mean scores (IT perspective), rather than maximum worst-case response times (

real-time computing Real-time computing (RTC) is the computer science term for Computer hardware, hardware and software systems subject to a "real-time constraint", for example from Event (synchronization primitive), event to Event (computing), system response. Rea ...

perspective), or low standard deviations (user perspective). * Many server architectures degrade dramatically at high (near 100%) levels of usage — "fall off a cliff" — and benchmarks should (but often do not) take that factor into account. Vendors, in particular, tend to publish server benchmarks at continuous at about 80% usage — an unrealistic situation — and do not document what happens to the overall system when demand spikes beyond that level. * Many benchmarks focus on one application, or even one application tier, to the exclusion of other applications. Most data centers are now implementing

virtualization In computing, virtualization (abbreviated v12n) is a series of technologies that allows dividing of physical computing resources into a series of virtual machines, operating systems, processes or containers. Virtualization began in the 1960s wit ...

extensively for a variety of reasons, and benchmarking is still catching up to that reality where multiple applications and application tiers are concurrently running on consolidated servers. * There are few (if any) high quality benchmarks that help measure the performance of batch computing, especially high volume concurrent batch and online computing. Batch computing tends to be much more focused on the predictability of completing long-running tasks correctly before deadlines, such as end of month or end of fiscal year. Many important core business processes are batch-oriented and probably always will be, such as billing. * Benchmarking institutions often disregard or do not follow basic scientific method. This includes, but is not limited to: small sample size, lack of variable control, and the limited repeatability of results.

Benchmarking principles

There are seven vital characteristics for benchmarks. These key properties are: # Relevance: Benchmarks should measure relatively vital features. # Representativeness: Benchmark performance metrics should be broadly accepted by industry and academia. # Equity: All systems should be fairly compared. # Repeatability: Benchmark results can be verified. # Cost-effectiveness: Benchmark tests are economical. # Scalability: Benchmark tests should work across systems possessing a range of resources from low to high. # Transparency: Benchmark metrics should be easy to understand.

Types of benchmark

#Real program #*word processing software #*tool software of CAD #*user's application software (i.e.: MIS) #*

Video game A video game or computer game is an electronic game that involves interaction with a user interface or input device (such as a joystick, game controller, controller, computer keyboard, keyboard, or motion sensing device) to generate visual fe ...

s #*

Compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

s building a large project, for example Chromium browser or

Linux kernel The Linux kernel is a Free and open-source software, free and open source Unix-like kernel (operating system), kernel that is used in many computer systems worldwide. The kernel was created by Linus Torvalds in 1991 and was soon adopted as the k ...

#Component Benchmark / Microbenchmark #*core routine consists of a relatively small and specific piece of code. #*measure performance of a computer's basic components #*may be used for automatic detection of computer's hardware parameters like number of registers, cache size, memory latency, etc. #Kernel #*contains key codes #*normally abstracted from actual program #*popular kernel: Livermore loop #*linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language) #*results are represented in Mflop/s. #Synthetic Benchmark #*Procedure for programming synthetic benchmark: #**take statistics of all types of operations from many application programs #**get proportion of each operation #**write program based on the proportion above #*Types of Synthetic Benchmark are: #** Whetstone #** Dhrystone #*These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers. # I/O benchmarks # Database benchmarks #* measure the throughput and response times of database management systems (DBMS) # Parallel benchmarks #* used on machines with multiple cores and/or processors, or systems consisting of multiple machines

Common benchmarks

Industry standard (audited and verifiable)

* Business Applications Performance Corporation (BAPCo) * Embedded Microprocessor Benchmark Consortium (EEMBC) * Standard Performance Evaluation Corporation (SPEC), in particular their SPECint and SPECfp * Transaction Processing Performance Council (TPC): DBMS benchmarks

Open source benchmarks

* AIM Multiuser Benchmark – composed of a list of tests that could be mixed to create a 'load mix' that would simulate a specific computer function on any UNIX-type OS. * Bonnie++ – filesystem and hard drive benchmark * BRL-CAD – cross-platform architecture-agnostic benchmark suite based on multithreaded ray tracing performance; baselined against a VAX-11/780; and used since 1984 for evaluating relative CPU performance, compiler differences, optimization levels, coherency, architecture differences, and operating system differences. * Collective Knowledge – customizable, cross-platform framework to crowdsource benchmarking and optimization of user workloads (such as

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

) across hardware provided by volunteers * Coremark – Embedded computing benchmark * DEISA Benchmark Suite – scientific HPC applications benchmark * Dhrystone – integer arithmetic performance, often reported in DMIPS (Dhrystone millions of instructions per second) * DiskSpd – Command-line tool for storage benchmarking that generates a variety of requests against

computer file A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...

s, partitions or storage devices * Fhourstones – an integer benchmark * HINT – designed to measure overall CPU and memory performance * Iometer – I/O subsystem measurement and characterization tool for single and clustered systems. * IOzone – Filesystem benchmark *

LINPACK benchmarks The LINPACK benchmarks are a measure of a system's floating-point computing power. Introduced by Jack Dongarra, they measure how fast a computer solves a dense ''n'' × ''n'' system of linear equations ''Ax'' = ''b'', which i ...

– traditionally used to measure

FLOPS Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measu ...

* Livermore loops * NAS parallel benchmarks * NBench – synthetic benchmark suite measuring performance of integer arithmetic, memory operations, and floating-point arithmetic *

PAL Phase Alternating Line (PAL) is a color encoding system for analog television. It was one of three major analogue colour television standards, the others being NTSC and SECAM. In most countries it was broadcast at 625 lines, 50 fields (25 ...

– a benchmark for realtime physics engines * PerfKitBenchmarker – A set of benchmarks to measure and compare cloud offerings. *

Phoronix Test Suite Phoronix Test Suite (PTS) is a free and open-source Benchmark (computing), benchmark software for Linux and other operating systems. The Phoronix Test Suite, developed by Michael Larabel and Matthew Tippett, has been endorsed by sites such as Li ...

– open-source cross-platform benchmarking suite for Linux, OpenSolaris, FreeBSD, OSX and Windows. It includes a number of other benchmarks included on this page to simplify execution. * POV-Ray – 3D render *

Tak (function) In computer science, the Tak function is a recursive function, named after . It is defined as follows: \tau (x,y,z) = \begin \tau (\tau (x-1,y,z) ,\tau (y-1,z,x) ,\tau (z-1,x,y) ) & \text y int: if y int: if y Int -> Int -> Int tarai ...

– a simple benchmark used to test recursion performance * TATP Benchmark – Telecommunication Application Transaction Processing Benchmark * TPoX – An XML transaction processing benchmark for XML databases * VUP (VAX unit of performance) – also called VAX MIPS * Whetstone – floating-point arithmetic performance, often reported in millions of Whetstone instructions per second (MWIPS)

Microsoft Windows benchmarks

* BAPCo: MobileMark, SYSmark, WebMark * CrystalDiskMark * Underwriters Laboratories (UL): 3DMark, PCMark * Heaven Benchmark * PiFast * Superposition Benchmark * Super PI * SuperPrime * Whetstone * Windows System Assessment Tool, included with Windows Vista and later releases, providing an index for consumers to rate their systems easily * Worldbench (discontinued)

Unusual benchmark

* Will Smith Eating Spaghetti test - an informal test to determine the capabilities of text-to-video models.

Others

* AnTuTu – commonly used on phones and ARM-based devices. * Byte Sieve - originally tested language performance, but widely used as a machine benchmark as well. * Creative Computing Benchmark – Compares the

BASIC Basic or BASIC may refer to: Science and technology * BASIC, a computer programming language * Basic (chemistry), having the properties of a base * Basic access authentication, in HTTP Entertainment * Basic (film), ''Basic'' (film), a 2003 film ...

programming language on various platforms. Introduced in 1983. * Geekbench – A cross-platform benchmark for Windows, Linux, macOS, iOS and Android. * iCOMP – the Intel comparative microprocessor performance, published by Intel * Khornerstone * Novabench - a computer benchmarking utility for Microsoft Windows, macOS, and Linux * Performance Rating – modeling scheme used by AMD and Cyrix to reflect the relative performance usually compared to competing products. * Rugg/Feldman benchmarks - one of the earliest microcomputer benchmarks, from 1977. *

SunSpider A browser speed test is a computer benchmark that scores the performance of a web browser, by measuring the browser's efficiency in completing a predefined list of tasks. In general the testing software is available online, located on a website, w ...

– a browser speed test * UserBenchmark - PC benchmark utility * VMmark – a virtualization benchmark suite.

References

External links

* The dates: 1962-1976 {{DEFAULTSORT:Benchmark (Computing) Hardware testing