Event Stream Processing
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a
programming paradigm Programming paradigms are a way to classify programming languages based on their features. Languages can be classified into multiple paradigms. Some paradigms are concerned mainly with implications for the execution model of the language, suc ...
which views
data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has bec ...
s, or sequences of events in time, as the central input and output objects of
computation Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as ''computers''. An es ...
. Stream processing encompasses
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
,
reactive programming In computing, reactive programming is a declarative programming paradigm concerned with data streams and the propagation of change. With this paradigm, it's possible to express static (e.g., arrays) or dynamic (e.g., event emitters) data streams ...
, and
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
. Stream processing systems aim to expose parallel processing for data streams and rely on
streaming algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access ...
s for efficient implementation. The
software stack In computing, a solution stack or software stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to "run on" or "run on t ...
for these systems includes components such as
programming model A programming model is an execution model coupled to an API or a particular pattern of code. In this style, there are actually two execution models in play: the execution model of the base programming language and the execution model of the prog ...
s and
query language Query languages, data query languages or database query languages (DQL) are computer languages used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL). Types Broadly, query language ...
s, for expressing computation; stream management systems, for distribution and
scheduling A schedule or a timetable, as a basic time-management tool, consists of a list of times at which possible task (project management), tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order ...
; and hardware components for
acceleration In mechanics, acceleration is the rate of change of the velocity of an object with respect to time. Accelerations are vector quantities (in that they have magnitude and direction). The orientation of an object's acceleration is given by the ...
including
floating-point unit In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
s,
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
s, and
field-programmable gate array A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term '' field-programmable''. The FPGA configuration is generally specified using a hardware d ...
s. The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data (a ''stream''), a series of operations ('' kernel functions'') is applied to each element in the stream. Kernel functions are usually pipelined, and optimal local on-chip memory reuse is attempted, in order to minimize the loss in bandwidth, associated with external memory interaction. ''Uniform streaming'', where one kernel function is applied to all elements in the stream, is typical. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use
scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependen ...
, for example, to initiate a
direct memory access Direct memory access (DMA) is a feature of computer systems and allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU). Without DMA, when the CPU is using programmed input/output, it is t ...
(DMA) when dependencies become known. The elimination of manual DMA management reduces software complexity, and an associated elimination for hardware cached I/O, reduces the data area expanse that has to be involved with service by specialized computational units such as
arithmetic logic unit In computing, an arithmetic logic unit (ALU) is a Combinational logic, combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on ...
s. During the 1980s stream processing was explored within
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
. An example is the language
SISAL Sisal (, ) (''Agave sisalana'') is a species of flowering plant native to southern Mexico, but widely cultivated and naturalized in many other countries. It yields a stiff fibre used in making rope and various other products. The term sisal ma ...
(Streams and Iteration in a Single Assignment Language).


Applications

Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and
digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...
) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context,
processor Processor may refer to: Computing Hardware * Processor (computing) **Central processing unit (CPU), the hardware within a computer that executes a program *** Microprocessor, a central processing unit contained on a single integrated circuit (I ...
design may be tuned for maximum efficiency or a trade-off for flexibility. Stream processing is especially suitable for applications that exhibit three application characteristics: * Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity. * Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records. * Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model. Examples of records within streams include: * In graphics, each record might be the vertex, normal, and color information for a triangle; * In image processing, each record might be a single pixel from an image; * In a video encoder, each record may be 256 pixels forming a macroblock of data; or * In wireless signal processing, each record could be a sequence of samples received from an antenna. For each record we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable.


Code examples

By way of illustration, the following code fragments demonstrate detection of patterns within event streams. The first is an example of processing a data stream using a continuous SQL query (a query that executes forever processing arriving data based on timestamps and window duration). This code fragment illustrates a JOIN of two data streams, one for stock orders, and one for the resulting stock trades. The query outputs a stream of all Orders matched by a Trade within one second of the Order being placed. The output stream is sorted by timestamp, in this case, the timestamp from the Orders stream. SELECT DataStream Orders.TimeStamp, Orders.orderId, Orders.ticker, Orders.amount, Trade.amount FROM Orders JOIN Trades OVER (RANGE INTERVAL '1' SECOND FOLLOWING) ON Orders.orderId = Trades.orderId; Another sample code fragment detects weddings among a flow of external "events" such as church bells ringing, the appearance of a man in a tuxedo or morning suit, a woman in a flowing white gown and rice flying through the air. A "complex" or "composite" event is what one infers from the individual simple events: a wedding is happening. WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo" FOLLOWED-BY Person.Clothes EQUALS "gown" AND (Church_Bell OR Rice_Flying) WITHIN 2 hours ACTION Wedding


Comparison to prior parallel paradigms

Basic computers started from a sequential execution paradigm. Traditional
CPUs A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...
are
SISD SISD can refer to: * Single instruction, single data, a computer processor architecture * CCL5, an 8kDa protein also using the symbol SISD * Sixteen-segment display * Several school districts in Texas. See List of school districts in Texas - S * S ...
based, which means they conceptually perform only one operation at a time. As the computing needs of the world evolved, the amount of data to be managed increased very quickly. It was obvious that the sequential programming model could not cope with the increased need for processing power. Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution. The result of those efforts was
SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...
, a programming paradigm which allowed applying one instruction to multiple instances of (different) data. Most of the time, SIMD was being used in a
SWAR SIMD within a register (SWAR), also known by the name "packed SIMD" is a technique for performing parallel operations on data contained in a processor register. SIMD stands for ''single instruction, multiple data''. Flynn's 1972 taxonomy categorise ...
environment. By using more complicated structures, one could also have
MIMD In computing, multiple instruction, multiple data (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be exe ...
parallelism. Although those two paradigms were efficient, real-world implementations were plagued with limitations from memory alignment problems to synchronization issues and limited parallelism. Only few SIMD processors survived as stand-alone components; most were embedded in standard CPUs. Consider a simple program adding up two arrays containing 100 4-component vectors (i.e. 400 numbers in total).


Conventional, sequential paradigm

for (int i = 0; i < 400; i++) result = source0 + source1 This is the sequential paradigm that is most familiar. Variations do exist (such as inner loops, structures and such), but they ultimately boil down to that construct.


Parallel SIMD paradigm, packed registers (SWAR)

for (int el = 0; el < 100; el++) // for each vector vector_sum(result l source0 l source1 l; This is actually oversimplified. It assumes the instruction vector_sum works. Although this is what happens with instruction intrinsics, much information is actually not taken into account here such as the number of vector components and their data format. This is done for clarity. You can see however, this method reduces the number of decoded instructions from ''numElements * componentsPerElement'' to ''numElements''. The number of jump instructions is also decreased, as the loop is run fewer times. These gains result from the parallel execution of the four mathematical operations. What happened however is that the packed SIMD register holds a certain amount of data so it's not possible to get more parallelism. The speed up is somewhat limited by the assumption we made of performing four parallel operations (please note this is common for both AltiVec and SSE).


Parallel Stream paradigm (SIMD/MIMD)

// This is a fictional language for demonstration purposes. elements = array streamElement( umber, number 00kernel = instance streamKernel("@arg0
iter ITER (initially the International Thermonuclear Experimental Reactor, ''iter'' meaning "the way" or "the path" in Latin) is an international nuclear fusion research and engineering megaproject aimed at creating energy by replicating, on Earth ...
) result = kernel.invoke(elements)
In this paradigm, the whole dataset is defined, rather than each component block being defined separately. Describing the set of data is assumed to be in the first two rows. After that, the result is inferred from the sources and kernel. For simplicity, there's a 1:1 mapping between input and output data but this does not need to be. Applied kernels can also be much more complex. An implementation of this paradigm can "unroll" a loop internally. This allows throughput to scale with chip complexity, easily utilizing hundreds of ALUs. The elimination of complex data patterns makes much of this extra power available. While stream processing is a branch of SIMD/MIMD processing, they must not be confused. Although SIMD implementations can often work in a "streaming" manner, their performance is not comparable: the model envisions a very different usage pattern which allows far greater performance by itself. It has been noted that when applied on generic processors such as standard CPU, only a 1.5x speedup can be reached. By contrast, ad-hoc stream processors easily reach over 10x performance, mainly attributed to the more efficient memory access and higher levels of parallel processing. Although there are various degrees of flexibility allowed by the model, stream processors usually impose some limitations on the kernel or stream size. For example, consumer hardware often lacks the ability to perform high-precision math, lacks complex indirection chains or presents lower limits on the number of instructions which can be executed.


Research

Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...
stream processing projects included the Stanford Real-Time Programmable Shading Project started in 1999. A prototype called Imagine was developed in 2002. A project called Merrimac ran until about 2004.
AT&T AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile tel ...
also researched stream-enhanced processors as
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
s rapidly evolved in both speed and functionality. Since these early days, dozens of stream processing languages have been developed, as well as specialized hardware.


Programming model notes

The most immediate challenge in the realm of parallel processing does not lie as much in the type of hardware architecture used, but in how easy it will be to program the system in question in a real-world environment with acceptable performance. Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and DMA scheduling. This in itself is a result of the research at MIT and Stanford in finding an optimal ''layering of tasks'' between programmer, tools and hardware. Programmers beat tools in mapping algorithms to parallel hardware, and tools beat programmers in figuring out smartest memory allocation schemes, etc. Of particular concern are MIMD designs such as
Cell Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...
, for which the programmer needs to deal with application partitioning across multiple cores and deal with process synchronization and load balancing. A drawback of SIMD programming was the issue of
Array-of-Structures (AoS) and Structure-of-Arrays (SoA) In computing, array of structures (AoS), structure of arrays (SoA) and array of structures of arrays (AoSoA) refer to contrasting ways to arrange a sequence of records in memory, with regard to interleaving, and are of interest in SIMD and SIMT p ...
. Programmers often create representations of enitities in memory, for exmaple, the location of an particle in 3D space, the colour of the ball and its size as below: // A particle in a three-dimensional space. struct particle_t ; When multiple of these structures exist in memory they are placed end to end creating an
arrays An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
in an '' array of structures'' (AoS) toplogy. This means that should some algorithim be applied to the location of each particle in turn it must skip over memory locations containing the other attributes. If these attributes are not needed this results in wasteful usage of the CPU cache. Additionally, a SIMD instruction will typically expect the data it will operate on to be continguous in memory, the elements may also need to be aligned. By moving the memory location of the data out of the structure data can be better organised for efficient access in a stream and for SIMD instructions to operate one. A '' structure of arrays'' (SoA), as shown below, can allow this. struct particle_t ; Instead of holding the data in the structure, it holds only pointers (memory locations) for the data. Shortcomings are that if an multiple attributes to of an object are to be operated on they might now be distant in memory and so result in a cache miss. The aligning and any needed padding lead to increased memory usage. Overall, memory management may be more complicated if structures are added and removed for example. For stream processors, the usage of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes (at least 16) available. For each attribute, the application can state the number of components and the format of the components (but only primitive data types are supported for now). The various attributes are then attached to a memory block, possibly defining a ''stride'' between 'consecutive' elements of the same attributes, effectively allowing interleaved data. When the GPU begins the stream processing, it will ''gather'' all the various attributes in a single set of parameters (usually this looks like a structure or a "magic global variable"), performs the operations and ''scatters'' the results to some memory area for later processing (or retrieving). More modern stream processing frameworks provide a FIFO like interface to structure data as a literal stream. This abstraction provides a means to specify data dependencies implicitly while enabling the runtime/hardware to take full advantage of that knowledge for efficient computation. One of the simplest and most efficient stream processing modalities to date for C++, is
RaftLib RaftLib is a portable parallel processing system that aims to provide extreme performance while increasing programmer productivity. It enables a programmer to assemble a massively parallel program (both local and distributed) using simple iostream ...
, which enables linking independent
compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main progr ...
s together as a data flow graph using C++ stream operators. As an example: #include #include #include #include class hi : public raft::kernel ; int main( int argc, char **argv )


Models of computation for stream processing

Apart from specifying streaming applications in high-level languages, models of computation (MoCs) also have been widely used as
dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dataf ...
models and process-based models.


Generic processor architecture

Historically, CPUs began implementing various tiers of memory access optimizations because of the ever-increasing performance when compared to relatively slow growing external memory bandwidth. As this gap widened, big amounts of die area were dedicated to hiding memory latencies. Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%). A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little. Beginning from a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on an add-in board (this seems to also apply to Imagine). CPUs do the dirty job of managing system resources, running applications and such. The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses have been employed in the past). The exact amount of memory lanes is dependent on the market range. As this is written, there are still 64-bit wide interconnections around (entry-level). Most mid-range models use a fast 128-bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512 MB) with a slightly slower crossbar that is 256 bits wide. By contrast, standard processors from
Intel Pentium Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and P ...
to some
Athlon 64 The Athlon 64 is a ninth-generation, AMD64-architecture microprocessor produced by Advanced Micro Devices (AMD), released on September 23, 2003. It is the third processor to bear the name ''Athlon'', and the immediate successor to the Athlon XP. T ...
have only a single 64-bit wide data bus. Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an ''indirection chain'', which is however guaranteed to finally read or write from a specific memory area (inside a stream). Because of the SIMD nature of the stream processor's execution units (ALUs clusters), read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from
Rambus Rambus Incorporated, founded in 1990, is an American technology company that designs, develops and licenses chip interface technologies and architectures that are used in digital electronics products. The company is well known for inventing RDR ...
and
DDR SDRAM Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR1 ...
, for example). This also allows for efficient memory bus negotiations. Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. This is where knowing the kernel temporaries and dependencies pays. Internally, a stream processor features some clever communication and management circuits but what's interesting is the ''Stream Register File'' (SRF). This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. As a cache-like software-controlled structure to the various ALUs, the SRF is shared between all the various ALU clusters. The key concept and innovation here done with Stanford's Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer. The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform flow analysis and optimally pack the SRFs. Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor (or at least Imagine) totally automates. Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than if you hand tuned the thing with much effort. There is proof; there can be a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because intra-cluster communication is common and thus needs to be highly efficient. To keep those ALUs fetched with data, each ALU is equipped with local register files (LRFs), which are basically its usable registers. This three-tiered data access pattern, makes it easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving.


Hardware-in-the-loop issues

Although an order of magnitude speedup can be reasonably expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this. Communication latencies are actually the biggest problem. Although
PCI Express PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common ...
improved this with full-duplex communications, getting a GPU (and possibly a generic stream processor) to work will possibly take long amounts of time. This means it's usually counter-productive to use them for small datasets. Because changing the kernel is a rather expensive operation the stream architecture also incurs penalties for small streams, a behaviour referred to as the ''short stream effect''. Pipelining is a very widespread and heavily used practice on stream processors, with GPUs featuring pipelines exceeding 200 stages. The cost for switching settings is dependent on the setting being modified but it is now considered to always be expensive. To avoid those problems at various levels of the pipeline, many techniques have been deployed such as "über shaders" and "texture atlases". Those techniques are game-oriented because of the nature of GPUs, but the concepts are interesting for generic stream processing as well.


Examples

* The
Blitter A blitter is a circuit, sometimes as a coprocessor or a logic block on a microprocessor, dedicated to the rapid movement and modification of data within a computer's memory. A blitter can copy large quantities of data from one memory area to anot ...
in the Commodore
Amiga Amiga is a family of personal computers introduced by Commodore in 1985. The original model is one of a number of mid-1980s computers with 16- or 32-bit processors, 256 KB or more of RAM, mouse-based GUIs, and significantly improved graphi ...
is an early (circa 1985) graphics processor capable of combining three source streams of 16 component bit vectors in 256 ways to produce an output stream consisting of 16 component bit vectors. Total input stream bandwidth is up to 42 million bits per second. Output stream bandwidth is up to 28 million bits per second. * Imagine, headed by Professor William Dally of
Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...
, is a flexible architecture intended to be both fast and energy efficient. The project, originally conceived in 1996, included architecture, software tools, a VLSI implementation and a development board, was funded by
DARPA The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Adv ...
,
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
and
Texas Instruments Texas Instruments Incorporated (TI) is an American technology company headquartered in Dallas, Texas, that designs and manufactures semiconductors and various integrated circuits, which it sells to electronics designers and manufacturers globall ...
. * Another
Stanford Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is considere ...
project, called Merrimac, is aimed at developing a stream-based supercomputer. Merrimac intends to use a stream architecture and advanced interconnection networks to provide more performance per unit cost than cluster-based scientific computers built from the same technology. * The Storm-1 family from Stream Processors, Inc, a commercial spin-off of Stanford's Imagine project, was announced during a feature presentation at
ISSCC International Solid-State Circuits Conference is a global forum for presentation of advances in solid-state electrical network, circuits and System-on-a-chip, Systems-on-a-Chip. The conference is held every year in February at the San Francisco ...
2007. The family contains four members ranging from 30 GOPS to 220 16-bit GOPS (billions of operations per second), all fabricated at
TSMC Taiwan Semiconductor Manufacturing Company Limited (TSMC; also called Taiwan Semiconductor) is a Taiwanese multinational corporation, multinational semiconductor contract manufacturing and design company. It is the world's most valuable semicon ...
in a 130 nanometer process. The devices target the high end of the
DSP DSP may refer to: Computing * Digital signal processing, the mathematical manipulation of an information signal * Digital signal processor, a microprocessor designed for digital signal processing * Yamaha DSP-1, a proprietary digital signal ...
market including
video conferencing Videotelephony, also known as videoconferencing and video teleconferencing, is the two-way or multipoint reception and transmission of audio and video signals by people in different locations for real time communication.McGraw-Hill Concise Ency ...
,
multifunction printer An MFP (multi-function product/printer/peripheral), multi-functional, all-in-one (AIO), or multi-function device (MFD), is an office machine which incorporates the functionality of multiple devices in one, so as to have a smaller footprint in a ...
s and digital
video surveillance Closed-circuit television (CCTV), also known as video surveillance, is the use of video cameras to transmit a signal to a specific place, on a limited set of monitors. It differs from broadcast television in that the signal is not openly tr ...
equipment. *
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
s are widespread, consumer-grade stream processors designed mainly by
AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
and
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
. Various generations to be noted from a stream processing point of view: ** Pre-R2xx/NV2x: no explicit support for stream processing. Kernel operations were hidden in the
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
and provided too little flexibility for general use. ** R2xx/NV2x: kernel stream operations became explicitly under the programmer's control but only for vertex processing (fragments were still using old paradigms). No branching support severely hampered flexibility but some types of algorithms could be run (notably, low-precision fluid simulation). ** R3xx/NV4x: flexible branching support although some limitations still exist on the number of operations to be executed and strict recursion depth, as well as array manipulation. ** R8xx: Supports append/consume buffers and atomic operations. This generation is the state of the art. *
AMD FireStream AMD FireStream was AMD's brand name for their Radeon-based product line targeting stream processing and/or GPGPU in supercomputers. Originally developed by ATI Technologies around the Radeon X1900 XTX in 2006, the product line was previously b ...
brand name for product line targeting HPC *
Nvidia Tesla Nvidia Tesla was the name of Nvidia's line of products targeted at stream processing or general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began using GPUs from the G80 ser ...
brand name for product line targeting HPC * The
Cell processor Cell is a Multi-core processor, multi-core microprocessor microarchitecture that combines a general-purpose PowerPC Central processing unit, core of modest performance with streamlined coprocessor, coprocessing elements which greatly accelerate m ...
from STI, an alliance of
Sony Computer Entertainment Sony Interactive Entertainment (SIE), formerly known as Sony Computer Entertainment (SCE), is a multinational video game and digital entertainment company wholly owned by multinational conglomerate Sony. The SIE Group is made up of two legal co ...
,
Toshiba Corporation , commonly known as Toshiba and stylized as TOSHIBA, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan. Its diversified products and services include power, industrial and social infrastructure syste ...
, and IBM, is a hardware architecture that can function like a stream processor with appropriate software support. It consists of a controlling processor, the PPE (Power Processing Element, an IBM
PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
) and a set of SIMD coprocessors, called SPEs (Synergistic Processing Elements), each with independent program counters and instruction memory, in effect a
MIMD In computing, multiple instruction, multiple data (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be exe ...
machine. In the native programming model all DMA and program scheduling is left up to the programmer. The hardware provides a fast ring bus among the processors for local communication. Because the local memory for instructions and data is limited the only programs that can exploit this architecture effectively either require a tiny memory footprint or adhere to a stream programming model. With a suitable algorithm the performance of the Cell can rival that of pure stream processors, however this nearly always requires a complete redesign of algorithms and software.


Stream programming libraries and languages

Most programming languages for stream processors start with Java, C or C++ and add extensions which provide specific instructions to allow application developers to tag kernels and/or streams. This also applies to most
shading language A shading language is a graphics programming language adapted to programming shader effects (characterizing surfaces, volumes, and objects). Such language forms usually consist of special data types, like "vector", "matrix", "color" and " normal". ...
s, which can be considered stream programming languages to a certain degree. Non-commercial examples of stream programming languages include: * Ateji PX Free Edition, enables a simple expression of stream programming, the Actor model, and the MapReduce algorithm on
JVM A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes ...
* Auto-Pipe, from the Stream Based Supercomputing Lab at
Washington University in St. Louis Washington University in St. Louis (WashU or WUSTL) is a private research university with its main campus in St. Louis County, and Clayton, Missouri. Founded in 1853, the university is named after George Washington. Washington University is r ...
, an application development environment for streaming applications that allows authoring of applications for heterogeneous systems (CPU,
GPGPU General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...
, FPGA). Applications can be developed in any combination of C, C++, and Java for the CPU. Verilog or VHDL for FPGAs. Cuda is currently used for Nvidia GPGPUs. Auto-Pipe also handles coordination of TCP connections between multiple machines. * ACOTES Programming Model: language from
Polytechnic University of Catalonia The Technical University of Catalonia ( ca, Universitat Politècnica de Catalunya, , es, link=no, Universidad Politécnica de Cataluña; UPC), currently referred to as BarcelonaTech, is the largest engineering university in Catalonia, Spai ...
based on
OpenMP OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating syste ...
*BeepBeep, a simple and lightweight Java-based event stream processing library from the Formal Computer Science Lab at
Université du Québec à Chicoutimi The Université du Québec à Chicoutimi (UQAC) is a branch of the Université du Québec network founded in 1969 and based in the Chicoutimi borough of Saguenay, Quebec, Canada. UQAC has secondary study centres in La Malbaie, Saint-Félicien ...
* Brook language from
Stanford Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is considere ...
*
CAL Actor Language CAL (the Cal Actor Language) is a high-level programming languageCAL Language Report: Specification of the CAL actor language, Johan Eker and Jörn W. Janneck, Technical Memorandum No. UCB/ERL M03/48, University of California, Berkeley, CA, 9472 ...
: a high-level programming language for writing (dataflow) actors, which are stateful operators that transform input streams of data objects (tokens) into output streams. * Cal2Many a code generation framework from Halmstad University, Sweden. It takes CAL code as input and generates different target specific languages including sequential C, Chisel, parallel C targeting Epiphany architecture, ajava & astruct targeting Ambric architecture, etc.. * DUP language from
Technical University of Munich The Technical University of Munich (TUM or TU Munich; german: Technische Universität München) is a public research university in Munich, Germany. It specializes in engineering, technology, medicine, and applied and natural sciences. Establis ...
and
University of Denver The University of Denver (DU) is a private university, private research university in Denver, Colorado. Founded in 1864, it is the oldest independent private university in the Mountain States, Rocky Mountain Region of the United States. It is ...
* HSTREAM: a Directive-Based Language Extension for Heterogeneous Stream Computing * RaftLib - open source C++ stream processing template library originally from the Stream Based Supercomputing Lab at
Washington University in St. Louis Washington University in St. Louis (WashU or WUSTL) is a private research university with its main campus in St. Louis County, and Clayton, Missouri. Founded in 1853, the university is named after George Washington. Washington University is r ...
* SPar - C++ domain-specific language for expressing stream parallelism from the Application Modelling Group (GMAP) at
Pontifical Catholic University of Rio Grande do Sul The Pontifical Catholic University of Rio Grande do Sul ( pt, Pontifícia Universidade Católica do Rio Grande do Sul, PUCRS) is a private non-profit Catholic university. With campuses in the Brazilian cities of Porto Alegre and Viamão, it is the ...
* Sh library from the
University of Waterloo The University of Waterloo (UWaterloo, UW, or Waterloo) is a public research university with a main campus in Waterloo, Ontario Waterloo is a city in the Canadian province of Ontario. It is one of three cities in the Regional Municipality ...
* Shallows, an open source project * S-Net coordination language from the
University of Hertfordshire The University of Hertfordshire (UH) is a public university in Hertfordshire, United Kingdom. The university is based largely in Hatfield, Hertfordshire. Its antecedent institution, Hatfield Technical College, was founded in 1948 and was ident ...
, which provides separation of coordination and algorithmic programming * StreamIt from
MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the m ...
* Siddhi from
WSO2 WSO2 is an open-source technology provider founded in 2005. It offers an enterprise platform for integrating application programming interfaces (APIs), applications, and web services locally and across the Internet. History WSO2 was founded b ...
* WaveScript Functional stream processing, also from MIT. *
Functional reactive programming Functional reactive programming (FRP) is a programming paradigm for reactive programming ( asynchronous dataflow programming) using the building blocks of functional programming (e.g. map, reduce, filter). FRP has been used for programming graphi ...
could be considered stream processing in a broad sense. Commercial implementations are either general purpose or tied to specific hardware by a vendor. Examples of general purpose languages include: * AccelerEyes' Jacket, a commercialization of a GPU engine for MATLAB * Ateji PX Java extension that enables a simple expression of stream programming, the Actor model, and the MapReduce algorithm *Embiot, a lightweight embedded streaming analytics agent from Telchemy * Floodgate, a stream processor provided with the
Gamebryo Gamebryo (; ; formerly NetImmerse until 2003) is a game engine developed by Gamebase Co., Ltd. and Gamebase USA, that incorporates a set of tools and plugins including run-time libraries, supporting video game developers for numerous cross-pl ...
game engine for PlayStation 3, Xbox360, Wii, and PC *
OpenHMPP OpenHMPP (HMPP for Hybrid Multicore Parallel Programming) - programming standard for heterogeneous computing. Based on a set of compiler directives, standard is a programming model designed to handle hardware accelerators without the complexity a ...
, a "directive" vision of Many-Core programming * PeakStream, a spinout of the
Brook A brook is a small river or natural stream of fresh water. It may also refer to: Computing *Brook, a programming language for GPU programming based on C *Brook+, an explicit data-parallel C compiler *BrookGPU, a framework for GPGPU programming ...
project (acquired by
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
in June 2007) * IBM Spade - Stream Processing Application Declarative Engine (B. Gedik, et al. SPADE: the system S declarative stream processing engine. ACM SIGMOD 2008.) *
RapidMind RapidMind Inc. was a privately held company founded and headquartered in Waterloo, Ontario, Canada, acquired by Intel in 2009. It provided a software product that aims to make it simpler for software developers to target multi-core processors an ...
, a commercialization of Sh (acquired by
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
in August 2009) * TStreams, Hewlett-Packard Cambridge Research Lab Vendor-specific languages include: * Brook+ (AMD hardware optimized implementation of
Brook A brook is a small river or natural stream of fresh water. It may also refer to: Computing *Brook, a programming language for GPU programming based on C *Brook+, an explicit data-parallel C compiler *BrookGPU, a framework for GPGPU programming ...
) from
AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
/
ATI Ati or ATI may refer to: * Ati people, a Negrito ethnic group in the Philippines **Ati language (Philippines), the language spoken by this people group ** Ati-Atihan festival, an annual celebration held in the Philippines *Ati language (China), a ...
*
CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ca ...
(Compute Unified Device Architecture) from
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
*
Intel Ct Intel Ct is a Parallel programming model, programming model developed by Intel to ease the exploitation of its future Multi-core processor, multicore chips, as demonstrated by the Intel Tera-Scale, Tera-Scale research program. It is based on the ex ...
- C for Throughput Computing * StreamC from Stream Processors, Inc, a commercialization of the Imagine work at Stanford Event-Based Processing *
Apama Apama ( grc, Ἀπάμα, Apáma), sometimes known as Apama I or Apame I, was a Sogdian noblewoman and the wife of the first ruler of the Seleucid Empire, Seleucus I Nicator. They married at Susa in 324 BC. According to Arrian, Apama was the da ...
- a combined Complex Event and
Stream Processing In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views data streams, or sequences of events in time, as the central input and ou ...
Engine by
Software AG Founded in 1969, Software AG is an enterprise software company with over 10,000 enterprise customers in over 70 countries. The company is the second largest software vendor in Germany, and the seventh largest in Europe. Software AG is traded on t ...
* Wallaroo * WSO2 Stream Processor by
WSO2 WSO2 is an open-source technology provider founded in 2005. It offers an enterprise platform for integrating application programming interfaces (APIs), applications, and web services locally and across the Internet. History WSO2 was founded b ...
*
Apache NiFi Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems. Leveraging the concept of extract, transform, load (ETL), it is based on the "''NiagaraFiles''" software previo ...
Batch File-Based Processing (emulates some of actual stream processing, but much lower performance in general) *
Apache Kafka Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency plat ...
*
Apache Storm Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. ...
* Apache Apex *
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
Continuous Operator Stream Processing *
Apache Flink Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink exec ...
*Walmartlabs Mupd8 * Eclipse Streamsheets - Spreadsheet for Stream Processing Stream Processing Services: * Amazon Web Services - Kinesis * Google Cloud - Dataflow * Microsoft Azure - Stream Analytics * Datastreams - Data Streaming Analytics Platform * IBM Streams ** IBM Streaming Analytics * Eventador SQLStreamBuilder


See also


References

{{DEFAULTSORT:Stream Processing Computer architecture Programming paradigms Models of computation GPGPU