Apache Kafka is a

distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

event store and stream-processing platform. It is an

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized so ...

system developed by the

Apache Software Foundation The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the ...

written in

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect, and provides the Kafka Streams

libraries A library is a collection of Document, materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or electronic media, digital access (soft copies) materials, and may be a ...

for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks ..which allows Kafka to turn a bursty stream of random message writes into linear writes."

History

Kafka was originally developed at

LinkedIn LinkedIn () is an American business and employment-oriented online service that operates via websites and mobile apps. Launched on May 5, 2003, the platform is primarily used for professional networking and career development, and allows job se ...

, and was subsequently open sourced in early 2011. Jay Kreps,

Neha Narkhede Neha Narkhede is an Indian American technology entrepreneur and the co-founder and former CTO of Confluent, a streaming data technology company. She co-created the open source software platform Apache Kafka. Narkhede now serves as a board membe ...

and Jun Rao helped co-create Kafka.Li, S. (2020). He Left His High-Paying Job At LinkedIn And Then Built A $4.5 Billion Business In A Niche You've Never Heard Of. Forbes. Retrieved 8 June 2021, fro
Forbes_Kreps
Graduation from the

Apache Incubator Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects. The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for projec ...

occurred on 23 October 2012. Jay Kreps chose to name the software after the author

Franz Kafka Franz Kafka (3 July 1883 – 3 June 1924) was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. It typ ...

because it is "a system optimized for writing", and he liked Kafka's work.

Applications

Apache Kafka is based on the commit log, and it allows users to subscribe to it and publish data to any number of systems or real-time applications. Example applications include managing passenger and driver matching at

Uber Uber Technologies, Inc. (Uber), based in San Francisco, provides mobility as a service, ride-hailing (allowing users to book a car and driver to transport them in a way similar to a taxi), food delivery ( Uber Eats and Postmates), pa ...

, providing real-time analytics and

predictive maintenance Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintena ...

for

British Gas British Gas (trading as Scottish Gas in Scotland) is an energy and home services provider in the United Kingdom. It is the trading name of British Gas Services Limited and British Gas New Heating Limited, both subsidiaries of Centrica. Servi ...

smart home, and performing numerous real-time services across all of LinkedIn.

Architecture

Kafka stores key-value messages that come from arbitrarily many processes called ''producers''. The data can be partitioned into different "partitions" within different "topics". Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp. Other processes called "consumers" can read messages from partitions. For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka. Apache Kafka also works with external stream processing systems such as

Apache Apex Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable. Apache Apex was named a t ...

Apache Beam Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Beam Pipelines are defined using one of the provided SDKs and executed in one of ...

Apache Flink Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink execut ...

Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...

, Apache Storm, and Apache NiFi. Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion and has allowed it to replace some of the conventional messaging systems like

Java Message Service The Jakarta Messaging API (formerly Java Message Service or JMS API) is a Java application programming interface (API) for message-oriented middleware. It provides generic messaging models, able to handle the producer–consumer problem, that can ...

(JMS),

Advanced Message Queuing Protocol The Advanced Message Queuing Protocol (AMQP) is an open standard application layer protocol for message-oriented middleware. The defining features of AMQP are message orientation, queuing, routing (including point-to-point and publish-and-su ...

(AMQP), etc. Since the 0.11.0.0 release, Kafka offers ''transactional writes'', which provide exactly-once stream processing using the Streams API. Kafka supports two types of topics: Regular and compacted. Regular topics can be configured with a retention time or a space bound. If there are records that are older than the specified retention time or if the space bound is exceeded for a partition, Kafka is allowed to delete old data to free storage space. By default, topics are configured with a retention time of 7 days, but it's also possible to store data indefinitely. For compacted topics, records don't expire based on time or space bounds. Instead, Kafka treats later messages as updates to older message with the same key and guarantees never to delete the latest message per key. Users can delete messages entirely by writing a so-called tombstone message with null-value for a specific key. There are five major APIs in Kafka: * Producer API – Permits an application to publish streams of records. * Consumer API – Permits an application to subscribe to topics and processes streams of records. * Connector API – Executes the reusable producer and consumer APIs that can link the topics to the existing applications. * Streams API – This API converts the input streams to output and produces the result. * Admin API – Used to manage Kafka topics, brokers, and other Kafka objects. The consumer and producer APIs are decoupled from the core functionality of Kafka through an underlying messaging

protocol Protocol may refer to: Sociology and politics * Protocol (politics), a formal agreement between nation states * Protocol (diplomacy), the etiquette of diplomacy and affairs of state * Etiquette, a code of personal behavior Science and technology ...

. This allows writing compatible API layers in any programming language that are as efficient as the Java APIs bundled with Kafka. The Apache Kafka project maintains a list of such third party APIs.

Kafka APIs

Connect API

Kafka Connect (or Connect API) is a framework to import/export data from/to other systems. It was added in the Kafka 0.9.0.0 release and uses the Producer and Consumer API internally. The Connect framework itself executes so-called "connectors" that implement the actual logic to read/write data from other systems. The Connect API defines the programming interface that must be implemented to build a custom connector. Many open source and commercial connectors for popular data systems are available already. However, Apache Kafka itself does not include production ready connectors.

Streams API

Kafka Streams (or Streams API) is a stream-processing library written in Java. It was added in the Kafka 0.10.0.0 release. The library allows for the development of stateful stream-processing applications that are scalable, elastic, and fully fault-tolerant. The main API is a stream-processing

domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...

(DSL) that offers high-level operators like filter,

map A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...

, grouping, windowing, aggregation, joins, and the notion of tables. Additionally, the Processor API can be used to implement custom operators for a more low-level development approach. The DSL and Processor API can be mixed, too. For stateful stream processing, Kafka Streams uses

RocksDB RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound wo ...

to maintain local operator state. Because RocksDB can write to disk, the maintained state can be larger than available main memory. For fault-tolerance, all updates to local state stores are also written into a topic in the Kafka cluster. This allows recreating state by reading those topics and feed all data into RocksDB. The latest version of Streams API is 2.8.0. The link also contains information about how to upgrade to the latest version.

Version compatibility

Up to version 0.9.x, Kafka brokers are backward compatible with older clients only. Since Kafka 0.10.0.0, brokers are also forward compatible with newer clients. If a newer client connects to an older broker, it can only use the features the broker supports. For the Streams API, full compatibility starts with version 0.10.1.0: a 0.10.1.0 Kafka Streams application is not compatible with 0.10.0 or older brokers.

Performance

Monitoring end-to-end performance requires tracking metrics from brokers, consumer, and producers, in addition to monitoring

ZooKeeper A zookeeper, sometimes referred as animal keeper, is a person who manages zoo animals that are kept in captivity for conservation or to be displayed to the public.Hurwitz, Jane. Choosing a Career in Animal Care (World of Work). New York: Rosen Gr ...

, which Kafka uses for coordination among consumers. There are currently several monitoring platforms to track Kafka performance. In addition to these platforms, collecting Kafka data can also be performed using tools commonly bundled with Java, including

JConsole JConsole is a graphical monitoring tool to monitor Java Virtual Machine (JVM) and Java applications both on a local or remote machine. JConsole uses underlying features of Java Virtual Machine to provide information on performance and resource cons ...

References

External links

* {{Authority control LinkedIn software

Kafka Franz Kafka (3 July 1883 – 3 June 1924) was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. It typi ...

Enterprise application integration Free software Free software programmed in Scala Java platform Message-oriented middleware Service-oriented architecture-related products 2011 software