Apache Ignite is a distributed database management system for high-performance computing. Apache Ignite's database utilizes

RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch ...

as the default storage and processing tier, thus, belonging to the class of in-memory computing platforms. The disk tier is optional but, once enabled, will hold the full data set whereas the memory tier will cache full or partial data set depending on its capacity. Data in Ignite is stored in the form of key-value pairs. The database component distributes key-value pairs across the cluster in such a way that every node owns a portion of the overall data set. Data is rebalanced automatically whenever a node is added to or removed from the cluster. Apache Ignite cluster can be deployed on-premise on a commodity hardware, in the cloud (e.g. Microsoft Azure,

AWS Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide di ...

Google Compute Engine Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform which is built on the global infrastructure that runs Google's search engine, Gmail, YouTube and other services. Google Compute Engine enab ...

) or in a containerized and provisioning environments such as

Kubernetes Kubernetes (, commonly stylized as K8s) is an open-source container orchestration system for automating software deployment, scaling, and management. Google originally designed Kubernetes, but the Cloud Native Computing Foundation now maintai ...

, Docker, Apache Mesos,

VMware VMware, Inc. is an American cloud computing and virtualization technology company with headquarters in Palo Alto, California. VMware was the first commercially successful company to virtualize the x86 architecture. VMware's desktop software ru ...

History

Apache Ignite was developed by GridGain Systems, Inc. and made open source in 2014. GridGain continues to be the main contributor to the source code, and offers both a commercial version and professional services around Apache Ignite. Once donated as open source, Ignite was accepted in the

Apache Incubator Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects. The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for proj ...

program in October of 2014. The Ignite project graduated from the incubator program to become a top-level Apache project on September 18, 2015. In recent years, Apache Ignite has become one of the top 5 most active projects by some metrics, including user base activity and repository size. GridGain was founded in 2010 by Nikita Ivanov and Dmitriy Setrakyan in

Pleasanton, California Pleasanton is a city in Alameda County, California, United States. Located in the Amador Valley, it is a suburb in the East Bay region of the San Francisco Bay Area. The population was 79,871 at the 2020 census. In 2005 and 2007, Pleasanton ...

. A funding round of $2 to $3 million was disclosed in November, 2011. By 2013 it was located in

Foster City, California Foster City is a city located in San Mateo County, California. The 2020 census put the population at 33,805, an increase of more than 10% over the 2010 census figure of 30,567. Foster City is sometimes considered to be part of Silicon Valley ...

when it disclosed funding of $10 million.

Clustering

Apache Ignite clustering component uses a shared nothing architecture. Server nodes are storage and computational units of the cluster that hold both data and indexes and process incoming requests along with computations. Server nodes are also known as data nodes. Client nodes are connection points from applications and services to the distributed database on a cluster of server nodes. Client nodes are usually embedded in the application code written in

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

, C# or C++ that have special libraries developed. On top of its distributed foundation, Apache Ignite supports interfaces including JCache-compliant key-value APIs, ANSI-99 SQL with joins, ACID transactions, as well as

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...

like computations. Ignite provides

ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. A ...

JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. ...

and

REST Rest or REST may refer to: Relief from activity * Sleep ** Bed rest * Kneeling * Lying (position) * Sitting * Squatting position Structural support * Structural support ** Rest (cue sports) ** Armrest ** Headrest ** Footrest Arts and ente ...

drivers as a way to work with the database from other programming languages or tools. The drivers utilize either client nodes or low-level socket connections internally in order to communicate to the cluster.

Partitioning and replication

Ignite database organizes data in the form of key-value pairs in distributed "caches" (the cache notion is used for historical reasons because initially, the database supported the memory tier). Generally, each cache represents one entity type such as an employee or organization. Every cache is split into a fixed set of "partitions" that are evenly distributed among cluster nodes using the rendezvous hashing algorithm. There is always one primary and zero or more backup copies of a partition. The number of copies is configured with a replication factor parameter. If the full replication mode is configured, then every cluster node will store a partition's copy. The partitions are rebalanced automatically if a node is added to or removed from the cluster in order to achieve an even data distribution and spread the workload. The key-value pairs are kept in the partitions. Apache Ignite maps a pair to a partition by taking the key's value and passing it to a special

hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...

Memory architecture

The memory architecture in Apache Ignite consists of two storage tiers and is called "durable memory". Internally, it uses

paging In computer operating systems, memory paging is a memory management scheme by which a computer stores and retrieves data from secondary storage for use in main memory. In this scheme, the operating system retrieves data from secondary storag ...

for memory space management and data reference, similar to the

virtual memory In computing, virtual memory, or virtual storage is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a very ...

of systems like

Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...

. However, one significant difference between the durable and virtual memory architectures is that the former always keeps the whole data set with indexes on disk (assuming that the disk tier is enabled), while the virtual memory uses the disk when it runs out of RAM, for swapping purposes only. The first tier of the memory architecture, memory tier, keeps data and indexes in

out of Java heap in so-called "off-heap regions". The regions are preallocated and managed by the database on its own which prevents Java heap utilization for storage needs, as a result, helping to avoid long garbage collection pauses. The regions are split into pages of fixed size that store data, indexes, and system metadata. Apache Ignite is fully operational from the memory tier but it is always possible to use the second tier, disk tier, for the sake of

durability Durability is the ability of a physical product to remain functional, without requiring excessive maintenance or repair, when faced with the challenges of normal operation over its design lifetime. There are several measures of durability in u ...

. The database comes with its own native persistence and, plus, can use

RDBMS A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...

NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...

Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...

databases as its disk tier.

Native persistence

Apache Ignite native persistence is a distributed and strongly consistent disk store that always holds a superset of data and indexes on disk. The memory tier will only cache as much data as it can depending on its capacity. For example, if there are 1000 entries and the memory tier can fit only 300 of them, then all 1000 will be stored on disk and only 300 will be cached in RAM. Persistence uses the

write-ahead logging In computer science, write-ahead logging (WAL) is a family of techniques for providing atomicity and durability (two of the ACID properties) in database systems. A write ahead log is an append-only auxiliary disk-resident structure used for crash ...

(WAL) technique for keeping immediate data modifications on disk. In the background, the store runs the "checkpointing process" which purpose is to copy dirty pages from the memory tier to the partition files. A dirty page is a page that is modified in memory with the modification recorded in WAL but not written to a respective partition file. The checkpointing allows removing outdated WAL segments over the time and reduces cluster restart time replaying only that part of WAL that has not been applied to the partition files.

Third-party persistence

The native persistence became available starting version 2.1. Before that Apache Ignite supported only third-party databases as its disk tier. Apache Ignite can be configured as the in-memory tier on top of

databases speeding up the latter. However, there are some limitations in comparison to the native persistence. For instance, SQL queries will be executed only on the data that is in RAM, thus, requiring to preload all the data set from disk to memory beforehand.

Swap space

When using pure memory storage, it is possible for the data size to exceed the physical RAM size, leading to Out-Of-Memory Errors (OOMEs). To avoid this, the ideal approach would be to enable Ignite native persistence or use third-party persistence. However, if you do not want to use native or third-party persistence, you can enable swapping, in which case, Ignite in-memory data will be moved to the swap space located on disk. Note that Ignite does not provide its own implementation of swap space. Instead, it takes advantage of the swapping functionality provided by the operating system (OS). When swap space is enabled, Ignites stores data in memory mapped files (MMF) whose content will be swapped to disk by the OS depending on the current RAM consumption

Consistency

Apache Ignite is a strongly consistent platform that implements

two-phase commit protocol In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed ...

. The consistency guarantees are met for both memory and disk tiers. Transactions in Apache Ignite are ACID-compliant and can span multiple cluster nodes and caches. The database supports pessimistic and optimistic concurrency modes,

deadlock In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a lo ...

-free transactions and deadlock detection techniques. In the scenarios where transactional guarantees are optional, Apache Ignite allows executing queries in the atomic mode that provides better performance.

Distributed SQL

Apache Ignite can be accessed using SQL APIs exposed via JDBC and ODBC drivers, and native libraries developed for

, C#, C++ programming languages. Both

data manipulation Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the mis ...

and data definition languages' syntax complies with ANSI-99 specification. Being a distributed database, Apache Ignite supports both distributed collocated and non-collocated

joins Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two topo ...

. When the data is collocated, joins are executed on the local data of cluster nodes avoiding data movement across the network. Non-collocated joins might move the data sets around the network in order to prepare a consistent result set.

Machine Learning

Apache Ignite provides machine learning training and inference functionality as well as

data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...

and model quality estimation. It natively supports classical training algorithms such as

Linear Regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...

Decision Trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains cond ...

Random Forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...

Gradient Boosting Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision tr ...

, SVM,

K-Means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to Partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the Cluster (statistics), cluster ...

and others. In addition to that, Apache Ignite has a deep integration with TensorFlow. This integrations allows to train neural networks on a data stored in Apache Ignite in a single-node or distributed manner. The key idea of Apache Ignite Machine Learning toolkit is an ability to perform distributed training and inference instantly without massive data transmissions. It's based on

approach, resilient to node failures and data rebalances, allows to avoid data transfers and so that speed up preprocessing and model training.

References

{{DEFAULTSORT:Ignite Free and open-source software Apache Software Foundation