deepset is a startup that provides software developers with the tools to build production-ready
natural language processing (NLP) systems. It was founded in 2018 in
Berlin
Berlin ( , ) is the capital and largest city of Germany by both area and population. Its 3.7 million inhabitants make it the European Union's most populous city, according to population within city limits. One of Germany's sixteen constitue ...
by Milos Rusic, Malte Pietsch, and Timo Möller.
deepset authored and maintains the
open source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
Haystack
and its commercial
SaaS
Software as a service (SaaS ) is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted. SaaS is also known as "on-demand software" and Web-based/Web-hosted software.
SaaS is cons ...
offering deepset Cloud.
History
In June 2018, Milos Rusic, Malte Pietsch, and Timo Möller co-founded deepset in
Berlin
Berlin ( , ) is the capital and largest city of Germany by both area and population. Its 3.7 million inhabitants make it the European Union's most populous city, according to population within city limits. One of Germany's sixteen constitue ...
,
Germany
Germany,, officially the Federal Republic of Germany, is a country in Central Europe. It is the second most populous country in Europe after Russia, and the most populous member state of the European Union. Germany is situated betwe ...
.
In the same year, the company served first customers who wanted to implement
NLP services by tailoring
BERT
Bert or BERT may refer to:
Persons, characters, or animals known as Bert
*Bert (name), commonly an abbreviated forename and sometimes a surname
*Bert, a character in the poem "Bert the Wombat" by The Wiggles; from their 1992 album Here Comes a Son ...
language models to their domain.
In July 2019, the company released the initial version of the
open source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
FARM.
In November 2019, the company released the initial version of the
open source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
Haystack.
Throughout 2020 and 2021 deepset published several applied research papers at
EMNLP,
COLING and
ACL, the leading conferences in the area of
NLP. In 2020, the research contributions comprised German language models named GBERT and GELECTRA, and a
question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
dataset addressing the
COVID-19 pandemic
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The novel virus was first identif ...
called COVID-QA, which was created in collaboration with
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
and has been annotated by biomedical experts.
In 2021, the research contributions comprised German models and datasets for
question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
and
passage retrieval named GermanQuAD and GermanDPR, a semantic answer
similarity metric, and an approach for multimodal retrieval of texts and tables to enable
question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
on tabular data. Haystack contains implementations of all three contributions, enabling the use of the research through the open source framework.
In November 2021, the development of the FARM framework was discontinued and its main features were integrated into the Haystack framework.
In April 2022, the company announced its commercial
SaaS
Software as a service (SaaS ) is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted. SaaS is also known as "on-demand software" and Web-based/Web-hosted software.
SaaS is cons ...
offering deepset Cloud.
As of October 2022, the most popular finetuned language model created by deepset was downloaded more than 7 million times.
Products and Applications
Haystack is an end-to-end
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
framework for building
semantic search
Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. Semantic search seek ...
solutions. With its modular building blocks, software developers can implement pipelines to address various search tasks over large document collections, such as
question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
,
document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly natural language, unstructured text, such as newspaper articles, real estate records or paragraphs ...
or
summarization. It integrates with
Hugging Face Transformers,
Elasticsearch
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
,
OpenSearch
OpenSearch is a collection of technologies that allow the publishing of search results in a format suitable for syndication and aggregation. Introduced in 2005, it is a way for websites and search engines to publish search results in a standard ...
and others. The
framework
A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of.
Framework may refer to:
Computing
* Application framework, used to implement the structure of an application for an op ...
has an active community on
GitHub
GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
, where so far more than 140 people contributed to its continuous development and it also enjoys a vibrant community on
Meetup
Meetup is a social media platform for hosting and organizing in-person and virtual activities, gatherings, and events for people and communities of similar interests, hobbies, and professions. It was founded in 2002 by Scott Heiferman and four ot ...
Thousands of organizations use the
framework
A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of.
Framework may refer to:
Computing
* Application framework, used to implement the structure of an application for an op ...
, including Global 500 enterprises like
Airbus
Airbus SE (; ; ; ) is a European Multinational corporation, multinational aerospace corporation. Airbus designs, manufactures and sells civil and military aerospace manufacturer, aerospace products worldwide and manufactures aircraft througho ...
, or
Infineon
Infineon Technologies AG is a German semiconductor manufacturer founded in 1999, when the semiconductor operations of the former parent company Siemens AG were spun off. Infineon has about 50,280 employees and is one of the ten largest semicond ...
,
Alcatel-Lucent Enterprise
ALE International SAS, trading as Alcatel-Lucent Enterprise, is a French software company headquartered in Colombes, France, providing communication equipment and services to telecommunications companies, ISPs and data providers. Since March 20 ...
, BetterUp, Etalab, and Sooth.ai.
The deepset Cloud platform supports customers at building scalable
NLP applications by covering the entire process of prototyping, experimentation, deployment, and monitoring.
It is built on Haystack.
FARM was a
framework
A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of.
Framework may refer to:
Computing
* Application framework, used to implement the structure of an application for an op ...
for adapting representation models.
One of its core concepts was the implementation of adaptive models, which comprised language models and an arbitrary number of prediction heads. FARM supported domain-adaptation and finetuning of these models with advanced options, for example gradient accumulation,
cross-validation or
automatic mixed-precision training. Its main features were integrated into Haystack in November 2021 and its development was discontinued at that time.
Funding
On April 28, 2022, deepset announced a Series A investment round of $14 million led by
GV, with the participation of Harpoon Ventures, Acequia Capital and a team of experienced commercial
open source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
founders, such as Alex Ratner (Snorkel AI),
Mustafa Suleyman
Mustafa Suleyman (born August 1984) is the co-founder and former head of applied AI at DeepMind, an artificial intelligence company acquired by Google and now owned by Alphabet. His current venture is Inflection AI.
Early life
Suleyman's fath ...
(
Deepmind
DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
),
Spencer Kimball (
Cockroach Labs
CockroachDB is a commercial distributed SQL database management system, developed by Cockroach Labs.
History
Cockroach Labs was founded in 2015 by ex-Google employees Spencer Kimball, Peter Mattis, and Ben Darnell. Cockroach Labs founders Kim ...
),
Jeff Hammerbacher Jeff Hammerbacher is a data scientist. He was chief scientist and cofounder at Cloudera and later served on the faculty of the Icahn School of Medicine at Mount Sinai. Early life
Hammerbacher grew up in Fort Wayne, Indiana. His father worked at th ...
(
Cloudera
Cloudera, Inc. is an American software company providing enterprise data management systems that make significant use of Apache Hadoop. As of January 31, 2021, the company had approximately 1,800 customers.
History
Cloudera, Inc. was formed on J ...
) and Emil Eifrem (
Neo4j
Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is available in a non-open-source "community edition" ...
).
A previous pre-seed investment round of $1.6 million on March 8, 2021, was led by System.One and Lunar Ventures, who also participated in the subsequent Series A round.
References
{{reflist
Natural language processing software
Companies of Germany