LAION
   HOME

TheInfoList



OR:

LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile
text-to-image model A text-to-image model is a machine learning model which takes as input a natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural netwo ...
s, including
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...
and
Imagen ''Imagen'' is a Spanish language monthly women's fashion magazine published in San Juan, Puerto Rico. Profile ''Imagen'' was founded in 1986. The magazine is printed monthly by Casiano Communications. The headquarters is in San Juan. It is Puer ...
. In February 2023, LAION was named in the
Getty Images Getty Images Holdings, Inc. is an American visual media company and is a supplier of stock images, editorial photography, video and music for business and consumers, with a library of over 477 million assets. It targets three markets— creative ...
lawsuit against
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...
as a non-party. In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set. On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot
OpenAssistant LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scr ...
.


Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the
Common Crawl Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...
, a dataset of scraped web pages. The developers searched the crawled html for tags and treated their
alt attribute The alt attribute is the HTML attribute used in HTML and XHTML documents to specify alternative text (alt text) that is to be displayed in place of an element that cannot be rendered. The alt attribute is used for short descriptions, with lon ...
s as captions. They used
CLIP Clip or CLIP may refer to: Fasteners * Hair clip, a device used to hold hair together or attaching materials such as caps to the hair * Binder clip, a device used for holding thicker materials (such as large volumes of paper) together ** Bulldog ...
to identify and discard images whose content did not appear to match their captions. LAION does not host the content of scraped images themselves; rather, the dataset contains
URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
s pointing to images, which researchers must download themselves. The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021. It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.
Imagen ''Imagen'' is a Spanish language monthly women's fashion magazine published in San Juan, Puerto Rico. Profile ''Imagen'' was founded in 1986. The magazine is printed monthly by Casiano Communications. The headquarters is in San Juan. It is Puer ...
, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets. A successor of more than 5 billion pairs, LAION-5B, was released in March 2022. As of its release, it was the largest freely available dataset of image-caption pairs in existence. Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...
text-to-image model, which was trained on it.


OpenAssistant

OpenAssistant is an
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
(AI)
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
chat-based assistant that understands tasks, can interact with third-party systems and retrieve information dynamically to do so. The project is developed by a group of volunteers in collaboration with LAION. One of the goals for development includes free access to
large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
that can be run locally on consumer hardware. The project is backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points.


References

{{reflist, refs= {{cite web , work=LAION.ai , title=About , access-date=26 September 2022 , url=https://laion.ai/about/ {{cite news , work=Ars Technica , date=15 September 2022 , last=Edwards, first=Benj , title=Have AI image generators assimilated your art? New tool lets you check , url=https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/ {{cite news , last1=Newman , first1=Marissa , last2=Cantrill , first2=Aggi , title=The Future of AI Relies on a High School Teacher's Free Database , url=https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns , access-date=24 April 2023 , work=
Bloomberg News Bloomberg News (originally Bloomberg Business News) is an international news agency headquartered in New York City and a division of Bloomberg L.P. Content produced by Bloomberg News is disseminated through Bloomberg Terminals, Bloomberg Televi ...
, date=24 April 2023 , language=en
{{cite web , work=InfoQ , last=Alford, first=Anthony , date=17 May 2022 , title=LAION Releases Five Billion Image-Text Pair Dataset LAION-5B , url=https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/ {{cite web , work=LAION blog , last=Schuhmann, first=Christoph , title=LAION-400-Million Open Dataset , date=8 August 2021 , access-date=26 September 2022 , url=https://laion.ai/blog/laion-400-open-dataset/ {{cite web , work=LAION blog , last=Beaumont, first=Romain , date=3 March 2022 , title=LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets , url=https://laion.ai/blog/laion-5b/ {{cite news , work=Ars Technica , last=Edwards, first=Benj , date=21 September 2022 , title=Artist finds private medical record photos in popular AI training data set , url=https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/ {{cite arxiv , title=Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , date=23 May 2022, last1=Saharia, first1=Chitwan , last2=Chan, first2=William , last3=Saxena, first3=Saurabh , last4=Li, first4=Lala , last5=Whang, first5=Jay , last6=Denton, first6=Emily , last7=Kamyar Seyed Ghasemipour, first7=Seyed , last8=Karagol Ayan, first8=Burcu , last9=Sara Mahdavi, first9=S. , last10=Gontijo Lopes, first10=Rapha , last11=Salimans, first11=Tim , last12=Ho, first12=Jonathan , last13=J Fleet, first13=David , last14=Norouzi, first14=Mohammad , arxiv=2205.11487 {{cite web , work=TechCrunch , last=Wiggers, first=Kyle , date=12 August 2022 , title=This startup is setting a DALL-E 2-like AI free, consequences be damned , url=https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/ Applications of artificial intelligence Open-source artificial intelligence Artificial intelligence laboratories Non-profit organisations based in Germany