Vision-language Model
   HOME

TheInfoList



OR:

In
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
(AI), a foundation model (FM), also known as large X model (LxM), is a
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
or
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
model trained on vast datasets so that it can be applied across a wide range of use cases.Competition and Markets Authority (2023). ''AI Foundation Models: Initial Report''. Available at: https://assets.publishing.service.gov.uk/media/65081d3aa41cc300145612c0/Full_report_.pdf
Generative AI Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and str ...
applications like
large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
(LLM) are common examples of foundation models. Building foundation models is often highly resource-intensive, with the most advanced models costing hundreds of millions of dollars to cover the expenses of acquiring, curating, and processing massive datasets, as well as the compute power required for training. These costs stem from the need for sophisticated infrastructure, extended training times, and advanced hardware, such as
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
s. In contrast, adapting an existing foundation model for a specific task or using it directly is far less costly, as it leverages pre-trained capabilities and typically requires only fine-tuning on smaller, task-specific datasets. Early examples of foundation models are
language models A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
(LMs) like OpenAI's GPT series and
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
's BERT. Beyond text, foundation models have been developed across a range of modalitiesincluding
DALL-E DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as Prompt engineering, ''prompts''. The first ...
and Flamingo for images, MusicGen for music, and RT-2 for robotic control. Foundation models are also being developed for fields like astronomy, radiology, genomics, music, coding, times-series forecasting, mathematics, and chemistry.


Definitions

The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) coined the term "foundation model" in August 2021 to mean "any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks". This was based on their observation that preexisting terms, while overlapping, were not adequate, stating that "' (large) language model' was too narrow given hefocus is not only language; 'self-supervised model' was too specific to the training objective; and 'pretrained model' suggested that the noteworthy action all happened after 'pretraining." The term "foundation model" was chosen over "foundational model" because "foundational" implies that these models provide fundamental principles in a way that "foundation" does not. As governments regulate foundation models, new legal definitions have emerged. * In the United States, the '' Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence'' defines a foundation model as "an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts". * In the United States, the proposed AI Foundation Model Transparency Act of 2023 by House Representatives
Don Beyer Donald Sternoff Beyer Jr. ( ; born June 20, 1950) is an American businessman, diplomat, and politician serving as the U.S. representative for since 2015. A member of the Democratic Party, his district is located in Northern Virginia and includ ...
(D, VA) and
Anna Eshoo Anna A. Eshoo ( ; née Georges; born December 13, 1942) is an American politician who served as the United States House of Representatives, U.S. representative for from 1993 to 2025. She is a member of the Democratic Party (United States), Democ ...
(D, CA) defines a foundation model as "an artificial intelligence model trained on broad data, generally uses self supervision, generally contains at least 1,000,000,000 parameters, is applicable across a wide range of contexts, and exhibits, or could be easily modified to exhibit, high levels of performance at tasks that could pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters." * In the European Union, the
European Parliament The European Parliament (EP) is one of the two legislative bodies of the European Union and one of its seven institutions. Together with the Council of the European Union (known as the Council and informally as the Council of Ministers), it ...
's negotiated position on the E.U. AI Act defines a foundation model as an "AI model that is trained on broad data at scale, is designed for generality of output, and can be adapted to a wide range of distinctive tasks". * In the United Kingdom, the
Competition and Markets Authority The Competition and Markets Authority (CMA) is the principal competition regulator in the United Kingdom. It is a non-ministerial government department in the United Kingdom, responsible for promoting competitive markets and tackling unfair beh ...
's ''AI Foundation Models: Initial Report'' defines foundations model as "a type of AI technology that are trained on vast amounts of data that can be adapted to a wide range of tasks and operations." The United States's definitions are the only ones to make reference to the size of a foundation model, and differ on magnitude. Beyer and Eshoo's definition also specifies that foundation models must achieve a level of performance as to be a potential danger. In contrast, the E.U. definition requires the model to be designed for generality of output. All definitions agree that foundation models must be trained on a broad range of data with potential applications in many domains.


History

Technologically, foundation models are built using established machine learning techniques like
deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
,
transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
, and
self-supervised learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...
. Foundation models differ from previous techniques as they are general purpose models that function as a reusable infrastructure, instead of bespoke and one-off task-specific models. Advances in computer parallelism (e.g., CUDA GPUs) and new developments in neural network architecture (e.g.,
Transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...
), and the increased use of training data with minimal supervision all contributed to the rise of foundation models. Foundation models began to materialize as the latest wave of deep learning models in the late 2010s. Relative to most prior work on deep learning, these language models demonstrated the potential of training on much larger web-sourced datasets using self-supervised objectives (e.g. predicting the next word in a large corpus of text). These approaches, which draw upon earlier works like
word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
and
GloVe A glove is a garment covering the hand, with separate sheaths or openings for each finger including the thumb. Gloves protect and comfort hands against cold or heat, damage by friction, abrasion or chemicals, and disease; or in turn to provide a ...
, deviated from prior supervised approaches that required annotated data (e.g. crowd-sourced labels). The 2022 releases of
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...
and
ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...
(initially powered by the GPT-3.5 model) led to foundation models and generative AI entering widespread public discourse. Further, releases of
LLaMA The llama (; or ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the pre-Columbian era. Llamas are social animals and live with ...
, Llama 2, and Mistral in 2023 contributed to a greater emphasis placed on how foundation models are released with open foundation models garnering a lot of support and scrutiny.


Related concepts


Frontier models

Certain highly advanced foundation models are termed "frontier models", which have the potential to "possess dangerous capabilities sufficient to pose severe risks to public safety." These "dangerous capabilities" stem from the accidental or intentional misuse of such models, which in conjunction with their powerful nature can lead to severe harms. As foundation models continue to improve, some AI researchers speculate that almost all next-generation foundation models will be considered frontier models. Since the concept of dangerous capabilities is inherently subjective, there is no strict designation for what foundation models qualify as frontier models. However, some generally held ideas for sufficiently dangerous capabilities include: * Designing and synthesizing new biological or chemical weapons * Producing and propagating convincing, tailored disinformation with minimal user instruction * Harnessing unprecedented offensive cyber capabilities * Evading human control through deceptive means Due to frontier models' unique capabilities, it is difficult to effectively regulate their development and deployment. Because of their emergent nature, new dangerous capabilities can appear on their own in frontier models, both in the development stage and after being deployed. Additionally, since frontier models continue to adapt after deployment, it remains difficult to mitigate all harms that arise from already-deployed models. If a frontier model happens to be open-source or is released online, the model can also disseminate rapidly, further hampering regulators by creating a lack of accountability.


General-purpose AI

Due to their adaptability to a wide range of use-cases, foundation models are sometimes considered to be examples of general-purpose AI. In designing the EU AI Act, the European Parliament has stated that a new wave of general-purpose AI technologies shapes the overall AI ecosystem. The fuller structure of the ecosystem, in addition to the properties of specific general-purpose AI systems, influences the design of AI policy and research. General-purpose AI systems also often appear in people's everyday lives through applications and tools like
ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...
or
DALL-E DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as Prompt engineering, ''prompts''. The first ...
. Government agencies like EU Parliament have identified regulation of general-purpose AI, such as foundation models, to be a high priority. General-purpose AI systems are often characterized by large size, opacity, and potential for emergence, all of which can create unintended harms. Such systems also heavily influence downstream applications, which further exacerbates the need for regulation. In regards to prominent legislation, a number of stakeholders have pushed for the EU AI Act to include restrictions on general-purpose AI systems, all of which would also apply to foundation models.


World models

In 2018, researchers David Ha and
Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artifici ...
defined world models in the context of
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
: an
agent Agent may refer to: Espionage, investigation, and law *, spies or intelligence officers * Law of agency, laws involving a person authorized to act on behalf of another ** Agent of record, a person with a contractual agreement with an insuran ...
with a
variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...
model V for representing visual observations, a
recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
model M for representing memory, and a
linear model In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...
C for making decisions. They suggested that agents trained on world models in environments that simulate reality could be applied to real world settings. In 2022,
Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...
saw a world model (defined by him as a
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
that acts as a
mental model A mental model is an internal representation of external reality: that is, a way of representing reality within one's mind. Such models are hypothesized to play a major role in cognition, reasoning and decision-making. The term for this concept wa ...
for aspects of the world that are seen as relevant) as part of a larger system of
cognitive architecture A cognitive architecture is both a theory about the structure of the human mind and to a computational instantiation of such a theory used in the fields of artificial intelligence (AI) and computational cognitive science. These formalized models ...
– other neural networks that are analogous to different regions of the
brain The brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for ...
. In his view, this framework could lead to
commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...
. World models are trained on a variety of data modalities, including text, images, audio and video, and have been applied to video generation. ''
TechCrunch TechCrunch is an American global online newspaper focusing on topics regarding high tech, high-tech and Startup company, startup companies. It was founded in June 2005 by Archimedes Ventures, led by partners Michael Arrington and Keith Teare. I ...
'' noted that world models could use more data than
large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
and would require significantly more computational power (including the use of thousands of
GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
for training and inference). It also noted the risk of
hallucinations A hallucination is a perception in the absence of an external stimulus that has the compelling sense of reality. They are distinguishable from several related phenomena, such as dreaming ( REM sleep), which does not involve wakefulness; pse ...
, coverage bias and
algorithmic bias Algorithmic bias describes systematic and repeatable harmful tendency in a computerized sociotechnical system to create " unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the a ...
. ''TechCrunch'' saw Sora as an example of a world model, while in January 2025,
Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
released its own set of world models. The ''South China Morning Post'' wrote that Manycore Tech was another example of companies aiming to build a world model, viewing their work as an example of spatial intelligence. In May 2025,
Mohamed bin Zayed University of Artificial Intelligence The Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) is a graduate-level, research-based academic institution located in Abu Dhabi, United Arab Emirates. It launched its first undergraduate program in March 2025. The current pres ...
released a world model for building simulations to test AI agents.
Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...
has also released two world models in
two-dimensional space A two-dimensional space is a mathematical space with two dimensions, meaning points have two degrees of freedom: their locations can be locally described with two coordinates or they can move in two independent directions. Common two-dimensiona ...
and
three-dimensional space In geometry, a three-dimensional space (3D space, 3-space or, rarely, tri-dimensional space) is a mathematical space in which three values ('' coordinates'') are required to determine the position of a point. Most commonly, it is the three- ...
, respectively, that were trained on video data, with Google claiming that the latter can be a training environment for AI agents. World models are intended for use in interactive media and environment simulation. Creative professionals have expressed concern that world models could disrupt jobs in their industries.


Technical details


Modeling

For a foundation model to effectively generalize, it must acquire rich representations of the training data. As a result, expressive model architectures that efficiently process large-scale data are often preferred in building foundation models. Currently, the
Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
architecture is the de facto choice for building foundation models across a range of modalities.


Training

Foundation models are built by optimizing a training objective(s), which is a mathematical function that determines how model parameters are updated based on model predictions on training data. Language models are often trained with a next-tokens prediction objective, which refers to the extent at which the model is able to predict the next token in a sequence. Image models are commonly trained with contrastive learning or diffusion training objectives. For contrastive learning, images are randomly augmented before being evaluated on the resulting similarity of the model's representations. For diffusion models, images are noised and the model learns to gradually de-noise via the objective. Multimodal training objectives also exist, with some separating images and text during training, while others examine them concurrently. In general, the training objectives for foundation models promote the learning of broadly useful representations of data. With the rise of foundation models and the larger datasets that power them, a training objective must be able to parse through internet-scale data for meaningful data points. Additionally, since foundation models are designed to solve a general range of tasks, training objectives ought to be ''domain complete'', or able to solve a broad set of downstream capabilities within the given domain. Lastly, foundation model training objectives should seek to scale well and be computationally efficient. With model size and compute power both being relevant constraints, a training objective must be able to overcome such bottlenecks.


Data

Foundation models are trained on a large quantity of data, working under the maxim "the more data, the better." Performance evaluation does show that more data generally leads to better performance, but other issues arise as data quantity grows. Tasks like managing the dataset, integrating data across new applications, ensuring adherence to data licenses, and maintaining data quality all become more difficult as data size grows. The specific demands of foundation models have only exacerbated such issues, as it remains the norm for large foundation models to use public web-scraped data. Foundation models include also search engines data and SEO meta tags data. Public web data remains a plentiful resource, but it also demands stringent moderation and data processing from foundation model developers before it can be successfully integrated into the training pipeline. Training foundation models often runs the risk of violating user privacy, as private data can be disclosed, collected, or used in ways beyond the stated scope. Even if no private data is leaked, models can still inadvertently compromise security through learned behavior in the resulting foundation model. Data quality is another key point, as web-scraped data frequently contains biased, duplicate, and toxic material. Once foundation models are deployed, ensuring high-quality data is still an issue, as undesirable behavior can still emerge from small subsets of data.


Systems

The size of foundation models also brings about issues with the computer systems they run on. The average foundation model is too large to be run within a single accelerator's memory and the initial training process requires an expensive amount of resources. Such issues are predicted to further exacerbate in future as foundation models grow to new heights. Due to this constraint, researchers have begun looking into compressing model size through tight model inference. GPUs are the most common choice of compute hardware for machine learning, due to high memory storage and strong power. Typical foundation model training requires many GPUs, all connected in parallel with fast interconnects. Acquiring a sufficient amount of GPUs of requisite compute efficiency is a challenge for many foundation model developers, one that has led to an increasing dilemma in the field. Larger models require greater compute power, but often at the cost of improved compute efficiency. Since training remains time-consuming and expensive, the tradeoff between compute power and compute efficiency has led only a few select companies to afford the production costs for large, state of the art foundation models. Some techniques like compression and distillation can make inference more affordable, but they fail to completely shore up this weakness.


Scaling

The accuracy and capabilities of foundation models often scale predictably with the size of the model and the amount of the training data. Specifically, scaling laws have been discovered, which are data-based empirical trends that relate resources (data, model size, compute usage) to model capabilities. Particularly, a model's scale is defined by compute, dataset size, and the number of parameters, all of which exhibit a power-law relationship with end performance. However, broken scaling laws have been discovered in which this relationship smoothly transitions (at points referred to as break(s)) from a power law with one exponent to a power law with another (different) exponent. When one does not collect any points near (or after) the break(s), it can be difficult to obtain an accurate extrapolation.


Adaptation

Foundation models are inherently multi-purpose: to use these model for a specific use case requires some form of adaptation. At a minimum, models need to be adapted to perform the task of interest (task specification), but often better performance can be achieved by more extensive adaptation to the domain of interest (domain specialization). A variety of methods (e.g. prompting, in-context learning, fine-tuning,
LoRA LoRa (from "long range", sometimes abbreviated as "LR") is a physical proprietary radio communication technique. It is based on spread spectrum modulation techniques derived from chirp spread spectrum (CSS) technology. It was developed by Cycleo ...
) provide different tradeoffs between the costs of adaptation and the extent to which models are specialized. Some major facets to consider when adapting a foundation model are compute budget and data availability. Foundation models can be very large, up to trillions of parameters in size, so adapting the entirety of a foundation model can be computationally expensive. Therefore, developers sometimes adapt only the last neural layer or only the bias vectors to save time and space. For particularly niche applications, specific data may also not be available to adapt the foundation model sufficiently. In such circumstances, data must be manually labeled, which is costly and can demand expert knowledge.


Evaluation

Evaluation is a key part of developing foundation models. Not only does evaluation allow for tracking progress of high-performance models, it also creates benchmarks for future model development. Stakeholders rely on evaluations to understand model behaviors and gain insight into their various attributes. Traditionally, foundation models are evaluated relative to each other through standardized task benchmarks like MMLU, MMMU, HumanEval, and GSM8K. Given that foundation models are multi-purpose, increasingly meta-benchmarks are developed that aggregate different underlying benchmarks. Examples include LM-Harness, BIG-Bench, HELM, OpenLLM Leaderboard, DecodingTrust, and HEIM. Since foundation models' utility depends on their own general capabilities and the performance of fine-tuned applications, evaluation must cover both metrics. Proper evaluation examines both a foundation model's downstream applications in aggregate and the direct properties the foundation model holds. To ensure further equity in evaluation, certain existing evaluation frameworks account for all adaptation resources, which leads to more informed analyses for the benefit of all stakeholders.


Supply chain

Foundation models' general capabilities allow them to fulfill a unique role in the AI ecosystem, fueled by many upstream and downstream technologies. Training a foundation model requires several resources (e.g. data, compute, labor, hardware, code), with foundation models often involving immense amounts of data and compute (also referred to as computational power). Due to foundation models' large development costs and inexpensive adaptation requirements, the AI landscape has shifted to a small subset of AI companies making foundation models for downstream adaptation. Thus, most foundation model companies outsource this step to specialized data providers (e.g. Scale AI, Surge) and compute providers (e.g.
Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
,
Google Cloud Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools ...
,
Microsoft Azure Microsoft Azure, or just Azure ( /ˈæʒər, ˈeɪʒər/ ''AZH-ər, AY-zhər'', UK also /ˈæzjʊər, ˈeɪzjʊər/ ''AZ-ure, AY-zure''), is the cloud computing platform developed by Microsoft. It has management, access and development of ...
). The foundation model developer itself will then take the data and use the supplied compute to actually train the foundation model. After the foundation model is completely built, much of the data and labor requirements abate. In this development process, hardware and compute are the most necessary, and also the most exclusive resources. To train larger and more complex AI, a sufficient amount of compute is key. However, compute is consolidated in the hands of a few, select entities, which most foundation model developers depend on. As such, the foundation model pipeline is concentrated heavily around these providers. Compute is also costly; in 2023, AI companies spent more than 80% of total capital on compute resources. Foundation models require a large amount of general data to power their capabilities. Early foundation models scraped from subsets of the internet to provide this data information. As the size and scope of foundation models grows, larger quantities of internet scraping becomes necessary, resulting in higher likelihoods of biased or toxic data. This toxic or biased data can disproportionately harm marginalized groups and exacerbate existing prejudices. To address this issue of low-quality data that arose with unsupervised training, some foundation model developers have turned to manual filtering. This practice, known as data labor, comes with its own host of issues. Such manual data detoxification is often outsourced to reduce labor costs, with some workers making less than $2 per hour. The foundation model will then be hosted online either via the developer or via an external organization. Once released, other parties can create applications based on the foundation model, whether through fine-tuning or wholly new purposes. People can then access these applications to serve their various means, allowing one foundation model to power and reach a wide audience.


Release strategies

After a foundation model is built, it can be released in one of many ways. There are many facets to a release: the asset itself, who has access, how access changes over time, and the conditions on use. All these factors contribute to how a foundation model will affect downstream applications. In particular, the two most common forms of foundation model release are through APIs and direct model downloads. When a model is released via an
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
, users can query the model and receive responses, but cannot directly access the model itself. Comparatively, the model could be directly downloadable for users to access and modify. Both release strategies are often classified as an open release. The exact definition of an open release is disputed, but widely accepted requirements are provided by the
Open Source Initiative The Open Source Initiative (OSI) is a California public benefit corporation "actively involved in Open Source community-building, education, and public advocacy to promote awareness and the importance of non-proprietary software". Governance The ...
. Some open foundation models are: PaLM 2,
Llama 2 Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 4, released in April 2025. Llama models come in different s ...
,
Granite Granite ( ) is a coarse-grained (phanerite, phaneritic) intrusive rock, intrusive igneous rock composed mostly of quartz, alkali feldspar, and plagioclase. It forms from magma with a high content of silica and alkali metal oxides that slowly coo ...
, and Mistral. While open foundation models can further research and development more easily, they are also more susceptible to misuse. Open foundation models can be downloaded by anyone, and particularly powerful models can be fine-tuned to intentionally or unintentionally cause harm. During a closed release, the foundation model cannot be accessed by the public, but is used internally by an organization. Such releases are considered safer, but offer no additional value to the research community or the public at large. Some foundation models like
Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...
's Flamingo are fully closed, meaning they are available only to the model developer; others, such as
OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
's
GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...
, are limited access, available to the public but only as a
black box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...
; and still others, such as
Meta Meta most commonly refers to: * Meta (prefix), a common affix and word in English ( in Greek) * Meta Platforms, an American multinational technology conglomerate (formerly ''Facebook, Inc.'') Meta or META may also refer to: Businesses * Meta (ac ...
's Llama 2 are open, with broadly available model weights enabling downstream modification and scrutiny.


References

{{Generative AI Natural language processing Computational linguistics Computational fields of study Language modeling Unsupervised learning Deep learning