DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are

text-to-image model A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...

s developed by

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

using

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

methodologies to generate

digital image A digital image is an image composed of picture elements, also known as pixels, each with '' finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions f ...

s from

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

descriptions known as ''prompts''. The first version of DALL-E was announced in January 2021. In the following year, its successor DALL-E 2 was released. DALL-E 3 was released natively into

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

for ChatGPT Plus and ChatGPT Enterprise customers in October 2023, with availability via OpenAI's API and "Labs" platform provided in early November. Microsoft implemented the model in Bing's Image Creator tool and plans to implement it into their Designer app. With Bing's Image Creator tool,

Microsoft Copilot Microsoft Copilot (or simply Copilot) is a generative artificial intelligence chatbot developed by Microsoft. Based on the GPT-4 series of large language models, it was launched in 2023 as Microsoft's primary replacement for the discontinued C ...

runs on DALL-E 3. In March 2025, DALL-E-3 was replaced in ChatGPT by GPT Image 1's native image-generation capabilities.

History and background

DALL-E was revealed by OpenAI in a blog post on 5 January 2021, and uses a version of

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

modified to generate images. On 6 April 2022, OpenAI announced DALL-E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles". On 20 July 2022, DALL-E 2 entered into a beta phase with invitations sent to 1 million waitlisted individuals; users could generate a certain number of images for free every month and may purchase more. Access had previously been restricted to pre-selected users for a research preview due to concerns about

ethics Ethics is the philosophy, philosophical study of Morality, moral phenomena. Also called moral philosophy, it investigates Normativity, normative questions about what people ought to do or which behavior is morally right. Its main branches inclu ...

and safety. On 28 September 2022, DALL-E 2 was opened to everyone and the waitlist requirement was removed. In September 2023, OpenAI announced their latest image model, DALL-E 3, capable of understanding "significantly more nuance and detail" than previous iterations. In early November 2022, OpenAI released DALL-E 2 as an

API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...

, allowing developers to integrate the model into their own applications.

Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...

unveiled their implementation of DALL-E 2 in their Designer app and Image Creator tool included in

Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a varie ...

and

Microsoft Edge Microsoft Edge is a Proprietary Software, proprietary cross-platform software, cross-platform web browser created by Microsoft and based on the Chromium (web browser), Chromium open-source project, superseding Edge Legacy. In Windows 11, Edge ...

. The API operates on a cost-per-image basis, with prices varying depending on image resolution. Volume discounts are available to companies working with OpenAI's enterprise team. The software's name is a

portmanteau In linguistics, a blend—also known as a blend word, lexical blend, or portmanteau—is a word formed by combining the meanings, and parts of the sounds, of two or more words together.

of the names of animated robot

Pixar Pixar (), doing business as Pixar Animation Studios, is an American animation studio based in Emeryville, California, known for its critically and commercially successful computer-animated feature films. Pixar is a subsidiary of Walt Disney ...

character

WALL-E ''WALL-E'' (stylized with an interpunct as ''WALL·E'') is a 2008 American animated Romance film, romantic science fiction film produced by Pixar Animation Studios for Walt Disney Pictures. The film was directed by Andrew Stanton, produced b ...

and the Spanish surrealist artist

Salvador Dalí Salvador Domingo Felipe Jacinto Dalí i Domènech, Marquess of Dalí of Púbol (11 May 190423 January 1989), known as Salvador Dalí ( ; ; ), was a Spanish Surrealism, surrealist artist renowned for his technical skill, precise draftsmanship, ...

. In February 2024, OpenAI began adding watermarks to DALL-E generated images, containing metadata in the C2PA (Coalition for Content Provenance and Authenticity) standard promoted by the Content Authenticity Initiative.

Technology

The first

generative pre-trained transformer A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an Neural network (machine learning), artificial neural network that is used in natural ...

(GPT) model was initially developed by OpenAI in 2018, using a

Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

architecture. The first iteration, GPT-1, was scaled up to produce

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was par ...

in 2019; in 2020, it was scaled up again to produce

, with 175 billion parameters.

DALL-E

DALL-E has three components: a discrete VAE, an autoregressive decoder-only Transformer (12 billion parameters) similar to GPT-3, and a CLIP pair of image encoder and text encoder. The discrete VAE can convert an image to a sequence of tokens, and conversely, convert a sequence of tokens back to an image. This is necessary as the Transformer does not directly process image data. The input to the Transformer model is a sequence of tokenised image caption followed by tokenised image patches. The image caption is in English, tokenised by

byte pair encoding Byte-pair encoding (also known as BPE, or digram coding) is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. A slightly modified version of the algori ...

(vocabulary size 16384), and can be up to 256 tokens long. Each image is a 256×256 RGB image, divided into 32×32 patches of 4×4 each. Each patch is then converted by a discrete

variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...

to a token (vocabulary size 8192). DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training). CLIP is a separate model based on

contrastive learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...

that was trained on 400 million pairs of images with text captions scraped from the Internet. Its role is to "understand and rank" DALL-E's output by predicting which caption from a list of 32,768 captions randomly selected from the dataset (of which one was the correct answer) is most appropriate for an image. A trained CLIP pair is used to filter a larger initial list of images generated by DALL-E to select the image that is closest to the text prompt.

DALL-E 2

DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor. Instead of an autoregressive Transformer, DALL-E 2 uses a

diffusion model In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable model, latent variable generative model, generative models. A diffusion model consists of two ...

conditioned on CLIP image embeddings, which, during inference, are generated from CLIP text embeddings by a prior model. This is the same architecture as that of

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

, released a few months later.

DALL-E 3

While a technical report was written for DALL-E 3, it does not include training or implementation details of the model, instead focusing on the improved prompt following capabilities developed for DALL-E 3.

Capabilities

DALL-E can generate imagery in multiple styles, including

photorealistic Photorealism is a genre of art that encompasses painting, drawing and other graphic media, in which an artist studies a photograph and then attempts to reproduce the image as realistically as possible in another medium. Although the term can b ...

imagery,

paintings Painting is a visual art, which is characterized by the practice of applying paint, pigment, color or other medium to a solid surface (called "matrix" or " support"). The medium is commonly applied to the base with a brush. Other implements, ...

, and

emoji An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...

. It can "manipulate and rearrange" objects in its images, and can correctly place design elements in novel compositions without explicit instruction. Thom Dunn writing for ''

BoingBoing ''Boing Boing'' is a website, first established as a zine in 1988, later becoming a group blog. Common topics and themes include technology, futurism, science fiction, gadgets, intellectual property, Disney, and left-wing politics. It twice wo ...

'' remarked that "For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL-E often draws the handkerchief, hands, and feet in plausible locations." DALL-E showed the ability to "fill in the blanks" to infer appropriate details without specific prompts, such as adding Christmas imagery to prompts commonly associated with the celebration, and appropriately placed shadows to images that did not mention them. Furthermore, DALL-E exhibits a broad understanding of visual and design trends. DALL-E can produce images for a wide variety of arbitrary descriptions from various viewpoints with only rare failures. Mark Riedl, an associate professor at the

Georgia Tech The Georgia Institute of Technology (commonly referred to as Georgia Tech, GT, and simply Tech or the Institute) is a public research university and institute of technology in Atlanta, Georgia, United States. Established in 1885, it has the lar ...

School of Interactive Computing, found that DALL-E could blend concepts (described as a key element of human

creativity Creativity is the ability to form novel and valuable Idea, ideas or works using one's imagination. Products of creativity may be intangible (e.g. an idea, scientific theory, Literature, literary work, musical composition, or joke), or a physica ...

). Its visual reasoning ability is sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence). DALL-E 3 AI image of an avocado speaking

DALL-E 3 AI image of an avocado speaking

DALL-E 3 follows complex prompts with more accuracy and detail than its predecessors, and is able to generate more coherent and accurate text. DALL-E 3 is integrated into ChatGPT Plus.

Image modification

Given an existing image, DALL-E 2 and DALL-E 3 can produce "variations" of the image as individual outputs based on the original, as well as edit the image to modify or expand upon it. The "inpainting" and "outpainting" abilities of these models use context from an image to fill in missing areas using a

medium Medium may refer to: Aircraft *Medium bomber, a class of warplane * Tecma Medium, a French hang glider design Arts, entertainment, and media Films * ''The Medium'' (1921 film), a German silent film * ''The Medium'' (1951 film), a film vers ...

consistent with the original, following a given prompt. For example, this can be used to insert a new subject into an image, or expand an image beyond its original borders. According to OpenAI, "Outpainting takes into account the image’s existing visual elements — including shadows, reflections, and textures — to maintain the context of the original image."

Technical limitations

DALL-E 2's language understanding has limits. It is sometimes unable to distinguish "A yellow book and a red vase" from "A red book and a yellow vase" or "A panda making latte art" from "Latte art of a panda". It generates images of an astronaut riding a horse when presented with the prompt "a horse riding an astronaut". It also fails to generate the correct images in a variety of circumstances. Requesting more than three objects, negation, numbers, and connected sentences may result in mistakes, and object features may appear on the wrong object. Additional limitations include handling text — which, even with legible lettering, almost invariably results in dream-like gibberish — and its limited capacity to address scientific information, such as astronomy or medical imagery. AI_tanuki_Japanese_text

Ethical concerns

DALL-E 2's reliance on public datasets influences its results and leads to

algorithmic bias Algorithmic bias describes systematic and repeatable harmful tendency in a computerized sociotechnical system to create " unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the a ...

in some cases, such as generating higher numbers of men than women for requests that do not mention gender. DALL-E 2's training data was filtered to remove violent and sexual imagery, but this was found to increase bias in some cases such as reducing the frequency of women being generated. OpenAI hypothesise that this may be because women were more likely to be sexualised in training data which caused the filter to influence results. In September 2022, OpenAI confirmed to ''

The Verge ''The Verge'' is an American Technology journalism, technology news website headquarters, headquartered in Lower Manhattan, New York City and operated by Vox Media. The website publishes news, feature stories, guidebooks, product reviews, cons ...

'' that DALL-E invisibly inserts phrases into user prompts to address bias in results; for instance, "black man" and "Asian woman" are inserted into prompts that do not specify gender or race. OpenAI claims to address concerns for potential "racy content"containing nudity or sexual content generation, with DALL-E 3 through input/output filters, blocklists, ChatGPT refusals, and model level interventions. However, DALL-E 3 continues to disproportionally represent people as White, female, and youthful. Users are able to somewhat remedy this through more specific prompts for image generation. A concern about DALL-E 2 and similar image generation models is that they could be used to propagate

deepfake ''Deepfakes'' (a portmanteau of and ) are images, videos, or audio that have been edited or generated using artificial intelligence, AI-based tools or AV editing software. They may depict real or fictional people and are considered a form of ...

s and other forms of misinformation. As an attempt to mitigate this, the software rejects prompts involving public figures and uploads containing human faces. Prompts containing potentially objectionable content are blocked, and uploaded images are analysed to detect offensive material. A disadvantage of prompt-based filtering is that it is easy to bypass using alternative phrases that result in a similar output. For example, the word "blood" is filtered, but "ketchup" and "red liquid" are not. Another concern about DALL-E 2 and similar models is that they could cause

technological unemployment The term technological unemployment is used to describe the loss of jobs caused by technological change. It is a key type of structural unemployment. Technological change typically includes the introduction of labour-saving "mechanical-muscle" ...

for artists, photographers, and graphic designers due to their accuracy and popularity. DALL-E 3 is designed to block users from generating art in the style of currently-living artists. While OpenAI states that images produced using these models do not require permission to reprint, sell, or merchandise, legal concerns have been raised regarding who owns those images. In 2023 Microsoft pitched the

United States Department of Defense The United States Department of Defense (DoD, USDOD, or DOD) is an United States federal executive departments, executive department of the federal government of the United States, U.S. federal government charged with coordinating and superv ...

to use DALL-E models to train

battlefield management system A battlefield management system (BMS) is a system meant to integrate information acquisition and processing to enhance command and control of a military unit through multiple other C4ISR (Command, Control, Communications, Computers, Intelligenc ...

s. In January 2024 OpenAI removed its blanket ban on military and warfare use from its usage policies.

Reception

Most coverage of DALL-E focuses on a small subset of "surreal" or "quirky" outputs. DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from ''Input'',

NBC The National Broadcasting Company (NBC) is an American commercial broadcast television and radio network serving as the flagship property of the NBC Entertainment division of NBCUniversal, a subsidiary of Comcast. It is one of NBCUniversal's ...

, ''

Nature Nature is an inherent character or constitution, particularly of the Ecosphere (planetary), ecosphere or the universe as a whole. In this general sense nature refers to the Scientific law, laws, elements and phenomenon, phenomena of the physic ...

'', and other publications. Its output for "an armchair in the shape of an avocado" was also widely covered. ''

ExtremeTech ExtremeTech is a technology weblog, launched in June 2001, which focuses on hardware, computer software, science, and other technologies. Between 2003 and 2005, ExtremeTech was also a print magazine and the publisher of a popular series of ho ...

'' stated "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed". ''

Engadget Engadget ( ) is a technology news, reviews and analysis website offering daily coverage of gadgets, consumer electronics, video games, gaming hardware, apps, social media, streaming, AI, space, robotics, electric vehicles and other potentially ...

'' also noted its unusual capacity for "understanding how telephones and other objects change over time". According to ''

MIT Technology Review ''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology. It was founded in 1899 as ''The Technology Review'', and was re-launched without "''The''" in its name on April 23, 1998, under then pu ...

'', one of OpenAI's objectives was to "give language models a better grasp of the everyday concepts that humans use to make sense of things". Wall Street investors have had a positive reception of DALL-E 2, with some firms thinking it could represent a turning point for a future multi-trillion dollar industry. By mid-2019, OpenAI had already received over $1 billion in funding from

and Khosla Ventures, and in January 2023, following the launch of DALL-E 2 and ChatGPT, received an additional $10 billion in funding from Microsoft. Japan's

anime is a Traditional animation, hand-drawn and computer animation, computer-generated animation originating from Japan. Outside Japan and in English, ''anime'' refers specifically to animation produced in Japan. However, , in Japan and in Ja ...

community has had a negative reaction to DALL-E 2 and similar models. Two arguments are typically presented by artists against the software. The first is that AI art is not art because it is not created by a human with intent. "The juxtaposition of AI-generated images with their own work is degrading and undermines the time and skill that goes into their art. AI-driven image generation tools have been heavily criticized by artists because they are trained on human-made art scraped from the web." The second is the trouble with

copyright law A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, e ...

and data text-to-image models are trained on. OpenAI has not released information about what dataset(s) were used to train DALL-E 2, inciting concern from some that the work of artists has been used for training without permission. Copyright laws surrounding these topics are inconclusive at the moment. After integrating DALL-E 3 into Bing Chat and ChatGPT, Microsoft and OpenAI faced criticism for excessive content filtering, with critics saying DALL-E had been "lobotomized." The flagging of images generated by prompts such as "man breaks server rack with sledgehammer" was cited as evidence. Over the first days of its launch, filtering was reportedly increased to the point where images generated by some of Bing's own suggested prompts were being blocked. ''

TechRadar ''TechRadar'' is an online technology publication owned by Future plc. It has editorial teams in the United States, United Kingdom, and Australia that provide news and reviews of tech products and gadgets. It was launched in 2008 and expanded t ...

'' argued that leaning too heavily on the side of caution could limit DALL-E's value as a creative tool.

Open-source implementations

Since OpenAI has not released

source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...

for any of the three models, there have been several attempts to create

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

models offering similar capabilities. Released in 2022 on

Hugging Face Hugging Face, Inc. is a French-American company based in List of tech companies in the New York metropolitan area, New York City that develops computation tools for building applications using machine learning. It is most notable for its Transf ...

's Spaces platform, Craiyon (formerly DALL-E Mini until a name change was requested by OpenAI in June 2022) is an AI model based on the original DALL-E that was trained on unfiltered data from the Internet. It attracted substantial media attention in mid-2022, after its release due to its capacity for producing humorous imagery. Another example of an open source text-to-image model is

by Stability AI.

References

External links

* . The original report on DALL-E.
DALL-E 3 System Card

DALL-E 3
paper by OpenAI
DALL-E 2 website

Craiyon website
{{Authority control Artificial intelligence art Text-to-image generation Deep learning software applications Unsupervised learning Generative pre-trained transformers 2021 software ChatGPT