DALL-E (stylized as DALL·E) and DALL-E 2 are

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

models developed by

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

to generate digital images from

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a version of

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standard ...

modified to generate images. In April 2022, OpenAI announced DALL-E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles". OpenAI has not released

source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...

for either model. On 20 July 2022, DALL-E 2 entered into a beta phase with invitations sent to 1 million waitlisted individuals; users can generate a certain number of images for free every month and may purchase more. Access had previously been restricted to pre-selected users for a research preview due to concerns about

ethics Ethics or moral philosophy is a branch of philosophy that "involves systematizing, defending, and recommending concepts of right and wrong behavior".''Internet Encyclopedia of Philosophy'' The field of ethics, along with aesthetics, concerns m ...

and safety. On 28 September 2022, DALL-E 2 was opened to anyone and the waitlist requirement was removed. In early November 2022, OpenAI released DALL-E 2 as an

API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standa ...

, allowing developers to integrate the model into their own applications.

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...

unveiled their implementation of DALL-E 2 in their Designer app and Image Creator tool included in

Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a variety ...

and

Microsoft Edge Microsoft Edge is a proprietary, cross-platform web browser created by Microsoft. It was first released in 2015 as part of Windows 10 and Xbox One and later ported to other platforms as a fork of Google's Chromium open-source project: Android ...

. CALA and Mixtiles are among other early adopters of the DALL-E 2 API. The API operates on a cost per image basis, with prices varying depending on image resolution. Volume discounts are available to companies working with OpenAI’s enterprise team. The software's name is a

portmanteau A portmanteau word, or portmanteau (, ) is a blend of wordsPixar Pixar Animation Studios (commonly known as Pixar () and stylized as P I X A R) is an American computer animation studio known for its critically and commercially successful computer animated feature films. It is based in Emeryville, Californi ...

character

WALL-E ''WALL-E'' (stylized with an interpunct as ''WALL·E'') is a 2008 American computer-animated science fiction film produced by Pixar Animation Studios and released by Walt Disney Pictures. It was directed and co-written by Andrew Stanton, pro ...

and the Spanish surrealist artist

Salvador Dalí Salvador Domingo Felipe Jacinto Dalí i Domènech, Marquess of Dalí of Púbol (; ; ; 11 May 190423 January 1989) was a Spanish Surrealism, surrealist artist renowned for his technical skill, precise draftsmanship, and the striking and bizarr ...

Technology

The Generative Pre-trained Transformer (GPT) model was initially developed by OpenAI in 2018, using a

Transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

architecture. The first iteration, GPT, was scaled up to produce

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while somet ...

in 2019; in 2020 it was scaled up again to produce

, with 175 billion parameters. DALL-E's model is a multimodal implementation of GPT-3 with 12 billion parameters which "swaps text for pixels", trained on text-image pairs from the Internet. DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor. DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training). CLIP is a separate model based on zero-shot learning that was trained on 400 million pairs of images with text captions scraped from the Internet. Its role is to "understand and rank" DALL-E's output by predicting which caption from a list of 32,768 captions randomly selected from the dataset (of which one was the correct answer) is most appropriate for an image. This model is used to filter a larger initial list of images generated by DALL-E to select the most appropriate outputs. DALL-E 2 uses a

diffusion model In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of ...

conditioned on CLIP image embeddings, which, during inference, are generated from CLIP text embeddings by a prior model.

Capabilities

DALL-E can generate imagery in multiple styles, including

photorealistic Photorealism is a genre of art that encompasses painting, drawing and other graphic media, in which an artist studies a photograph and then attempts to reproduce the image as realistically as possible in another medium. Although the term can be ...

imagery,

paintings Painting is the practice of applying paint, pigment, color or other medium to a solid surface (called the "matrix" or "support"). The medium is commonly applied to the base with a brush, but other implements, such as knives, sponges, and ai ...

, and

emoji An emoji ( ; plural emoji or emojis) is a pictogram, logogram, ideogram or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed conversat ...

. It can "manipulate and rearrange" objects in its images, and can correctly place design elements in novel compositions without explicit instruction. Thom Dunn writing for ''

BoingBoing ''Boing Boing'' is a website, first established as a zine in 1988, later becoming a group blog. Common topics and themes include technology, futurism, science fiction, gadgets, intellectual property, Disney, and left-wing politics. It twice won ...

'' remarked that "For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL-E often draws the handkerchief, hands, and feet in plausible locations." DALL-E showed the ability to "fill in the blanks" to infer appropriate details without specific prompts such as adding Christmas imagery to prompts commonly associated with the celebration, and appropriately-placed shadows to images that did not mention them. Furthermore, DALL-E exhibits broad understanding of visual and design trends. DALL-E is able to produce images for a wide variety of arbitrary descriptions from various viewpoints with only rare failures. Mark Riedl, an associate professor at the

Georgia Tech The Georgia Institute of Technology, commonly referred to as Georgia Tech or, in the state of Georgia, as Tech or The Institute, is a public research university and institute of technology in Atlanta, Georgia. Established in 1885, it is part of ...

School of Interactive Computing, found that DALL-E could blend concepts (described as a key element of human

creativity Creativity is a phenomenon whereby something new and valuable is formed. The created item may be intangible (such as an idea, a scientific theory, a musical composition, or a joke) or a physical object (such as an invention, a printed literary w ...

). Its visual reasoning ability is sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence). Given an existing image, DALL-E 2 can produce "variations" of the image as unique outputs based on the original, as well as edit the image to modify or expand upon it. DALL-E 2's "inpainting" and "outpainting" use context from an image to fill in missing areas using a

medium Medium may refer to: Science and technology Aviation *Medium bomber, a class of war plane * Tecma Medium, a French hang glider design Communication * Media (communication), tools used to store and deliver information or data * Medium of ...

consistent with the original, following a given prompt. For example, this can be used to insert a new subject into an image, or expand an image beyond its original borders. According to OpenAI, "Outpainting takes into account the image’s existing visual elements — including shadows, reflections, and textures — to maintain the context of the original image."

Ethical concerns

DALL-E 2's reliance on public datasets influences its results and lead to

algorithmic bias Algorithmic bias describes systematic and repeatable errors in a computer system that create "unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the algorithm. Bias can emerge from ...

in some cases such as generating higher numbers of men than women for requests that do not mention gender. DALL-E 2's training data was filtered to remove violent and sexual imagery, but this was found to increase bias in some cases such as reducing the frequency of women being generated. OpenAI hypothesize that this may be because women were more likely to be sexualized in training data which caused the filter to influence results. In September 2022, OpenAI confirmed to ''

The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media' ...

'' that DALL-E invisibly inserts phrases into user prompts in order to address bias in results; for instance, "black man" and "Asian woman" are inserted into prompts that do not specify gender or race. A concern about DALL-E 2 and similar image generation models is that they could be used to propagate

deepfake Deepfakes (a portmanteau of "deep learning" and "fake") are synthetic media in which a person in an existing image or video is replaced with someone else's likeness. While the act of creating fake content is not new, deepfakes leverage powerful ...

s and other forms of misinformation. As an attempt to mitigate this, the software rejects prompts involving public figures and uploads containing human faces. Prompts containing potentially objectionable content are blocked, and uploaded images are analyzed to detect offensive material. A disadvantage of prompt-based filtering is that it is easy to bypass using alternative phrases that result in a similar output. For example, the word "blood" is filtered, but "ketchup" and "red liquid" are not. Another concern about DALL-E 2 and similar models is that they could cause

technological unemployment Technological unemployment is the loss of jobs caused by technological change. It is a key type of structural unemployment. Technological change typically includes the introduction of labour-saving "mechanical-muscle" machines or more efficie ...

for artists, photographers, and graphic designers due to their accuracy and popularity.

Technical limitations

DALL-E 2's language understanding has limits. It is sometimes unable to distinguish "A yellow book and a red vase" from "A red book and a yellow vase" or "A panda making latte art" from "Latte art of a panda". It generates images of "an astronaut riding a horse" when presented with the prompt "a horse riding an astronaut". It also fails to generate the correct images in a variety of circumstances. Requesting more than 3 objects, negation, numbers, and connected sentences may result in mistakes and object features may appear on the wrong object. Additional limitations include handling text - which, even with legible lettering, almost invariably results in dream-like gibberish - and its limited capacity to address scientific information, such as astronomy or medical imagery.

Reception

Most coverage of DALL-E focuses on a small subset of "surreal" or "quirky" outputs. DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from ''Input'',

NBC The National Broadcasting Company (NBC) is an Television in the United States, American English-language Commercial broadcasting, commercial television network, broadcast television and radio network. The flagship property of the NBC Enterta ...

, ''

Nature Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...

'', and other publications. Its output for "an armchair in the shape of an avocado" was also widely covered. ''

ExtremeTech ExtremeTech is a technology weblog, launched in June 2001, which focuses on hardware, computer software, science and other technologies. Between 2003 and 2005, ExtremeTech was also a print magazine and the publisher of a popular series of how-t ...

'' stated "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed". ''

Engadget ''Engadget'' ( ) is a multilingual technology blog network with daily coverage of gadgets and consumer electronics. ''Engadget'' manages ten blogs four of which are written in English and six have international versions with independent editori ...

'' also noted its unusual capacity for "understanding how telephones and other objects change over time". According to ''

MIT Technology Review ''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology, and editorially independent of the university. It was founded in 1899 as ''The Technology Review'', and was re-launched without "The" in ...

'', one of OpenAI's objectives was to "give language models a better grasp of the everyday concepts that humans use to make sense of things". Wall Street investors have had a positive reception of DALL-E 2, with some firms thinking it could represent a turning point for a future multi-trillion dollar industry. OpenAI has already received over 1 billion dollars in funding from

and Khosla Ventures. The art community has had a negative reaction to DALL-E. Two arguments are typically presented. The first is that AI art is not art because it is not created by a human with intent. "The juxtaposition of AI-generated images with their own work is degrading and undermines the time and skill that goes into their art. AI-driven image generation tools have been heavily criticized by artists because they are trained on human-made art scraped from the web." The second is the trouble with copyright law and what art is used for training the AI. DALL-E has not released information about what dataset(s) were used to create the models and there is a general concern that the artist's work has been used for training without permission. The copyright laws are inconclusive at the moment.

Open-source implementations

There have been several attempts to create

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

implementations of DALL-E. Released in 2022 on

Hugging Face Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users ...

's Spaces platform, Craiyon (formerly DALL-E Mini until a name change was requested by OpenAI in June 2022) is an AI model based on the original DALL-E that was trained on unfiltered data from the Internet. It attracted substantial media attention in mid-2022 after its release due to its capacity for producing humorous imagery.

References

External links

DALL E 2 website
{{Differentiable computing Text-to-image generation Deep learning software applications Unsupervised learning OpenAI