Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a
generative artificial intelligence
Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models Machine learning, learn the underlyin ...
(AI) model.
A ''prompt'' is
natural language
A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
text describing the task that an AI should perform.
A prompt for a text-to-text
language model
A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
can be a query, a command, or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, choice of words and grammar,
providing relevant context, or describing a character for the AI to mimic.
When communicating with a
text-to-image
A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description.
Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...
or a text-to-audio model, a typical prompt is a description of a desired output such as "a high-quality photo of an astronaut riding a horse" or "Lo-fi slow BPM electro chill with organic samples". Prompting a text-to-image model may involve adding, removing, or emphasizing words to achieve a desired subject, style, layout, lighting, and aesthetic.
History
In 2018, researchers first proposed that all previously separate tasks in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP) could be cast as a question-answering problem over a context. In addition, they trained a first single, joint, multi-task model that would answer any task-related question like "What is the sentiment" or "Translate this sentence to German" or "Who is the president?"
The
AI boom
The AI boom is an ongoing period of rapid Progress in artificial intelligence, progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include lar ...
saw an increase in the amount of "prompting technique" to get the model to output the desired outcome and avoid
nonsensical output, a process characterized by
trial-and-error
Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying.
According to W.H. Thorpe, the term was devised by C. Lloyd Morgan ( ...
. After the release of
ChatGPT
ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...
in 2022, prompt engineering was soon seen as an important business skill, albeit one with an uncertain economic future.
A repository for prompts reported that over 2,000 public prompts for around 170 datasets were available in February 2022. In 2022, the ''chain-of-thought'' prompting technique was proposed by
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
researchers.
In 2023, several text-to-text and text-to-image prompt databases were made publicly available. The Personalized Image-Prompt (PIP) dataset, a generated image-text dataset that has been categorized by 3,115 users, has also been made available publicly in 2024.
Text-to-text
Multiple distinct prompt engineering techniques have been published.
Chain-of-thought
According to Google Research, ''chain-of-thought'' (CoT) prompting is a technique that allows
large language models
A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
(LLMs) to solve a problem as a series of intermediate steps before giving a final answer. In 2022,
Google Brain
Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...
reported that chain-of-thought prompting improves reasoning ability by inducing the model to answer a multi-step problem with steps of reasoning that mimic a
train of thought
The train of thought or track of thought refers to the interconnection in the sequence of ideas expressed during a connected discourse or thought, as well as the sequence itself, especially in discussion how this sequence leads from one idea to ...
.
Chain-of-thought techniques were developed to help LLMs handle multi-step reasoning tasks, such as
arithmetic
Arithmetic is an elementary branch of mathematics that deals with numerical operations like addition, subtraction, multiplication, and division. In a wider sense, it also includes exponentiation, extraction of roots, and taking logarithms.
...
or
commonsense reasoning questions.
For example, given the question, "Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", Google claims that a CoT prompt might induce the LLM to answer "A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9."
When applied to
PaLM
Palm most commonly refers to:
* Palm of the hand, the central region of the front of the hand
* Palm plants, of family Arecaceae
** List of Arecaceae genera
**Palm oil
* Several other plants known as "palm"
Palm or Palms may also refer to:
Music ...
, a 540 billion parameter
language model
A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
, according to Google, CoT prompting significantly aided the model, allowing it to perform comparably with task-specific
fine-tuned models on several tasks, achieving
state-of-the-art
The state of the art (SOTA or SotA, sometimes cutting edge, leading edge, or bleeding edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contex ...
results at the time on the GSM8K
mathematical reasoning
Logical reasoning is a mental activity that aims to arrive at a conclusion in a rigorous way. It happens in the form of inferences or arguments by starting from a set of premises and reasoning to a conclusion supported by these premises. The ...
benchmark
Benchmark may refer to:
Business and economics
* Benchmarking, evaluating performance within organizations
* Benchmark price
* Benchmark (crude oil), oil-specific practices
Science and technology
* Experimental benchmarking, the act of defining a ...
.
It is possible to fine-tune models on CoT reasoning datasets to enhance this capability further and stimulate better
interpretability
In mathematical logic, interpretability is a relation between formal theories that expresses the possibility of interpreting or translating one into the other.
Informal definition
Assume ''T'' and ''S'' are formal theories. Slightly simplified, ...
.
An example of a CoT prompting:
Q:
A: Let's think step by step.
As originally proposed by Google,
each CoT prompt included a few Q&A examples. This made it a ''few-shot'' prompting technique. However, according to researchers at Google and the
University of Tokyo
The University of Tokyo (, abbreviated as in Japanese and UTokyo in English) is a public research university in Bunkyō, Tokyo, Japan. Founded in 1877 as the nation's first modern university by the merger of several pre-westernisation era ins ...
, simply appending the words "Let's think step-by-step"
was also effective, which makes CoT a ''zero-shot'' prompting technique.
OpenAI
OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
claims that this prompt allows for better scaling as a user no longer needs to formulate many specific CoT Q&A examples.
In-context learning
''In-context learning'', refers to a model's ability to temporarily learn from prompts. For example, a prompt may include a few examples for a model to learn from, such as asking the model to complete "''
maison'' house, ''
chat'' cat, ''
chien'' " (the expected response being ''dog''), an approach called ''few-shot learning''.
In-context learning is an
emergent ability of large language models. It is an emergent property of model scale, meaning that
breaks
Break or Breaks or The Break may refer to:
Time off from duties
* Recess (break), time in which a group of people is temporarily dismissed from its duties
* Break (work), time off during a shift/recess
** Coffee break, a short mid-morning rest ...
in downstream scaling laws occur, leading to its efficacy increasing at a different rate in larger models than in smaller models.
Unlike training and
fine-tuning, which produce lasting changes, in-context learning is temporary. Training models to perform in-context learning can be viewed as a form of
meta-learning
Meta-learning is a branch of metacognition concerned with learning about one's own learning and learning processes.
The term comes from the meta prefix's modern meaning of an abstract recursion, or "X about X", similar to its use in metaknowle ...
, or "learning to learn".
Self-consistency decoding
''Self-consistency decoding'' performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.
Tree-of-thought
''Tree-of-thought'' prompting generalizes chain-of-thought by generating multiple lines of reasoning in parallel, with the ability to backtrack or explore other paths. It can use
tree search algorithms like
breadth-first,
depth-first
Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible al ...
, or
beam.
Prompting to estimate model sensitivity
Research consistently demonstrates that LLMs are highly sensitive to subtle variations in prompt formatting, structure, and linguistic properties. Some studies have shown up to 76 accuracy points across formatting changes in few-shot settings.
Linguistic features significantly influence prompt effectiveness—such as morphology, syntax, and lexico-semantic changes—which meaningfully enhance task performance across a variety of tasks.
Clausal syntax, for example, improves consistency and reduces uncertainty in knowledge retrieval. This sensitivity persists even with larger model sizes, additional few-shot examples, or instruction tuning.
To address sensitivity of models and make them more robust, several methods have been proposed. FormatSpread facilitates systematic analysis by evaluating a range of plausible prompt formats, offering a more comprehensive performance interval.
Similarly, PromptEval estimates performance distributions across diverse prompts, enabling robust metrics such as performance quantiles and accurate evaluations under constrained budgets.
Automatic prompt generation
Retrieval-augmented generation
Retrieval-augmented generation (RAG) is a technique that enables
generative artificial intelligence
Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models Machine learning, learn the underlyin ...
(Gen AI) models to retrieve and incorporate new information. It modifies interactions with a
large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
(LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing
training data
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
. This allows LLMs to use domain-specific and/or updated information.
RAG improves large language models (LLMs) by incorporating
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
before generating responses. Unlike traditional LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According to
''Ars'' ''Technica'', "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduce
AI hallucinations
In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or misleading information presented as fact. Th ...
, which have led to real-world issues like chatbots inventing policies or lawyers citing nonexistent legal cases. By dynamically retrieving information, RAG enables AI to provide more accurate responses without frequent retraining.
Graph retrieval-augmented generation

GraphRAG (coined by
Microsoft Research
Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
) is a technique that extends RAG with the use of a knowledge graph (usually, LLM-generated) to allow the model to connect disparate pieces of information, synthesize insights, and holistically understand summarized semantic concepts over large data collections. It was shown to be effective on datasets like the Violent Incident Information from News Articles (VIINA).
Earlier work showed the effectiveness of using a
knowledge graph
In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...
for question answering using text-to-query generation. These techniques can be combined to search across both unstructured and structured data, providing expanded context, and improved ranking.
Using language models to generate prompts
Large language models (LLM) themselves can be used to compose prompts for large language models. The ''automatic prompt engineer'' algorithm uses one LLM to
beam search
In computer science, beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is a modification of best-first search that reduces its memory requirements. Best-first searc ...
over prompts for another LLM:
* There are two LLMs. One is the target LLM, and another is the prompting LLM.
* Prompting LLM is presented with example input-output pairs, and asked to generate instructions that could have caused a model following the instructions to generate the outputs, given the inputs.
* Each of the generated instructions is used to prompt the target LLM, followed by each of the inputs. The log-probabilities of the outputs are computed and added. This is the score of the instruction.
* The highest-scored instructions are given to the prompting LLM for further variations.
* Repeat until some stopping criteria is reached, then output the highest-scored instructions.
CoT examples can be generated by LLM themselves. In "auto-CoT", a library of questions are converted to vectors by a model such as
BERT. The question vectors are
clustered. Questions close to the
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...
of each cluster are selected, in order to have a subset of diverse questions. An LLM does zero-shot CoT on each selected question. The question and the corresponding CoT answer are added to a dataset of demonstrations. These diverse demonstrations can then added to prompts for few-shot learning.
Text-to-image

In 2022,
text-to-image
A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description.
Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...
models like
DALL-E 2
DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as ''prompts''.
The first version of DALL-E w ...
,
Stable Diffusion
Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...
, and
Midjourney
Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called '' prompts'', ...
were released to the public. These models take text prompts as input and use them to generate images.
Prompt formats
Early text-to-image models typically don't understand negation, grammar and sentence structure in the same way as
large language models
A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
, and may thus require a different set of prompting techniques. The prompt "a party with no cake" may produce an image including a cake.
As an alternative, ''negative prompts'' allow a user to indicate, in a separate prompt, which terms should ''not'' appear in the resulting image. Techniques such as framing the normal prompt into a
sequence-to-sequence language modeling problem can be used to automatically generate an output for the negative prompt.
A text-to-image prompt commonly includes a description of the subject of the art, the desired medium (such as ''digital painting'' or ''photography''), style (such as ''hyperrealistic'' or ''pop-art''), lighting (such as ''rim lighting'' or ''crepuscular rays''), color, and texture. Word order also affects the output of a text-to-image prompt. Words closer to the start of a prompt may be emphasized more heavily.
The
Midjourney
Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called '' prompts'', ...
documentation encourages short, descriptive prompts: instead of "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils", an effective prompt might be "Bright orange California poppies drawn with colored pencils".
Artist styles
Some text-to-image models are capable of imitating the style of particular artists by name. For example, the phrase ''in the style of Greg Rutkowski'' has been used in Stable Diffusion and Midjourney prompts to generate images in the distinctive style of Polish digital artist
Greg Rutkowski. Famous artists such as
Vincent van Gogh
Vincent Willem van Gogh (; 30 March 185329 July 1890) was a Dutch Post-Impressionist painter who is among the most famous and influential figures in the history of Western art. In just over a decade, he created approximately 2,100 artworks ...
and
Salvador Dalí
Salvador Domingo Felipe Jacinto Dalí i Domènech, Marquess of Dalí of Púbol (11 May 190423 January 1989), known as Salvador Dalí ( ; ; ), was a Spanish Surrealism, surrealist artist renowned for his technical skill, precise draftsmanship, ...
have also been used for styling and testing.
Non-text prompts
Some approaches augment or replace natural language text prompts with non-text input.
Textual inversion and embeddings
For text-to-image models, ''textual inversion'' performs an optimization process to create a new
word embedding
In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
based on a set of example images. This embedding vector acts as a "pseudo-word" which can be included in a prompt to express the content or style of the examples.
Image prompting
In 2023,
Meta's AI research released Segment Anything, a
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
model that can perform
image segmentation
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects (Set (mathematics), sets of pixels). The goal of segmen ...
by prompting. As an alternative to text prompts, Segment Anything can accept bounding boxes, segmentation masks, and foreground/background points.
Using gradient descent to search for prompts
In "prefix-tuning", "prompt tuning", or "soft prompting", floating-point-valued vectors are searched directly by
gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
to maximize the log-likelihood on outputs.
Formally, let
be a set of soft prompt tokens (tunable embeddings), while
and
be the token embeddings of the input and output respectively. During training, the tunable embeddings, input, and output tokens are concatenated into a single sequence
, and fed to the LLMs. The
losses are computed over the
tokens; the gradients are
backpropagated to prompt-specific parameters: in prefix-tuning, they are parameters associated with the prompt tokens at each layer; in prompt tuning, they are merely the soft tokens added to the vocabulary.
More formally, this is prompt tuning. Let an LLM be written as
, where
is a sequence of linguistic tokens,
is the token-to-vector function, and
is the rest of the model. In prefix-tuning, one provides a set of input-output pairs
, and then use gradient descent to search for