Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a

generative artificial intelligence Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models Machine learning, learn the underlyin ...

(AI) model. A ''prompt'' is

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

text describing the task that an AI should perform. A prompt for a text-to-text

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

can be a query, a command, or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, choice of words and grammar, providing relevant context, or describing a character for the AI to mimic. When communicating with a

text-to-image A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...

or a text-to-audio model, a typical prompt is a description of a desired output such as "a high-quality photo of an astronaut riding a horse" or "Lo-fi slow BPM electro chill with organic samples". Prompting a text-to-image model may involve adding, removing, or emphasizing words to achieve a desired subject, style, layout, lighting, and aesthetic.

History

In 2018, researchers first proposed that all previously separate tasks in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

(NLP) could be cast as a question-answering problem over a context. In addition, they trained a first single, joint, multi-task model that would answer any task-related question like "What is the sentiment" or "Translate this sentence to German" or "Who is the president?" The

AI boom The AI boom is an ongoing period of rapid Progress in artificial intelligence, progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include lar ...

saw an increase in the amount of "prompting technique" to get the model to output the desired outcome and avoid nonsensical output, a process characterized by

trial-and-error Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying. According to W.H. Thorpe, the term was devised by C. Lloyd Morgan ( ...

. After the release of

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

in 2022, prompt engineering was soon seen as an important business skill, albeit one with an uncertain economic future. A repository for prompts reported that over 2,000 public prompts for around 170 datasets were available in February 2022. In 2022, the ''chain-of-thought'' prompting technique was proposed by

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

researchers. In 2023, several text-to-text and text-to-image prompt databases were made publicly available. The Personalized Image-Prompt (PIP) dataset, a generated image-text dataset that has been categorized by 3,115 users, has also been made available publicly in 2024.

Text-to-text

Multiple distinct prompt engineering techniques have been published.

Chain-of-thought

According to Google Research, ''chain-of-thought'' (CoT) prompting is a technique that allows

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

(LLMs) to solve a problem as a series of intermediate steps before giving a final answer. In 2022,

Google Brain Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...

reported that chain-of-thought prompting improves reasoning ability by inducing the model to answer a multi-step problem with steps of reasoning that mimic a

train of thought The train of thought or track of thought refers to the interconnection in the sequence of ideas expressed during a connected discourse or thought, as well as the sequence itself, especially in discussion how this sequence leads from one idea to ...

. Chain-of-thought techniques were developed to help LLMs handle multi-step reasoning tasks, such as

arithmetic Arithmetic is an elementary branch of mathematics that deals with numerical operations like addition, subtraction, multiplication, and division. In a wider sense, it also includes exponentiation, extraction of roots, and taking logarithms. ...

or commonsense reasoning questions. For example, given the question, "Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", Google claims that a CoT prompt might induce the LLM to answer "A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9." When applied to

PaLM Palm most commonly refers to: * Palm of the hand, the central region of the front of the hand * Palm plants, of family Arecaceae ** List of Arecaceae genera **Palm oil * Several other plants known as "palm" Palm or Palms may also refer to: Music ...

, a 540 billion parameter

, according to Google, CoT prompting significantly aided the model, allowing it to perform comparably with task-specific fine-tuned models on several tasks, achieving

state-of-the-art The state of the art (SOTA or SotA, sometimes cutting edge, leading edge, or bleeding edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contex ...

results at the time on the GSM8K

mathematical reasoning Logical reasoning is a mental activity that aims to arrive at a conclusion in a rigorous way. It happens in the form of inferences or arguments by starting from a set of premises and reasoning to a conclusion supported by these premises. The ...

benchmark Benchmark may refer to: Business and economics * Benchmarking, evaluating performance within organizations * Benchmark price * Benchmark (crude oil), oil-specific practices Science and technology * Experimental benchmarking, the act of defining a ...

. It is possible to fine-tune models on CoT reasoning datasets to enhance this capability further and stimulate better

interpretability In mathematical logic, interpretability is a relation between formal theories that expresses the possibility of interpreting or translating one into the other. Informal definition Assume ''T'' and ''S'' are formal theories. Slightly simplified, ...

. An example of a CoT prompting: Q: A: Let's think step by step. As originally proposed by Google, each CoT prompt included a few Q&A examples. This made it a ''few-shot'' prompting technique. However, according to researchers at Google and the

University of Tokyo The University of Tokyo (, abbreviated as in Japanese and UTokyo in English) is a public research university in Bunkyō, Tokyo, Japan. Founded in 1877 as the nation's first modern university by the merger of several pre-westernisation era ins ...

, simply appending the words "Let's think step-by-step" was also effective, which makes CoT a ''zero-shot'' prompting technique.

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

claims that this prompt allows for better scaling as a user no longer needs to formulate many specific CoT Q&A examples.

In-context learning

''In-context learning'', refers to a model's ability to temporarily learn from prompts. For example, a prompt may include a few examples for a model to learn from, such as asking the model to complete "''maison'' house, ''chat'' cat, ''chien'' " (the expected response being ''dog''), an approach called ''few-shot learning''. In-context learning is an emergent ability of large language models. It is an emergent property of model scale, meaning that

breaks Break or Breaks or The Break may refer to: Time off from duties * Recess (break), time in which a group of people is temporarily dismissed from its duties * Break (work), time off during a shift/recess ** Coffee break, a short mid-morning rest ...

in downstream scaling laws occur, leading to its efficacy increasing at a different rate in larger models than in smaller models. Unlike training and fine-tuning, which produce lasting changes, in-context learning is temporary. Training models to perform in-context learning can be viewed as a form of

meta-learning Meta-learning is a branch of metacognition concerned with learning about one's own learning and learning processes. The term comes from the meta prefix's modern meaning of an abstract recursion, or "X about X", similar to its use in metaknowle ...

, or "learning to learn".

Self-consistency decoding

''Self-consistency decoding'' performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.

Tree-of-thought

''Tree-of-thought'' prompting generalizes chain-of-thought by generating multiple lines of reasoning in parallel, with the ability to backtrack or explore other paths. It can use tree search algorithms like breadth-first,

depth-first Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible al ...

, or beam.

Prompting to estimate model sensitivity

Research consistently demonstrates that LLMs are highly sensitive to subtle variations in prompt formatting, structure, and linguistic properties. Some studies have shown up to 76 accuracy points across formatting changes in few-shot settings. Linguistic features significantly influence prompt effectiveness—such as morphology, syntax, and lexico-semantic changes—which meaningfully enhance task performance across a variety of tasks. Clausal syntax, for example, improves consistency and reduces uncertainty in knowledge retrieval. This sensitivity persists even with larger model sizes, additional few-shot examples, or instruction tuning. To address sensitivity of models and make them more robust, several methods have been proposed. FormatSpread facilitates systematic analysis by evaluating a range of plausible prompt formats, offering a more comprehensive performance interval. Similarly, PromptEval estimates performance distributions across diverse prompts, enabling robust metrics such as performance quantiles and accurate evaluations under constrained budgets.

Automatic prompt generation

Retrieval-augmented generation

Retrieval-augmented generation (RAG) is a technique that enables

(Gen AI) models to retrieve and incorporate new information. It modifies interactions with a

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

(LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. This allows LLMs to use domain-specific and/or updated information. RAG improves large language models (LLMs) by incorporating

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

before generating responses. Unlike traditional LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According to ''Ars'' ''Technica'', "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduce

AI hallucinations In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting, confabulation, or delusion) is a response generated by AI that contains false or misleading information presented as fact. Th ...

, which have led to real-world issues like chatbots inventing policies or lawyers citing nonexistent legal cases. By dynamically retrieving information, RAG enables AI to provide more accurate responses without frequent retraining.

Graph retrieval-augmented generation

GraphRAG (coined by

Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...

) is a technique that extends RAG with the use of a knowledge graph (usually, LLM-generated) to allow the model to connect disparate pieces of information, synthesize insights, and holistically understand summarized semantic concepts over large data collections. It was shown to be effective on datasets like the Violent Incident Information from News Articles (VIINA). Earlier work showed the effectiveness of using a

knowledge graph In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...

for question answering using text-to-query generation. These techniques can be combined to search across both unstructured and structured data, providing expanded context, and improved ranking.

Using language models to generate prompts

Large language models (LLM) themselves can be used to compose prompts for large language models. The ''automatic prompt engineer'' algorithm uses one LLM to

beam search In computer science, beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is a modification of best-first search that reduces its memory requirements. Best-first searc ...

over prompts for another LLM: * There are two LLMs. One is the target LLM, and another is the prompting LLM. * Prompting LLM is presented with example input-output pairs, and asked to generate instructions that could have caused a model following the instructions to generate the outputs, given the inputs. * Each of the generated instructions is used to prompt the target LLM, followed by each of the inputs. The log-probabilities of the outputs are computed and added. This is the score of the instruction. * The highest-scored instructions are given to the prompting LLM for further variations. * Repeat until some stopping criteria is reached, then output the highest-scored instructions. CoT examples can be generated by LLM themselves. In "auto-CoT", a library of questions are converted to vectors by a model such as BERT. The question vectors are clustered. Questions close to the

centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...

of each cluster are selected, in order to have a subset of diverse questions. An LLM does zero-shot CoT on each selected question. The question and the corresponding CoT answer are added to a dataset of demonstrations. These diverse demonstrations can then added to prompts for few-shot learning.

Text-to-image

In 2022,

models like

DALL-E 2 DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as ''prompts''. The first version of DALL-E w ...

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

, and

Midjourney Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called '' prompts'', ...

were released to the public. These models take text prompts as input and use them to generate images.

Prompt formats

Early text-to-image models typically don't understand negation, grammar and sentence structure in the same way as

, and may thus require a different set of prompting techniques. The prompt "a party with no cake" may produce an image including a cake. As an alternative, ''negative prompts'' allow a user to indicate, in a separate prompt, which terms should ''not'' appear in the resulting image. Techniques such as framing the normal prompt into a sequence-to-sequence language modeling problem can be used to automatically generate an output for the negative prompt. A text-to-image prompt commonly includes a description of the subject of the art, the desired medium (such as ''digital painting'' or ''photography''), style (such as ''hyperrealistic'' or ''pop-art''), lighting (such as ''rim lighting'' or ''crepuscular rays''), color, and texture. Word order also affects the output of a text-to-image prompt. Words closer to the start of a prompt may be emphasized more heavily. The

documentation encourages short, descriptive prompts: instead of "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils", an effective prompt might be "Bright orange California poppies drawn with colored pencils".

Artist styles

Some text-to-image models are capable of imitating the style of particular artists by name. For example, the phrase ''in the style of Greg Rutkowski'' has been used in Stable Diffusion and Midjourney prompts to generate images in the distinctive style of Polish digital artist Greg Rutkowski. Famous artists such as

Vincent van Gogh Vincent Willem van Gogh (; 30 March 185329 July 1890) was a Dutch Post-Impressionist painter who is among the most famous and influential figures in the history of Western art. In just over a decade, he created approximately 2,100 artworks ...

and

Salvador Dalí Salvador Domingo Felipe Jacinto Dalí i Domènech, Marquess of Dalí of Púbol (11 May 190423 January 1989), known as Salvador Dalí ( ; ; ), was a Spanish Surrealism, surrealist artist renowned for his technical skill, precise draftsmanship, ...

have also been used for styling and testing.

Non-text prompts

Some approaches augment or replace natural language text prompts with non-text input.

Textual inversion and embeddings

For text-to-image models, ''textual inversion'' performs an optimization process to create a new

word embedding In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...

based on a set of example images. This embedding vector acts as a "pseudo-word" which can be included in a prompt to express the content or style of the examples.

Image prompting

In 2023, Meta's AI research released Segment Anything, a

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

model that can perform

image segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects (Set (mathematics), sets of pixels). The goal of segmen ...

by prompting. As an alternative to text prompts, Segment Anything can accept bounding boxes, segmentation masks, and foreground/background points.

Using gradient descent to search for prompts

In "prefix-tuning", "prompt tuning", or "soft prompting", floating-point-valued vectors are searched directly by

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

to maximize the log-likelihood on outputs. Formally, let

\mathbf = \

be a set of soft prompt tokens (tunable embeddings), while

\mathbf = \

and

\mathbf = \

be the token embeddings of the input and output respectively. During training, the tunable embeddings, input, and output tokens are concatenated into a single sequence

\text(\mathbf;\mathbf;\mathbf)

, and fed to the LLMs. The losses are computed over the

\mathbf

tokens; the gradients are backpropagated to prompt-specific parameters: in prefix-tuning, they are parameters associated with the prompt tokens at each layer; in prompt tuning, they are merely the soft tokens added to the vocabulary. More formally, this is prompt tuning. Let an LLM be written as

LLM(X) = F(E(X))

, where

X

is a sequence of linguistic tokens,

E

is the token-to-vector function, and

F

is the rest of the model. In prefix-tuning, one provides a set of input-output pairs

\_i

, and then use gradient descent to search for

\arg\max_ \sum_i \log Pr \tilde Z \ast E(X^i) /math>. In words, \log Pr \tilde Z \ast E(X^i) /math> is the log-likelihood of outputting Y^i, if the model first encodes the input X^i into the vector E(X^i), then prepend the vector with the "prefix vector" \tilde Z, then apply F . For prefix tuning, it is similar, but the "prefix vector" \tilde Z is pre-appended to the hidden states in every layer of the model.

An earlier result uses the same idea of gradient descent search, but is designed for masked language models like BERT, and searches only over token sequences, rather than numerical vectors. Formally, it searches for \arg\max_ \sum_i \log Pr \tilde X \ast X^i /math> where \tilde X is ranges over token sequences of a specified length.

Limitations

While the process of writing and refining a prompt for an LLM or generative AI shares some parallels with an iterative engineering design process, such as through discovering 'best principles' to reuse and discovery through reproducible experimentation, the actual learned principles and skills depend heavily on the specific model being learned rather than being generalizable across the entire field of prompt-based generative models. Such patterns are also volatile and exhibit significantly different results from seemingly insignificant prompt changes. According to ''

The Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...

'' in 2025, the job of prompt engineer was one of the hottest in 2023, but has become obsolete due to models that better intuit user intent and to company trainings.

Prompt injection

Prompt injection is a

cybersecurity Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...

exploit in which adversaries craft inputs that appear legitimate but are designed to cause unintended behavior in machine learning models, particularly

(LLMs). This attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs, allowing adversaries to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

References

{{Artificial intelligence navbox Deep learning Machine learning Natural language processing Unsupervised learning 2022 neologisms Linguistics Generative artificial intelligence