A text-to-video model is a

machine learning model Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

that uses a

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

description as input to produce a

video Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...

relevant to the input text. Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video

diffusion model In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable model, latent variable generative model, generative models. A diffusion model consists of two ...

Models

There are different models, including

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

models. Chinese-language input CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on

GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...

in 2022. That year,

Meta Platforms Meta Platforms, Inc. is an American multinational technology company headquartered in Menlo Park, California. Meta owns and operates several prominent social media platforms and communication services, including Facebook, Instagram, Threads ...

released a partial text-to-video model called "Make-A-Video", and

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

Brain The brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for ...

(later

Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...

) introduced Imagen Video, a text-to-video model with 3D

U-Net U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more preci ...

. In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. In the same month,

Adobe Adobe (from arabic: الطوب Attub ; ) is a building material made from earth and organic materials. is Spanish for mudbrick. In some English-speaking regions of Spanish heritage, such as the Southwestern United States, the term is use ...

introduced Firefly AI as part of its features. In January 2024,

announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.

Matthias Niessner Matthias Nießner (born 1986) is a German computer scientist, academic, and entrepreneur working in the fields of computer graphics and computer vision. He is a professor of Computer Science at the Technical University of Munich and leads the Vi ...

and Lourdes Agapito at AI company

Synthesia Synthesia may refer to: * Synthesia (company) Synthesia is a synthetic media generation company that develops software used to create AI generated video content. Its customer base, as of January 2025, includes over sixty percent of Fortune 10 ...

work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars. In June 2024, Luma Labs launched its Dream Machine video tool. That same month,

Kuaishou Kuaishou Technology ( zh, c=快手, l=quick hand) is a Chinese publicly traded partly state-owned holding company based in Haidian District, Beijing, that was founded in 2011 by Hua Su (宿华) and Cheng Yixiao (程一笑). The company, liste ...

extended its Kling AI text-to-video model to international users. In July 2024,

TikTok TikTok, known in mainland China and Hong Kong as Douyin (), is a social media and Short-form content, short-form online video platform owned by Chinese Internet company ByteDance. It hosts user-submitted videos, which may range in duration f ...

owner

ByteDance ByteDance Ltd. is a Chinese internet technology company headquartered in Haidian, Beijing, and incorporated in the Cayman Islands. Founded by Zhang Yiming, Liang Rubo, and a team of others in 2012, ByteDance developed the video-sharing ap ...

released Jimeng AI in China, through its subsidiary, Faceu Technology. By September 2024, the Chinese AI company

MiniMax Minimax (sometimes Minmax, MM or saddle point) is a decision rule used in artificial intelligence, decision theory, combinatorial game theory, statistics, and philosophy for ''minimizing'' the possible loss function, loss for a Worst-case scenari ...

debuted its video-01 model, joining other established AI model companies like

Zhipu AI Zhipu AI (智谱AI), formally known as Beijing Zhipu Huazhang Technology, is a Chinese technology company specializing in artificial intelligence. As of 2024, it is one of China's "AI Tiger" companies by investors and considered to be the third ...

Baichuan Baichuan AI (Baichuan; ) is an artificial intelligence (AI) company based in Beijing, China. As of 2024, it has been dubbed one of China's "AI Tiger" companies by investors. Background Baichuan was founded in April 2023 by Wang Xiaochuan who ...

, and

Moonshot AI Moonshot AI (Moonshot; ) is an artificial intelligence (AI) company based in Beijing, China. As of 2024, it has been dubbed one of China's "AI Tiger" companies by investors with its focus on developing large language models. The company has attr ...

, which contribute to China’s involvement in AI technology. Alternative approaches to text-to-video models include Google's Phenaki, Hour One, Colossyan,

Runway In aviation, a runway is an elongated, rectangular surface designed for the landing and takeoff of an aircraft. Runways may be a human-made surface (often asphalt concrete, asphalt, concrete, or a mixture of both) or a natural surface (sod, ...

's Gen-3 Alpha, and OpenAI's Sora, Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.

FLUX.1 Flux (also known as FLUX.1) is a text-to-image model developed by Black Forest Labs, based in Freiburg im Breisgau, Germany. Black Forest Labs was founded by former employees of Stability AI. As with other text-to-image models, Flux generates ...

developer Black Forest Labs has announced its text-to-video model SOTA.

was preparing to launch a video generation tool named Veo for

YouTube Shorts YouTube Shorts is the short-form section of the online video-sharing platform YouTube. YouTube Shorts are vertical videos that have a duration of up to 180 seconds, and has various features for user interaction. Videos were limited to 60 ...

in 2025. On May 2025, Google launched the Veo 3 iteration of the model. It was noted for it's impressive audio generation capabilities, which were a previous limitation for text-to-video models.

Architecture and training

There are several architectures that have been used to create Text-to-Video models. Similar to

Text-to-Image A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...

models, these models can be trained using

Recurrent Neural Networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(RNNs) such as

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively. An alternative for these include transformer models.

Generative adversarial networks A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

(GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion — and diffusion models have also been used to develop the image generation aspects of the model. Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M. These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM. These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts. The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.

Limitations

Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs. Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility. Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’s ability to align generated video with the user’s intended message. Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation. Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem that

stable diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

models also struggle with. Examples include distorted hands and unreadable text.

Ethics

The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent. Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.

Impacts and applications

Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content. During the Russo-Ukranian war, fake videos made with

Artificial Intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

were created as part of a propaganda war against Ukraine and shared in

social media Social media are interactive technologies that facilitate the Content creation, creation, information exchange, sharing and news aggregator, aggregation of Content (media), content (such as ideas, interests, and other forms of expression) amongs ...

. These included depictions of children in the

Ukrainian Armed Forces The Armed Forces of Ukraine (AFU) are the Military, military forces of Ukraine. All military and security forces, including the Armed Forces, are under the command of the president of Ukraine and subject to oversight by a permanent Verkhovna Rad ...

, fake ads targeting children encouraging them to denounce critics of the

Ukrainian government The Cabinet of Ministers of Ukraine (), commonly referred to as the Government of Ukraine (), is the highest body of state executive power in Ukraine. As the Cabinet of Ministers of the Ukrainian SSR, it was formed on 18 April 1991, by the Law ...

, or fictitious statements by

Ukrainian President The president of Ukraine (, ) is the head of state of Ukraine. The president represents the nation in international relations, administers the foreign political activity of the state, conducts negotiations and concludes international treaties. ...

Volodymyr Zelenskyy Volodymyr Oleksandrovych Zelenskyy (born 25 January 1978) is a Ukrainian politician and former entertainer who has served as the sixth and current president of Ukraine since 2019. He took office five years after the start of the Russo-Ukraini ...

about the country's surrender, among others.

Comparison of existing models

References

{{Artificial intelligence navbox Artificial intelligence engineering Algorithms Language Natural language processing Video