In
computer vision
Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...
and
image processing
An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...
, motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent
frames in a video sequence. It is an
ill-posed problem as the
motion
In physics, motion is when an object changes its position with respect to a reference point in a given time. Motion is mathematically described in terms of displacement, distance, velocity, acceleration, speed, and frame of reference to an o ...
happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image (''global motion estimation'') or specific parts, such as rectangular blocks, arbitrary shaped patches or even per
pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a Raster graphics, raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, p ...
. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.
Related terms
More often than not, the term motion estimation and the term ''
optical flow
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent velocit ...
'' are used interchangeably. It is also related in concept to ''
image registration
Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, mil ...
'' and ''
stereo correspondence
The correspondence problem refers to the problem of ascertaining which parts of one image correspond to which parts of another image, where differences are due to movement of the camera, the elapse of time, and/or movement of objects in the photo ...
''.
In fact all of these terms refer to the process of
finding corresponding points between two images or video frames. The points that correspond to each other in two views (images or frames) of a real scene or object are "usually" the same point in that scene or on that object. Before we do motion estimation, we must define our measurement of correspondence, i.e., the matching metric, which is a measurement of how similar two image points are. There is no right or wrong here; the choice of matching metric is usually related to what the final estimated motion is used for as well as the optimisation strategy in the estimation process.
Each ''motion vector'' is used to represent a ''
macroblock
The macroblock is a processing unit in image and video compression formats based on linear block transforms, typically the discrete cosine transform (DCT). A macroblock typically consists of 16×16 samples, and is further subdivided into transform ...
'' in a picture based on the position of this macroblock (or a similar one) in another picture, called the reference picture.
The
H.264/MPEG-4 AVC standard defines ''motion vector'' as:
motion vector: a two-dimensional vector used for inter prediction that provides an offset from the coordinates in the decoded picture to the coordinates in a reference picture.
Algorithms
The methods for finding motion vectors can be categorised into pixel based methods ("direct") and feature based methods ("indirect"). A famous debate resulted in two papers from the opposing factions being produced to try to establish a conclusion.
Direct methods
*
Block-matching algorithm
*
Phase correlation
Phase correlation is an approach to estimate the relative Translation (geometry), translative offset between two similar images (digital image correlation) or other data sets. It is commonly used in image registration and relies on a frequency-doma ...
and frequency domain methods
* Pixel recursive algorithms
*
Optical flow
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent velocit ...
Indirect methods
''Indirect methods'' use features, such as
corner detection
Corner detection is an approach used within computer vision systems to extract certain kinds of Feature detection (computer vision), features and infer the contents of an image. Corner detection is frequently used in motion detection, image reg ...
, and match corresponding features between frames, usually with a statistical function applied over a local or global area. The purpose of the statistical function is to remove matches that do not correspond to the actual motion.
Statistical functions that have been successfully used include
RANSAC.
Additional note on the categorization
It can be argued that almost all methods require some kind of definition of the matching criteria. The difference is only whether you summarise over a local image region first and then compare the summarisation (such as feature based methods), or you compare each pixel first (such as squaring the difference) and then summarise over a local image region (block base motion and filter based motion). An emerging type of matching criteria summarises a local image region first for every pixel location (through some feature transform such as Laplacian transform), compares each summarised pixel and summarises over a local image region again. Some matching criteria have the ability to exclude points that do not actually correspond to each other albeit producing a good matching score, others do not have this ability, but they are still matching criteria.
Affine motion estimation
Affine motion estimation
Affine may describe any of various topics concerned with connections or affinities.
It may refer to:
* Affine, a relative by marriage in law and anthropology
* Affine cipher, a special case of the more general substitution cipher
* Affine com ...
is a technique used in computer vision and image processing to estimate the motion between two images or frames. It assumes that the motion can be modeled as an affine transformation (translation + rotation + zooming), which is a linear transformation followed by a translation.
Applications
Video coding
Applying the motion vectors to an image to synthesize the transformation to the next image is called
motion compensation
Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video ...
.
It is most easily applied to
discrete cosine transform
A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequency, frequencies. The DCT, first proposed by Nasir Ahmed (engineer), Nasir Ahmed in 1972, is a widely ...
(DCT) based
video coding standards
A video coding format (or sometimes video compression format) is a content representation format of digital video content, such as in a data file or bitstream. It typically uses a standardized video compression algorithm, most commonly based on ...
, because the coding is performed in blocks.
As a way of exploiting temporal redundancy, motion estimation and compensation are key parts of
video compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
. Almost all video coding standards use block-based motion estimation and compensation such as the
MPEG
The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by International Organization for Standardization, ISO and International Electrotechnical Commission, IEC that sets standards for media coding, includ ...
series including the most recent
HEVC
High Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2, is a video compression standard designed as part of the MPEG-H project as a successor to the widely used Advanced Video Coding (AVC, H.264, or MPEG-4 Part 10). In co ...
.
3D reconstruction
In
simultaneous localization and mapping
Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an Intelligent agent, agent's location within it. While this initially ap ...
, a 3D model of a scene is reconstructed using images from a moving camera.
[Kerl, Christian, Jürgen Sturm, and Daniel Cremers.]
Dense visual SLAM for RGB-D cameras
" 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013.
See also
*
Moving object detection
Moving object detection is a technique used in computer vision and image processing. Multiple consecutive frames from a video are compared by various methods to determine if any moving object is detected.
Moving objects detection has been used for ...
*
Graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
*
Vision processing unit
*
Scale-invariant feature transform
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local '' features'' in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, ...
References
{{Digital image processing
Video processing
Motion (physics)
Motion in computer vision