Video Super Resolution
   HOME

TheInfoList



OR:

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency. There are many approaches for this task, but this problem still remains to be popular and challenging. ----


Mathematical explanation

Most research considers the degradation process of frames as : \ = (\ * k)\downarrow + \ where: :\ — original high-resolution frame sequence, :k — blur kernel, :* — convolution operation, :\downarrow — downscaling operation, :\ — additive noise, :\ — low-resolution frame sequence. Super-resolution is an inverse operation, so its problem is to estimate frame sequence \ from frame sequence \ so that \ is close to original \. Blur kernel, downscaling operation and additive noise should be estimated for given input to achieve better results. Video super-resolution approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. Some most essential components for VSR are guided by four basic functionalities: Propagation, Alignment, Aggregation, and Upsampling. * Propagation refers to the way in which features are propagated temporally * Alignment concerns on the spatial transformation applied to misaligned images/features * Aggregation defines the steps to combine aligned features * Upsampling describes the method to transform the aggregated features to the final output image


Methods

When working with video, temporal information could be used to improve upscaling quality. Single image super-resolution methods could be used too, generating high-resolution frames independently from their neighbours, but it's less effective and introduces temporal instability. There are a few traditional methods, which consider the video super-resolution task as an optimization problem. Last years
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
based methods for video upscaling outperform traditional ones.


Traditional methods

There are several traditional methods for video upscaling. These methods try to use some natural preferences and effectively estimate
motion In physics, motion is the phenomenon in which an object changes its position with respect to time. Motion is mathematically described in terms of displacement, distance, velocity, acceleration, speed and frame of reference to an observer and mea ...
between frames. The high-resolution frame is reconstructed based on both natural preferences and estimated motion.


Frequency domain

Firstly the low-resolution frame is transformed to the
frequency domain In physics, electronics, control systems engineering, and statistics, the frequency domain refers to the analysis of mathematical functions or signals with respect to frequency, rather than time. Put simply, a time-domain graph shows how a signa ...
. The high-resolution frame is estimated in this domain. Finally, this result frame is transformed to the spatial domain. Some methods use
Fourier transform A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed, ...
, which helps to extend the spectrum of captured signal and though increase resolution. There are different approaches for these methods: using weighted least squares theory, total least squares (TLS) algorithm, space-varying or spatio-temporal varying filtering. Other methods use
wavelet transform In mathematics, a wavelet series is a representation of a square-integrable (real number, real- or complex number, complex-valued) function (mathematics), function by a certain orthonormal series (mathematics), series generated by a wavelet. This ...
, which helps to find similarities in neighboring local areas. Later
second-generation wavelet transform {{Short description, Type of wavelet transform In signal processing, the second-generation wavelet transform (SGWT) is a wavelet transform where the filters (or even the represented wavelets) are not designed explicitly, but the transform consists o ...
was used for video super resolution.


Spatial domain

Iterative back-projection methods assume some function between low-resolution and high-resolution frames and try to improve their guessed function in each step of an iterative process.
Projections onto convex sets In mathematics, projections onto convex sets (POCS), sometimes known as the alternating projection method, is a method to find a point in the intersection of two closed convex sets. It is a very simple algorithm and has been rediscovered many times ...
(POCS), that defines a specific cost function, also can be used for iterative methods. Iterative adaptive filtering algorithms use
Kalman filter For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimat ...
to estimate transformation from low-resolution frame to high-resolution one. To improve the final result these methods consider temporal correlation among low-resolution sequences. Some approaches also consider temporal correlation among high-resolution sequence. To approximate Kalman filter a common way is to use least mean squares (LMS). One can also use
steepest descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
,
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
(LS), recursive least squares (RLS). Direct methods estimate motion between frames, upscale a reference frame, and warp neighboring frames to the high-resolution reference one. To construct result, these upscaled frames are fused together by
median filter The median filter is a non-linear digital filtering technique, often used to remove noise from an image or signal. Such noise reduction is a typical pre-processing step to improve the results of later processing (for example, edge detection on an ...
, weighted median filter, adaptive normalized averaging, AdaBoost classifier or
SVD ''Svenska Dagbladet'' (, "The Swedish Daily News"), abbreviated SvD, is a daily newspaper published in Stockholm, Sweden. History and profile The first issue of ''Svenska Dagbladet'' appeared on 18 December 1884. During the beginning of the ...
based filters. Non-parametric algorithms join motion estimation and frames fusion to one step. It is performed by consideration of patches similarities. Weights for fusion can be calculated by nonlocal-means filters. To strength searching for similar patches, one can use rotation invariance
similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
or adaptive patch size. Calculating intra-frame similarity help to preserve small details and edges. Parameters for fusion also can be calculated by
kernel regression In statistics, kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables ''X'' and ''Y''. In any nonparametric r ...
. Probabilistic methods use statistical theory to solve the task. maximum likelihood (ML) methods estimate more probable image. Another group of methods use maximum a posteriori (MAP) estimation. Regularization parameter for MAP can be estimated by
Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
. Markov random fields (MRF) is often used along with MAP and helps to preserve similarity in neighboring patches. Huber MRFs are used to preserve sharp edges. Gaussian MRF can smooth some edges, but remove noise.


Deep learning based methods


Aligned by motion estimation and motion compensation

In approaches with alignment, neighboring frames are firstly aligned with target one. One can align frames by performing
motion estimation Motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions b ...
and
motion compensation Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video d ...
(MEMC) or by using Deformable convolution (DC).
Motion estimation Motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions b ...
gives information about the motion of
pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the smal ...
s between frames.
motion compensation Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video d ...
is a warping operation, which aligns one frame to another based on motion information. Examples of such methods: *Deep-DE (deep draft-ensemble learning) generates a series of SR feature maps and then process them together to estimate the final frame *VSRnet is based on SRCNN (model for single image super resolution), but takes multiple frames as input. Input frames are first aligned by the Druleas algorithm *VESPCN uses a spatial motion compensation transformer module (MCT), which estimates and compensates motion. Then a series of convolutions performed to extract feature and fuse them *DRVSR (detail-revealing deep video super-resolution) consists of three main steps:
motion estimation Motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions b ...
,
motion compensation Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video d ...
and
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
. The motion compensation transformer (MCT) is used for motion estimation. The sub-pixel motion compensation layer (SPMC) compensates motion. Fusion step uses encoder-decoder architecture and ConvLSTM module to unit information from both spatial and temporal dimensions *RVSR (robust video super-resolution) have two branches: one for spatial alignment and another for temporal adaptation. The final frame is a weighted sum of branches' output *FRVSR (frame recurrent video super-resolution) estimate low-resolution
optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
, upsample it to high-resolution and warp previous output frame by using this high-resolution optical flow *STTN (the spatio-temporal transformer network) estimate
optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
by U-style network based on Unet and compensate motion by a trilinear interpolation method *SOF-VSR (super-resolution optical flow for video super-resolution) calculate high-resolution
optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
in coarse-to-fine manner. Then the low-resolution optical flow is estimated by a space-to-depth transformation. The final super-resolution result is gained from aligned low-resolution frames *TecoGAN (the temporally coherent
GAN The word Gan or the initials GAN may refer to: Places *Gan, a component of Hebrew placenames literally meaning "garden" China * Gan River (Jiangxi) * Gan River (Inner Mongolia), * Gan County, in Jiangxi province * Gansu, abbreviated ''Gā ...
) consists of
generator Generator may refer to: * Signal generator, electronic devices that generate repeating or non-repeating electronic signals * Electric generator, a device that converts mechanical energy to electrical energy. * Generator (circuit theory), an eleme ...
and
discriminator In distributed computing, a discriminator is a typed tag field present in OMG IDL discriminated union type and value definitions that determines which union member is selected in the current union instance. Unlike in some conventional programm ...
. Generator estimates LR
optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
between consecutive frames and from this approximate HR optical flow to yield output frame. The discriminator assesses the quality of the generator *TOFlow (task-oriented flow) is a combination of optical flow network and reconstruction network. Estimated optical flow is suitable for a particular task, such as video super resolution *MMCNN (the multi-memory convolutional neural network) aligns frames with target one and then generates the final HR-result through the feature extraction, detail
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
and feature reconstruction modules *RBPN (the recurrent back-projection network). The input of each recurrent projection module features from the previous frame, features from the consequence of frames, and optical flow between neighboring frames *MEMC-Net (the motion estimation and motion compensation network) uses both motion estimation network and kernel estimation network to warp frames adaptively *RTVSR (real-time video super-resolution) aligns frames with estimated convolutional kernel *MultiBoot VSR (the multi-stage multi-reference bootstrapping method) aligns frames and then have two-stage of SR-reconstruction to improve quality *BasicVSR aligns frames with optical flow and then fuse their features in a recurrent bidirectional scheme *IconVSR is a refined version of BasicVSR with a recurrent coupled propagation scheme *UVSR (unrolled network for video super-resolution) adapted unrolled optimization algorithms to solve the VSR problem


Aligned by deformable convolution

Another way to align neighboring frames with target one is deformable convolution. While usual convolution has fixed kernel, deformable convolution on the first step estimate shifts for kernel and then do convolution. Examples of such methods: *EDVR (The enhanced deformable video restoration) can be divided into two main modules: the pyramid, cascading and deformable (PCD) module for alignment and the temporal-spatial
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Atte ...
(TSA) module for fusion *DNLN (The deformable non-local network) has alignment module, based on deformable convolution with the hierarchical feature fusion module (HFFB) for better quality) and non-local
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Atte ...
module *TDAN (The temporally deformable alignment network) consists of an alignment module and a reconstruction module. Alignment performed by deformable convolution based on feature extraction and alignment *Multi-Stage Feature Fusion Network for Video Super-Resolution uses the multi-scale dilated deformable convolution for frame alignment and the Modulative Feature Fusion Branch to integrate aligned frames


Aligned by homography

Some methods align frames by calculated
homography In projective geometry, a homography is an isomorphism of projective spaces, induced by an isomorphism of the vector spaces from which the projective spaces derive. It is a bijection that maps lines to lines, and thus a collineation. In general, ...
between frames. *TGA (Temporal Group
Attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Atte ...
) divide input frames to N groups dependent on time difference and extract information from each group independently. Fast Spatial Alignment module based on
homography In projective geometry, a homography is an isomorphism of projective spaces, induced by an isomorphism of the vector spaces from which the projective spaces derive. It is a bijection that maps lines to lines, and thus a collineation. In general, ...
used to align frames


Spatial non-aligned

Methods without alignment do not perform alignment as a first step and just process input frames. *VSRResNet like
GAN The word Gan or the initials GAN may refer to: Places *Gan, a component of Hebrew placenames literally meaning "garden" China * Gan River (Jiangxi) * Gan River (Inner Mongolia), * Gan County, in Jiangxi province * Gansu, abbreviated ''Gā ...
consists of
generator Generator may refer to: * Signal generator, electronic devices that generate repeating or non-repeating electronic signals * Electric generator, a device that converts mechanical energy to electrical energy. * Generator (circuit theory), an eleme ...
and
discriminator In distributed computing, a discriminator is a typed tag field present in OMG IDL discriminated union type and value definitions that determines which union member is selected in the current union instance. Unlike in some conventional programm ...
. Generator upsamples input frames, extracts features and fuses them. Discriminator assess the quality of result high-resolution frames *FFCVSR (frame and feature-context video super-resolution) takes unaligned low-resolution frames and output high-resolution previous frames to simultaneously restore high-frequency details and maintain temporal consistency *MRMNet (the multi-resolution mixture network) consists of three modules: bottleneck, exchange, and residual. Bottleneck unit extract features that have the same resolution as input frames. Exchange module exchange features between neighboring frames and enlarges feature maps. Residual module extract features after exchange one *STMN (the spatio-temporal matching network) use
discrete wavelet transform In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any wavelet transform for which the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it has over Fourier transforms is temporal ...
to
fuse Fuse or FUSE may refer to: Devices * Fuse (electrical), a device used in electrical systems to protect against excessive current ** Fuse (automotive), a class of fuses for vehicles * Fuse (hydraulic), a device used in hydraulic systems to protect ...
temporal features. Non-local matching block integrates super-resolution and
denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an un ...
. At the final step, SR-result is got on the global wavelet domain *MuCAN (the multi-correspondence aggregation network) uses temporal multi-correspondence strategy to
fuse Fuse or FUSE may refer to: Devices * Fuse (electrical), a device used in electrical systems to protect against excessive current ** Fuse (automotive), a class of fuses for vehicles * Fuse (hydraulic), a device used in hydraulic systems to protect ...
temporal features and cross-scale nonlocal-correspondence to extract self-similarities in frames


3D convolutions

While 2D
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
s work on spatial domain, 3D
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
s use both spatial and temporal information. They perform
motion compensation Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video d ...
and maintain temporal consistency *DUF (the dynamic upsampling filters) uses deformable 3D
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
for
motion compensation Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video d ...
. The model estimates kernels for specific input frames *FSTRN (The fast spatio-temporal residual network) includes a few modules: LR video shallow feature extraction net (LFENet), LR feature
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
and up-sampling module (LSRNet) and two residual modules: spatio-temporal and global *3DSRnet (The 3D super-resolution network) uses 3D
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
s to extract spatio-temporal information. Model also has a special approach for frames, where scene change is detected *MP3D (the multi-scale pyramid 3D convolutional network) uses 3D
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
to extract spatial and temporal features simultaneously, which then passed through reconstruction module with 3D sub-pixel
convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
for upsampling *DMBN (the dynamic multiple branch network) has three branches to exploit information from multiple resolutions. Finally, information from branches fuse dynamically


Recurrent neural networks

Recurrent convolutional neural networks perform video super-resolution by storing temporal dependencies. *STCN (the spatio-temporal convolutional network) extract features in the spatial module, pass them through the recurrent temporal module and final reconstruction module. Temporal consistency is maintained by
long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM) mechanism *BRCN (the bidirectional recurrent convolutional network) has two subnetworks: with forward
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
and backward
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
. The result of the network is a composition of two branches' output *RISTN (the residual invertible spatio-temporal network) consists of spatial, temporal and reconstruction module. Spatial module composed of residual invertible blocks (RIB), which extract spatial features effectively. The output of the spatial module is processed by the temporal module, which extracts spatio-temporal information and then fuses important features. The final result is calculated in the reconstruction module by deconvolution operation *RRCN (the residual recurrent convolutional network) is a bidirectional recurrent network, which calculates a residual image. Then the final result is gained by adding a bicubically upsampled input frame *RRN (the recurrent residual network) uses a recurrent sequence of residual blocks to extract spatial and temporal information *BTRPN (the bidirectional temporal-recurrent propagation network) use bidirectional recurrent scheme. Final-result combined from two branches with channel
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Atte ...
mechanism *RLSP (recurrent latent state propagation) fully convolutional network cell with highly efficient propagation of temporal information through a hidden state *RSDN (the recurrent structure-detail network) divide input frame into structure and detail components and process them in two parallel streams


Non-local

Non-local methods extract both spatial and temporal information. The key idea is to use all possible positions as a
weighted A weight function is a mathematical device used when performing a sum, integral, or average to give some elements more "weight" or influence on the result than other elements in the same set. The result of this application of a weight function is ...
sum. This strategy may be more effective than local approaches. *PFNL (the progressive
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
non-local method) extract spatio-temporal features by non-local residual blocks, then fuse them by progressive fusion residual block (PFRB). The result of these blocks is a residual image. The final result is gained by adding bicubically upsampled input frame *NLVSR (the novel video super‐resolution network) aligns frames with target one by temporal‐spatial non‐local operation. To integrate information from aligned frames an attention‐based mechanism is used *MSHPFNL also incorporates multi-scale structure and hybrid convolutions to extract wide-range dependencies. To avoid some artifacts like flickering or ghosting, they use generative adversarial training


Metrics

The common way to estimate the performance of video super-resolution
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s is to use a few metrics: * PSNR (Peak signal-noise ratio) calculates the difference between two corresponding frames based on
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
(''MSE'') * SSIM (Structural similarity index) measures the similarity of structure between two corresponding frames * IFC (Information Fidelity Criterion) shows information similarity with the reference frame * MOVIE (Motion-based Video Integrity Evaluation index) integrates explicit motion information by estimating distortions along motion trajectories * VMAF (Video Multimethod Assessment Fusion) predicts subjective video quality based on a reference and distorted video sequence * VIF (Visual Information Fidelity) is a full-reference image quality assessment index based on
natural scene statistics Scene statistics is a discipline within the field of perception. It is concerned with the statistical regularities related to scenes. It is based on the premise that a perceptual system is designed to interpret scenes. Biological perceptual sy ...
and the notion of image information extracted by the
human visual system The visual system comprises the sensory organ (the eye) and parts of the central nervous system (the retina containing photoreceptor cells, the optic nerve, the optic tract and the visual cortex) which gives organisms the sense of sight (the a ...
*LPIPS (Learned Perceptual Image Patch Similarity) compares the perceptual similarity of frames based on high-order image structure *tOF measures pixel-wise motion similarity with reference frame based on
optical flow Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent veloci ...
*tLP calculates how LPIPS changes from frame to frame in comparison with the reference sequence *FSIM (Feature Similarity Index for Image Quality) uses
phase congruency Phase congruency is a measure of feature significance in computer images, a method of edge detection that is particularly robust against changes in illumination and contrast. Foundations Phase congruency reflects the behaviour of the image in the ...
as the primary feature to measure the similarity between two corresponding frames. Currently, there aren't so many objective metrics to verify video super-resolution method's ability to restore real details. Research is currently underway in this area. Another way to assess the performance of the video super-resolution algorithm is to organize the subjective evaluation. People are asked to compare the corresponding frames, and the final mean opinion score (MOS) is calculated as the
arithmetic mean In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the ''mean'' or the ''average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The colle ...
overall ratings.


Datasets

While deep learning approaches of video super-resolution outperform traditional ones, it's crucial to form a high-quality
dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
for evaluation. It's important to verify models' ability to restore small details, text, and objects with complicated structure, to cope with big motion and noise.


Benchmarks

A few benchmarks in video super-resolution were organized by companies and conferences. The purposes of such challenges are to compare diverse algorithms and to find the state-of-the-art for the task.


NTIRE 2019 Challenge

The NTIRE 2019 Challenge was organized by
CVPR The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field. According to Google Scholar Metrics (202 ...
and proposed two tracks for Video Super-Resolution: clean (only bicubic degradation) and blur (blur added firstly). Each track had more than 100 participants and 14 final results were submitted.
Dataset REDS was collected for this challenge. It consists of 30 videos of 100 frames each. The resolution of ground-truth frames is 1280×720. The tested scale factor is 4. To evaluate models' performance PSNR and SSIM were used. The best participants' results are performed in the table:


Youku-VESR Challenge 2019

The Youku-VESR Challenge was organized to check models' ability to cope with degradation and noise, which are real for Youku online video-watching application. The proposed dataset consists of 1000 videos, each length is 4–6 seconds. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. PSNR and VMAF metrics were used for performance evaluation. Top methods are performed in the table:


AIM 2019 Challenge

The challenge was held by
ECCV The European Conference on Computer Vision (ECCV) is a biennial research conference with the proceedings published by Springer Science+Business Media. Similar to ICCV in scope and quality, it is held those years which ICCV is not. It is considere ...
and had two tracks on video extreme super-resolution: first track checks the fidelity with reference frame (measured by
PSNR Peak signal-to-noise ratio (PSNR) is an engineering term for the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic ...
and
SSIM The structural similarity index measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. SSIM is used for measuring the similarity between tw ...
). The second track checks the perceptual quality of videos (
MOS MOS or Mos may refer to: Technology * MOSFET (metal–oxide–semiconductor field-effect transistor), also known as the MOS transistor * Mathematical Optimization Society * Model output statistics, a weather-forecasting technique * MOS (filmm ...
). Dataset consists of 328 video sequences of 120 frames each. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 16. Top methods are performed in the table:


AIM 2020 Challenge

Challenge's conditions are the same as AIM 2019 Challenge. Top methods are performed in the table:


MSU Video Super-Resolution Benchmark

The MSU Video Super-Resolution Benchmark was organized by MSU and proposed three types of motion, two ways to lower resolution, and eight types of content in the dataset. The resolution of ground-truth frames is 1920×1280. The tested scale factor is 4. 14 models were tested. To evaluate models' performance PSNR and SSIM were used with shift compensation. Also proposed a few new metrics: ERQAv1.0, QRCRv1.0, and CRRMv1.0. Top methods are performed in the table:


MSU Super-Resolution for Video Compression Benchmark

The MSU Super-Resolution for Video Compression Benchmark was organized by MSU. This benchmark tests models' ability to work with compressed videos. The dataset consists of 9 videos, compressed with different
Video codec A video codec is software or hardware that compresses and decompresses digital video. In the context of video compression, ''codec'' is a portmanteau of ''encoder'' and ''decoder'', while a device that only compresses is typically called an '' ...
standards and different bitrates. Models are ranked by BSQ-rate over subjective score. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. 17 models were tested. 5 video codecs were used to compress ground-truth videos. Top combinations of Super-Resolution methods and video codecs are performed in the table:


Application

In many areas, working with video, we deal with different types of video degradation, including downscaling. The resolution of video can be degraded because of imperfections of measuring devices, such as optical degradations and limited size of camera sensors. Bad light and weather conditions add noise to video. Object and camera motion also decrease video quality. Super Resolution techniques help to restore the original video. It's useful in a wide range of applications, such as *
video surveillance Closed-circuit television (CCTV), also known as video surveillance, is the use of video cameras to transmit a signal to a specific place, on a limited set of monitors. It differs from broadcast television in that the signal is not openly tr ...
(to improve video captured from the camera and recognize car numbers and faces) *
medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to rev ...
(to discover better some organs or tissues for clinical analysis and medical intervention) *
forensic science Forensic science, also known as criminalistics, is the application of science to criminal and civil laws, mainly—on the criminal side—during criminal investigation, as governed by the legal standards of admissible evidence and criminal ...
(to help in the investigation during the criminal procedure) *
astronomy Astronomy () is a natural science that studies astronomical object, celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and chronology of the Universe, evolution. Objects of interest ...
(to improve quality of video of stars and planets) *
remote sensing Remote sensing is the acquisition of information about an object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring information about Earth ...
(to alleviate observation of an object) *
microscopy Microscopy is the technical field of using microscopes to view objects and areas of objects that cannot be seen with the naked eye (objects that are not within the resolution range of the normal eye). There are three well-known branches of micr ...
(to strength microscopes' ability) It also helps to solve task of
object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
,
face The face is the front of an animal's head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may aff ...
and
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
recognition (as preprocessing step). The interest to super-resolution is growing with the development of high definition
computer displays A computer monitor is an output device that displays information in pictorial or textual form. A discrete monitor comprises a electronic visual display, visual display, support electronics, power supply, Housing (engineering), housing, electric ...
and TVs. Video super-resolution finds its practical use in some modern smartphones and cameras, where it is used to reconstruct digital photographs. Reconstructing details on digital photographs is a difficult task since these photographs are already incomplete: the camera sensor elements measure only the intensity of the light, not directly its color. A process called
demosaicing A demosaicing (also de-mosaicing, demosaicking or debayering) algorithm is a digital image process used to reconstruct a full color image from the incomplete color samples output from an image sensor overlaid with a color filter array (CFA). It is ...
is used to reconstruct the photos from partial color information. A single frame doesn't give us enough data to fill in the missing colors, however, we can receive some of the missing information from multiple images taken one after the other. This process is known as burst photography and can be used to restore a single image of good quality from multiple sequential frames. When we capture a lot of sequential photos with a smartphone or handheld camera, there is always some movement present between the frames because of the hand motion. We can take advantage of this hand tremor by combining the information on those images. We choose a single image as the "base" or reference frame and align every other frame relative to it. There are situations where hand motion is simply not present because the device is stabilized (e.g. placed on a tripod). There is a way to simulate natural hand motion by intentionally slightly moving the camera. The movements are extremely small so they don't interfere with regular photos. You can observe these motions on Google Pixel 3 phone by holding it perfectly still (e.g. pressing it against the window) and maximally pinch-zooming the viewfinder.


See also

*
Super-resolution imaging Super-resolution imaging (SR) is a class of techniques that enhance (increase) the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors i ...
*
Image resolution Image resolution is the detail an image holds. The term applies to digital images, film images, and other types of images. "Higher resolution" means more image detail. Image resolution can be measured in various ways. Resolution quantifies how ...
*
High definition video High-definition video (HD video) is video of higher display resolution, resolution and quality than Standard-definition television, standard-definition. While there is no standardized meaning for ''high-definition'', generally any video image wit ...
*
Display resolution The display resolution or display modes of a digital television, computer monitor or display device is the number of distinct pixels in each dimension that can be displayed. It can be an ambiguous term especially as the displayed resolution is ...
*
Ultra-high-definition television Ultra-high-definition television (also known as Ultra HD television, Ultra HD, UHDTV, UHD and Super Hi-Vision) today includes 4K UHD and 8K UHD, which are two digital video formats with an aspect ratio of 16:9. These were first proposed by ...
*
Oversampling In signal processing, oversampling is the process of sampling a signal at a sampling frequency significantly higher than the Nyquist rate. Theoretically, a bandwidth-limited signal can be perfectly reconstructed if sampled at the Nyquist rate o ...
*
High-dynamic-range video High-dynamic-range television (HDR or HDR-TV) is a technology that improves the quality of display signals. It is contrasted with the retroactively-named standard dynamic range (SDR). HDR changes the way the luminance and colors of videos and ...


References

{{reflist Signal processing Film and video technology Image processing