Video super-resolution

Last updated
VSR and SISR methods' outputs comparison. VSR restores more details by using temporal information. SISR and VSR comparison.gif
VSR and SISR methods' outputs comparison. VSR restores more details by using temporal information.

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Contents

There are many approaches for this task, but this problem still remains to be popular and challenging.


Mathematical explanation

Most research considers the degradation process of frames as

where:

— original high-resolution frame sequence,
— blur kernel,
— convolution operation,
— downscaling operation,
— additive noise,
— low-resolution frame sequence.

Super-resolution is an inverse operation, so its problem is to estimate frame sequence from frame sequence so that is close to original . Blur kernel, downscaling operation and additive noise should be estimated for given input to achieve better results.

Video super-resolution approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. Some most essential components for VSR are guided by four basic functionalities: Propagation, Alignment, Aggregation, and Upsampling. [1]

Methods

When working with video, temporal information could be used to improve upscaling quality. Single image super-resolution methods could be used too, generating high-resolution frames independently from their neighbours, but it's less effective and introduces temporal instability. There are a few traditional methods, which consider the video super-resolution task as an optimization problem. Last years deep learning based methods for video upscaling outperform traditional ones.

Traditional methods

There are several traditional methods for video upscaling. These methods try to use some natural preferences and effectively estimate motion between frames. The high-resolution frame is reconstructed based on both natural preferences and estimated motion.

Frequency domain

Firstly the low-resolution frame is transformed to the frequency domain. The high-resolution frame is estimated in this domain. Finally, this result frame is transformed to the spatial domain. Some methods use Fourier transform, which helps to extend the spectrum of captured signal and though increase resolution. There are different approaches for these methods: using weighted least squares theory, [2] total least squares (TLS) algorithm, [3] space-varying [4] or spatio-temporal [5] varying filtering. Other methods use wavelet transform, which helps to find similarities in neighboring local areas. [6] Later second-generation wavelet transform was used for video super resolution. [7]

Spatial domain

Iterative back-projection methods assume some function between low-resolution and high-resolution frames and try to improve their guessed function in each step of an iterative process. [8] Projections onto convex sets (POCS), that defines a specific cost function, also can be used for iterative methods. [9]

Iterative adaptive filtering algorithms use Kalman filter to estimate transformation from low-resolution frame to high-resolution one. [10] To improve the final result these methods consider temporal correlation among low-resolution sequences. Some approaches also consider temporal correlation among high-resolution sequence. [11] To approximate Kalman filter a common way is to use least mean squares (LMS). [12] One can also use steepest descent, [13] least squares (LS), [14] recursive least squares (RLS). [14]

Direct methods estimate motion between frames, upscale a reference frame, and warp neighboring frames to the high-resolution reference one. To construct result, these upscaled frames are fused together by median filter, [15] weighted median filter, [16] adaptive normalized averaging, AdaBoost classifier [17] or SVD based filters. [18]

Non-parametric algorithms join motion estimation and frames fusion to one step. It is performed by consideration of patches similarities. Weights for fusion can be calculated by nonlocal-means filters. [19] To strength searching for similar patches, one can use rotation invariance similarity measure [20] or adaptive patch size. [21] Calculating intra-frame similarity help to preserve small details and edges. [22] Parameters for fusion also can be calculated by kernel regression. [23]

Probabilistic methods use statistical theory to solve the task. maximum likelihood (ML) methods estimate more probable image. [24] [25] Another group of methods use maximum a posteriori (MAP) estimation. Regularization parameter for MAP can be estimated by Tikhonov regularization. [26] Markov random fields (MRF) is often used along with MAP and helps to preserve similarity in neighboring patches. [27] Huber MRFs are used to preserve sharp edges. [28] Gaussian MRF can smooth some edges, but remove noise. [29]

Deep learning based methods

Aligned by motion estimation and motion compensation

In approaches with alignment, neighboring frames are firstly aligned with target one. One can align frames by performing motion estimation and motion compensation (MEMC) or by using Deformable convolution (DC). Motion estimation gives information about the motion of pixels between frames. motion compensation is a warping operation, which aligns one frame to another based on motion information. Examples of such methods:

  • Deep-DE [30] (deep draft-ensemble learning) generates a series of SR feature maps and then process them together to estimate the final frame
  • VSRnet [31] is based on SRCNN (model for single image super resolution), but takes multiple frames as input. Input frames are first aligned by the Druleas algorithm
  • VESPCN [32] uses a spatial motion compensation transformer module (MCT), which estimates and compensates motion. Then a series of convolutions performed to extract feature and fuse them
  • DRVSR [33] (detail-revealing deep video super-resolution) consists of three main steps: motion estimation, motion compensation and fusion. The motion compensation transformer (MCT) is used for motion estimation. The sub-pixel motion compensation layer (SPMC) compensates motion. Fusion step uses encoder-decoder architecture and ConvLSTM module to unit information from both spatial and temporal dimensions
  • RVSR [34] (robust video super-resolution) have two branches: one for spatial alignment and another for temporal adaptation. The final frame is a weighted sum of branches' output
  • FRVSR [35] (frame recurrent video super-resolution) estimate low-resolution optical flow, upsample it to high-resolution and warp previous output frame by using this high-resolution optical flow
  • STTN [36] (the spatio-temporal transformer network) estimate optical flow by U-style network based on Unet and compensate motion by a trilinear interpolation method
  • SOF-VSR [37] (super-resolution optical flow for video super-resolution) calculate high-resolution optical flow in coarse-to-fine manner. Then the low-resolution optical flow is estimated by a space-to-depth transformation. The final super-resolution result is gained from aligned low-resolution frames
  • TecoGAN [38] (the temporally coherent GAN) consists of generator and discriminator. Generator estimates LR optical flow between consecutive frames and from this approximate HR optical flow to yield output frame. The discriminator assesses the quality of the generator
  • TOFlow [39] (task-oriented flow) is a combination of optical flow network and reconstruction network. Estimated optical flow is suitable for a particular task, such as video super resolution
  • MMCNN [40] (the multi-memory convolutional neural network) aligns frames with target one and then generates the final HR-result through the feature extraction, detail fusion and feature reconstruction modules
  • RBPN [41] (the recurrent back-projection network). The input of each recurrent projection module features from the previous frame, features from the consequence of frames, and optical flow between neighboring frames
  • MEMC-Net [42] (the motion estimation and motion compensation network) uses both motion estimation network and kernel estimation network to warp frames adaptively
  • RTVSR [43] (real-time video super-resolution) aligns frames with estimated convolutional kernel
  • MultiBoot VSR [44] (the multi-stage multi-reference bootstrapping method) aligns frames and then have two-stage of SR-reconstruction to improve quality
  • BasicVSR [45] aligns frames with optical flow and then fuse their features in a recurrent bidirectional scheme
  • IconVSR [45] is a refined version of BasicVSR with a recurrent coupled propagation scheme
  • UVSR [46] (unrolled network for video super-resolution) adapted unrolled optimization algorithms to solve the VSR problem

Aligned by deformable convolution

Another way to align neighboring frames with target one is deformable convolution. While usual convolution has fixed kernel, deformable convolution on the first step estimate shifts for kernel and then do convolution. Examples of such methods:

  • EDVR [47] (The enhanced deformable video restoration) can be divided into two main modules: the pyramid, cascading and deformable (PCD) module for alignment and the temporal-spatial attention (TSA) module for fusion
  • DNLN [48] (The deformable non-local network) has alignment module, based on deformable convolution with the hierarchical feature fusion module (HFFB) for better quality) and non-local attention module
  • TDAN [49] (The temporally deformable alignment network) consists of an alignment module and a reconstruction module. Alignment performed by deformable convolution based on feature extraction and alignment
  • Multi-Stage Feature Fusion Network [50] for Video Super-Resolution uses the multi-scale dilated deformable convolution for frame alignment and the Modulative Feature Fusion Branch to integrate aligned frames

Aligned by homography

Some methods align frames by calculated homography between frames.

  • TGA [51] (Temporal Group Attention) divide input frames to N groups dependent on time difference and extract information from each group independently. Fast Spatial Alignment module based on homography used to align frames

Spatial non-aligned

Methods without alignment do not perform alignment as a first step and just process input frames.

  • VSRResNet [52] like GAN consists of generator and discriminator. Generator upsamples input frames, extracts features and fuses them. Discriminator assess the quality of result high-resolution frames
  • FFCVSR [53] (frame and feature-context video super-resolution) takes unaligned low-resolution frames and output high-resolution previous frames to simultaneously restore high-frequency details and maintain temporal consistency
  • MRMNet [54] (the multi-resolution mixture network) consists of three modules: bottleneck, exchange, and residual. Bottleneck unit extract features that have the same resolution as input frames. Exchange module exchange features between neighboring frames and enlarges feature maps. Residual module extract features after exchange one
  • STMN [55] (the spatio-temporal matching network) use discrete wavelet transform to fuse temporal features. Non-local matching block integrates super-resolution and denoising. At the final step, SR-result is got on the global wavelet domain
  • MuCAN [56] (the multi-correspondence aggregation network) uses temporal multi-correspondence strategy to fuse temporal features and cross-scale nonlocal-correspondence to extract self-similarities in frames

3D convolutions

While 2D convolutions work on spatial domain, 3D convolutions use both spatial and temporal information. They perform motion compensation and maintain temporal consistency

  • DUF [57] (the dynamic upsampling filters) uses deformable 3D convolution for motion compensation. The model estimates kernels for specific input frames
  • FSTRN [58] (The fast spatio-temporal residual network) includes a few modules: LR video shallow feature extraction net (LFENet), LR feature fusion and up-sampling module (LSRNet) and two residual modules: spatio-temporal and global
  • 3DSRnet [59] (The 3D super-resolution network) uses 3D convolutions to extract spatio-temporal information. Model also has a special approach for frames, where scene change is detected
  • MP3D [60] (the multi-scale pyramid 3D convolutional network) uses 3D convolution to extract spatial and temporal features simultaneously, which then passed through reconstruction module with 3D sub-pixel convolution for upsampling
  • DMBN [61] (the dynamic multiple branch network) has three branches to exploit information from multiple resolutions. Finally, information from branches fuse dynamically

Recurrent neural networks

Recurrent convolutional neural networks perform video super-resolution by storing temporal dependencies.

  • STCN [62] (the spatio-temporal convolutional network) extract features in the spatial module, pass them through the recurrent temporal module and final reconstruction module. Temporal consistency is maintained by long short-term memory (LSTM) mechanism
  • BRCN [63] (the bidirectional recurrent convolutional network) has two subnetworks: with forward fusion and backward fusion. The result of the network is a composition of two branches' output
  • RISTN [64] (the residual invertible spatio-temporal network) consists of spatial, temporal and reconstruction module. Spatial module composed of residual invertible blocks (RIB), which extract spatial features effectively. The output of the spatial module is processed by the temporal module, which extracts spatio-temporal information and then fuses important features. The final result is calculated in the reconstruction module by deconvolution operation
  • RRCN [65] (the residual recurrent convolutional network) is a bidirectional recurrent network, which calculates a residual image. Then the final result is gained by adding a bicubically upsampled input frame
  • RRN [66] (the recurrent residual network) uses a recurrent sequence of residual blocks to extract spatial and temporal information
  • BTRPN [67] (the bidirectional temporal-recurrent propagation network) use bidirectional recurrent scheme. Final-result combined from two branches with channel attention mechanism
  • RLSP [68] (recurrent latent state propagation) fully convolutional network cell with highly efficient propagation of temporal information through a hidden state
  • RSDN [69] (the recurrent structure-detail network) divide input frame into structure and detail components and process them in two parallel streams

Videos

Non-local methods extract both spatial and temporal information. The key idea is to use all possible positions as a weighted sum. This strategy may be more effective than local approaches (the progressive fusion non-local method) extract spatio-temporal features by non-local residual blocks, then fuse them by progressive fusion residual block (PFRB). The result of these blocks is a residual image. The final result is gained by adding bicubically upsampled input frame

  • NLVSR [70] (the novel video super‐resolution network) aligns frames with target one by temporal‐spatial non‐local operation. To integrate information from aligned frames an attention‐based mechanism is used
  • MSHPFNL [71] also incorporates multi-scale structure and hybrid convolutions to extract wide-range dependencies. To avoid some artifacts like flickering or ghosting, they use generative adversarial training

Metrics

Top: original sequence. Bottom: PSNR (Peak signal-to-noise ratio) visualization of the output of a VSR method. PSNR visualisation.gif
Top: original sequence. Bottom: PSNR (Peak signal-to-noise ratio) visualization of the output of a VSR method.

The common way to estimate the performance of video super-resolution algorithms is to use a few metrics:

Currently, there aren't so many objective metrics to verify video super-resolution method's ability to restore real details. Research is currently underway in this area.

Another way to assess the performance of the video super-resolution algorithm is to organize the subjective evaluation. People are asked to compare the corresponding frames, and the final mean opinion score (MOS) is calculated as the arithmetic mean overall ratings.

Datasets

While deep learning approaches of video super-resolution outperform traditional ones, it's crucial to form a high-quality dataset for evaluation. It's important to verify models' ability to restore small details, text, and objects with complicated structure, to cope with big motion and noise.

Comparison of datasets
DatasetVideosMean video lengthGround-truth resolutionMotion in framesFine details
Vid4 443 frames720×480Without fast motionSome small details, without text
SPMCS 3031 frames960×540SLow motionA lot of small details
Vimeo-90K (test SR set)78247 frames448×256A lot of fast, difficult, diverse motionFew details, text in a few sequences
Xiph HD (complete sets)702 secondsfrom 640×360
to 4096×2160
A lot of fast, difficult, diverse motionFew details, text in a few sequences
Ultra Video Dataset 4K 1610 seconds4096×2160Diverse motionFew details, without text
REDS (test SR)30100 frames1280×720A lot of fast, difficult, diverse motionFew details, without text
Space-Time SR 5100 frames1280×720Diverse motionWithout small details and text
Harmonic 4096×2160
CDVL 1920×1080

Benchmarks

A few benchmarks in video super-resolution were organized by companies and conferences. The purposes of such challenges are to compare diverse algorithms and to find the state-of-the-art for the task.

Comparison of benchmarks
BenchmarkOrganizerDatasetUpscale factorMetrics
NTIRE 2019 Challenge CVPR (Computer Vision and pattern recognition) REDS 4 PSNR, SSIM
Youku-VESR Challenge 2019 Youku Youku-VESR4 PSNR, VMAF
AIM 2019 Challenge ECCV (European Conference on Computer Vision)Vid3oC16 PSNR, SSIM, MOS
AIM 2020 Challenge ECCV (European Conference on Computer Vision)Vid3oC16 PSNR, SSIM, LPIPS
Mobile Video Restoration Challenge ICIP (International Conference of Image Processing), Kwai PSNR, SSIM, MOS
MSU Video Super-Resolution Benchmark 2021 MSU (Moscow State University)4ERQAv1.0, PSNR and SSIM with shift compensation, QRCRv1.0, CRRMv1.0
MSU Super-Resolution for Video Compression Benchmark 2022 MSU (Moscow State University)4ERQAv2.0, PSNR, MS-SSIM, VMAF, LPIPS

NTIRE 2019 Challenge

The NTIRE 2019 Challenge was organized by CVPR and proposed two tracks for Video Super-Resolution: clean (only bicubic degradation) and blur (blur added firstly). Each track had more than 100 participants and 14 final results were submitted.
Dataset REDS was collected for this challenge. It consists of 30 videos of 100 frames each. The resolution of ground-truth frames is 1280×720. The tested scale factor is 4. To evaluate models' performance PSNR and SSIM were used. The best participants' results are performed in the table:

Top teams
TeamModel namePSNR
(clean track)
SSIM
(clean track)
PSNR
(blur track)
SSIM
(blur track)
Runtime per image in sec
(clean track)
Runtime per image in sec
(blur track)
PlatformGPUOpen source
HelloVSREDVR31.790.896230.170.86472.7883.562PyTorchTITAN Xp YES
UIUC-IFPWDVR30.810.874829.460.84300.9800.980PyTorchTesla V100 YES
SuperRiorensemble of RDN,
RCAN, DUF
31.130.8811120.000PyTorchTesla V100NO
CyberverseSanDiegoRecNet31.000.882227.710.80673.0003.000TensorFlowRTX 2080 Ti YES
TTIRBPN30.970.880428.920.83331.3901.390PyTorchTITAN X YES
NERCMSPFNL30.910.878228.980.83076.0206.020PyTorchGTX 1080 Ti YES
XJTU-IAIRFSTDN28.860.830113.000PyTorchGTX 1080 TiNO

Youku-VESR Challenge 2019

The Youku-VESR Challenge was organized to check models' ability to cope with degradation and noise, which are real for Youku online video-watching application. The proposed dataset consists of 1000 videos, each length is 4–6 seconds. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. PSNR and VMAF metrics were used for performance evaluation. Top methods are performed in the table:

Top teams
TeamPSNRVMAF
Avengers Assemble37.85141.617
NJU_L137.68141.227
ALONG_NTES37.63240.405

AIM 2019 Challenge

The challenge was held by ECCV and had two tracks on video extreme super-resolution: first track checks the fidelity with reference frame (measured by PSNR and SSIM). The second track checks the perceptual quality of videos (MOS). Dataset consists of 328 video sequences of 120 frames each. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 16. Top methods are performed in the table:

Top teams
TeamModel namePSNRSSIMMOSRuntime per image in secPlatformGPU/CPUOpen source
fenglinglwbbased on EDVR22.530.64first result0.35PyTorch4× Titan XNO
NERCMSPFNL22.350.630.51PyTorch2× 1080 TiNO
baselineRLSP21.750.600.09TensorFlowTitan XpNO
HIT-XLabbased on EDSR21.450.60second result60.00PyTorchV100NO

AIM 2020 Challenge

Challenge's conditions are the same as AIM 2019 Challenge. Top methods are performed in the table:

Top teams
TeamModel nameParams numberPSNRSSIMRuntime per image in secGPU/CPUOpen source
KirinUKEVESRNet45.29M22.830.64506.1 s1 × 2080 Ti 6NO
Team-WVU29.51M22.480.63784.9 s1 × Titan XpNO
BOE-IOT-AIBD3D-MGBP53M22.480.63044.83 s1 × 1080NO
sr xxxbased on EDVR22.430.63534 s1 × V100NO
ZZXMAHA31.14M22.280.63214 s1 × 1080 TiNO
lylFineNet22.080.625613 sNO
TTIbased on STARnet21.910.61650.249 sNO
CET CVLab21.770.61120.04 s1 × P100NO

MSU Video Super-Resolution Benchmark

The MSU Video Super-Resolution Benchmark was organized by MSU and proposed three types of motion, two ways to lower resolution, and eight types of content in the dataset. The resolution of ground-truth frames is 1920×1280. The tested scale factor is 4. 14 models were tested. To evaluate models' performance PSNR and SSIM were used with shift compensation. Also proposed a few new metrics: ERQAv1.0, QRCRv1.0, and CRRMv1.0. [72] Top methods are performed in the table:

Top methods
Model nameMulti-frameSubjectiveERQAv1.0PSNRSSIMQRCRv1.0CRRMv1.0Runtime per image in secOpen source
DBVSR YES5.5610.73731.0710.8940.6290.992 YES
LGFN YES5.0400.74031.2910.8980.6290.9961.499 YES
DynaVSR-R YES4.7510.70928.3770.8650.5570.9975.664 YES
TDAN YES4.0360.70630.2440.8830.5570.994 YES
DUF-28L YES3.9100.64525.8520.8300.5490.9932.392 YES
RRN-10L YES3.8870.62724.2520.7900.5570.9890.390 YES
RealSR NO3.7490.69025.9890.7670.0000.886 YES

MSU Super-Resolution for Video Compression Benchmark

The MSU Super-Resolution for Video Compression Benchmark was organized by MSU. This benchmark tests models' ability to work with compressed videos. The dataset consists of 9 videos, compressed with different Video codec standards and different bitrates. Models are ranked by BSQ-rate [73] over subjective score. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. 17 models were tested. 5 video codecs were used to compress ground-truth videos. Top combinations of Super-Resolution methods and video codecs are performed in the table:

Top methods
Model nameBSQ-rate (Subjective score)BSQ-rate (ERQAv2.0)BSQ-rate (VMAF)BSQ-rate (PSNR)BSQ-rate (MS-SSIM)BSQ-rate (LPIPS)Open source
RealSR + x264 0.1960.7700.7750.6750.4870.591 YES
ahq-11 + x2640.2710.8830.7530.8730.7190.656NO
SwinIR + x264 0.3040.7600.6426.2680.7360.559 YES
Real-ESRGAN + x264 0.3355.5800.6987.8740.8810.733 YES
SwinIR + x265 0.3461.5751.3048.1304.6411.474 YES
COMISR + x264 0.3670.9691.3026.0810.6721.118 YES
RealSR + x265 0.5021.6221.6171.0641.0331.206 YES

Application

In many areas, working with video, we deal with different types of video degradation, including downscaling. The resolution of video can be degraded because of imperfections of measuring devices, such as optical degradations and limited size of camera sensors. Bad light and weather conditions add noise to video. Object and camera motion also decrease video quality. Super Resolution techniques help to restore the original video. It's useful in a wide range of applications, such as

It also helps to solve task of object detection, face and character recognition (as preprocessing step). The interest to super-resolution is growing with the development of high definition computer displays and TVs.

Simulating the natural hand movements by "jiggling" the camera Camera rotation.gif
Simulating the natural hand movements by "jiggling" the camera

Video super-resolution finds its practical use in some modern smartphones and cameras, where it is used to reconstruct digital photographs.

Reconstructing details on digital photographs is a difficult task since these photographs are already incomplete: the camera sensor elements measure only the intensity of the light, not directly its color. A process called demosaicing is used to reconstruct the photos from partial color information. A single frame doesn't give us enough data to fill in the missing colors, however, we can receive some of the missing information from multiple images taken one after the other. This process is known as burst photography and can be used to restore a single image of good quality from multiple sequential frames.

When we capture a lot of sequential photos with a smartphone or handheld camera, there is always some movement present between the frames because of the hand motion. We can take advantage of this hand tremor by combining the information on those images. We choose a single image as the "base" or reference frame and align every other frame relative to it.

There are situations where hand motion is simply not present because the device is stabilized (e.g. placed on a tripod). There is a way to simulate natural hand motion by intentionally slightly moving the camera. The movements are extremely small so they don't interfere with regular photos. You can observe these motions on Google Pixel 3 [74] phone by holding it perfectly still (e.g. pressing it against the window) and maximally pinch-zooming the viewfinder.

See also

Related Research Articles

Frame rate, most commonly expressed in frames per second or FPS, is typically the frequency (rate) at which consecutive images (frames) are captured or displayed. This definition applies to film and video cameras, computer animation, and motion capture systems. In these contexts, frame rate may be used interchangeably with frame frequency and refresh rate, which are expressed in hertz. Additionally, in the context of computer graphics performance, FPS is the rate at which a system, particularly a GPU, is able to generate frames, and refresh rate is the frequency at which a display shows completed frames. In electronic camera specifications frame rate refers to the maximum possible rate frames could be captured, but in practice, other settings may reduce the actual frequency to a lower number than the frame rate.

<span class="mw-page-title-main">Motion compensation</span> Video compression technique, used to efficiently predict and generate video frames

Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.

Super-resolution imaging (SR) is a class of techniques that enhance (increase) the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors is enhanced.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.

<span class="mw-page-title-main">Long short-term memory</span> Type of recurrent neural network architecture

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">Saliency map</span> Type of image

In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.

The MOtion-tuned Video Integrity Evaluation (MOVIE) index is a model and set of algorithms for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.

This gallery shows the results of numerous image scaling algorithms.

Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric developed by Netflix in cooperation with the University of Southern California, the IPI/LS2N lab Nantes Université, and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

<span class="mw-page-title-main">Block-matching and 3D filtering</span> Algorithm for noise reduction in images

Block-matching and 3D filtering (BM3D) is a 3-D block-matching algorithm used primarily for noise reduction in images. It is one of the expansions of the non-local means methodology. There are two cascades in BM3D: a hard-thresholding and a Wiener filter stage, both involving the following parts: grouping, collaborative filtering, and aggregation. This algorithm depends on an augmented representation in the transformation site.

<span class="mw-page-title-main">Object co-segmentation</span> Type of image segmentation, jointly segmenting semantically similar objects in multiple images

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

Dynamic texture is the texture with motion which can be found in videos of sea-waves, fire, smoke, wavy trees, etc. Dynamic texture has a spatially repetitive pattern with time-varying visual pattern. Modeling and analyzing dynamic texture is a topic of images processing and pattern recognition in computer vision.

<span class="mw-page-title-main">Event camera</span> Type of imaging sensor

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

Emotion recognition in conversation (ERC) is a sub-field of emotion recognition, that focuses on mining human emotions from conversations or dialogues having two or more interlocutors. The datasets in this field are usually derived from social platforms that allow free and plenty of samples, often containing multimodal data. Self- and inter-personal influences play critical role in identifying some basic emotions, such as, fear, anger, joy, surprise, etc. The more fine grained the emotion labels are the harder it is to detect the correct emotion. ERC poses a number of challenges, such as, conversational-context modeling, speaker-state modeling, presence of sarcasm in conversation, emotion shift across consecutive utterances of the same interlocutor.

Deep learning in photoacoustic imaging

Deep learning in photoacoustic imaging combines the hybrid imaging modality of photoacoustic imaging (PA) with the rapidly evolving field of deep learning. Photoacoustic imaging is based on the photoacoustic effect, in which optical absorption causes a rise in temperature, which causes a subsequent rise in pressure via thermo-elastic expansion. This pressure rise propagates through the tissue and is sensed via ultrasonic transducers. Due to the proportionality between the optical absorption, the rise in temperature, and the rise in pressure, the ultrasound pressure wave signal can be used to quantify the original optical energy deposition within the tissue.

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

Ultrasound Localization Microscopy (ULM) is an advanced ultrasound imaging technique. By localizing microbubbles, ULM overcomes the physical limit of diffraction, achieving sub-wavelength level resolution and qualifying as a super-resolution technique.

References

  1. Chan, Kelvin CK, et al. "BasicVSR: The search for essential components in video super-resolution and beyond." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
  2. Kim, S. P.; Bose, N. K.; Valenzuela, H. M. (1989). "Reconstruction of high resolution image from noise undersampled frames". Lecture Notes in Control and Information Sciences. Vol. 129. Berlin/Heidelberg: Springer-Verlag. pp. 315–326. doi:10.1007/bfb0042742. ISBN   3-540-51424-4.
  3. Bose, N.K.; Kim, H.C.; Zhou, B. (1994). "Performance analysis of the TLS algorithm for image reconstruction from a sequence of undersampled noisy and blurred frames". Proceedings of 1st International Conference on Image Processing. Vol. 3. IEEE Comput. Soc. Press. pp. 571–574. doi:10.1109/icip.1994.413741. ISBN   0-8186-6952-7.
  4. Tekalp, A.M.; Ozkan, M.K.; Sezan, M.I. (1992). "High-resolution image reconstruction from lower-resolution image sequences and space-varying image restoration". [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE. pp. 169–172 vol.3. doi:10.1109/icassp.1992.226249. ISBN   0-7803-0532-9.
  5. Goldberg, N.; Feuer, A.; Goodwin, G.C. (2003). "Super-resolution reconstruction using spatio-temporal filtering". Journal of Visual Communication and Image Representation. 14 (4). Elsevier BV: 508–525. doi:10.1016/s1047-3203(03)00042-7. ISSN   1047-3203.
  6. Mallat, S (2010). "Super-Resolution With Sparse Mixing Estimators". IEEE Transactions on Image Processing. 19 (11). Institute of Electrical and Electronics Engineers (IEEE): 2889–2900. Bibcode:2010ITIP...19.2889M. doi:10.1109/tip.2010.2049927. ISSN   1057-7149. PMID   20457549. S2CID   856101.
  7. Bose, N.K.; Lertrattanapanich, S.; Chappalli, M.B. (2004). "Superresolution with second generation wavelets". Signal Processing: Image Communication. 19 (5). Elsevier BV: 387–391. doi:10.1016/j.image.2004.02.001. ISSN   0923-5965.
  8. Cohen, B.; Avrin, V.; Dinstein, I. (2000). "Polyphase back-projection filtering for resolution enhancement of image sequences". 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100). Vol. 4. IEEE. pp. 2171–2174. doi:10.1109/icassp.2000.859267. ISBN   0-7803-6293-4.
  9. Katsaggelos, A.K. (1997). "An iterative weighted regularized algorithm for improving the resolution of video sequences". Proceedings of International Conference on Image Processing. IEEE Comput. Soc. pp. 474–477. doi:10.1109/icip.1997.638811. ISBN   0-8186-8183-7.
  10. Farsiu, Sina; Elad, Michael; Milanfar, Peyman (2006-01-15). "A practical approach to superresolution". In Apostolopoulos, John G.; Said, Amir (eds.). Visual Communications and Image Processing 2006. Vol. 6077. SPIE. p. 607703. doi:10.1117/12.644391.
  11. Jing Tian; Kai-Kuang Ma (2005). "A new state-space approach for super-resolution image sequence reconstruction". IEEE International Conference on Image Processing 2005. IEEE. pp. I-881. doi:10.1109/icip.2005.1529892. ISBN   0-7803-9134-9.
  12. Costa, Guilherme Holsbach; Bermudez, Jos Carlos Moreira (2007). "Statistical Analysis of the LMS Algorithm Applied to Super-Resolution Image Reconstruction". IEEE Transactions on Signal Processing. 55 (5). Institute of Electrical and Electronics Engineers (IEEE): 2084–2095. Bibcode:2007ITSP...55.2084C. doi:10.1109/tsp.2007.892704. ISSN   1053-587X. S2CID   52857681.
  13. Elad, M.; Feuer, A. (1999). "Super-resolution reconstruction of continuous image sequences". Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348). Vol. 3. IEEE. pp. 459–463. doi:10.1109/icip.1999.817156. ISBN   0-7803-5467-2.
  14. 1 2 Elad, M.; Feuer, A. (1999). "Superresolution restoration of an image sequence: adaptive filtering approach". IEEE Transactions on Image Processing. 8 (3). Institute of Electrical and Electronics Engineers (IEEE): 387–395. Bibcode:1999ITIP....8..387E. doi:10.1109/83.748893. ISSN   1057-7149. PMID   18262881.
  15. Pickering, M.; Frater, M.; Arnold, J. (2005). "Arobust approach to super-resolution sprite generation". IEEE International Conference on Image Processing 2005. IEEE. pp. I-897. doi:10.1109/icip.2005.1529896. ISBN   0-7803-9134-9.
  16. Nasonov, Andrey V.; Krylov, Andrey S. (2010). "Fast Super-Resolution Using Weighted Median Filtering". 2010 20th International Conference on Pattern Recognition. IEEE. pp. 2230–2233. doi:10.1109/icpr.2010.546. ISBN   978-1-4244-7542-1.
  17. Simonyan, K.; Grishin, S.; Vatolin, D.; Popov, D. (2008). "Fast video super-resolution via classification". 2008 15th IEEE International Conference on Image Processing. IEEE. pp. 349–352. doi:10.1109/icip.2008.4711763. ISBN   978-1-4244-1765-0.
  18. Nasir, Haidawati; Stankovic, Vladimir; Marshall, Stephen (2011). "Singular value decomposition based fusion for super-resolution image reconstruction". 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). IEEE. pp. 393–398. doi:10.1109/icsipa.2011.6144138. ISBN   978-1-4577-0242-6.
  19. Protter, M.; Elad, M.; Takeda, H.; Milanfar, P. (2009). "Generalizing the Nonlocal-Means to Super-Resolution Reconstruction". IEEE Transactions on Image Processing. 18 (1). Institute of Electrical and Electronics Engineers (IEEE): 36–51. Bibcode:2009ITIP...18...36P. doi:10.1109/tip.2008.2008067. ISSN   1057-7149. PMID   19095517. S2CID   2142115.
  20. Zhuo, Yue; Liu, Jiaying; Ren, Jie; Guo, Zongming (2012). "Nonlocal based Super Resolution with rotation invariance and search window relocation". 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 853–856. doi:10.1109/icassp.2012.6288018. ISBN   978-1-4673-0046-9.
  21. Cheng, Ming-Hui; Chen, Hsuan-Ying; Leou, Jin-Jang (2011). "Video super-resolution reconstruction using a mobile search strategy and adaptive patch size". Signal Processing. 91 (5). Elsevier BV: 1284–1297. Bibcode:2011SigPr..91.1284C. doi:10.1016/j.sigpro.2010.12.016. ISSN   0165-1684. S2CID   17920263.
  22. Huhle, Benjamin; Schairer, Timo; Jenke, Philipp; Straßer, Wolfgang (2010). "Fusion of range and color images for denoising and resolution enhancement with a non-local filter". Computer Vision and Image Understanding. 114 (12). Elsevier BV: 1336–1345. doi:10.1016/j.cviu.2009.11.004. ISSN   1077-3142.
  23. Takeda, Hiroyuki; Farsiu, Sina; Milanfar, Peyman (2007). "Kernel Regression for Image Processing and Reconstruction". IEEE Transactions on Image Processing. 16 (2). Institute of Electrical and Electronics Engineers (IEEE): 349–366. Bibcode:2007ITIP...16..349T. doi:10.1109/tip.2006.888330. ISSN   1057-7149. PMID   17269630. S2CID   12116009.
  24. Elad, M.; Feuer, A. (1997). "Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images". IEEE Transactions on Image Processing. 6 (12). Institute of Electrical and Electronics Engineers (IEEE): 1646–1658. Bibcode:1997ITIP....6.1646E. doi:10.1109/83.650118. ISSN   1057-7149. PMID   18285235.
  25. Farsiu, Sina; Robinson, Dirk; Elad, Michael; Milanfar, Peyman (2003-11-20). "Robust shift and add approach to superresolution". In Tescher, Andrew G. (ed.). Applications of Digital Image Processing XXVI. Vol. 5203. SPIE. p. 121. doi:10.1117/12.507194.
  26. Chantas, G.K.; Galatsanos, N.P.; Woods, N.A. (2007). "Super-Resolution Based on Fast Registration and Maximum a Posteriori Reconstruction". IEEE Transactions on Image Processing. 16 (7). Institute of Electrical and Electronics Engineers (IEEE): 1821–1830. Bibcode:2007ITIP...16.1821C. doi:10.1109/tip.2007.896664. ISSN   1057-7149. PMID   17605380. S2CID   1811280.
  27. Rajan, D.; Chaudhuri, S. (2001). "Generation of super-resolution images from blurred observations using Markov random fields". 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 3. IEEE. pp. 1837–1840. doi:10.1109/icassp.2001.941300. ISBN   0-7803-7041-4.
  28. Zibetti, Marcelo Victor Wust; Mayer, Joceli (2006). "Outlier Robust and Edge-Preserving Simultaneous Super-Resolution". 2006 International Conference on Image Processing. IEEE. pp. 1741–1744. doi:10.1109/icip.2006.312718. ISBN   1-4244-0480-0.
  29. Joshi, M.V.; Chaudhuri, S.; Panuganti, R. (2005). "A Learning-Based Method for Image Super-Resolution From Zoomed Observations". IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 35 (3). Institute of Electrical and Electronics Engineers (IEEE): 527–537. doi:10.1109/tsmcb.2005.846647. ISSN   1083-4419. PMID   15971920. S2CID   3162908.
  30. Liao, Renjie; Tao, Xin; Li, Ruiyu; Ma, Ziyang; Jia, Jiaya (2015). "Video Super-Resolution via Deep Draft-Ensemble Learning". 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 531–539. doi:10.1109/iccv.2015.68. ISBN   978-1-4673-8391-2.
  31. Kappeler, Armin; Yoo, Seunghwan; Dai, Qiqin; Katsaggelos, Aggelos K. (2016). "Video Super-Resolution With Convolutional Neural Networks". IEEE Transactions on Computational Imaging. 2 (2). Institute of Electrical and Electronics Engineers (IEEE): 109–122. doi:10.1109/tci.2016.2532323. ISSN   2333-9403. S2CID   9356783.
  32. Caballero, Jose; Ledig, Christian; Aitken, Andrew; Acosta, Alejandro; Totz, Johannes; Wang, Zehan; Shi, Wenzhe (2016-11-16). "Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation". arXiv: 1611.05250v2 [cs.CV].
  33. Tao, Xin; Gao, Hongyun; Liao, Renjie; Wang, Jue; Jia, Jiaya (2017). "Detail-Revealing Deep Video Super-Resolution". 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 4482–4490. arXiv: 1704.02738 . doi:10.1109/iccv.2017.479. ISBN   978-1-5386-1032-9.
  34. Liu, Ding; Wang, Zhaowen; Fan, Yuchen; Liu, Xianming; Wang, Zhangyang; Chang, Shiyu; Huang, Thomas (2017). "Robust Video Super-Resolution with Learned Temporal Dynamics". 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2526–2534. doi:10.1109/iccv.2017.274. ISBN   978-1-5386-1032-9.
  35. Sajjadi, Mehdi S. M.; Vemulapalli, Raviteja; Brown, Matthew (2018). "Frame-Recurrent Video Super-Resolution". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 6626–6634. arXiv: 1801.04590 . doi:10.1109/cvpr.2018.00693. ISBN   978-1-5386-6420-9.
  36. Kim, Tae Hyun; Sajjadi, Mehdi S. M.; Hirsch, Michael; Schölkopf, Bernhard (2018). "Spatio-Temporal Transformer Network for Video Restoration". Computer Vision – ECCV 2018. Lecture Notes in Computer Science. Vol. 11207. Cham: Springer International Publishing. pp. 111–127. doi:10.1007/978-3-030-01219-9_7. ISBN   978-3-030-01218-2. ISSN   0302-9743.
  37. Wang, Longguang; Guo, Yulan; Liu, Li; Lin, Zaiping; Deng, Xinpu; An, Wei (2020). "Deep Video Super-Resolution Using HR Optical Flow Estimation". IEEE Transactions on Image Processing. 29. Institute of Electrical and Electronics Engineers (IEEE): 4323–4336. arXiv: 2001.02129 . Bibcode:2020ITIP...29.4323W. doi:10.1109/tip.2020.2967596. ISSN   1057-7149. PMID   31995491. S2CID   210023539.
  38. Chu, Mengyu; Xie, You; Mayer, Jonas; Leal-Taixé, Laura; Thuerey, Nils (2020-07-08). "Learning temporal coherence via self-supervision for GAN-based video generation". ACM Transactions on Graphics. 39 (4). Association for Computing Machinery (ACM). arXiv: 1811.09393 . doi:10.1145/3386569.3392457. ISSN   0730-0301. S2CID   209460786.
  39. Xue, Tianfan; Chen, Baian; Wu, Jiajun; Wei, Donglai; Freeman, William T. (2019-02-12). "Video Enhancement with Task-Oriented Flow". International Journal of Computer Vision. 127 (8). Springer Science and Business Media LLC: 1106–1125. arXiv: 1711.09078 . doi:10.1007/s11263-018-01144-2. ISSN   0920-5691. S2CID   40412298.
  40. Wang, Zhongyuan; Yi, Peng; Jiang, Kui; Jiang, Junjun; Han, Zhen; Lu, Tao; Ma, Jiayi (2019). "Multi-Memory Convolutional Neural Network for Video Super-Resolution". IEEE Transactions on Image Processing. 28 (5). Institute of Electrical and Electronics Engineers (IEEE): 2530–2544. Bibcode:2019ITIP...28.2530W. doi:10.1109/tip.2018.2887017. ISSN   1057-7149. PMID   30571634. S2CID   58595890.
  41. Haris, Muhammad; Shakhnarovich, Gregory; Ukita, Norimichi (2019). "Recurrent Back-Projection Network for Video Super-Resolution". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 3892–3901. arXiv: 1903.10128 . doi:10.1109/cvpr.2019.00402. ISBN   978-1-7281-3293-8.
  42. Bao, Wenbo; Lai, Wei-Sheng; Zhang, Xiaoyun; Gao, Zhiyong; Yang, Ming-Hsuan (2021-03-01). "MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (3). Institute of Electrical and Electronics Engineers (IEEE): 933–948. arXiv: 1810.08768 . doi:10.1109/tpami.2019.2941941. ISSN   0162-8828. PMID   31722471. S2CID   53046739.
  43. Bare, Bahetiyaer; Yan, Bo; Ma, Chenxi; Li, Ke (2019). "Real-time video super-resolution via motion convolution kernel estimation". Neurocomputing. 367. Elsevier BV: 236–245. doi:10.1016/j.neucom.2019.07.089. ISSN   0925-2312. S2CID   201264266.
  44. Kalarot, Ratheesh; Porikli, Fatih (2019). "MultiBoot Vsr: Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE. pp. 2060–2069. doi:10.1109/cvprw.2019.00258. ISBN   978-1-7281-2506-0.
  45. 1 2 Chan, Kelvin C. K.; Wang, Xintao; Yu, Ke; Dong, Chao; Loy, Chen Change (2020-12-03). "BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond". arXiv: 2012.02181v1 [cs.CV].
  46. Naoto Chiche, Benjamin; Frontera-Pons, Joana; Woiselle, Arnaud; Starck, Jean-Luc (2020-11-09). "Deep Unrolled Network for Video Super-Resolution". 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE. pp. 1–6. arXiv: 2102.11720 . doi:10.1109/ipta50016.2020.9286636. ISBN   978-1-7281-8750-1.
  47. Wang, Xintao; Chan, Kelvin C. K.; Yu, Ke; Dong, Chao; Loy, Chen Change (2019-05-07). "EDVR: Video Restoration with Enhanced Deformable Convolutional Networks". arXiv: 1905.02716v1 [cs.CV].
  48. Wang, Hua; Su, Dewei; Liu, Chuangchuang; Jin, Longcun; Sun, Xianfang; Peng, Xinyi (2019). "Deformable Non-Local Network for Video Super-Resolution". IEEE Access. 7. Institute of Electrical and Electronics Engineers (IEEE): 177734–177744. arXiv: 1909.10692 . Bibcode:2019IEEEA...7q7734W. doi: 10.1109/access.2019.2958030 . ISSN   2169-3536.
  49. Tian, Yapeng; Zhang, Yulun; Fu, Yun; Xu, Chenliang (2020). "TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution". 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 3357–3366. arXiv: 1812.02898 . doi:10.1109/cvpr42600.2020.00342. ISBN   978-1-7281-7168-5.
  50. Song, Huihui; Xu, Wenjie; Liu, Dong; Liua, Bo; Liub, Qingshan; Metaxas, Dimitris N. (2021). "Multi-Stage Feature Fusion Network for Video Super-Resolution". IEEE Transactions on Image Processing. 30. Institute of Electrical and Electronics Engineers (IEEE): 2923–2934. Bibcode:2021ITIP...30.2923S. doi:10.1109/tip.2021.3056868. ISSN   1057-7149. PMID   33560986. S2CID   231864067.
  51. Isobe, Takashi; Li, Songjiang; Jia, Xu; Yuan, Shanxin; Slabaugh, Gregory; Xu, Chunjing; Li, Ya-Li; Wang, Shengjin; Tian, Qi (2020). "Video Super-Resolution With Temporal Group Attention". 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 8005–8014. arXiv: 2007.10595 . doi:10.1109/cvpr42600.2020.00803. ISBN   978-1-7281-7168-5.
  52. Lucas, Alice; Lopez-Tapia, Santiago; Molina, Rafael; Katsaggelos, Aggelos K. (2019). "Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution". IEEE Transactions on Image Processing. 28 (7). Institute of Electrical and Electronics Engineers (IEEE): 3312–3327. arXiv: 1806.05764 . Bibcode:2019ITIP...28.3312L. doi:10.1109/tip.2019.2895768. ISSN   1057-7149. PMID   30714918. S2CID   73415655.
  53. Yan, Bo; Lin, Chuming; Tan, Weimin (2019-09-28). "Frame and Feature-Context Video Super-Resolution". arXiv: 1909.13057v1 [cs.CV].
  54. Tian, Zhiqiang; Wang, Yudiao; Du, Shaoyi; Lan, Xuguang (2020-07-10). Yang, You (ed.). "A multiresolution mixture generative adversarial network for video super-resolution". PLOS ONE. 15 (7). Public Library of Science (PLoS): e0235352. Bibcode:2020PLoSO..1535352T. doi: 10.1371/journal.pone.0235352 . ISSN   1932-6203. PMC   7351143 . PMID   32649694.
  55. Zhu, Xiaobin; Li, Zhuangzi; Lou, Jungang; Shen, Qing (2021). "Video super-resolution based on a spatio-temporal matching network". Pattern Recognition. 110: 107619. Bibcode:2021PatRe.11007619Z. doi:10.1016/j.patcog.2020.107619. ISSN   0031-3203. S2CID   225285804.
  56. Li, Wenbo; Tao, Xin; Guo, Taian; Qi, Lu; Lu, Jiangbo; Jia, Jiaya (2020-07-23). "MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution". arXiv: 2007.11803v1 [cs.CV].
  57. Jo, Younghyun; Oh, Seoung Wug; Kang, Jaeyeon; Kim, Seon Joo (2018). "Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 3224–3232. doi:10.1109/cvpr.2018.00340. ISBN   978-1-5386-6420-9.
  58. Li, Sheng; He, Fengxiang; Du, Bo; Zhang, Lefei; Xu, Yonghao; Tao, Dacheng (2019-04-05). "Fast Spatio-Temporal Residual Network for Video Super-Resolution". arXiv: 1904.02870v1 [cs.CV].
  59. Kim, Soo Ye; Lim, Jeongyeon; Na, Taeyoung; Kim, Munchurl (2019). "Video Super-Resolution Based on 3D-CNNS with Consideration of Scene Change". 2019 IEEE International Conference on Image Processing (ICIP). pp. 2831–2835. doi:10.1109/ICIP.2019.8803297. ISBN   978-1-5386-6249-6. S2CID   202763112.
  60. Luo, Jianping; Huang, Shaofei; Yuan, Yuan (2020). "Video Super-Resolution using Multi-scale Pyramid 3D Convolutional Networks". Proceedings of the 28th ACM International Conference on Multimedia. pp. 1882–1890. doi:10.1145/3394171.3413587. ISBN   9781450379885. S2CID   222278621.
  61. Zhang, Dongyang; Shao, Jie; Liang, Zhenwen; Liu, Xueliang; Shen, Heng Tao (2020). "Multi-branch Networks for Video Super-Resolution with Dynamic Reconstruction Strategy". IEEE Transactions on Circuits and Systems for Video Technology. 31 (10): 3954–3966. doi:10.1109/TCSVT.2020.3044451. ISSN   1051-8215. S2CID   235057646.
  62. Aksan, Emre; Hilliges, Otmar (2019-02-18). "STCN: Stochastic Temporal Convolutional Networks". arXiv: 1902.06568v1 [cs.LG].
  63. Huang, Yan; Wang, Wei; Wang, Liang (2018). "Video Super-Resolution via Bidirectional Recurrent Convolutional Networks". IEEE Transactions on Pattern Analysis and Machine Intelligence. 40 (4): 1015–1028. doi:10.1109/TPAMI.2017.2701380. ISSN   0162-8828. PMID   28489532. S2CID   136582.
  64. Zhu, Xiaobin; Li, Zhuangzi; Zhang, Xiao-Yu; Li, Changsheng; Liu, Yaqi; Xue, Ziyu (2019). "Residual Invertible Spatio-Temporal Network for Video Super-Resolution". Proceedings of the AAAI Conference on Artificial Intelligence. 33: 5981–5988. doi: 10.1609/aaai.v33i01.33015981 . ISSN   2374-3468.
  65. Li, Dingyi; Liu, Yu; Wang, Zengfu (2019). "Video Super-Resolution Using Non-Simultaneous Fully Recurrent Convolutional Network". IEEE Transactions on Image Processing. 28 (3): 1342–1355. Bibcode:2019ITIP...28.1342L. doi:10.1109/TIP.2018.2877334. ISSN   1057-7149. PMID   30346282. S2CID   53044490.
  66. Isobe, Takashi; Zhu, Fang; Jia, Xu; Wang, Shengjin (2020-08-13). "Revisiting Temporal Modeling for Video Super-resolution". arXiv: 2008.05765v2 [eess.IV].
  67. Han, Lei; Fan, Cien; Yang, Ye; Zou, Lian (2020). "Bidirectional Temporal-Recurrent Propagation Networks for Video Super-Resolution". Electronics. 9 (12): 2085. doi: 10.3390/electronics9122085 . ISSN   2079-9292.
  68. Fuoli, Dario; Gu, Shuhang; Timofte, Radu (2019-09-17). "Efficient Video Super-Resolution through Recurrent Latent Space Propagation". arXiv: 1909.08080 [eess.IV].
  69. Isobe, Takashi; Jia, Xu; Gu, Shuhang; Li, Songjiang; Wang, Shengjin; Tian, Qi (2020-08-02). "Video Super-Resolution with Recurrent Structure-Detail Network". arXiv: 2008.00455v1 [cs.CV].
  70. Zhou, Chao; Chen, Can; Ding, Fei; Zhang, Dengyin (2021). "Video super-resolution with non-local alignment network". IET Image Processing. 15 (8): 1655–1667. doi: 10.1049/ipr2.12134 . ISSN   1751-9659.
  71. Yi, Peng; Wang, Zhongyuan; Jiang, Kui; Jiang, Junjun; Lu, Tao; Ma, Jiayi (2020). "A Progressive Fusion Generative Adversarial Network for Realistic and Consistent Video Super-Resolution". IEEE Transactions on Pattern Analysis and Machine Intelligence. PP (5): 2264–2280. doi:10.1109/TPAMI.2020.3042298. ISSN   0162-8828. PMID   33270559. S2CID   227282569.
  72. "MSU VSR Benchmark Methodology". Video Processing. 2021-04-26. Retrieved 2021-05-12.
  73. Zvezdakova, A. V.; Kulikov, D. L.; Zvezdakov, S. V.; Vatolin, D. S. (2020). "BSQ-rate: a new approach for video-codec performance comparison and drawbacks of current solutions". Programming and Computer Software. 46 (3): 183–194. doi:10.1134/S0361768820030111. S2CID   219157416.
  74. "See Better and Further with Super Res Zoom on the Pixel 3". Google AI Blog. 2018-10-15.