Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information.
In recent years, image fusion has been used in many applications such as remote sensing, surveillance, medical diagnosis, and photography applications. Two major applications of image fusion in photography are fusion of multi-focus images and multi-exposure images. [1] [2]
The main idea of image fusion is gathering important and the essential information from the input images into one single image which ideally has all of the information of the input images. [1] [3] [4] [5] The research history of image fusion spans over 30 years and many scientific papers. [2] [6] Image fusion generally has two aspects: image fusion methods and objective evaluation metrics. [6]
In visual sensor networks (VSN), sensors are cameras which record images and video sequences. In many applications of VSN, a camera can't give a perfect illustration including all details of the scene. This is because of the limited depth of focus of the optical lens of cameras. Therefore, just the object located in the focal length of camera is focused and clear, and other parts of the image are blurred.
VSN captures images with different depths of focus using several cameras. Due to the large amount of data generated by cameras compared to other sensors such as pressure and temperature sensors and some limitations of bandwidth, energy consumption and processing time, it is essential to process the local input images to decrease the amount of transmitted data. [2]
Much research on multi-focus image fusion has been done in recent years and can be classified into two categories: transform and spatial domains. Commonly used transforms for image fusion are Discrete cosine transform (DCT) and Multi-Scale Transform (MST). [2] [7] Recently, Deep learning (DL) has been thriving in several image processing and computer vision applications. [1] [3] [8]
Huang and Jing have reviewed and applied several focus measurements in the spatial domain for the multi-focus image fusion process, suitable for real-time applications. They mentioned some focus measurements including variance, energy of image gradient (EOG), Tenenbaum's algorithm (Tenengrad), energy of Laplacian (EOL), sum-modified-Laplacian (SML), and spatial frequency (SF). Their experiments showed that EOL gave better results than other methods like variance and spatial frequency. [9] [5]
Image fusion based on the multi-scale transform is the most commonly used and promising technique. Laplacian pyramid transform, gradient pyramid-based transform, morphological pyramid transform and the premier ones, discrete wavelet transform, shift-invariant wavelet transform (SIDWT), and discrete cosine harmonic wavelet transform (DCHWT) are some examples of image fusion methods based on multi-scale transform. [2] [5] [7] These methods are complex and have some limitations e.g. processing time and energy consumption. For example, multi-focus image fusion methods based on DWT require a lot of convolution operations, so they take more time and energy to process. Therefore, most methods in multi-scale transform are not suitable for real-time applications. [7] [5] Moreover, these methods are not very successful along edges, due to the wavelet transform process missing the edges of the image. They create ringing artefacts in the output image and reduce its quality.
Due to the aforementioned problems in the multi-scale transform methods, researchers are interested in multi-focus image fusion in the DCT domain. DCT-based methods are more efficient in terms of transmission and archiving images coded in Joint Photographic Experts Group (JPEG) standard to the upper node in the VSN agent. A JPEG system consists of a pair of an encoder and a decoder. In the encoder, images are divided into non-overlapping 8×8 blocks, and the DCT coefficients are calculated for each. Since the quantization of DCT coefficients is a lossy process, many of the small-valued DCT coefficients are quantized to zero, which corresponds to high frequencies. DCT-based image fusion algorithms work better when the multi-focus image fusion methods are applied in the compressed domain. [7] [5]
In addition, in the spatial-based methods, the input images must be decoded and then transferred to the spatial domain. After implementation of the image fusion operations, the output fused images must again be encoded. DCT domain-based methods do not require complex and time-consuming consecutive decoding and encoding operations. Therefore, the image fusion methods based on DCT domain operate with much less energy and processing time. [7] [5] Recently, a lot of research has been carried out in the DCT domain. DCT+Variance, DCT+Corr_Eng, DCT+EOL, and DCT+VOL are some prominent examples of DCT based methods. [5] [7]
Nowadays, the deep learning is utilized in image fusion applications such as multi-focus image fusion. Liu et al. were the first researchers that used CNN for multi-focus image fusion. They used the Siamese architecture for comparing the focused and unfocused patches. [4] C. Du et al. submitted MSCNN method that obtains the initial segmented decision map with image segmentation between the focused and unfocused patches through the multi-scale convolution neural network. [10] H. Tang et al. introduced the pixel-wise convolution neural network (p-CNN) for classification of the focused and unfocused patches. [11]
All of these CNN based multi-focus image fusion methods have enhanced the decision map. Nevertheless, their initial segmented decision maps have a lot of weakness and errors. Therefore, satisfaction of their final fusion decision map depends to use vast post-processing algorithms such as Consistency Verification (CV), morphological operations, watershed, guiding filters, and small region removal on the initial segmented decision map. Along with the CNN based multi-focus image fusion methods, fully convolutional network (FCN) is also utilized in multi-focus image fusion. [8] [12]
The Convolutional Neural Networks (CNNs) based multi-focus image fusion methods have recently attracted enormous attention. They greatly enhanced the constructed decision map compared with the previous state of the art methods that have been done in the spatial and transform domains. Nevertheless, these methods have not reached to the satisfactory initial decision map, and they need to undergo vast post-processing algorithms to achieve a satisfactory decision map.
In the method of ECNN, a novel CNNs based method with the help of the ensemble learning is proposed. It is very reasonable to use various models and datasets rather than just one. The ensemble learning based methods intend to pursue increasing diversity among the models and datasets in order to decrease the problem of the overfitting on the training dataset.
It is obvious that the results of an ensemble of CNNs are better than just one single CNNs. Also, the proposed method introduces a new simple type of multi-focus images dataset. It simply changes the arranging of the patches of the multi-focus datasets, which is very useful for obtaining the better accuracy. With this new type arrangement of datasets, the three different datasets including the original and the Gradient in directions of vertical and horizontal patches are generated from the COCO dataset. Therefore, the proposed method introduces a new network that three CNNs models which have been trained on three different created datasets to construct the initial segmented decision map. These ideas greatly improve the initial segmented decision map of the proposed method which is similar, or even better than, the other final decision map of CNNs based methods obtained after applying many post-processing algorithms. Many real multi-focus test images are used in our experiments, and the results are compared with quantitative and qualitative criteria. The obtained experimental results indicate that the proposed CNNs based network is more accurate and have the better decision map without post-processing algorithms than the other existing state of the art multi-focus fusion methods which used many post-processing algorithms.
This method introduces a new network for achieving the cleaner initial segmented decision map compared with the others. The pro- posed method introduces a new architecture which uses an ensemble of three CNNs trained on three different datasets. Also, the proposed method prepares a new simple type of multi- focus image datasets for achieving the better fusion performance than the other popular multi-focus image datasets.
This idea is very helpful to achieve the better initial segmented decision map, which is the same or even better than the others initial segmented decision map by using vast post-processing algorithms.
The source code of ECNN http://amin-naji.com/publications/ and https://github.com/mostafaaminnaji/ECNN
Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior results compared with generic data compression methods which are used for other digital data.
Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing. Since images are defined over two dimensions digital image processing may be modeled in the form of multidimensional systems. The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics ; third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased.
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Sensor fusion is the process of combining sensor data or data derived from disparate sources such that the resulting information has less uncertainty than would be possible when these sources were used individually. For instance, one could potentially obtain a more accurate location estimate of an indoor object by combining multiple data sources such as video cameras and WiFi localization signals. The term uncertainty reduction in this case can mean more accurate, more complete, or more dependable, or refer to the result of an emerging view, such as stereoscopic vision.
In computer science and machine learning, cellular neural networks (CNN) or cellular nonlinear networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces, solving partial differential equations, reducing non-visual problems to geometric maps, modelling biological vision and other sensory-motor organs.
The image fusion process is defined as gathering all the important information from multiple images, and their inclusion into fewer images, usually a single one. This single image is more informative and accurate than any single source image, and it consists of all the necessary information. The purpose of image fusion is not only to reduce the amount of data but also to construct images that are more appropriate and understandable for the human and machine perception. In computer vision, multisensor image fusion is the process of combining relevant information from two or more images into a single image. The resulting image will be more informative than any of the input images.
Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be distinguished: A direct pattern recognition of sensor readings that indicate a fault and an analysis of the discrepancy between the sensor readings and expected values, derived from some model. In the latter case, it is typical that a fault is said to be detected if the discrepancy or residual goes above a certain threshold. It is then the task of fault isolation to categorize the type of fault and its location in the machinery. Fault detection and isolation (FDI) techniques can be broadly classified into two categories. These include model-based FDI and signal processing based FDI.
Image restoration is the operation of taking a corrupt/noisy image and estimating the clean, original image. Corruption may come in many forms such as motion blur, noise and camera mis-focus. Image restoration is performed by reversing the process that blurred the image and such is performed by imaging a point source and use the point source image, which is called the Point Spread Function (PSF) to restore the image information lost to the blurring process.
Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.
In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.
Visual temporal attention is a special case of visual attention that involves directing attention to specific instant of time. Similar to its spatial counterpart visual spatial attention, these attention modules have been widely implemented in video analytics in computer vision to provide enhanced performance and human interpretable explanation of deep learning models.
Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).
LeNet is a convolutional neural network structure proposed by LeCun et al. in 1998. In general, LeNet refers to LeNet-5 and is a simple convolutional neural network. Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing.
Deep learning in photoacoustic imaging combines the hybrid imaging modality of photoacoustic imaging (PA) with the rapidly evolving field of deep learning. Photoacoustic imaging is based on the photoacoustic effect, in which optical absorption causes a rise in temperature, which causes a subsequent rise in pressure via thermo-elastic expansion. This pressure rise propagates through the tissue and is sensed via ultrasonic transducers. Due to the proportionality between the optical absorption, the rise in temperature, and the rise in pressure, the ultrasound pressure wave signal can be used to quantify the original optical energy deposition within the tissue.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
Applications of machine learning in earth sciences include geological mapping, gas leakage detection and geological features identification. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming. Earth science is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere and biosphere.