Deep learning in photoacoustic imaging combines the hybrid imaging modality of photoacoustic imaging (PA) with the rapidly evolving field of deep learning. Photoacoustic imaging is based on the photoacoustic effect, in which optical absorption causes a rise in temperature, which causes a subsequent rise in pressure via thermo-elastic expansion. [1] This pressure rise propagates through the tissue and is sensed via ultrasonic transducers. Due to the proportionality between the optical absorption, the rise in temperature, and the rise in pressure, the ultrasound pressure wave signal can be used to quantify the original optical energy deposition within the tissue. [2]
Photoacoustic imaging has applications of deep learning in both photoacoustic computed tomography (PACT) and photoacoustic microscopy (PAM). PACT utilizes wide-field optical excitation and an array of unfocused ultrasound transducers. [1] Similar to other computed tomography methods, the sample is imaged at multiple view angles, which are then used to perform an inverse reconstruction algorithm based on the detection geometry (typically through universal backprojection, [3] modified delay-and-sum, [4] or time reversal [5] [6] ) to elicit the initial pressure distribution within the tissue. PAM on the other hand uses focused ultrasound detection combined with weakly-focused optical excitation (acoustic resolution PAM or AR-PAM) or tightly-focused optical excitation (optical resolution PAM or OR-PAM). [7] PAM typically captures images point-by-point via a mechanical raster scanning pattern. At each scanned point, the acoustic time-of-flight provides axial resolution while the acoustic focusing yields lateral resolution. [1]
The first application of deep learning in PACT was by Reiter et al. [8] in which a deep neural network was trained to learn spatial impulse responses and locate photoacoustic point sources. The resulting mean axial and lateral point location errors on 2,412 of their randomly selected test images were 0.28 mm and 0.37 mm respectively. After this initial implementation, the applications of deep learning in PACT have branched out primarily into removing artifacts from acoustic reflections, [9] sparse sampling, [10] [11] [12] limited-view, [13] [14] [15] and limited-bandwidth. [16] [14] [17] [18] There has also been some recent work in PACT toward using deep learning for wavefront localization. [19] There have been networks based on fusion of information from two different reconstructions to improve the reconstruction using deep learning fusion based networks. [20]
Traditional photoacoustic beamforming techniques modeled photoacoustic wave propagation by using detector array geometry and the time-of-flight to account for differences in the PA signal arrival time. However, this technique failed to account for reverberant acoustic signals caused by acoustic reflection, resulting in acoustic reflection artifacts that corrupt the true photoacoustic point source location information. In Reiter et al., [8] a convolutional neural network (similar to a simple VGG-16 [21] style architecture) was used that took pre-beamformed photoacoustic data as input and outputted a classification result specifying the 2-D point source location.
Johnstonbaugh et al. [19] was able to localize the source of photoacoustic wavefronts with a deep neural network. The network used was an encoder-decoder style convolutional neural network. The encoder-decoder network was made of residual convolution, upsampling, and high field-of-view convolution modules. A Nyquist convolution layer and differentiable spatial-to-numerical transform layer were also used within the architecture. Simulated PA wavefronts served as the input for training the model. To create the wavefronts, the forward simulation of light propagation was done with the NIRFast toolbox and the light-diffusion approximation, while the forward simulation of sound propagation was done with the K-Wave toolbox. The simulated wavefronts were subjected to different scattering mediums and Gaussian noise. The output for the network was an artifact free heat map of the targets axial and lateral position. The network had a mean error rate of less than 30 microns when localizing target below 40 mm and had a mean error rate of 1.06 mm for localizing targets between 40 mm and 60 mm. [19] With a slight modification to the network, the model was able to accommodate multi target localization. [19] A validation experiment was performed in which pencil lead was submerged into an intralipid solution at a depth of 32 mm. The network was able to localize the lead's position when the solution had a reduced scattering coefficient of 0, 5, 10, and 15 cm−1. [19] The results of the network show improvements over standard delay-and-sum or frequency-domain beamforming algorithms and Johnstonbaugh proposes that this technology could be used for optical wavefront shaping, circulating melanoma cell detection, and real-time vascular surgeries. [19]
Building on the work of Reiter et al., [8] Allman et al. [9] utilized a full VGG-16 [21] architecture to locate point sources and remove reflection artifacts within raw photoacoustic channel data (in the presence of multiple sources and channel noise). This utilization of deep learning trained on simulated data produced in the MATLAB k-wave library, and then later reaffirmed their results on experimental data.
In PACT, tomographic reconstruction is performed, in which the projections from multiple solid angles are combined to form an image. When reconstruction methods like filtered backprojection or time reversal, are ill-posed inverse problems [22] due to sampling under the Nyquist-Shannon's sampling requirement or with limited-bandwidth/view, the resulting reconstruction contains image artifacts. Traditionally these artifacts were removed with slow iterative methods like total variation minimization, but the advent of deep learning approaches has opened a new avenue that utilizes a priori knowledge from network training to remove artifacts. In the deep learning methods that seek to remove these sparse sampling, limited-bandwidth, and limited-view artifacts, the typical workflow involves first performing the ill-posed reconstruction technique to transform the pre-beamformed data into a 2-D representation of the initial pressure distribution that contains artifacts. Then, a convolutional neural network (CNN) is trained to remove the artifacts, in order to produce an artifact-free representation of the ground truth initial pressure distribution.
When the density of uniform tomographic view angles is under what is prescribed by the Nyquist-Shannon's sampling theorem, it is said that the imaging system is performing sparse sampling. Sparse sampling typically occurs as a way of keeping production costs low and improving image acquisition speed. [10] The typical network architectures used to remove these sparse sampling artifacts are U-net [10] [12] and Fully Dense (FD) U-net. [11] Both of these architectures contain a compression and decompression phase. The compression phase learns to compress the image to a latent representation that lacks the imaging artifacts and other details. [23] The decompression phase then combines with information passed by the residual connections in order to add back image details without adding in the details associated with the artifacts. [23] FD U-net modifies the original U-net architecture by including dense blocks that allow layers to utilize information learned by previous layers within the dense block. [11] Another technique was proposed using a simple CNN based architecture for removal of artifacts and improving the k-wave image reconstruction. [17]
When a region of partial solid angles are not captured, generally due to geometric limitations, the image acquisition is said to have limited-view. [24] As illustrated by the experiments of Davoudi et al., [12] limited-view corruptions can be directly observed as missing information in the frequency domain of the reconstructed image. Limited-view, similar to sparse sampling, makes the initial reconstruction algorithm ill-posed. Prior to deep learning, the limited-view problem was addressed with complex hardware such as acoustic deflectors [25] and full ring-shaped transducer arrays, [12] [26] as well as solutions like compressed sensing, [27] [28] [29] [30] [31] weighted factor, [32] and iterative filtered backprojection. [33] [34] The result of this ill-posed reconstruction is imaging artifacts that can be removed by CNNs. The deep learning algorithms used to remove limited-view artifacts include U-net [12] [15] [35] and FD U-net, [36] as well as generative adversarial networks (GANs) [14] and volumetric versions of U-net. [13] One GAN implementation of note improved upon U-net by using U-net as a generator and VGG as a discriminator, with the Wasserstein metric and gradient penalty to stabilize training (WGAN-GP). [14]
Guan et al. [36] was able to apply a FD U-net to remove artifacts from simulated limited-view reconstructed PA images. PA images reconstructed with the time-reversal process and PA data collected with either 16, 32, or 64 sensors served as the input to the network and the ground truth images served as the desired output. The network was able to remove artifacts created in the time-reversal process from synthetic, mouse brain, fundus, and lung vasculature phantoms. [36] This process was similar to the work done for clearing artifacts from sparse and limited view images done by Davoudi et al. [12] To improve the speed of reconstruction and to allow for the FD U-net to use more information from the sensor, Guan et al. proposed to use a pixel-wise interpolation as an input to the network instead of a reconstructed image. [36] Using a pixel-wise interpolation would remove the need to produce an initial image that may remove small details or make details unrecoverable by obscuring them with artifacts. To create the pixel-wise interpolation, the time-of-flight for each pixel was calculated using the wave propagation equation. Next, a reconstruction grid was created from pressure measurements calculated from the pixels' time-of-flight. Using the reconstruction grid as an input, the FD U-net was able to create artifact free reconstructed images. This pixel-wise interpolation method was faster and achieved better peak signal to noise ratios (PSNR) and structural similarity index measures (SSIM) than artifact free images created when the time-reversal images served as the input to the FD U-net. [36] This pixel-wise interpolation method was significantly faster and had comparable PSNR and SSIM than the images reconstructed from the computationally intensive iterative approach. [36] The pixel-wise method proposed in this study was only proven for in silico experiments with homogenous medium, but Guan posits that the pixel-wise method can be used for real time PAT rendering. [36]
The limited-bandwidth problem occurs as a result of the ultrasound transducer array's limited detection frequency bandwidth. This transducer array acts like a band-pass filter in the frequency domain, attenuating both high and low frequencies within the photoacoustic signal. [15] [16] This limited-bandwidth can cause artifacts and limit the axial resolution of the imaging system. [14] The primary deep neural network architectures used to remove limited-bandwidth artifacts have been WGAN-GP [14] and modified U-net. [15] [16] The typical method to remove artifacts and denoise limited-bandwidth reconstructions before deep learning was Wiener filtering, which helps to expand the PA signal's frequency spectrum. [14] The primary advantage of the deep learning method over Wiener filtering is that Wiener filtering requires a high initial signal-to-noise ratio (SNR), which is not always possible, while the deep learning model has no such restriction. [14]
Fusion of information for improving photoacoustic Images with deep neural networks
The complementary information is utilized using fusion based architectures for improving the photoacoustic image reconstruction. [20] Since different reconstructions promote different characteristics in the output and hence the image quality and characteristics vary if a different reconstruction technique is used. [20] A novel fusion based architecture was proposed to combine the output of two different reconstructions and give a better image quality as compared to any of those reconstructions. It includes weight sharing, and fusion of characteristics to achieve the desired improvement in the output image quality. [20]
High energy lasers allow for light to reach deep into tissue and they allow for deep structures to be visible in PA images. High energy lasers provide a greater penetration depth than low energy lasers. Around an 8 mm greater penetration depth for lasers with a wavelength between 690 to 900 nm. [35] The American National Standards Institute has set a maximal permissible exposure (MPE) for different biological tissues. Lasers with specifications above the MPE can cause mechanical or thermal damage to the tissue they are imaging. [35] Manwar et al. was able to increase the penetration of depth of low energy lasers that meet the MPE standard by applying a U-net architecture to the images created by a low energy laser. [35] The network was trained with images of an ex vivo sheep brain created by a low energy laser of 20 mJ as the input to the network and images of the same sheep brain created by a high energy laser of 100 mJ, 20 mJ above the MPE, as the desired output. A perceptually sensitive loss function was used to train the network to increase the low signal-to-noise ratio in PA images created by the low energy laser. The trained network was able to increase the peak-to-background ratio by 4.19 dB and penetration depth by 5.88% for photos created by the low energy laser of an in vivo sheep brain. [35] Manwar claims that this technology could be beneficial in neonatal brain imaging where transfontanelle imaging is possible to look for any lessions or injury.
Photoacoustic microscopy differs from other forms of photoacoustic tomography in that it uses focused ultrasound detection to acquire images pixel-by-pixel. PAM images are acquired as time-resolved volumetric data that is typically mapped to a 2-D projection via a Hilbert transform and maximum amplitude projection (MAP). [1] The first application of deep learning to PAM, took the form of a motion-correction algorithm. [37] This procedure was posed to correct the PAM artifacts that occur when an in vivo model moves during scanning. This movement creates the appearance of vessel discontinuities.
The two primary motion artifact types addressed by deep learning in PAM are displacements in the vertical and tilted directions. Chen et al. [37] used a simple three layer convolutional neural network, with each layer represented by a weight matrix and a bias vector, in order to remove the PAM motion artifacts. Two of the convolutional layers contain RELU activation functions, while the last has no activation function. [37] Using this architecture, kernel sizes of 3 × 3, 4 × 4, and 5 × 5 were tested, with the largest kernel size of 5 × 5 yielding the best results. [37] After training, the performance of the motion correction model was tested and performed well on both simulation and in vivo data. [37]
Frequency-domain PAM constitutes a powerful cost-efficient imaging method integrating intensity-modulated laser beams emitted by continuous wave sources for the excitation of single-frequency PA signals. [38] Nevertheless, this imaging approach generally provides smaller signal-to-noise ratios (SNR) which can be up to two orders of magnitude lower than the conventional time-domain systems. [39] To overcome the inherent SNR limitation of frequency-domain PAM, a U-Net neural network has been utilized to augment the generated images without the need for excessive averaging or the application of high optical power on the sample. In this context, the accessibility of PAM is improved as the system’s cost is dramatically reduced while retaining sufficiently high image quality standards for demanding biological observations. [40]
Microscopy is the technical field of using microscopes to view objects and areas of objects that cannot be seen with the naked eye. There are three well-known branches of microscopy: optical, electron, and scanning probe microscopy, along with the emerging field of X-ray microscopy.
The term biophotonics denotes a combination of biology and photonics, with photonics being the science and technology of generation, manipulation, and detection of photons, quantum units of light. Photonics is related to electronics and photons. Photons play a central role in information technologies, such as fiber optics, the way electrons do in electronics.
Optical coherence tomography (OCT) is an imaging technique that uses interferometry with short-coherence-length light to obtain micrometer-level depth resolution and uses transverse scanning of the light beam to form two- and three-dimensional images from light reflected from within biological tissue or other scattering media. Short-coherence-length light can be obtained using a superluminescent diode (SLD) with a broad spectral bandwidth or a broadly tunable laser with narrow linewidth. The first demonstration of OCT imaging was published by a team from MIT and Harvard Medical School in a 1991 article in the journal Science. The article introduced the term "OCT" to credit its derivation from optical coherence-domain reflectometry, in which the axial resolution is based on temporal coherence. The first demonstrations of in vivo OCT imaging quickly followed.
Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.
Super-resolution imaging (SR) is a class of techniques that enhance (increase) the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors is enhanced.
An optical neural network is a physical implementation of an artificial neural network with optical components. Early optical neural networks used a photorefractive Volume hologram to interconnect arrays of input neurons to arrays of output with synaptic weights in proportion to the multiplexed hologram's strength. Volume holograms were further multiplexed using spectral hole burning to add one dimension of wavelength to space to achieve four dimensional interconnects of two dimensional arrays of neural inputs and outputs. This research led to extensive research on alternative methods using the strength of the optical interconnect for implementing neuronal communications.
Medical optical imaging is the use of light as an investigational imaging technique for medical applications, pioneered by American Physical Chemist Britton Chance. Examples include optical microscopy, spectroscopy, endoscopy, scanning laser ophthalmoscopy, laser Doppler imaging, and optical coherence tomography. Because light is an electromagnetic wave, similar phenomena occur in X-rays, microwaves, and radio waves.
Photoacoustic imaging or optoacoustic imaging is a biomedical imaging modality based on the photoacoustic effect. Non-ionizing laser pulses are delivered into biological tissues and part of the energy will be absorbed and converted into heat, leading to transient thermoelastic expansion and thus wideband ultrasonic emission. The generated ultrasonic waves are detected by ultrasonic transducers and then analyzed to produce images. It is known that optical absorption is closely associated with physiological properties, such as hemoglobin concentration and oxygen saturation. As a result, the magnitude of the ultrasonic emission, which is proportional to the local energy deposition, reveals physiologically specific optical absorption contrast. 2D or 3D images of the targeted areas can then be formed.
Ultrasound-modulated optical tomography (UOT), also known as Acousto-Optic Tomography (AOT), is a hybrid imaging modality that combines light and sound; it is a form of tomography involving ultrasound. It is used in imaging of biological soft tissues and has potential applications for early cancer detection. As a hybrid modality which uses both light and sound, UOT provides some of the best features of both: the use of light provides strong contrast and sensitivity ; these two features are derived from the optical component of UOT. The use of ultrasound allows for high resolution, as well as a high imaging depth. However, the difficulty of tackling the two fundamental problems with UOT have caused UOT to evolve relatively slowly; most work in the field is limited to theoretical simulations or phantom / sample studies.
Time stretch microscopy, also known as serial time-encoded amplified imaging/microscopy or stretched time-encoded amplified imaging/microscopy' (STEAM), is a fast real-time optical imaging method that provides MHz frame rate, ~100 ps shutter speed, and ~30 dB optical image gain. Based on the photonic time stretch technique, STEAM holds world records for shutter speed and frame rate in continuous real-time imaging. STEAM employs the Photonic Time Stretch with internal Raman amplification to realize optical image amplification to circumvent the fundamental trade-off between sensitivity and speed that affects virtually all optical imaging and sensing systems. This method uses a single-pixel photodetector, eliminating the need for the detector array and readout time limitations. Avoiding this problem and featuring the optical image amplification for improvement in sensitivity at high image acquisition rates, STEAM's shutter speed is at least 1000 times faster than the best CCD and CMOS cameras. Its frame rate is 1000 times faster than the fastest CCD cameras and 10–100 times faster than the fastest CMOS cameras.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
Fourier ptychography is a computational imaging technique based on optical microscopy that consists in the synthesis of a wider numerical aperture from a set of full-field images acquired at various coherent illumination angles, resulting in increased resolution compared to a conventional microscope.
The Beckman Laser Institute is an interdisciplinary research center for the development of optical technologies and their use in biology and medicine. Located on the campus of the University of California, Irvine in Irvine, California, an independent nonprofit corporation was created in 1982, under the leadership of Michael W. Berns, and the actual facility opened on June 4, 1986. It is one of a number of institutions focused on translational research, connecting research and medical applications. Researchers at the institute have developed laser techniques for the manipulation of structures within a living cell, and applied them medically in treatment of skin conditions, stroke, and cancer, among others.
Lihong V. Wang is the Bren Professor of Medical Engineering and Electrical Engineering at the Andrew and Peggy Cherng Department of Medical Engineering at California Institute of Technology and was formerly the Gene K. Beare Distinguished Professorship of Biomedical Engineering at Washington University in St. Louis. Wang is known for his contributions to the field of Photoacoustic imaging technologies. Wang was elected as the member of National Academy of Engineering (NAE) in 2018.
Super-resolution photoacoustic imaging is a set of techniques used to enhance spatial resolution in photoacoustic imaging. Specifically, these techniques primarily break the optical diffraction limit of the photoacoustic imaging system. It can be achieved in a variety of mechanisms, such as blind structured illumination, multi-speckle illumination, or photo-imprint photoacoustic microscopy in Figure 1.
Photoacoustic microscopy is an imaging method based on the photoacoustic effect and is a subset of photoacoustic tomography. Photoacoustic microscopy takes advantage of the local temperature rise that occurs as a result of light absorption in tissue. Using a nanosecond pulsed laser beam, tissues undergo thermoelastic expansion, resulting in the release of a wide-band acoustic wave that can be detected using a high-frequency ultrasound transducer. Since ultrasonic scattering in tissue is weaker than optical scattering, photoacoustic microscopy is capable of achieving high-resolution images at greater depths than conventional microscopy methods. Furthermore, photoacoustic microscopy is especially useful in the field of biomedical imaging due to its scalability. By adjusting the optical and acoustic foci, lateral resolution may be optimized for the desired imaging depth.
U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture.
Coherent Raman scattering (CRS) microscopy is a multi-photon microscopy technique based on Raman-active vibrational modes of molecules. The two major techniques in CRS microscopy are stimulated Raman scattering (SRS) and coherent anti-Stokes Raman scattering (CARS). SRS and CARS were theoretically predicted and experimentally realized in the 1960s. In 1982 the first CARS microscope was demonstrated. In 1999, CARS microscopy using a collinear geometry and high numerical aperture objective were developed in Xiaoliang Sunney Xie's lab at Harvard University. This advancement made the technique more compatible with modern laser scanning microscopes. Since then, CRS's popularity in biomedical research started to grow. CRS is mainly used to image lipid, protein, and other bio-molecules in live or fixed cells or tissues without labeling or staining. CRS can also be used to image samples labeled with Raman tags, which can avoid interference from other molecules and normally allows for stronger CRS signals than would normally be obtained for common biomolecules. CRS also finds application in other fields, such as material science and environmental science.
Photoacoustic flow cytometry or PAFC is a biomedical imaging modality that utilizes photoacoustic imaging to perform flow cytometry. A flow of cells passes a photoacoustic system producing individual signal response. Each signal is counted to produce a quantitative evaluation of the input sample.
Single-pixel imaging is a computational imaging technique for producing spatially-resolved images using a single detector instead of an array of detectors. A device that implements such an imaging scheme is called a single-pixel camera. Combined with compressed sensing, the single-pixel camera can recover images from fewer measurements than the number of reconstructed pixels.