U-Net

Last updated

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. [1] The network is based on a fully convolutional neural network [2] whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture. [1]

Contents

The U-Net architecture has also been employed in diffusion models for iterative image denoising. [3] This technology underlies many modern image generation models, such as DALL-E, Midjourney, and Stable Diffusion.

Description

The U-Net architecture stems from the so-called “fully convolutional network” proposed by Long, Shelhamer, and Darrell in 2014. [2]

The main idea is to supplement a usual contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information. [1]

One important modification in U-Net is that there are a large number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting part, and yields a u-shaped architecture. The network only uses the valid part of each convolution without any fully connected layers. [2] To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

History

U-Net was created by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015 and reported in the paper “U-Net: Convolutional Networks for Biomedical Image Segmentation”. [1] It is an improvement and development of FCN: Evan Shelhamer, Jonathan Long, Trevor Darrell (2014). "Fully convolutional networks for semantic segmentation". [2]

Network architecture

The network consists of a contracting path and an expansive path, which gives it the u-shaped architecture. The contracting path is a typical convolutional network that consists of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. During the contraction, the spatial information is reduced while feature information is increased. The expansive pathway combines the feature and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. [4]

This is an example architecture of U-Net for producing k 256-by-256 image masks for a 256-by-256 RGB image. Example architecture of U-Net for producing k 256-by-256 image masks for a 256-by-256 RGB image.png
This is an example architecture of U-Net for producing k 256-by-256 image masks for a 256-by-256 RGB image.

Applications

There are many applications of U-Net in biomedical image segmentation, such as brain image segmentation (''BRATS'' [5] ) and liver image segmentation ("siliver07" [6] ) as well as protein binding site prediction. [7] U-Net implementations have also found use in the physical sciences, for example in the analysis of micrographs of materials. [8] [9] [10] Variations of the U-Net have also been applied for medical image reconstruction. [11] Here are some variants and applications of U-Net as follows:

  1. Pixel-wise regression using U-Net and its application on pansharpening; [12]
  2. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation; [13]
  3. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. [14]
  4. Image-to-image translation to estimate fluorescent stains [15]
  5. In binding site prediction of protein structure. [7]

Related Research Articles

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">DeepDream</span> Software program

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

Deep image prior is a type of convolutional neural network used to enhance a given image with no prior training data other than the image itself. A neural network is randomly initialized and used as prior to solve inverse problems such as noise reduction, super-resolution, and inpainting. Image statistics are captured by the structure of a convolutional image generator rather than by any previously learned capabilities.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

In computer vision, SqueezeNet is the name of a deep neural network for image classification that was released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to create a smaller neural network with fewer parameters while achieving competitive accuracy.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Deep learning in photoacoustic imaging

Deep learning in photoacoustic imaging combines the hybrid imaging modality of photoacoustic imaging (PA) with the rapidly evolving field of deep learning. Photoacoustic imaging is based on the photoacoustic effect, in which optical absorption causes a rise in temperature, which causes a subsequent rise in pressure via thermo-elastic expansion. This pressure rise propagates through the tissue and is sensed via ultrasonic transducers. Due to the proportionality between the optical absorption, the rise in temperature, and the rise in pressure, the ultrasound pressure wave signal can be used to quantify the original optical energy deposition within the tissue.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Biomedical data science is a multidisciplinary field which leverages large volumes of data to promote biomedical innovation and discovery. Biomedical data science draws from various fields including Biostatistics, Biomedical informatics, and machine learning, with the goal of understanding biological and medical data. It can be viewed as the study and application of data science to solve biomedical problems. Modern biomedical datasets often have specific features which make their analyses difficult, including:

Alexander Wong is a professor in the Department of Systems Design Engineering and a Co-Director of the Vision and Image Processing Research Group at the University of Waterloo. He is the Canada Research Chair in Artificial Intelligence and Medical Imaging, a Founding Member of the Waterloo Artificial Intelligence Institute and a Member of the College of the Royal Society of Canada and a Fellow of the Institute of Engineering and Technology. He is also a Fellow of the Institute of Physics, a Fellow in the International Society for Design and Development in Education, a Fellow of the Royal Society for Public Health and a Fellow of the Royal Society of Medicine.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. It is considered to be a part of the ongoing artifical intelligence boom.

References

  1. 1 2 3 4 Ronneberger O, Fischer P, Brox T (2015). "U-Net: Convolutional Networks for Biomedical Image Segmentation". arXiv: 1505.04597 [cs.CV].
  2. 1 2 3 4 Shelhamer E, Long J, Darrell T (Nov 2014). "Fully Convolutional Networks for Semantic Segmentation". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (4): 640–651. arXiv: 1411.4038 . doi:10.1109/TPAMI.2016.2572683. PMID   27244717. S2CID   1629541.
  3. Ho, Jonathan (2020). "Denoising Diffusion Probabilistic Models". arXiv: 2006.11239 [cs.LG].
  4. "U-Net code".
  5. "MICCAI BraTS 2017: Scope | Section for Biomedical Image Analysis (SBIA) | Perelman School of Medicine at the University of Pennsylvania". www.med.upenn.edu. Retrieved 2018-12-24.
  6. "SLIVER07 : Home". www.sliver07.org. Retrieved 2018-12-24.
  7. 1 2 Nazem F, Ghasemi F, Fassihi A, Dehnavi AM (April 2021). "3D U-Net: A voxel-based method in binding site prediction of protein structure". Journal of Bioinformatics and Computational Biology. 19 (2): 2150006. doi:10.1142/S0219720021500062. PMID   33866960. S2CID   233300145.
  8. Chen, Fu-Xiang Rikudo; Lin, Chia-Yu; Siao, Hui-Ying; Jian, Cheng-Yuan; Yang, Yong-Cheng; Lin, Chun-Liang (2023-02-14). "Deep learning based atomic defect detection framework for two-dimensional materials". Scientific Data. 10 (1): 91. doi:10.1038/s41597-023-02004-6. ISSN   2052-4463. PMC   9929095 . PMID   36788235.
  9. Shi, Peng; Duan, Mengmeng; Yang, Lifang; Feng, Wei; Ding, Lianhong; Jiang, Liwu (2022-06-22). "An Improved U-Net Image Segmentation Method and Its Application for Metallic Grain Size Statistics". Materials. 15 (13): 4417. doi: 10.3390/ma15134417 . ISSN   1996-1944. PMC   9267311 . PMID   35806543.
  10. Patrick, Matthew J; Eckstein, James K; Lopez, Javier R; Toderas, Silvia; Asher, Sarah A; Whang, Sylvia I; Levine, Stacey; Rickman, Jeffrey M; Barmak, Katayun (2023-11-15). "Automated Grain Boundary Detection for Bright-Field Transmission Electron Microscopy Images via U-Net". Microscopy and Microanalysis. arXiv: 2312.09392 . doi: 10.1093/micmic/ozad115 . ISSN   1431-9276. PMID   37966960.
  11. Andersson J, Ahlström H, Kullberg J (September 2019). "Separation of water and fat signal in whole-body gradient echo scans using convolutional neural networks". Magnetic Resonance in Medicine. 82 (3): 1177–1186. doi:10.1002/mrm.27786. PMC   6618066 . PMID   31033022.
  12. Yao W, Zeng Z, Lian C, Tang H (2018-10-27). "Pixel-wise regression using U-Net and its application on pansharpening". Neurocomputing. 312: 364–371. doi:10.1016/j.neucom.2018.05.103. ISSN   0925-2312. S2CID   207119255.
  13. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O (2016). "3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation". arXiv: 1606.06650 [cs.CV].
  14. Iglovikov V, Shvets A (2018). "TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation". arXiv: 1801.05746 [cs.CV].
  15. Kandel ME, He YR, Lee YJ, Chen TH, Sullivan KM, Aydin O, et al. (December 2020). "Phase imaging with computational specificity (PICS) for measuring dry mass changes in sub-cellular compartments". Nature Communications. 11 (1): 6256. arXiv: 2002.08361 . doi:10.1038/s41467-020-20062-x. PMC   7721808 . PMID   33288761.

Implementations