An event camera, also known as a neuromorphic camera, [1] silicon retina, [2] or dynamic vision sensor, [3] is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.
Event camera pixels independently respond to changes in brightness as they occur. [4] Each pixel stores a reference brightness level, and continuously compares it to the current brightness level. If the difference in brightness exceeds a threshold, that pixel resets its reference level and generates an event: a discrete packet that contains the pixel address and timestamp. Events may also contain the polarity (increase or decrease) of a brightness change, or an instantaneous measurement of the illumination level, [5] depending on the specific sensor model. Thus, event cameras output an asynchronous stream of events triggered by changes in scene illumination.
Event cameras typically report timestamps with a microsecond temporal resolution, 120 dB dynamic range, and less under/overexposure and motion blur [4] [6] than frame cameras. This allows them to track object and camera movement (optical flow) more accurately. They yield grey-scale information. Initially (2014), resolution was limited to 100 pixels.[ citation needed ] A later entry reached 640x480 resolution in 2019.[ citation needed ] Because individual pixels fire independently, event cameras appear suitable for integration with asynchronous computing architectures such as neuromorphic computing. Pixel independence allows these cameras to cope with scenes with brightly and dimly lit regions without having to average across them. [7] It is important to note that, while the camera reports events with microsecond resolution, the actual temporal resolution (or, alternatively, the bandwidth for sensing) is on the order of tens of microseconds to a few miliseconds, depending on signal contrast, lighting conditions, and sensor design. [8]
Sensor | Dynamic range (dB) | Equivalent framerate (fps) | Spatial resolution (MP) | Power consumption (mW) |
---|---|---|---|---|
Human eye | 30–40 | 200-300* | - | 10 [9] |
High-end DSLR camera (Nikon D850) | 44.6 [10] | 120 | 2–8 | - |
Ultrahigh-speed camera (Phantom v2640) [11] | 64 | 12,500 | 0.3–4 | - |
Event camera [12] | 120 | 50,000 – 300,000** | 0.1–1 | 30 |
* Indicates human perception temporal resolution, including cognitive processing time. **Refers to change recognition rates, and varies according to signal and sensor model.
Temporal contrast sensors (such as DVS [4] (Dynamic Vision Sensor), or sDVS [13] (sensitive-DVS)) produce events that indicate polarity (increase or decrease in brightness), while temporal image sensors [5] indicate the instantaneous intensity with each event. The DAVIS [14] (Dynamic and Active-pixel Vision Sensor) contains a global shutter active pixel sensor (APS) in addition to the dynamic vision sensor (DVS) that shares the same photosensor array. Thus, it has the ability to produce image frames alongside events. Many event cameras additionally carry an inertial measurement unit (IMU).
Another class of event sensors are so-called retinomorphic sensors. While the term retinomorphic has been used to describe event sensors generally, [15] [16] in 2020 it was adopted as the name for a specific sensor design based on a resistor and photosensitive capacitor in series. [17] These capacitors are distinct from photocapacitors, which are used to store solar energy, [18] and are instead designed to change capacitance under illumination. They (dis)charge slightly when the capacitance is changed, but otherwise remain in equilibrium. When a photosensitive capacitor is placed in series with a resistor, and an input voltage is applied across the circuit, the result is a sensor that outputs a voltage when the light intensity changes, but otherwise does not.
Unlike other event sensors (typically a photodiode and some other circuit elements), these sensors produce the signal inherently. They can hence be considered a single device that produces the same result as a small circuit in other event cameras. Retinomorphic sensors have to-date[ as of? ] only been studied in a research environment. [19] [20] [21] [22]
Image reconstruction from events has the potential to create images and video with high dynamic range, high temporal resolution, and reduced motion blur. Image reconstruction can be achieved using temporal smoothing, e.g. high-pass or complementary filter. [23] Alternative methods include optimization [24] and gradient estimation [25] followed by Poisson integration.
The concept of spatial event-driven convolution was postulated in 1999 [26] (before the DVS), but later generalized during EU project CAVIAR [27] (during which the DVS was invented) by projecting event-by-event an arbitrary convolution kernel around the event coordinate in an array of integrate-and-fire pixels. [28] Extension to multi-kernel event-driven convolutions [29] allows for event-driven deep convolutional neural networks. [30]
Segmentation and detection of moving objects viewed by an event camera can seem to be a trivial task, as it is done by the sensor on-chip. However, these tasks are difficult, because events carry little information [31] and do not contain useful visual features like texture and color. [32] These tasks become even more challenging given a moving camera, [31] because events are triggered everywhere on the image plane, produced by moving objects and the static scene (whose apparent motion is induced by the camera's ego-motion). Some of the recent[ when? ] approaches to solving this problem include the incorporation of motion-compensation models [33] [34] and traditional clustering algorithms. [35] [36] [32] [37]
Potential applications include most tasks classically fitting conventional cameras, but with emphasis on machine vision tasks (such as object recognition, autonomous vehicles, and robotics. [21] ). The US military is[ as of? ] considering infrared and other event cameras because of their lower power consumption and reduced heat generation. [7]
Considering the advantages the event camera possesses, compared to conventional image sensors, it is considered fitting for applications requiring low power consumption and latency, and where it is difficult to stabilize the camera's line of sight. These applications include the aforementioned autonomous systems, but also space imaging, security, defense, and industrial monitoring. Research into color sensing with event cameras is[ when? ] underway, [38] but it is not yet[ when? ] convenient for use with applications requiring color sensing.
Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
Neuromorphic computing is an approach to computing that is inspired by the structure and function of the human brain. A neuromorphic computer/chip is any device that uses physical artificial neurons to do computations. In recent times, the term neuromorphic has been used to describe analog, digital, mixed-mode analog/digital VLSI, and software systems that implement models of neural systems. Recent advances have even discovered ways to mimic the human nervous system through liquid solutions of chemical systems.
Carver Andress Mead is an American scientist and engineer. He currently holds the position of Gordon and Betty Moore Professor Emeritus of Engineering and Applied Science at the California Institute of Technology (Caltech), having taught there for over 40 years.
An optical neural network is a physical implementation of an artificial neural network with optical components. Early optical neural networks used a photorefractive Volume hologram to interconnect arrays of input neurons to arrays of output with synaptic weights in proportion to the multiplexed hologram's strength. Volume holograms were further multiplexed using spectral hole burning to add one dimension of wavelength to space to achieve four dimensional interconnects of two dimensional arrays of neural inputs and outputs. This research led to extensive research on alternative methods using the strength of the optical interconnect for implementing neuronal communications.
An image sensor or imager is a sensor that detects and conveys information used to form an image. It does so by converting the variable attenuation of light waves into signals, small bursts of current that convey the information. The waves can be light or other electromagnetic radiation. Image sensors are used in electronic imaging devices of both analog and digital types, which include digital cameras, camera modules, camera phones, optical mouse devices, medical imaging equipment, night vision equipment such as thermal imaging devices, radar, sonar, and others. As technology changes, electronic and digital imaging tends to replace chemical and analog imaging.
An active-pixel sensor (APS) is an image sensor, which was invented by Peter J.W. Noble in 1968, where each pixel sensor unit cell has a photodetector and one or more active transistors. In a metal–oxide–semiconductor (MOS) active-pixel sensor, MOS field-effect transistors (MOSFETs) are used as amplifiers. There are different types of APS, including the early NMOS APS and the now much more common complementary MOS (CMOS) APS, also known as the CMOS sensor. CMOS sensors are used in digital camera technologies such as cell phone cameras, web cameras, most modern digital pocket cameras, most digital single-lens reflex cameras (DSLRs), mirrorless interchangeable-lens cameras (MILCs), and lensless imaging for, e.g., blood cells.
In digital imaging, a color filter array (CFA), or color filter mosaic (CFM), is a mosaic of tiny color filters placed over the pixel sensors of an image sensor to capture color information.
In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.
Spiking neural networks (SNNs) are artificial neural networks (ANN) that more closely mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
A cognitive computer is a computer that hardwires artificial intelligence and machine learning algorithms into an integrated circuit that closely reproduces the behavior of the human brain. It generally adopts a neuromorphic engineering approach. Synonyms include neuromorphic chip and cognitive chip.
Donhee Ham is the John A. and Elizabeth S. Armstrong Professor of Engineering and Applied Sciences at Harvard University and Fellow of Samsung Electronics.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
Edoardo Charbon is a Swiss electrical engineer. He is a professor of quantum engineering at EPFL and the head of the Laboratory of Advanced Quantum Architecture (AQUA) at the School of Engineering.
Retinomorphic sensors are a type of event-driven optical sensor which produce a signal in response to changes in light intensity, rather than to light intensity itself. This is in contrast to conventional optical sensors such as charge coupled device (CCD) or complementary metal oxide semiconductor (CMOS) based sensors, which output a signal that increases with increasing light intensity. Because they respond to movement only, retinomorphic sensors are hoped to enable faster tracking of moving objects than conventional image sensors, and have potential applications in autonomous vehicles, robotics, and neuromorphic engineering.
Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.
Peter L. P. Dillon is an American physicist, and the inventor of integral color image sensors and single-chip color video cameras. The curator of the Technology Collection at the George Eastman Museum, Todd Gustavson, has stated that "the color sensor technology developed by Peter Dillon has revolutionized all forms of color photography. These color sensors are now ubiquitous in products such as smart phone cameras, digital cameras and camcorders, digital cinema cameras, medical cameras, automobile cameras, and drones". Dillon joined Kodak Research Labs in 1959 and retired from Kodak in 1991. He lives in Pittsford, New York.