An event camera, also known as a neuromorphic camera, [1] silicon retina [2] or dynamic vision sensor, [3] is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.
Event camera pixels independently respond to changes in brightness as they occur. [4] Each pixel stores a reference brightness level, and continuously compares it to the current brightness level. If the difference in brightness exceeds a threshold, that pixel resets its reference level and generates an event: a discrete packet that contains the pixel address and timestamp. Events may also contain the polarity (increase or decrease) of a brightness change, or an instantaneous measurement of the illumination level, [5] depending on the specific sensor model. Thus, event cameras output an asynchronous stream of events triggered by changes in scene illumination.
Event cameras typically report timestamps with a microsecond temporal resolution, 120 dB dynamic range, and less under/overexposure and motion blur [4] [6] than frame cameras. This allows them to track object and camera movement (optical flow) more accurately. They yield grey-scale information. Initially (2014), resolution was limited to 100 pixels. A later entry reached 640x480 resolution in 2019. Because individual pixels fire independently, event cameras appear suitable for integration with asynchronous computing architectures such as neuromorphic computing. Pixel independence allows these cameras to cope with scenes with brightly and dimly lit regions without having to average across them. [7] It is important to note that while the camera reports events with microsecond resolution, the actual temporal resolution (or, alternatively, the bandwidth for sensing) is in the order of tens of microseconds to a few miliseconds - depending on signal contrast, lighting conditions and sensor design. [8]
Sensor | Dynamic range (dB) | Equivalent framerate (fps) | Spatial resolution (MP) | Power consumption (mW) |
---|---|---|---|---|
Human eye | 30–40 | 200-300* | - | 10 [9] |
High-end DSLR camera (Nikon D850) | 44.6 [10] | 120 | 2–8 | - |
Ultrahigh-speed camera (Phantom v2640) [11] | 64 | 12,500 | 0.3–4 | - |
Event camera [12] | 120 | 50,000 - 300,000** | 0.1–1 | 30 |
* Indicates human perception temporal resolution, including cognitive processing time. **Refers to change recognition rates, and varies according to signal and sensor model.
Temporal contrast sensors (such as DVS [4] (Dynamic Vision Sensor), or sDVS [13] (sensitive-DVS)) produce events that indicate polarity (increase or decrease in brightness), while temporal image sensors [5] indicate the instantaneous intensity with each event. The DAVIS [14] (Dynamic and Active-pixel Vision Sensor) contains a global shutter active pixel sensor (APS) in addition to the dynamic vision sensor (DVS) that shares the same photosensor array. Thus, it has the ability to produce image frames alongside events. Many event cameras additionally carry an inertial measurement unit (IMU).
Another class of event sensors are so-called retinomorphic sensors. While the term retinomorphic has been used to describe event sensors generally, [15] [16] in 2020 it was adopted as the name for a specific sensor design based on a resistor and photosensitive capacitor in series. [17] These capacitors are distinct from photocapacitors, which are used to store solar energy, [18] and are instead designed to change capacitance under illumination. They charge/discharge slightly when the capacitance is changed, but otherwise remain in equilibrium. When a photosensitive capacitor is placed in series with a resistor, and an input voltage is applied across the circuit, the result is a sensor that outputs a voltage when the light intensity changes, but otherwise does not.
Unlike other event sensors (typically a photodiode and some other circuit elements), these sensors produce the signal inherently. They can hence be considered a single device that produces the same result as a small circuit in other event cameras. Retinomorphic sensors have to-date only been studied in a research environment. [19] [20] [21] [22]
Image reconstruction from events has the potential to create images and video with high dynamic range, high temporal resolution and reduced motion blur. Image reconstruction can be achieved using temporal smoothing, e.g. high-pass or complementary filter. [23] Alternative methods include optimization [24] and gradient estimation [25] followed by Poisson integration.
The concept of spatial event-driven convolution was postulated in 1999 [26] (before the DVS), but later generalized during EU project CAVIAR [27] (during which the DVS was invented) by projecting event-by-event an arbitrary convolution kernel around the event coordinate in an array of integrate-and-fire pixels. [28] Extension to multi-kernel event-driven convolutions [29] allows for event-driven deep convolutional neural networks. [30]
Segmentation and detection of moving objects viewed by an event camera can seem to be a trivial task, as it is done by the sensor on-chip. However, these tasks are difficult, because events carry little information [31] and do not contain useful visual features like texture and color. [32] These tasks become further challenging given a moving camera, [31] because events are triggered everywhere on the image plane, produced by moving objects and the static scene (whose apparent motion is induced by the camera’s ego-motion). Some of the recent approaches to solving this problem include the incorporation of motion-compensation models [33] [34] and traditional clustering algorithms. [35] [36] [32] [37]
Potential applications include most tasks classically fitting conventional camera, but with emphasis on machine vision tasks (such as object recognition, autonomous vehicles, and robotics. [21] ). The US military is considering infrared and other event cameras because of their lower power consumption and reduced heat generation. [7]
Considering the advantages the event camera possesses, compared to conventional image sensors, it is considered fitting for applications requiring low power consumption, low latency, and difficulty to stabilize camera line of sight. These applications include the aforementioned autonomous systems, but also space imaging, security, defense and industrial monitoring. It is notable that while research into color sensing with event cameras is underway, [38] it is not yet convenient for use with applications requiring color sensing.
Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
Neuromorphic computing is an approach to computing that is inspired by the structure and function of the human brain. A neuromorphic computer/chip is any device that uses physical artificial neurons to do computations. In recent times, the term neuromorphic has been used to describe analog, digital, mixed-mode analog/digital VLSI, and software systems that implement models of neural systems. The implementation of neuromorphic computing on the hardware level can be realized by oxide-based memristors, spintronic memories, threshold switches, transistors, among others. Training software-based neuromorphic systems of spiking neural networks can be achieved using error backpropagation, e.g., using Python based frameworks such as snnTorch, or using canonical learning rules from the biological learning literature, e.g., using BindsNet.
Carver Andress Mead is an American scientist and engineer. He currently holds the position of Gordon and Betty Moore Professor Emeritus of Engineering and Applied Science at the California Institute of Technology (Caltech), having taught there for over 40 years.
An optical neural network (ONN) is a physical implementation of an artificial neural network with optical components.
An image sensor or imager is a sensor that detects and conveys information used to form an image. It does so by converting the variable attenuation of light waves into signals, small bursts of current that convey the information. The waves can be light or other electromagnetic radiation. Image sensors are used in electronic imaging devices of both analog and digital types, which include digital cameras, camera modules, camera phones, optical mouse devices, medical imaging equipment, night vision equipment such as thermal imaging devices, radar, sonar, and others. As technology changes, electronic and digital imaging tends to replace chemical and analog imaging.
An active-pixel sensor (APS) is an image sensor, which was invented by Peter J.W. Noble in 1968, where each pixel sensor unit cell has a photodetector and one or more active transistors. In a metal–oxide–semiconductor (MOS) active-pixel sensor, MOS field-effect transistors (MOSFETs) are used as amplifiers. There are different types of APS, including the early NMOS APS and the now much more common complementary MOS (CMOS) APS, also known as the CMOS sensor. CMOS sensors are used in digital camera technologies such as cell phone cameras, web cameras, most modern digital pocket cameras, most digital single-lens reflex cameras (DSLRs), mirrorless interchangeable-lens cameras (MILCs), and lensless imaging for cells.
In digital imaging, a color filter array (CFA), or color filter mosaic (CFM), is a mosaic of tiny color filters placed over the pixel sensors of an image sensor to capture color information.
Spiking neural networks (SNNs) are artificial neural networks (ANN) that more closely mimic natural neural networks. In addition to neuronal and synaptic state, SNNs incorporate the concept of time into their operating model. The idea is that neurons in the SNN do not transmit information at each propagation cycle, but rather transmit information only when a membrane potential—an intrinsic quality of the neuron related to its membrane electrical charge—reaches a specific value, called the threshold. When the membrane potential reaches the threshold, the neuron fires, and generates a signal that travels to other neurons which, in turn, increase or decrease their potentials in response to this signal. A neuron model that fires at the moment of threshold crossing is also called a spiking neuron model.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
SpiNNaker is a massively parallel, manycore supercomputer architecture designed by the Advanced Processor Technologies Research Group (APT) at the Department of Computer Science, University of Manchester. It is composed of 57,600 processing nodes, each with 18 ARM9 processors and 128 MB of mobile DDR SDRAM, totalling 1,036,800 cores and over 7 TB of RAM. The computing platform is based on spiking neural networks, useful in simulating the human brain.
A cognitive computer is a computer that hardwires artificial intelligence and machine learning algorithms into an integrated circuit that closely reproduces the behavior of the human brain. It generally adopts a neuromorphic engineering approach. Synonyms include neuromorphic chip and cognitive chip.
Donhee Ham is the John A. and Elizabeth S. Armstrong Professor of Engineering and Applied Sciences at Harvard University and Fellow of Samsung Electronics.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
Edoardo Charbon is a Swiss electrical engineer. He is a professor of quantum engineering at EPFL and the head of the Laboratory of Advanced Quantum Architecture (AQUA) at the School of Engineering.
Retinomorphic sensors are a type of event-driven optical sensor which produce a signal in response to changes in light intensity, rather than to light intensity itself. This is in contrast to conventional optical sensors such as charge coupled device (CCD) or complementary metal oxide semiconductor (CMOS) based sensors, which output a signal that increases with increasing light intensity. Because they respond to movement only, retinomorphic sensors are hoped to enable faster tracking of moving objects than conventional image sensors, and have potential applications in autonomous vehicles, robotics, and neuromorphic engineering.
Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.
Peter L. P. Dillon is an American physicist, and the inventor of integral color image sensors and single-chip color video cameras. The curator of the Technology Collection at the George Eastman Museum, Todd Gustavson, has stated that "the color sensor technology developed by Peter Dillon has revolutionized all forms of color photography. These color sensors are now ubiquitous in products such as smart phone cameras, digital cameras and camcorders, digital cinema cameras, medical cameras, automobile cameras, and drones". Dillon joined Kodak Research Labs in 1959 and retired from Kodak in 1991. He lives in Pittsford, New York.