Active vision

Last updated

An area of computer vision is active vision, sometimes also called active computer vision. An active vision system is one that can manipulate the viewpoint of the camera(s) in order to investigate the environment and get better information from it. [1] [2] [3] [4]

Contents

Background

The interest in active camera system started as early as two decades ago. Beginning in the late 1980s, Aloimonos et al. introduced the first general framework for active vision in order to improve the perceptual quality of tracking results. [3] Active vision is particularly important to cope with problems like occlusions, limited field of view and limited resolution of the camera. [5] Other advantages can be reducing the motion blur of a moving object [6] and enhancing depth perception of an object by focusing two cameras on the same object or moving the cameras. [3] Active control of the camera view point also helps in focusing computational resources on the relevant element of the scene. [7] In this selective aspect, active vision can be seen as strictly related to (overt & covert) visual attention in biological organisms, which has been shown to enhance the perception of selected part of the visual field. This selective aspect of human (active) vision can be easily related to the foveal structure of the human eye, [8] [9] where in about 5% of the retina more than the 50% of the colour receptors are located.

It has also been suggested that visual attention and the selective aspect of active camera control can help in other tasks like learning more robust models of objects and environments with less labeled samples or autonomously . [4] [10]

[11]

Approaches

The autonomous camera approach

Autonomous cameras are cameras that can direct themselves in their environment. There has been some recent work using this approach. In work from Denzler et al., the motion of a tracked object is modeled using a Kalman filter while the focal length that minimizes the uncertainty in the state estimations is the one that is used. A stereo set-up with two zoom cameras was used. A handful of papers have been written for zoom control and do not deal with total object-camera position estimation. An attempt to join estimation and control in the same framework can be found in the work of Bagdanov et al., where a Pan-Tilt-Zoom camera is used to track faces. [12] Both the estimation and control models used are ad hoc, and the estimation approach is based on image features rather than 3D properties of the target being tracked. [13]

The master/slave approach

In a master/slave configuration, a supervising static camera is used to monitor a wide field of view and to track every moving target of interest. The position of each of these targets over time is then provided to a foveal camera, which tries to observe the targets at a higher resolution. Both the static and the active cameras are calibrated to a common reference, so that data coming from one of them can be easily projected onto the other, in order to coordinate the control of the active sensors. Another possible use of the master/slave approach consists of a static (master) camera extracting visual features of an object of interest, while the active (slave) sensor uses these features to detect the desired object without the need of any training data. [13] [14]

The active camera network approach

In recent years there has been growing interest in building networks of active cameras and optional static cameras so that you can cover a large area while maintaining high resolution of multiple targets. This is ultimately a scaled-up version of either the master/slave approach or the autonomous camera approach. This approach can be highly effective, but also incredibly costly. Not only are multiple cameras involved but you also must have them communicate with each other which can be computationally expensive. [13] [14]

Controlled active vision framework

Controlled active vision can be defined as a controlled motion of a vision sensor can maximize the performance of any robotic algorithm that involves a moving vision sensor. It is a hybrid of control theory and conventional vision. An application of this framework is real-time robotic servoing around static or moving arbitrary 3-D objects. See Visual Servoing. Algorithms that incorporate the use of multiple windows and numerically stable confidence measures are combined with stochastic controllers in order to provide a satisfactory solution to the tracking problem introduced by combining computer vision and control. In the case where there is an inaccurate model of the environment, adaptive control techniques may be introduced. The above information and further mathematical representations of controlled active vision can be seen in the thesis of Nikolaos Papanikolopoulos. [15]

Examples

Examples of active vision systems usually involve a robot mounted camera, [16] but other systems have employed human operator-mounted cameras (a.k.a. "wearables"). [17] Applications include automatic surveillance, human robot interaction (video), [18] [19] SLAM, route planning, [20] etc. In the DARPA Grand Challenge most of the teams used LIDAR combined with active vision systems to guide driverless vehicles across an off-road course.

A good example of active vision can be seen in this YouTube video. It shows face tracking using active vision with a pan-tilt camera system. https://www.youtube.com/watch?v=N0FjDOTnmm0

Active Vision is also important to understand how humans. [8] [21] and organism endowed with visual sensors, actually see the world considering the limits of their sensors, the richness and continuous variability of the visual signal and the effects of their actions and goals on their perception. [7] [22] [23]

The controllable active vision framework can be used in a number of different ways. Some examples might be vehicle tracking, robotics applications, [24] and interactive MRI segmentation. [25]

Interactive MRI segmentation uses controllable active vision by using a Lyapanov control design to establish a balance between the influence of a data-driven gradient flow and the human’s input over time. This smoothly couples automatic segmentation with interactivity. More information on this method can be found in. [25] Segmentation in MRIs is a difficult subject, and it takes an expert to trace out the desired segments due to the MRI picking up all fluid and tissue. This could prove impractical because it would be a very lengthy process. Controllable active vision methods described in the cited paper could help improve the process while relying on the human less.

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Motion capture</span> Process of recording the movement of objects or people

Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In filmmaking and video game development, it refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

<span class="mw-page-title-main">Simultaneous localization and mapping</span> Computational navigational technique used by robots and autonomous vehicles

Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. While this initially appears to be a chicken or the egg problem, there are several algorithms known to solve it in, at least approximately, tractable time for certain environments. Popular approximate solution methods include the particle filter, extended Kalman filter, covariance intersection, and GraphSLAM. SLAM algorithms are based on concepts in computational geometry and computer vision, and are used in robot navigation, robotic mapping and odometry for virtual reality or augmented reality.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

A visual sensor network or smart camera network or intelligent camera network is a network of spatially distributed smart camera devices capable of processing, exchanging data and fusing images of a scene from a variety of viewpoints into some form more useful than the individual images. A visual sensor network may be a type of wireless sensor network, and much of the theory and application of the latter applies to the former. The network generally consists of the cameras themselves, which have some local image processing, communication and storage capabilities, and possibly one or more central computers, where image data from multiple cameras is further processed and fused. Visual sensor networks also provide some high-level services to the user so that the large amount of data can be distilled into information of interest using specific queries.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science communities due to its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology.

<span class="mw-page-title-main">3D reconstruction</span> Process of capturing the shape and appearance of real objects

In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.

Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979.

Visual privacy is the relationship between collection and dissemination of visual information, the expectation of privacy, and the legal issues surrounding them. These days digital cameras are ubiquitous. They are one of the most common sensors found in electronic devices, ranging from smartphones to tablets, and laptops to surveillance cams. However, privacy and trust implications surrounding it limit its ability to seamlessly blend into computing environment. In particular, large-scale camera networks have created increasing interest in understanding the advantages and disadvantages of such deployments. It is estimated that over 4 million CCTV cameras deployed in the UK. Due to increasing security concerns, camera networks have continued to proliferate across other countries such as the United States. While the impact of such systems continues to be evaluated, in parallel, tools for controlling how these camera networks are used and modifications to the images and video sent to end-users have been explored.

<span class="mw-page-title-main">Visual odometry</span> Determining the position and orientation of a robot by analyzing associated camera images

In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.

Medical image computing (MIC) is an interdisciplinary field at the intersection of computer science, information engineering, electrical engineering, physics, mathematics and medicine. This field develops computational and mathematical methods for solving problems pertaining to medical images and their use for biomedical research and clinical care.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

<span class="mw-page-title-main">Saliency map</span>

In computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. Saliency maps engineered in artificial or computer vision are typically not the same as the actual saliency map constructed by biological or natural vision.

Egocentric vision or first-person vision is a sub-field of computer vision that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

<span class="mw-page-title-main">Margarita Chli</span> Greek computer vision and robotics researcher

Margarita Chli is an assistant professor and leader of the Vision for Robotics Lab at ETH Zürich in Switzerland. Chli is a leader in the field of computer vision and robotics and was on the team of researchers to develop the first fully autonomous helicopter with onboard localization and mapping. Chli is also the Vice Director of the Institute of Robotics and Intelligent Systems and an Honorary Fellow of the University of Edinburgh in the United Kingdom. Her research currently focuses on developing visual perception and intelligence in flying autonomous robotic systems.

A continuum robot is a type of robot that is characterised by infinite degrees of freedom and number of joints. These characteristics allow continuum manipulators to adjust and modify their shape at any point along their length, granting them the possibility to work in confined spaces and complex environments where standard rigid-link robots cannot operate. In particular, we can define a continuum robot as an actuatable structure whose constitutive material forms curves with continuous tangent vectors. This is a fundamental definition that allows to distinguish between continuum robots and snake-arm robots or hyper-redundant manipulators: the presence of rigid links and joints allows them to only approximately perform curves with continuous tangent vectors.

Juyang (John) Weng is a Chinese-American computer engineer, neuroscientist, author, and academic. He is a former professor at the Department of Computer Science and Engineering at Michigan State University and the President of Brain-Mind Institute and GENISAMA.

References

  1. http://axiom.anu.edu.au/~rsl/rsl_active.html
  2. Ballard, Dana H. (1991). "Animate vision". Artificial Intelligence. 48: 57–86. doi:10.1016/0004-3702(91)90080-4.
  3. 1 2 3 Aloimonos, John; Weiss, Isaac; Bandyopadhyay, Amit (1988). "Active vision". International Journal of Computer Vision. 1 (4): 333–356. doi:10.1007/BF00133571. S2CID   25458585.
  4. 1 2 Ognibene, Dimitri; Baldassare, Gianluca (2015). "Ecological Active Vision: Four Bioinspired Principles to Integrate Bottom–Up and Adaptive Top–Down Attention Tested with a Simple Camera-Arm Robot". IEEE Transactions on Autonomous Mental Development. 7: 3–25. doi: 10.1109/TAMD.2014.2341351 . hdl: 10281/301362 .
  5. Denzler; Zobel; Niemann (2003). "Information theoretic focal length selection for real-time active 3D object tracking". Proceedings Ninth IEEE International Conference on Computer Vision. pp. 400–407 vol.1. CiteSeerX   10.1.1.122.1594 . doi:10.1109/ICCV.2003.1238372. ISBN   978-0-7695-1950-0. S2CID   17622133.
  6. Rivlin, Ehud; Rotstein, Héctor (2000). "Control of a Camera for Active Vision: Foveal Vision, Smooth Tracking and Saccade". International Journal of Computer Vision. 39 (2): 81–96. doi:10.1023/A:1008166825510. S2CID   8737891.
  7. 1 2 Tatler, B. W.; Hayhoe, M. M.; Land, M. F.; Ballard, D. H. (2011). "Eye guidance in natural vision: Reinterpreting salience". Journal of Vision. 11 (5): 5. doi:10.1167/11.5.5. PMC   3134223 . PMID   21622729.
  8. 1 2 Findlay, J. M. & Gilchrist, I. D. Active Vision, The Psychology of Looking and Seeing Oxford University Press, 2003
  9. Tistarelli, M.; Sandini, G. (1993). "On the advantages of polar and log-polar mapping for direct estimation of time-to-impact from optical flow". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (4): 401–410. CiteSeerX   10.1.1.49.9595 . doi:10.1109/34.206959.
  10. Walther, Dirk; Rutishauser, Ueli; Koch, Christof; Perona, Pietro (2005). "Selective visual attention enables learning and recognition of multiple objects in cluttered scenes" (PDF). Computer Vision and Image Understanding. 100 (1–2): 41–63. CiteSeerX   10.1.1.110.976 . doi:10.1016/j.cviu.2004.09.004.
  11. Larochelle, H.; Hinton, G. (6 December 2010). "Learning to combine foveal glimpses with a third-order Boltzmann machine" (PDF). Proceedings of the 23rd International Conference on Neural Information Processing Systems. Vol. 1. pp. 1243–1251.
  12. Bagdanov, A.D.; Del Bimbo, A.; Nunziati, W. (2006). "Improving evidential quality of surveillance imagery through active face tracking". 18th International Conference on Pattern Recognition (ICPR'06). pp. 1200–1203. doi:10.1109/ICPR.2006.700. ISBN   978-0-7695-2521-1. S2CID   2273696.
  13. 1 2 3 Al Haj, Murad; Fernández, Carles; Xiong, Zhanwu; Huerta, Ivan; Gonzàlez, Jordi; Roca, Xavier (2011). "Beyond the Static Camera: Issues and Trends in Active Vision". Visual Analysis of Humans. pp. 11–30. doi:10.1007/978-0-85729-997-0_2. ISBN   978-0-85729-996-3.
  14. 1 2 Bellotto, Nicola; Benfold, Ben; Harland, Hanno; Nagel, Hans-Hellmut; Pirlo, Nicola; Reid, Ian; Sommerlade, Eric; Zhao, Chuan (2012). "Cognitive visual tracking and camera control" (PDF). Computer Vision and Image Understanding. 116 (3): 457–471. doi:10.1016/j.cviu.2011.09.011. S2CID   4937663.
  15. Papanikolopoulos, Nikolaos Panagiotis (1992). Controlled Active Vision (PhD Thesis). Carnegie Mellon University.
  16. Mak, Lin Chi; Furukawa, Tomonari; Whitty, Mark (2008). "A localisation system for an indoor rotary-wing MAV using blade mounted LEDs". Sensor Review. 28 (2): 125–131. doi:10.1108/02602280810856688.
  17. Mapping Large Loops with a Single Hand-Held Camera. LA Clemente, AJ Davison, ID Reid, J Neira, JD Tardós - Robotics: Science and Systems, 2007
  18. Demiris, Yiannis; Khadhouri, Bassam (2006). "Hierarchical attentive multiple models for execution and recognition of actions". Robotics and Autonomous Systems. 54 (5): 361–369. CiteSeerX   10.1.1.226.5282 . doi:10.1016/j.robot.2006.02.003.
  19. Towards active event recognition D Ognibene, Y Demiris The 23rd International Joint Conference of Artificial Intelligence (IJCAI13)
  20. http://www.surrey.ac.uk/eng/research/mechatronics/robots/Activities/ActiveVision/activevis.html Archived August 17, 2007, at the Wayback Machine
  21. Land, Michael F. (2006). "Eye movements and the control of actions in everyday life" (PDF). Progress in Retinal and Eye Research. 25 (3): 296–324. doi:10.1016/j.preteyeres.2006.01.002. PMID   16516530. S2CID   18946141.
  22. Lungarella, Max; Sporns, Olaf (2006). "Mapping Information Flow in Sensorimotor Networks". PLOS Computational Biology. 2 (10): e144. Bibcode:2006PLSCB...2..144L. doi: 10.1371/journal.pcbi.0020144 . PMC   1626158 . PMID   17069456.
  23. Verschure, Paul F. M. J.; Voegtlin, Thomas; Douglas, Rodney J. (2003). "Environmentally mediated synergy between perception and behaviour in mobile robots". Nature. 425 (6958): 620–624. Bibcode:2003Natur.425..620V. doi:10.1038/nature02024. PMID   14534588. S2CID   4418697.
  24. Smith, C.E.; Papanikolopoulos, N.P.; Brandt, S.A. (1994). "Application of the controlled active vision framework to robotic and transportation problems". Proceedings of 1994 IEEE Workshop on Applications of Computer Vision. pp. 213–220. CiteSeerX   10.1.1.40.3470 . doi:10.1109/ACV.1994.341311. ISBN   978-0-8186-6410-6. S2CID   9735967.
  25. 1 2 Karasev, Peter; Kolesov, Ivan; Chudy, Karol; Tannenbaum, Allen; Muller, Grant; Xerogeanes, John (2011). "Interactive MRI segmentation with controlled active vision". 2011 50th IEEE Conference on Decision and Control and European Control Conference. pp. 2293–2298. doi:10.1109/CDC.2011.6161453. ISBN   978-1-61284-801-3. PMC   3935399 . PMID   24584213.