Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor (visual feedback [1] ) to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979. [2]
There are two fundamental configurations of the robot end-effector (hand) and the camera: [4]
Visual Servoing control techniques are broadly classified into the following types: [5] [6]
IBVS was proposed by Weiss and Sanderson. [7] The control law is based on the error between current and desired features on the image plane, and does not involve any estimate of the pose of the target. The features may be the coordinates of visual features, lines or moments of regions. IBVS has difficulties [8] with motions very large rotations, which has come to be called camera retreat. [9]
PBVS is a model-based technique (with a single camera). This is because the pose of the object of interest is estimated with respect to the camera and then a command is issued to the robot controller, which in turn controls the robot. In this case the image features are extracted as well, but are additionally used to estimate 3D information (pose of the object in Cartesian space), hence it is servoing in 3D.
Hybrid approaches use some combination of the 2D and 3D servoing. There have been a few different approaches to hybrid servoing
This section is written like a research paper or scientific journal that may use overly technical terms or may not be written like an encyclopedic article .(February 2015) |
The following description of the prior work is divided into 3 parts
Visual servo systems, also called servoing, have been around since the early 1980s , [11] although the term visual servo itself was only coined in 1987. [4] [5] [6] Visual Servoing is, in essence, a method for robot control where the sensor used is a camera (visual sensor). Servoing consists primarily of two techniques, [6] one involves using information from the image to directly control the degrees of freedom (DOF) of the robot, thus referred to as Image Based Visual Servoing (IBVS). While the other involves the geometric interpretation of the information extracted from the camera, such as estimating the pose of the target and parameters of the camera (assuming some basic model of the target is known). Other servoing classifications exist based on the variations in each component of a servoing system , [5] e.g. the location of the camera, the two kinds are eye-in-hand and hand–eye configurations. Based on the control loop, the two kinds are end-point-open-loop and end-point-closed-loop. Based on whether the control is applied to the joints (or DOF) directly or as a position command to a robot controller the two types are direct servoing and dynamic look-and-move. Being one of the earliest works [12] the authors proposed a hierarchical visual servo scheme applied to image-based servoing. The technique relies on the assumption that a good set of features can be extracted from the object of interest (e.g. edges, corners and centroids) and used as a partial model along with global models of the scene and robot. The control strategy is applied to a simulation of a two and three DOF robot arm.
Feddema et al. [13] introduced the idea of generating task trajectory with respect to the feature velocity. This is to ensure that the sensors are not rendered ineffective (stopping the feedback) for any the robot motions. The authors assume that the objects are known a priori (e.g. CAD model) and all the features can be extracted from the object. The work by Espiau et al. [14] discusses some of the basic questions in visual servoing. The discussions concentrate on modeling of the interaction matrix, camera, visual features (points, lines, etc..). In [15] an adaptive servoing system was proposed with a look-and-move servoing architecture. The method used optical flow along with SSD to provide a confidence metric and a stochastic controller with Kalman filtering for the control scheme. The system assumes (in the examples) that the plane of the camera and the plane of the features are parallel., [16] discusses an approach of velocity control using the Jacobian relationship s˙ = Jv˙ . In addition the author uses Kalman filtering, assuming that the extracted position of the target have inherent errors (sensor errors). A model of the target velocity is developed and used as a feed-forward input in the control loop. Also, mentions the importance of looking into kinematic discrepancy, dynamic effects, repeatability, settling time oscillations and lag in response.
Corke [17] poses a set of very critical questions on visual servoing and tries to elaborate on their implications. The paper primarily focuses the dynamics of visual servoing. The author tries to address problems like lag and stability, while also talking about feed-forward paths in the control loop. The paper also, tries to seek justification for trajectory generation, methodology of axis control and development of performance metrics.
Chaumette in [18] provides good insight into the two major problems with IBVS. One, servoing to a local minima and second, reaching a Jacobian singularity. The author show that image points alone do not make good features due to the occurrence of singularities. The paper continues, by discussing the possible additional checks to prevent singularities namely, condition numbers of J_s and Jˆ+_s, to check the null space of ˆ J_s and J^T_s . One main point that the author highlights is the relation between local minima and unrealizable image feature motions.
Over the years many hybrid techniques have been developed. [4] These involve computing partial/complete pose from Epipolar Geometry using multiple views or multiple cameras. The values are obtained by direct estimation or through a learning or a statistical scheme. While others have used a switching approach that changes between image-based and position-based based on a Lyapnov function. [4] The early hybrid techniques that used a combination of image-based and pose-based (2D and 3D information) approaches for servoing required either a full or partial model of the object in order to extract the pose information and used a variety of techniques to extract the motion information from the image. [19] used an affine motion model from the image motion in addition to a rough polyhedral CAD model to extract the object pose with respect to the camera to be able to servo onto the object (on the lines of PBVS).
2-1/2-D visual servoing developed by Malis et al. [20] is a well known technique that breaks down the information required for servoing into an organized fashion which decouples rotations and translations. The papers assume that the desired pose is known a priori. The rotational information is obtained from partial pose estimation, a homography, (essentially 3D information) giving an axis of rotation and the angle (by computing the eigenvalues and eigenvectors of the homography). The translational information is obtained from the image directly by tracking a set of feature points. The only conditions being that the feature points being tracked never leave the field of view and that a depth estimate be predetermined by some off-line technique. 2-1/2-D servoing has been shown to be more stable than the techniques that preceded it. Another interesting observation with this formulation is that the authors claim that the visual Jacobian will have no singularities during the motions. The hybrid technique developed by Corke and Hutchinson, [21] [22] popularly called portioned approach partitions the visual (or image) Jacobian into motions (both rotations and translations) relating X and Y axes and motions related to the Z axis. [22] outlines the technique, to break out columns of the visual Jacobian that correspond to the Z axis translation and rotation (namely, the third and sixth columns). The partitioned approach is shown to handle the Chaumette Conundrum discussed in. [23] This technique requires a good depth estimate in order to function properly. [24] outlines a hybrid approach where the servoing task is split into two, namely main and secondary. The main task is keep the features of interest within the field of view. While the secondary task is to mark a fixation point and use it as a reference to bring the camera to the desired pose. The technique does need a depth estimate from an off-line procedure. The paper discusses two examples for which depth estimates are obtained from robot odometry and by assuming that all features are on a plane. The secondary task is achieved by using the notion of parallax. The features that are tracked are chosen by an initialization performed on the first frame, which are typically points. [25] carries out a discussion on two aspects of visual servoing, feature modeling and model-based tracking. Primary assumption made is that the 3D model of the object is available. The authors highlights the notion that ideal features should be chosen such that the DOF of motion can be decoupled by linear relation. The authors also introduce an estimate of the target velocity into the interaction matrix to improve tracking performance. The results are compared to well known servoing techniques even when occlusions occur.
This section discusses the work done in the field of visual servoing. We try to track the various techniques in the use of features. Most of the work has used image points as visual features. The formulation of the interaction matrix in [5] assumes points in the image are used to represent the target. There has some body of work that deviates from the use of points and use feature regions, lines, image moments and moment invariants. [26] In, [27] the authors discuss an affine based tracking of image features. The image features are chosen based on a discrepancy measure, which is based on the deformation that the features undergo. The features used were texture patches. One of key points of the paper was that it highlighted the need to look at features for improving visual servoing. In [28] the authors look into choice of image features (the same question was also discussed in [5] in the context of tracking). The effect of the choice of image features on the control law is discussed with respect to just the depth axis. Authors consider the distance between feature points and the area of an object as features. These features are used in the control law with slightly different forms to highlight the effects on performance. It was noted that better performance was achieved when the servo error was proportional to the change in depth axis. [29] provides one of the early discussions of the use of moments. The authors provide a new formulation of the interaction matrix using the velocity of the moments in the image, albeit complicated. Even though the moments are used, the moments are of the small change in the location of contour points with the use of Green’s theorem. The paper also tries to determine the set of features (on a plane) to for a 6 DOF robot. In [30] discusses the use of image moments to formulate the visual Jacobian. This formulation allows for decoupling of the DOF based on type of moments chosen. The simple case of this formulation is notionally similar to the 2-1/2- D servoing. [30] The time variation of the moments (m˙ij) are determined using the motion between two images and Greens Theorem. The relation between m˙ij and the velocity screw (v) is given as m˙_ij = L_m_ij v. This technique avoids camera calibration by assuming that the objects are planar and using a depth estimate. The technique works well in the planar case but tends to be complicated in the general case. The basic idea is based on the work in [4] Moment Invariants have been used in. [31] The key idea being to find the feature vector that decouples all the DOF of motion. Some observations made were that centralized moments are invariant for 2D translations. A complicated polynomial form is developed for 2D rotations. The technique follows teaching-by-showing, hence requiring the values of desired depth and area of object (assuming that the plane of camera and object are parallel, and the object is planar). Other parts of the feature vector are invariants R3, R4. The authors claim that occlusions can be handled. [32] and [33] build on the work described in. [29] [31] [32] The major differ- ence being that the authors use a technique similar to, [16] where the task is broken into two (in the case where the features are not parallel to the cam- era plane). A virtual rotation is performed to bring the featured parallel to the camera plane. [34] consolidates the work done by the authors on image moments.
Espiau in [35] showed from purely experimental work that image based visual servoing (IBVS) is robust to calibration errors. The author used a camera with no explicit calibration along with point matching and without pose estimation. The paper looks at the effect of errors and uncertainty on the terms in the interaction matrix from an experimental approach. The targets used were points and were assumed to be planar.
A similar study was done in [36] where the authors carry out experimental evaluation of a few uncalibrated visual servo systems that were popular in the 90’s. The major outcome was the experimental evidence of the effectiveness of visual servo control over conventional control methods. Kyrki et al. [37] analyze servoing errors for position based and 2-1/2-D visual servoing. The technique involves determining the error in extracting image position and propagating it to pose estimation and servoing control. Points from the image are mapped to points in the world a priori to obtain a mapping (which is basically the homography, although not explicitly stated in the paper). This mapping is broken down to pure rotations and translations. Pose estimation is performed using standard technique from Computer Vision. Pixel errors are transformed to the pose. These are propagating to the controller. An observation from the analysis shows that errors in the image plane are proportional to the depth and error in the depth-axis is proportional to square of depth. Measurement errors in visual servoing have been looked into extensively. Most error functions relate to two aspects of visual servoing. One being steady state error (once servoed) and two on the stability of the control loop. Other servoing errors that have been of interest are those that arise from pose estimation and camera calibration. In, [38] the authors extend the work done in [39] by considering global stability in the presence of intrinsic and extrinsic calibration errors. [40] provides an approach to bound the task function tracking error. In, [41] the authors use teaching-by-showing visual servoing technique. Where the desired pose is known a priori and the robot is moved from a given pose. The main aim of the paper is to determine the upper bound on the positioning error due to image noise using a convex- optimization technique. [42] provides a discussion on stability analysis with respect the uncertainty in depth estimates. The authors conclude the paper with the observation that for unknown target geometry a more accurate depth estimate is required in order to limit the error. Many of the visual servoing techniques [21] [22] [43] implicitly assume that only one object is present in the image and the relevant feature for tracking along with the area of the object are available. Most techniques require either a partial pose estimate or a precise depth estimate of the current and desired pose.
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human visual system can do.
For many cameras, depth of field (DOF) is the distance between the nearest and the farthest objects that are in acceptably sharp focus in an image. The depth of field can be calculated based on focal length, distance to subject, the acceptable circle of confusion size, and aperture. A particular depth of field may be chosen for technical or artistic purposes. Limitations of depth of field can sometimes be overcome with various techniques/equipment.
Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industry. Machine vision refers to many technologies, software and hardware products, integrated systems, actions, methods and expertise. Machine vision as a systems engineering discipline can be considered distinct from computer vision, a form of computer science. It attempts to integrate existing technologies in new ways and apply them to solve real world problems. The term is the prevalent one for these functions in industrial automation environments but is also used for these functions in other environments such as security and vehicle guidance.
Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. While this initially appears to be a chicken-and-egg problem there are several algorithms known for solving it, at least approximately, in tractable time for certain environments. Popular approximate solution methods include the particle filter, extended Kalman filter, covariance intersection, and GraphSLAM. SLAM algorithms are based on concepts in computational geometry and computer vision, and are used in robot navigation, robotic mapping and odometry for virtual reality or augmented reality.
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.
In computer vision and robotics, a typical task is to identify specific objects in an image and to determine each object's position and orientation relative to some coordinate system. This information can then be used, for example, to allow a robot to manipulate an object or to avoid moving into the object. The combination of position and orientation is referred to as the pose of an object, even though this concept is sometimes used only to describe the orientation. Exterior orientation and translation are also used as synonyms of pose.
Structure from motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception. In biological vision, SfM refers to the phenomenon by which humans can recover 3D structure from the projected 2D (retinal) motion field of a moving object or scene.
The stereo cameras approach is a method of distilling a noisy video signal into a coherent data set that a computer can begin to process into actionable symbolic objects, or abstractions. Stereo cameras is one of many approaches used in the broader fields of computer vision and machine vision.
The following outline is provided as an overview of and topical guide to object recognition:
An area of computer vision is active vision, sometimes also called active computer vision. An active vision system is one that can manipulate the viewpoint of the camera(s) in order to investigate the environment and get better information from it.
In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.
In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.
In photography, an omnidirectional camera, also known as 360-degree camera, is a camera having a field of view that covers approximately the entire sphere or at least a full circle in the horizontal plane. Omnidirectional cameras are important in areas where large visual field coverage is needed, such as in panoramic photography and robotics.
2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.
3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.
Peter Corke is an Australian roboticist known for his work on Visual Servoing, field robotics, online education, the online Robot Academy and the Robotics Toolbox and Machine Vision Toolbox for MATLAB. He is currently director of the Australian Research Council Centre of Excellence for Robotic Vision, and a Distinguished Professor of Robotic Vision at Queensland University of Technology. His research is concerned with robotic vision, flying robots and farming robots.
Egocentric vision or first-person vision is a sub-field of computer vision that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.
Gregory D. Hager is the Mandell Bellmore Professor of Computer Science and founding director of the Johns Hopkins Malone Center for Engineering in Healthcare at Johns Hopkins University.
In computer vision, the inverse depth parametrization is a parametrization used in methods for 3D reconstruction from multiple images such as simultaneous localization and mapping (SLAM). Given a point in 3D space observed by a monocular pinhole camera from multiple views, the inverse depth parametrization of the point's position is a 6D vector that encodes the optical centre of the camera when in first observed the point, and the position of the point along the ray passing through and .
Alois Christian Knoll is German Computer Scientist and a professor at the TUM Department of Informatics at the Technical University of Munich (TUM). He is head of the Chair of Robotics, Artificial Intelligence and Real-Time Systems and known for seminal contributions to Human–Robot-Interaction, Neurorobotics and Autonomous Systems.