Stixel

Last updated September 15, 2022

In computer vision, a stixel (portmanteau of "stick" and "pixel") is a superpixel representation of depth information in an image, in the form of a vertical stick that approximates the closest obstacles within a certain vertical slice of the scene. Introduced in 2009,^[1] stixels have applications in robotic navigation and advanced driver-assistance systems, where they can be used to define a representation of robotic environments and traffic scenes with a medium level of abstraction.^[2]^[3]

Definition

One of the problems of scene understanding in computer vision is to determine horizontal freespace around the camera, where the agent can move, and the vertical obstacles delimiting it. An image can be paired with depth information (produced e.g. from stereo disparity, lidar, or monocular depth estimation), allowing a dense tridimensional reconstruction of the observed scene. One drawback of dense reconstruction is the large amount of data involved, since each pixel in the image is mapped to an element of a point cloud. Vision problems characterised by planar freespace delimited by mostly vertical obstacles, such as traffic scenes or robotic navigation, can benefit from a condensed representation that allows to save memory and processing time.

Stixels are thin vertical rectangles representing a slice of a vertical surface belonging to the closest obstacle in the observed scene. They allow to dramatically reduce the amount of information needed to represent a scene in such problems. A stixel is characterised by three parameters: vertical coordinate of the bottom, height of the stick, and depth. Stixels have fixed width, with each stixel spanning over a certain number of image columns, allowing downsampling of the horizontal image resolution. In the original formulation, each column of the image would contain at most one stixel, and later extensions were developed to allow multiple stixels on each column, allowing to represent multiple objects at different distances.^[4]

Stixel estimation

The input to stixel estimation is a dense depth map, that can be computed from stereo disparity or other means. The original approach computes an occupancy grid that can be segmented to estimate the freespace, with dynamic programming providing an efficient method to find an optimal segmentation.^[5] Alternative approaches can be used instead of occupancy grid mapping, such as manifold-based methods.^[6]

The freespace boundary provides the base points of the obstacles at closest longitudinal distance, however multiple objects at different distances might appear in each column of the image. To fully define the obstacles, their height should be estimated, and this is accomplished by segmenting the depth of the object from the depth of the background. A membership function over the pixels can be defined based on the depth value, where the membership represents the confidence of a pixel belonging to the closest vertical obstacle or to the background, and a cut separating the obstacles from the background can again be computed effectively with dynamic programming.

Once both the freespace and the obstacle height are known, the stixels can be estimated by fusing the information over the columns spanned by each stixel, and finally a refined depth of the stixel can be estimated via model fitting over the depth of the pixels covered by the stixel, possibly paired with confidence information (e.g. disparity confidence produced by methods such as semi-global matching).^[7]

Related Research Articles

Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human visual system can do.

In biology, binocular vision is a type of vision in which an animal has two eyes capable of facing the same direction to perceive a single three-dimensional image of its surroundings. Neurological researcher Manfred Fahle has stated six specific advantages of having two eyes rather than just one:

It gives a creature a "spare eye" in case one is damaged.
It gives a wider field of view. For example, humans have a maximum horizontal field of view of approximately 190 degrees with two eyes, approximately 120 degrees of which makes up the binocular field of view flanked by two uniocular fields of approximately 40 degrees.
It can give stereopsis in which binocular disparity provided by the two eyes' different positions on the head gives precise depth perception. This also allows a creature to break the camouflage of another creature.
It allows the angles of the eyes' lines of sight, relative to each other (vergence), and those lines relative to a particular object to be determined from the images in the two eyes. These properties are necessary for the third advantage.
It allows a creature to see more of, or all of, an object behind an obstacle. This advantage was pointed out by Leonardo da Vinci, who noted that a vertical column closer to the eyes than an object at which a creature is looking might block some of the object from the left eye but that part of the object might be visible to the right eye.
It gives binocular summation in which the ability to detect faint objects is enhanced.

Stereoscopy is a technique for creating or enhancing the illusion of depth in an image by means of stereopsis for binocular vision. The word stereoscopy derives from Greek στερεός (stereos) 'firm, solid', and σκοπέω (skopeō) 'to look, to see'. Any stereoscopic image is called a stereogram. Originally, stereogram referred to a pair of stereo images which could be viewed using a stereoscope.

Ray casting is the methodological basis for 3D CAD/CAM solid modeling and image rendering. It is essentially the same as ray tracing for computer graphics where virtual light rays are "cast" or "traced" on their path from the focal point of a camera through each pixel in the camera sensor to determine what is visible along the ray in the 3D scene. The term "Ray Casting" was introduced by Scott Roth while at the General Motors Research Labs from 1978–1980. His paper, "Ray Casting for Modeling Solids", describes modeled solid objects by combining primitive solids, such as blocks and cylinders, using the set operators union (+), intersection (&), and difference (-). The general idea of using these binary operators for solid modeling is largely due to Voelcker and Requicha's geometric modelling group at the University of Rochester. See Solid modeling for a broad overview of solid modeling methods. This figure on the right shows a U-Joint modeled from cylinders and blocks in a binary tree using Roth's ray casting system, circa 1979.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

An autostereogram is a single-image stereogram (SIS), designed to create the visual illusion of a three-dimensional (3D) scene from a two-dimensional image. Most people with normal binocular vision are capable of seeing the depth in autostereograms, but to do so they must overcome the normally automatic coordination between accommodation and horizontal vergence. The illusion is one of depth perception and involves stereopsis: depth perception arising from the different perspective each eye has of a three-dimensional scene, called binocular parallax.

A distance transform, also known as distance map or distance field, is a derived representation of a digital image. The choice of the term depends on the point of view on the object in question: whether the initial image is transformed into another representation, or it is simply endowed with an additional map or field.

Stereopsis is the perception of depth and three-dimensional structure through binocular vision, the combined visual information from two eyes. Because the eyes of humans, and many animals, are located at different lateral positions on the head, binocular vision results in two slightly different images projected to the retinas of the eyes. The differences are mainly in the relative horizontal position of objects in the two images. These positional differences are referred to as "horizontal disparities" or, more generally, "binocular disparities". Disparities are processed in the visual cortex of the brain to yield depth perception. While binocular disparities are naturally present when viewing a real three-dimensional scene with two eyes, they can also be simulated by artificially presenting two different images separately to each eye using a method called stereoscopy. The perception of depth in such cases is also referred to as "stereoscopic depth".

An image histogram is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance.

The following are common definitions related to the machine vision field.

A rotating line camera is a digital camera that uses a linear CCD array to assemble a digital image as the camera rotates. The CCD array may consist of three sensor lines, one for each RGB color channel. Advanced rotating line cameras may have multiple linear CCD arrays on the focal plate and may capture multiple panoramic images during their rotation.

Binocular disparity refers to the difference in image location of an object seen by the left and right eyes, resulting from the eyes’ horizontal separation (parallax). The brain uses binocular disparity to extract depth information from the two-dimensional retinal images in stereopsis. In computer vision, binocular disparity refers to the difference in coordinates of similar features within two stereo images.

Image rectification is a transformation process used to project images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane.

Range imaging is the name for a collection of techniques that are used to produce a 2D image showing the distance to points in a scene from a specific point, normally associated with some type of sensor device.

A time-of-flight camera is a range imaging camera system employing time-of-flight techniques to resolve distance between the camera and the subject for each point of the image, by measuring the round trip time of an artificial light signal provided by a laser or an LED. Laser-based time-of-flight cameras are part of a broader class of scannerless LIDAR, in which the entire scene is captured with each laser pulse, as opposed to point-by-point with a laser beam such as in scanning LIDAR systems. Time-of-flight camera products for civil applications began to emerge around 2000, as the semiconductor processes allowed the production of components fast enough for such devices. The systems cover ranges of a few centimeters up to several kilometers.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.

Stereoscopic motion, as introduced by Béla Julesz in his book Foundations of Cyclopean Perception of 1971, is a translational motion of figure boundaries defined by changes in binocular disparity over time in a real-life 3D scene, a 3D film or other stereoscopic scene. This translational motion gives rise to a mental representation of three dimensional motion created in the brain on the basis of the binocular motion stimuli. Whereas the motion stimuli as presented to the eyes have a different direction for each eye, the stereoscopic motion is perceived as yet another direction on the basis of the views of both eyes taken together. Stereoscopic motion, as it is perceived by the brain, is also referred to as cyclopean motion, and the processing of visual input that takes place in the visual system relating to stereoscopic motion is called stereoscopic motion processing.

Air-Cobot is a French research and development project of a wheeled collaborative mobile robot able to inspect aircraft during maintenance operations. This multi-partner project involves research laboratories and industry. Research around this prototype was developed in three domains: autonomous navigation, human-robot collaboration and nondestructive testing.

Semi-global matching (SGM) is a computer vision algorithm for the estimation of a dense disparity map from a rectified stereo image pair, introduced in 2005 by Heiko Hirschmüller while working at the German Aerospace Center. Given its predictable run time, its favourable trade-off between quality of the results and computing time, and its suitability for fast parallel implementation in ASIC or FPGA, it has encountered wide adoption in real-time stereo vision applications such as robotics and advanced driver assistance systems.

References

↑ ( Badino, Franke & Pfeiffer 2009 )
↑ ( Benenson et al. 2012 )
↑ ( Erbs, Barth & Franke 2011 )
↑ ( Pfeiffer 2012 , p. 5)
↑ ( Badino, Franke & Pfeiffer 2009 , sec. 2.3)
↑ ( Saleem, Rezaei & Klette 2017 )
↑ ( Badino, Franke & Pfeiffer 2009 , sec. 2.5)

Sources

Badino, Hernán; Franke, Uwe; Pfeiffer, David (2009). The stixel world – A compact medium level representation of the 3D-world. Joint Pattern Recognition Symposium.
Benenson, Rodrigo; Mathias, Markus; Timofte, Radu; Gool, Luc Van (2012). Fast stixel computation for fast pedestrian detection. European Conference on Computer Vision.
Erbs, Friedrich; Barth, Alexander; Franke, Uwe (2011). Moving vehicle detection by optimal segmentation of the dynamic stixel world. 2011 IEEE Intelligent Vehicles Symposium (IV).
Pfeiffer, David (2012). The stixel world – A Compact Medium-level Representation for Efficiently Modeling Dynamic Three-dimensional Environments (Ph.D. thesis). Humboldt University of Berlin.
Saleem, Noor Haitham; Rezaei, Mahdi; Klette, Reinhard (2017). Extending the stixel world using polynomial ground manifold approximation. 2017 24th International Conference on Mechatronics and Machine Vision in Practice (M2VIP).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] ( Badino, Franke & Pfeiffer 2009 )

[2] ( Benenson et al. 2012 )

[3] ( Erbs, Barth & Franke 2011 )

[4] ( Pfeiffer 2012 , p. 5)

[5] ( Badino, Franke & Pfeiffer 2009 , sec. 2.3)

[6] ( Saleem, Rezaei & Klette 2017 )

[7] ( Badino, Franke & Pfeiffer 2009 , sec. 2.5)

[1]

[2]

[3]

[4]

[5]

[6]

[7]