Egocentric vision or first-person vision is a sub-field of computer vision that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting. [1]
The wearable camera looking forwards is often supplemented with a camera looking inward at the user's eye and able to measure a user's eye gaze, which is useful to reveal attention and to better understand the user's activity and intentions.
The idea of using a wearable camera to gather visual data from a first-person perspective dates back to the 70s, when Steve Mann invented "Digital Eye Glass", a device that, when worn, causes the human eye itself to effectively become both an electronic camera and a television display. [2]
Subsequently, wearable cameras were used for health-related applications in the context of Humanistic Intelligence [3] and Wearable AI. [4] Egocentric vision is best done from the point-of-eye, but may also be done by way of a neck-worn camera when eyeglasses would be in-the-way. [5] This neck-worn variant was popularized by way of the Microsoft SenseCam in 2006 for experimental health research works. [6] The interest of the computer vision community into the egocentric paradigm has been arising slowly entering the 2010s and it is rapidly growing in recent years, [7] boosted by both the impressive advances in the field of wearable technology and by the increasing number of potential applications.
The prototypical first-person vision system described by Kanade and Hebert, [8] in 2012 is composed by three basic components: a localization component able to estimate the surrounding, a recognition component able to identify object and people, and an activity recognition component, able to provide information about the current activity of the user. Together, these three components provide a complete situational awareness of the user, which in turn can be used to provide assistance to the user or to the caregiver. Following this idea, the first computational techniques for egocentric analysis focused on hand-related activity recognition [9] and social interaction analysis. [10] Also, given the unconstrained nature of the video and the huge amount of data generated, temporal segmentation [11] and summarization [12] were among the first problems addressed. After almost ten years of egocentric vision (2007 - 2017), the field is still undergoing diversification. Emerging research topics include:
Today's wearable cameras are small and lightweight digital recording devices that can acquire images and videos automatically, without the user intervention, with different resolutions and frame rates, and from a first-person point of view. Therefore, wearable cameras are naturally primed to gather visual information from our everyday interactions since they offer an intimate perspective of the visual field of the camera wearer.
Depending on the frame rate, it is common to distinguish between photo-cameras (also called lifelogging cameras) and video-cameras.
In both cases, since the camera is worn in a naturalistic setting, visual data present a huge variability in terms of illumination conditions and object appearance. Moreover, the camera wearer is not visible in the image and what he/she is doing has to be inferred from the information in the visual field of the camera, implying that important information about the wearer, such for instance as pose or facial expression estimation, is not available.
A collection of studies published in a special theme issue of the American Journal of Preventive Medicine [6] has demonstrated the potential of lifelogs captured through wearable cameras from a number of viewpoints. In particular, it has been shown that used as a tool for understanding and tracking lifestyle behaviour, lifelogs would enable the prevention of noncommunicable diseases associated to unhealthy trends and risky profiles (such as obesity, depression, etc.). In addition, used as a tool of re-memory cognitive training, lifelogs would enable the prevention of cognitive and functional decline in elderly people.
More recently, egocentric cameras have been used to study human and animal cognition, human-human social interaction, human-robot interaction, human expertise in complex tasks. Other applications include navigation/assistive technologies for the blind, [20] monitoring and assistance of industrial workflows, [21] [22] and augmented reality interfaces. [5]
Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
A wearable computer, also known as a body-borne computer, is a computing device worn on the body. The definition of 'wearable computer' may be narrow or broad, extending to smartphones or even ordinary wristwatches.
Augmented reality (AR) is an interactive experience that combines the real world and computer-generated content. The content can span multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory. AR can be defined as a system that incorporates three basic features: a combination of real and virtual worlds, real-time interaction, and accurate 3D registration of virtual and real objects. The overlaid sensory information can be constructive, or destructive. This experience is seamlessly interwoven with the physical world such that it is perceived as an immersive aspect of the real environment. In this way, augmented reality alters one's ongoing perception of a real-world environment, whereas virtual reality completely replaces the user's real-world environment with a simulated one.
William Stephen George Mann is a Canadian engineer, professor, and inventor who works in augmented reality, computational photography, particularly wearable computing, and high-dynamic-range imaging. Mann is sometimes labeled the "Father of Wearable Computing" for early inventions and continuing contributions to the field. He cofounded InteraXon, makers of the Muse brain-sensing headband, and is also a founding member of the IEEE Council on Extended Intelligence (CXI). Mann is currently CTO and cofounder at Blueberry X Technologies and Chairman of MannLab. Mann was born in Canada, and currently lives in Toronto, Canada, with his wife and two children. In 2023, Mann unsuccessfully ran for mayor of Toronto.
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.
Computer-mediated reality refers to the ability to add to, subtract information from, or otherwise manipulate one's perception of reality through the use of a wearable computer or hand-held device such as a smartphone.
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or state, but commonly originate from the face or hand. Focuses in the field include emotion recognition from face and hand gesture recognition since they are all expressions. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and computer vision algorithms to interpret sign language, however, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a better bridge between machines and humans than older text user interfaces or even GUIs, which still limit the majority of input to keyboard and mouse and interact naturally without any mechanical devices.
Eye tracking is the process of measuring either the point of gaze or the motion of an eye relative to the head. An eye tracker is a device for measuring eye positions and eye movement. Eye trackers are used in research on the visual system, in psychology, in psycholinguistics, marketing, as an input device for human-computer interaction, and in product design. In addition, eye trackers are increasingly being used for assistive and rehabilitative applications such as controlling wheelchairs, robotic arms, and prostheses. There are several methods for measuring eye movement, with the most popular variant using video images to extract eye position. Other methods use search coils or are based on the electrooculogram.
Sensory substitution is a change of the characteristics of one sensory modality into stimuli of another sensory modality.
Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.
In computer security, shoulder surfing is a type of social engineering technique used to obtain information such as personal identification numbers (PINs), passwords and other confidential data by looking over the victim's shoulder. Unauthorized users watch the keystrokes inputted on a device or listen to sensitive information being spoken, which is also known as eavesdropping.
Light writing is an emerging form of stop motion animation wherein still images captured using the technique known as light painting or light drawing are put in sequence thereby creating the optical illusion of movement for the viewer.
An area of computer vision is active vision, sometimes also called active computer vision. An active vision system is one that can manipulate the viewpoint of the camera(s) in order to investigate the environment and get better information from it.
Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science communities due to its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology.
A lifelog is a personal record of one's daily life in a varying amount of detail, for a variety of purposes. The record contains a comprehensive dataset of a human's activities. The data could be used to increase knowledge about how people live their lives. In recent years, some lifelog data has been automatically captured by wearable technology or mobile devices. People who keep lifelogs about themselves are known as lifeloggers.
Visual privacy is the relationship between collection and dissemination of visual information, the expectation of privacy, and the legal issues surrounding them. These days digital cameras are ubiquitous. They are one of the most common sensors found in electronic devices, ranging from smartphones to tablets, and laptops to surveillance cams. However, privacy and trust implications surrounding it limit its ability to seamlessly blend into computing environment. In particular, large-scale camera networks have created increasing interest in understanding the advantages and disadvantages of such deployments. It is estimated that over 4 million CCTV cameras deployed in the UK. Due to increasing security concerns, camera networks have continued to proliferate across other countries such as the United States. While the impact of such systems continues to be evaluated, in parallel, tools for controlling how these camera networks are used and modifications to the images and video sent to end-users have been explored.
SixthSense is a gesture-based wearable computer system developed at MIT Media Lab by Steve Mann in 1994 and 1997, and 1998, and further developed by Pranav Mistry, in 2009, both of whom developed both hardware and software for both headworn and neckworn versions of it. It comprises a headworn or neck-worn pendant that contains both a data projector and camera. Headworn versions were built at MIT Media Lab in 1997 that combined cameras and illumination systems for interactive photographic art, and also included gesture recognition.
Nadine is a gynoid humanoid social robot that is modelled on Professor Nadia Magnenat Thalmann. The robot has a strong human-likeness with a natural-looking skin and hair and realistic hands. Nadine is a socially intelligent robot which returns a greeting, makes eye contact, and can remember all the conversations had with it. It is able to answer questions autonomously in several languages, simulate emotions both in gestures and facially, depending on the content of the interaction with the user. Nadine can recognise persons it has previously seen, and engage in flowing conversation. Nadine has been programmed with a "personality", in that its demeanour can change according to what is said to it. Nadine has a total of 27 degrees of freedom for facial expressions and upper body movements. With persons it has previously encountered, it remembers facts and events related to each person. It can assist people with special needs by reading stories, showing images, put on Skype sessions, send emails, and communicate with other members of the family. It can play the role of a receptionist in an office or be dedicated to be a personal coach.
Kristen Lorraine Grauman is a Professor of Computer Science at the University of Texas at Austin on leave as a research scientist at Facebook AI Research (FAIR). She works on computer vision and machine learning.
Rita Cucchiara is an Italian electrical and computer engineer, and professor in Computer engineering and Science in the Enzo Ferrari Department of Engineering at the University of Modena and Reggio Emilia (UNIMORE) in Italy. She helds the courses of “Computer Architecture” and “Computer Vision and Cognitive Systems”. Cucchiara's research work focuses on artificial intelligence, specifically deep network technologies and computer vision for human behavior understanding (HBU) and visual, language and multimodal generative AI. She is the scientific coordinator of the AImage Lab at UNIMORE and is director of the Artificial Intelligence Research and Innovation Center (AIRI) as well as the ELLIS Unit at Modena. She was founder and director from 2018 to 2021 of the Italian National Lab of Artificial Intelligence and intelligent systems AIIS of CINI. Cucchiara was also president of the CVPL from 2016 to 2018. Rita Cucchiara is IAPR Fellow since 2006 and ELLIS Fellow since 2020.