Kismet (robot)

Last updated
Kismet now resides at the MIT Museum. Kismet-IMG 6007-gradient.jpg
Kismet now resides at the MIT Museum.

Kismet is a robot head which was made in the 1990s at Massachusetts Institute of Technology (MIT) by Dr. Cynthia Breazeal as an experiment in affective computing; a machine that can recognize and simulate emotions. The name Kismet comes from a Turkish word meaning "fate" or sometimes "luck". [1]

Contents

Hardware design and construction

In order for Kismet to properly interact with human beings, it contains input devices that give it auditory, visual, and proprioception abilities. Kismet simulates emotion through various facial expressions, vocalizations, and movement. Facial expressions are created through movements of the ears, eyebrows, eyelids, lips, jaw, and head. The cost of physical materials was an estimated US$25,000. [1]

In addition to the equipment mentioned above, there are four Motorola 68332s, nine 400 MHz PCs, and another 500 MHz PC. [1]

Software system

Kismet's social intelligence software system, or synthetic nervous system (SNS), was designed with human models of intelligent behavior in mind. It contains six subsystems [2] as follows.

Low-level feature extraction system

This system processes raw visual and auditory information from cameras and microphones. Kismet's vision system can perform eye detection, motion detection and, albeit controversial, skin-color detection. Whenever Kismet moves its head, it momentarily disables its motion detection system to avoid detecting self-motion. It also uses its stereo cameras to estimate the distance of an object in its visual field, for example to detect threats—large, close objects with a lot of movement. [3]

Kismet's audio system is mainly tuned towards identifying affect in infant-directed speech. In particular, it can detect five different types of affective speech: approval, prohibition, attention, comfort, and neutral. The affective intent classifier was created as follows. Low-level features such as pitch mean and energy (volume) variance were extracted from samples of recorded speech. The classes of affective intent were then modeled as a gaussian mixture model and trained with these samples using the expectation-maximization algorithm. Classification is done with multiple stages, first classifying an utterance into one of two general groups (e.g. soothing/neutral vs. prohibition/attention/approval) and then doing more detailed classification. This architecture significantly improved performance for hard-to-distinguish classes, like approval ("You're a clever robot") versus attention ("Hey Kismet, over here"). [3]

Motivation system

Dr. Breazeal figures her relations with the robot as 'something like an infant-caretaker interaction, where I'm the caretaker essentially, and the robot is like an infant'. The overview sets the human-robot relation within a frame of learning, with Dr. Breazeal providing the scaffolding for Kismet's development. It offers a demonstration of Kismet's capabilities, narrated as emotive facial expressions that communicate the robot's 'motivational state', Dr. Breazeal: "This one is anger (laugh) extreme anger, disgust, excitement, fear, this is happiness, this one is interest, this one is sadness, surprise, this one is tired, and this one is sleep." [4]

At any given moment, Kismet can only be in one emotional state at a time. However, Breazeal states that Kismet is not conscious, so it does not have feelings. [5]

Motor system

Kismet speaks a proto-language with a variety of phonemes, similar to a baby's babbling. It uses the DECtalk voice synthesizer, and changes pitch, timing, articulation, etc. to express various emotions. Intonation is used to vary between question and statement-like utterances. Lip synchronization was important for realism, and the developers used a strategy from animation: [6] "simplicity is the secret to successful lip animation." Thus, they did not try to imitate lip motions perfectly, but instead "create a visual shorthand that passes unchallenged by the viewer."

See also

Related Research Articles

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

A facial expression is one or more motions or positions of the muscles beneath the skin of the face. According to one set of controversial theories, these movements convey the emotional state of an individual to observers. Facial expressions are a form of nonverbal communication. They are a primary means of conveying social information between humans, but they also occur in most other mammals and some other animal species.

Lip reading, also known as speechreading, is a technique of understanding a limited range of speech by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as low as 30% because lip reading relies on context, language knowledge, and any residual hearing. Although lip reading is used most extensively by deaf and hard-of-hearing people, most people with normal hearing process some speech information from sight of the moving mouth.

<span class="mw-page-title-main">McGurk effect</span> Perceptual illusion

The McGurk effect is a perceptual phenomenon that demonstrates an interaction between hearing and vision in speech perception. The illusion occurs when the auditory component of one sound is paired with the visual component of another sound, leading to the perception of a third sound. The visual information a person gets from seeing a person speak changes the way they hear the sound. If a person is getting poor-quality auditory information but good-quality visual information, they may be more likely to experience the McGurk effect. Integration abilities for audio and visual information may also influence whether a person will experience the effect. People who are better at sensory integration have been shown to be more susceptible to the effect. Many people are affected differently by the McGurk effect based on many factors, including brain damage and other disorders.

<span class="mw-page-title-main">Paul Ekman</span> American psychologist (born 1934)

Paul Ekman is an American psychologist and professor emeritus at the University of California, San Francisco who is a pioneer in the study of emotions and their relation to facial expressions. He was ranked 59th out of the 100 most cited psychologists of the twentieth century. Ekman conducted seminal research on the specific biological correlations of specific emotions, attempting to demonstrate the universality and discreteness of emotions in a Darwinian approach.

<span class="mw-page-title-main">Cynthia Breazeal</span> American computer scientist

Cynthia Breazeal is an American robotics scientist and entrepreneur. She is a former chief scientist and chief experience officer of Jibo, a company she co-founded in 2012 that developed personal assistant robots. Currently, she is a professor of media arts and sciences at the Massachusetts Institute of Technology and the director of the Personal Robots group at the MIT Media Lab. Her most recent work has focused on the theme of living everyday life in the presence of AI, and gradually gaining insight into the long-term impacts of social robots.

Domo is an experimental robot made by the Massachusetts Institute of Technology designed to interact with humans. The brainchild of Jeff Weber and Aaron Edsinger, cofounders of Meka Robotics, its name comes from the Japanese phrase for "thank you very much", domo arigato, as well as the Styx song, "Mr. Roboto". The Domo project was originally funded by NASA, and has now been joined by Toyota in funding robot's development.

In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

<span class="mw-page-title-main">Rosalind Picard</span> American computer scientist

Rosalind Wright Picard is an American scholar and inventor who is Professor of Media Arts and Sciences at MIT, founder and director of the Affective Computing Research Group at the MIT Media Lab, and co-founder of the startups Affectiva and Empatica.

Categorical perception is a phenomenon of perception of distinct categories when there is a gradual change in a variable along a continuum. It was originally observed for auditory stimuli but now found to be applicable to other perceptual modalities.

<span class="mw-page-title-main">Leonardo (robot)</span>

Leonardo is a 2.5 foot social robot, the first created by the Personal Robots Group of the Massachusetts Institute of Technology. Its development is credited to Cynthia Breazeal. The body is by Stan Winston Studios, leaders in animatronics. Its body was completed in 2002. It was the most complex robot the studio had ever attempted as of 2001. Other contributors to the project include NevenVision, Inc., Toyota, NASA's Lyndon B. Johnson Space Center, and the Navy Research Lab. It was created to facilitate the study of human–robot interaction and collaboration. A DARPA Mobile Autonomous Robot Software (MARS) grant, Office of Naval Research Young Investigators Program grant, Digital Life, and Things that Think consortia have partially funded the project. The MIT Media Lab Robotic Life Group, who also studied Robonaut 1, set out to create a more sophisticated social-robot in Leonardo. They gave Leonardo a different visual tracking system and programs based on infant psychology that they hope will make for better human-robot collaboration. One of the goals of the project was to make it possible for untrained humans to interact with and teach the robot much more quickly with fewer repetitions. Leonardo was awarded a spot in Wired Magazine’s 50 Best Robots Ever list in 2006.

Non-verbal leakage is a form of non-verbal behavior that occurs when a person verbalizes one thing, but their body language indicates another, common forms of which include facial movements and hand-to-face gestures. The term "non-verbal leakage" got its origin in literature in 1968, leading to many subsequent studies on the topic throughout the 1970s, with related studies continuing today.

Social cues are verbal or non-verbal signals expressed through the face, body, voice, motion and guide conversations as well as other social interactions by influencing our impressions of and responses to others. These percepts are important communicative tools as they convey important social and contextual information and therefore facilitate social understanding.

Emotional prosody or affective prosody is the various non-verbal aspects of language that allow people to convey or understand emotion. It includes an individual's tone of voice in speech that is conveyed through changes in pitch, loudness, timbre, speech rate, and pauses. It can be isolated from semantic information, and interacts with verbal content.

Emotion perception refers to the capacities and abilities of recognizing and identifying emotions in others, in addition to biological and physiological processes involved. Emotions are typically viewed as having three components: subjective experience, physical changes, and cognitive appraisal; emotion perception is the ability to make accurate decisions about another's subjective experience by interpreting their physical changes through sensory systems responsible for converting these observed changes into mental representations. The ability to perceive emotion is believed to be both innate and subject to environmental influence and is also a critical component in social interactions. How emotion is experienced and interpreted depends on how it is perceived. Likewise, how emotion is perceived is dependent on past experiences and interpretations. Emotion can be accurately perceived in humans. Emotions can be perceived visually, audibly, through smell and also through bodily sensations and this process is believed to be different from the perception of non-emotional material.

Affectiva was a software company that built artificial intelligence, acquired by SmartEye in 2021. The company claimed its AI understood human emotions, cognitive states, activities and the objects people use, by analyzing facial and vocal expressions. An offshoot of MIT Media Lab, Affectiva created a new technological category of Artificial Emotional Intelligence, namely, Emotion AI.

<span class="mw-page-title-main">Visage SDK</span> Software development kit

Visage SDK is a multi-platform software development kit (SDK) created by Visage Technologies AB. Visage SDK allows software programmers to build facial motion capture and eye tracking applications.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

<span class="mw-page-title-main">Nadine Social Robot</span> Social Humanoid Robot

Nadine is a gynoid humanoid social robot that is modelled on Professor Nadia Magnenat Thalmann. The robot has a strong human-likeness with a natural-looking skin and hair and realistic hands. Nadine is a socially intelligent robot which returns a greeting, makes eye contact, and can remember all the conversations had with it. It is able to answer questions autonomously in several languages, simulate emotions both in gestures and facially, depending on the content of the interaction with the user. Nadine can recognise persons it has previously seen, and engage in flowing conversation. Nadine has been programmed with a "personality", in that its demeanour can change according to what is said to it. Nadine has a total of 27 degrees of freedom for facial expressions and upper body movements. With persons it has previously encountered, it remembers facts and events related to each person. It can assist people with special needs by reading stories, showing images, put on Skype sessions, send emails, and communicate with other members of the family. It can play the role of a receptionist in an office or be dedicated to be a personal coach.

Artificial empathy or computational empathy is the development of AI systems—such as companion robots or virtual agents—that can detect emotions and respond to them in an empathic way.

References

  1. 1 2 3 Peter Menzel and Faith D'Aluisio. Robosapiens. Cambridge: The MIT Press, 2000. Pg. 66
  2. Breazeal, Cynthia. Designing Sociable Robots. The MIT Press, 2002
  3. 1 2 "Kismet, the robot".
  4. Suchman, Lucy. "Subject Objects." Feminist Theory. 2011, pg. 127
  5. Breazeal, Cynthia. Designing Sociable Robots. The MIT Press, 2002, pg. 112
  6. Madsen, R. Animated Film: Concepts, Methods, Uses. Interland, New York, 1969