Rich Representation Language

Last updated November 25, 2023

The Rich Representation Language, often abbreviated as RRL, is a computer animation language specifically designed to facilitate the interaction of two or more animated characters.^[1]^[2]^[3] The research effort was funded by the European Commission as part of the NECA Project. The NECA (Net Environment for Embodied Emotional Conversational Agents) framework within which RRL was developed was not oriented towards the animation of movies, but the creation of intelligent "virtual characters" that interact within a virtual world and hold conversations with emotional content, coupled with suitable facial expressions.^[3]

RRL was a pioneering research effort which influenced the design of other languages such as the Player Markup Language which extended parts of the design of RRL.^[4] The language design specifically intended to lessen the training needed for modeling the interaction of multiple characters within a virtual world and to automatically generate much of the facial animation as well as the skeletal animation based on the content of the conversations. Due to the interdependence of nonverbal communication components such as facial features on the spoken words, no animation is possible in the language without considering the context of the scene in which the animation takes place - e.g. anger versus joy.^[5]

Language design issues

The application domain for RRL consists of scenes with two or more virtual characters. The representation of these scenes requires multiple information types such as body postures, facial expressions, semantic content and meaning of conversations, etc. The design challenge is that often information of one type is dependent on another type of information, e.g. the body posture, the facial expression and the semantic content of the conversation need to coordinate. An example is that in an angry conversation, the semantics of the conversation dictate the body posture and facial expressions in a distinct from which is quite different from a joyful conversation. Hence any commands within the language to control facial expressions must inherently depend on the context of the conversation.^[3]

The different types of information used in RRL require different forms of expression within the language, e.g. while semantic information is represented by grammars, the facial expression component requires graphic manipulation primitives.^[3]

A key goal in the design of RRL was the ease of development, to make scenes and interaction construction available to users without advanced knowledge of programming. Moreover, the design aimed to allow for incremental development in a natural form, so that scenes could be partially prototyped, then refined to more natural looking renderings, e.g. via the later addition of blinking or breathing.^[3]

Scene description

Borrowing theatrical terminology, each interaction session between the synthetic characters in RRL is called a scene. A scene description specifies the content, timing, and emotional features employed within a scene. A specific module called the affective reasoner computes the emotional primitives involved in the scene, including the type and the intensity of the emotions, as well as their causes. The affective reasoner uses emotion dimensions such as intensity and assertiveness.^[3]

Although XML is used as the base representation format, the scenes are described at a higher level within an object oriented framework. In this framework nodes (i.e. objects) are connected via arrows or links. For instance, a scene is the top level node which is linked to others. The scene may have three specific attributes: the agents/people who participate in the scene, the discourse representation which provides the basis for conversations and a history which records the temporal relationships between various actions.^[3]

The scene descriptions are fed to the natural language generation module which produces suitable sentences. The generation of natural flow in a conversation requires a high degree of representational power for the emotional elements. RRL uses a discourse representation system based the standard method of referents and conditions. The affective reasoner supplies the suitable information to select the words and structures that correspond to specific sentences.^[3]

Speech synthesis and emotive markers

The speech synthesis component is highly dependent on the semantic information and the behavior of the gesture assignment module. The speech synthesis component must operate before the gesture assignment system because it includes the timing information for the spoken words and emotional interjections. After interpreting the natural language text to be spoken, this component adds prosodic structure such as rhythm, stress and intonations.^[3]

The speech elements, once enriched with stress, intonation and emotional markers are passed to the gesture assignment system.^[3] RRL supports three separate aspects of emotion management. First, specific emotion tags may be provided for scenes and specific sentences. A number of specific commands support the display a wide range of emotions in the faces of animated characters.^[3]

Secondly, there are built in mechanisms for aligning specific facial features to emotive body postures. Third, specific emotive interjections such as sighs, yawns, chuckles, etc. may be interleaved within actions to enhance the believability of the character's utterances.^[3]

Gesture assignment and body movements

In RRL the term gesture is used in a general sense and applies to facial expressions, body posture and proper gestures. Three levels of information are processed within gesture assignment:^[3]

Assignment of specific gestures within a scene to specific modules, e.g. "turn taking" being handled in the natural language generation module.
Refinement and elaboration of gesture assignment following a first level synthesis of speech, e.g. the addition of blinking and breathing to a conversation.
Interface to external modules that handle player-specific renderings such as MPEG-4 Face Animation Parameters (FAPs).

The gesture assignment system has specific gesture types such as body movements (e.g. shrug of shoulders as indifference vs hanging shoulders of sadness), emblematic movements (gestures that by convention signal yes/no), iconic (e.g. imitating a telephone via fingers), deictic (pointing gestures), contrast (e.g. on one hand, but on the other hand), facial features (e.g. raised eyebrows, frowning, surprise or a gaze).^[3]

Related Research Articles

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

Poser is a figure posing and rendering 3D computer graphics program distributed by Bondware. Poser is optimized for the 3D modeling of human figures. By enabling beginners to produce basic animations and digital images, along with the extensive availability of third-party digital 3D models, it has attained much popularity.

<span class="mw-page-title-main">Body language</span> Type of nonverbal communication

Body language is a type of communication in which physical behaviors, as opposed to words, are used to express or convey information. Such behavior includes facial expressions, body posture, gestures, eye movement, touch and the use of space. The term body language is usually applied in regard to people but may also be applied to animals. The study of body language is also known as kinesics. Although body language is an important part of communication, most of it happens without conscious awareness.

Nonverbal communication (NVC) is the transmission of messages or signals through a nonverbal platform such as eye contact, facial expressions, gestures, posture, use of objects and body language. It includes the use of social cues, kinesics, distance (proxemics) and physical environments/appearance, of voice (paralanguage) and of touch (haptics). A signal has three different parts to it, including the basic signal, what the signal is trying to convey, and how it is interpreted. These signals that are transmitted to the receiver depend highly on the knowledge and empathy that this individual has. It can also include the use of time (chronemics) and eye contact and the actions of looking while talking and listening, frequency of glances, patterns of fixation, pupil dilation, and blink rate (oculesics).

Kinesics is the interpretation of body communication such as facial expressions and gestures, nonverbal behavior related to movement of any part of the body or the body as a whole. The equivalent popular culture term is body language, a term Ray Birdwhistell, considered the founder of this area of study, neither used nor liked.

In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

The semantic gap characterizes the difference between two descriptions of an object by different linguistic representations, for instance languages or symbols. According to Andreas M. Hein, the semantic gap can be defined as "the difference in meaning between constructs formed within different representation systems". In computer science, the concept is relevant whenever ordinary human activities, observations, and tasks are transferred into a computational representation.

Computer facial animation is primarily an area of computer graphics that encapsulates methods and techniques for generating and animating images or models of a character face. The character can be a human, a humanoid, an animal, a legendary creature or character, etc. Due to its subject and output type, it is also related to many other scientific and artistic fields from psychology to traditional animation. The importance of human faces in verbal and non-verbal communication and advances in computer graphics hardware and software have caused considerable scientific, technological, and artistic interests in computer facial animation.

AutoTutor is an intelligent tutoring system developed by researchers at the Institute for Intelligent Systems at the University of Memphis, including Arthur C. Graesser that helps students learn Newtonian physics, computer literacy, and critical thinking topics through tutorial dialogue in natural language. AutoTutor differs from other popular intelligent tutoring systems such as the Cognitive Tutor, in that it focuses on natural language dialog. This means that the tutoring occurs in the form of an ongoing conversation, with human input presented using either voice or free text input. To handle this input, AutoTutor uses computational linguistics algorithms including latent semantic analysis, regular expression matching, and speech act classifiers. These complementary techniques focus on the general meaning of the input, precise phrasing or keywords, and functional purpose of the expression, respectively. In addition to natural language input, AutoTutor can also accept ad hoc events such as mouse clicks, learner emotions inferred from emotion sensors, and estimates of prior knowledge from a student model. Based on these inputs, the computer tutor determine when to reply and what speech acts to reply with. This process is driven by a "script" that includes a set of dialog-specific production rules.

<span class="mw-page-title-main">OpenCog</span> Project for an open source artificial intelligence framework

OpenCog is a project that aims to build an open source artificial intelligence framework. OpenCog Prime is an architecture for robot and virtual embodied cognition that defines a set of interacting components designed to give rise to human-equivalent artificial general intelligence (AGI) as an emergent phenomenon of the whole system. OpenCog Prime's design is primarily the work of Ben Goertzel while the OpenCog framework is intended as a generic framework for broad-based AGI research. Research utilizing OpenCog has been published in journals and presented at conferences and workshops including the annual Conference on Artificial General Intelligence. OpenCog is released under the terms of the GNU Affero General Public License.

Non-verbal leakage is a form of non-verbal behavior that occurs when a person verbalizes one thing, but their body language indicates another, common forms of which include facial movements and hand-to-face gestures. The term "non-verbal leakage" got its origin in literature in 1968, leading to many subsequent studies on the topic throughout the 1970s, with related studies continuing today.

In humans, posture can provide a significant amount of important information through nonverbal communication. Psychological studies have also demonstrated the effects of body posture on emotions. This research can be traced back to Charles Darwin's studies of emotion and movement in humans and animals. Currently, many studies have shown that certain patterns of body movements are indicative of specific emotions. Researchers studied sign language and found that even non-sign language users can determine emotions from only hand movements. Another example is the fact that anger is characterized by forward whole body movement. The theories that guide research in this field are the self-validation or perception theory and the embodied emotion theory.

The Virtual Human Markup Language often abbreviated as VHML is a markup language used for the computer animation of human bodies and facial expressions. The language is designed to describe various aspects of human-computer interactions with regards to facial animation, text to Speech, and multimedia information.

The NECA Project was a research project that focused on multimodal communication with animated agents in a virtual world. NECA was funded by the European Commission from 1998 to 2002 and the research results were published up to 2005.

Computer-generated imagery (CGI) is a specific-technology or application of computer graphics for creating or improving images in art, printed media, simulators, videos and video games. These images are either static or dynamic. CGI both refers to 2D computer graphics and 3D computer graphics with the purpose of designing characters, virtual worlds, or scenes and special effects. The application of CGI for creating/improving animations is called computer animation, or CGI animation.

Emotional prosody or affective prosody is the various non-verbal aspects of language that allow people to convey or understand emotion. It includes an individual's tone of voice in speech that is conveyed through changes in pitch, loudness, timbre, speech rate, and pauses. It can be isolated from semantic information, and interacts with verbal content.

Emotion perception refers to the capacities and abilities of recognizing and identifying emotions in others, in addition to biological and physiological processes involved. Emotions are typically viewed as having three components: subjective experience, physical changes, and cognitive appraisal; emotion perception is the ability to make accurate decisions about another's subjective experience by interpreting their physical changes through sensory systems responsible for converting these observed changes into mental representations. The ability to perceive emotion is believed to be both innate and subject to environmental influence and is also a critical component in social interactions. How emotion is experienced and interpreted depends on how it is perceived. Likewise, how emotion is perceived is dependent on past experiences and interpretations. Emotion can be accurately perceived in humans. Emotions can be perceived visually, audibly, through smell and also through bodily sensations and this process is believed to be different from the perception of non-emotional material.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

Nadine is a gynoid humanoid social robot that is modelled on Professor Nadia Magnenat Thalmann. The robot has a strong human-likeness with a natural-looking skin and hair and realistic hands. Nadine is a socially intelligent robot which returns a greeting, makes eye contact, and can remember all the conversations had with it. It is able to answer questions autonomously in several languages, simulate emotions both in gestures and facially, depending on the content of the interaction with the user. Nadine can recognise persons it has previously seen, and engage in flowing conversation. Nadine has been programmed with a "personality", in that its demeanour can change according to what is said to it. Nadine has a total of 27 degrees of freedom for facial expressions and upper body movements. With persons it has previously encountered, it remembers facts and events related to each person. It can assist people with special needs by reading stories, showing images, put on Skype sessions, send emails, and communicate with other members of the family. It can play the role of a receptionist in an office or be dedicated to be a personal coach.

Artificial empathy or computational empathy is the development of AI systems—such as companion robots or virtual agents—that can detect emotions and respond to them in an empathic way.

References

↑ 'Intelligent virtual agents: 6th international working conference by Jonathan Matthew Gratch 2006 ISBN 3-540-37593-7 page 221
↑ Data-driven 3D facial animation by Zhigang Deng, Ulrich Neumann 2007 ISBN 1-84628-906-8 page 54
1 2 3 4 5 6 7 8 9 10 11 12 13 14 P. Piwek, et al. RRL: A Rich Representation Language for the Description of Agent Behaviour in "Proceedings of the AAMAS-02 Workshop on Embodied conversational agents", July 16, 2002, Bologna, Italy.
↑ Technologies for interactive digital storytelling and entertainment by Stefan Göbel 2004 ISBN 3-540-22283-9 page 83
↑ Interactive storytelling: First Joint International Conference, edited by Ulrike Spierling, Nicolas Szilas 2008 ISBN 3-540-89424-1 page 93

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Gratch-1] 'Intelligent virtual agents: 6th international working conference by Jonathan Matthew Gratch 2006 ISBN 3-540-37593-7 page 221

[2] Data-driven 3D facial animation by Zhigang Deng, Ulrich Neumann 2007 ISBN 1-84628-906-8 page 54

[Piwek-3] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P. Piwek, et al. RRL: A Rich Representation Language for the Description of Agent Behaviour in "Proceedings of the AAMAS-02 Workshop on Embodied conversational agents", July 16, 2002, Bologna, Italy.

[4] Technologies for interactive digital storytelling and entertainment by Stefan Göbel 2004 ISBN 3-540-22283-9 page 83

[5] Interactive storytelling: First Joint International Conference, edited by Ulrike Spierling, Nicolas Szilas 2008 ISBN 3-540-89424-1 page 93

[1]

[2]

[3]

[4]

[5]