Face Modeling Language (FML)
Version 2.1
Interactive Media Technologies Inc. (iMediaTek)
September 15th , 2005
INTRODUCTION
Scope
Motivation
Basic Concepts
FML DOCUMENT STRUCTURE
LANGUAGE CONSTRUCTS
Primary Elements
Modeling
Story Description and
Time Containers
Primitive Moves
Decision-Making and Event Handling
External Events
Exclusive Time
Containers
Iteration
Definite Loops
Indefinite (Conditional)
Loops
Behavioral Templates
FML OBJECT MODEL
Elements
Attributes
FML AND MPEG-4
REFERENCES
This document is the specification for Face Modeling Language (FML) designed based on research projects done in the Department of Electrical and Computer Engineering, University of British Columbia, and the School of Interactive Arts and Technology, Simon Fraser University. FML is a content description language for face animation. Motivations and basic concepts of FML are discussed, and its language constructs and object model (entities and their relations and attributes) are defined.
FML is based on Extensible Markup Language (XML) [1] and shares ideas and concepts with other standards, languages, and technologies. Such common issues (e.g. XML document structure and parsing) are not discussed in this document, either.
An FML-compatible animation system has three major parts:
* FML Processor
* Animation Player that uses the FML document as input
* Application that owns the player object
Face Modeling Language is designed to be independent of the face animation methods used to render the scenes it describes. Such methods are not explicitly discussed here but some aspects of FML are developed with certain needs of animation players in mind.
Face Animation, as a special type of multimedia presentation, has been a challenging subject for many researchers. Advances in computer hardware and software, and also new web-based applications, have helped intensify these research activities, recently. Video conferencing and online services provided by human characters are good examples of the applications using face animation. Personalized Face Animation includes all the information and activities required to create a multimedia presentation resembling a specific person. The input to such system can be a combination of audio/visual data and textual commands and descriptions. A successful face animation system needs to have efficient yet powerful solutions for providing and displaying the content, i.e. a content description format, decoding algorithms, and finally an architecture to put different components together in a flexible way.
Advances in computer graphics techniques, as mentioned before, have allowed incorporation of computer generated content in multimedia presentations. Many techniques, languages, and programming interfaces are proposed to let developers define their virtual scenes. OpenGL [2], Virtual Reality Modeling Language (VRML) [3], and Synchronized Multimedia Integration Language (SMIL) [4] are only few examples in this regard. The growing use of web-based systems have also encouraged the use of such languages, since they make it possible to transmit only a textual description rather than complete audio-visual data, provided the audio-visual effects of these actions can be recreated with a minimum acceptable quality. Although new streaming technologies allow real-time download/playback of audio/video data, but bandwidth limitation and its efficient usage still are, and probably will be, major issues.
In face animation (and also other cases) minimizing the data transfer time is not the only advantage of content specifications. In many situations, the "real" multimedia data does not exist at all, and has to be created based on a description of desired actions. This leads to the whole new idea of representing the spatial and temporal relation of the facial actions. In a generalized view, such a description of facial presentation should provide a hierarchical structure with elements ranging from low level "images", to simple "moves", more complicated "actions", to complete "stories". We call this a Structured Content Description, which also requires means of defining capabilities, behavioural templates, dynamic contents, and event/user interaction.
Based on the above ideas, in face animation, some researches have been done to translate certain facial actions into a predefined set of "codes". Facial Action Coding System [5] was probably the first successful attempt in this area. More recently, MPEG-4 standard [6] has defined Face Definition and Animation Parameters (FDP and FAP) to encode low level facial actions like jaw-down, and higher level, more complicated ones like smile. It also provides Extensible MPEG-4 Textual format (XMT) as a framework for incorporating textual descriptions in languages like SMIL and VRML. XMT does not include any face-specific features, yet.
Due to its capabilities, popularity, and availability of parsing tools, Extensible Markup Language (XML) seems to be the best choice as basis of a content description language for face animation. Such language can be considered a natural high-level abstraction on top of MPEG-4 FAPs and should be able to function as part of XMT framework.
Face Modeling Language (FML) is a Structured Content Description mechanism based
on Extensible Markup Language. The main ideas behind FML are:
* Hierarchical representation of face animation
* Timeline definition of the relation between facial actions and external
events
* Defining capabilities and behavior templates
* Compatibility with MPEG-4 XMT and FAPs
* Compatibility with XML and related web technologies and existing tools
FACS and MPEG-4 FAPs provide the means of describing low-level face actions but they do not cover temporal relations and higher-level structures. Languages like SMIL do this in a general purpose form for any multimedia presentation and are not customized for specific applications like face animation. A language bringing the best of these two together, customized for face animation, seems to be an important requirement. FML is designed to do so, filling the gap in XMT framework for a face animation language.
Fundamental to FML is the idea of Structured Content Description. It means a
hierarchical view of face animation capable of representing simple
individually-meaningless moves to complicated high level stories. This
hierarchy can be thought of as consisting of the following levels (bottom-up):
* Frame, a single image showing a snapshot of the face (Naturally, may not be
accompanied by speech)
* Move, a set of frames representing linear transition between two frames (e.g.
making a smile)
* Action or Act, a "meaningful" combination of moves
* Story, a stand-alone piece of face animation
The boundaries between these levels are not rigid and well defined. Due to
complicated and highly expressive nature of facial activities, a single move
can make a simple yet meaningful story (e.g. an expression). The levels are
basically required by content designer in order to:
* Organize the content
* Define temporal relation between activities
* Develop behavioural templates, based on his/her presentation purposes and
structure.
FML defines a timeline of events (Figure 1) including head movements, speech,
and facial expressions, and their combinations. Since a face animation might be
used in an interactive environment, such a timeline may be altered/determined
by a user. So another functionality of FML is to allow user interaction and in
general event handling (Notice that user input can be considered a special case
of external event.). This event handling may be in form of:
* Decision Making; choosing to go through one of possible paths in the story
* Dynamic Generation; creating a new set of actions to follow
Figure 1. FML Timeline and Temporal Relation of Face Activities
A major concern in designing FML is compatibility with existing standards and languages. Growing acceptance of MPEG-4 standard makes it necessary to design FML in a way it can be translated to/from a set of FAPs. Also due to similarity of concepts, it is desirable to use SMIL syntax and constructs, as much as possible. Satisfying these requirements make FML a good candidate for being a part of MPEG-4 XMT framework.
FML is an XML-based language, following the same structural rules (e.g. well-formedness constraints) and sharing the same syntax. The choice of XML as the base for FML is based on its capabilities as a markup language, growing acceptance, and available system support in different platforms. Figure 2 shows typical structure of an FML document.
<fml>
<model> <!--
Model Information -->
<model-info-item>
</model>
<story> <!--
Animation Time Line -->
<act>
<time-container>
<move-item>
<...>
</time-container>
<...>
</act>
<...>
</story>
</fml>
Figure 2. FML Document Map
An FML document consists, at higher level, of two types of elements: model and story. A model element is used for defining face capabilities, parameters, and initial configuration. A story element, on the other hand, describes the timeline of events in face animation. It is possible to have more than one of each element but due to possible sequential execution of animation in streaming applications, a model element affect only those parts of document coming after it.
Face animation timeline consists of facial activities grouped into act modules. Within each group, activities are defined as simple Moves and their temporal relations. The timeline is primarily created using two time container elements, seq and par, corresponding to sequential and parallel temporal relation between moves. A story itself is a special case of sequential time container. The begin times of activities inside a seq and par are relative to previous activity and container begin time, respectively. story and act are special cases of sequential time container which can only be used at top levels of FML document.
FML supports three basic face activities (moves): talking, facial expressions, and 3D head movements. Combined in time containers, they create an FML act. This combination can also be done using nested containers.
FML model element embodies all the modeling and configuration parts of the document. In version 2.1 this can include the following elements:
* character: The person to be displayed in the animation; This element has one major attribute name and is used to initialize the animation player database.
* img: The image to be used for animation; This element has two major attribute file and type. It provides an image and tells the player where to use it. For instance the image can be a frontal or profile pictures used for creating a 3D geometric model. The usage and value of type are player-dependent.
* sound: The sound data to be used in animation; This element also has a file attribute that points to a player-dependent audio data file/directory.
* range: Acceptable range of head movement in a specific direction; It has two major attributes: type and value specifying the direction and the related range value.
* param: Any player-specific parameter (e.g. MPEG-4 FDP); param has three attributes type , name and value .
* data: Any player-specific animation data file/directory (e.g. a 3D geometric model); data has two attributes name and file .
* template and event : Behavioral models and external event; These elemente will be discussed in details in later sections.
* bgsound (bgs): Background audio file
All these elements except template are XML empty elements (i.e. the information is in their attributes). Their absence is not considered a syntax error, since the animation player is supposed to use its default values. Figure 3 illustrates a sample FML model.
<model>
<img file="me.jpg"
type="front" />
<range type="left"
value="60" />
<template name="hi"
>
<seq
begin="0">
<talk>Hello</talk>
<hdmv begin="0" end="5" type="0" value="30" />
</seq>
</template>
</model>
<story>
<behavior name="hi"
/>
</story>
Figure 3. FML Model and Templates
FML timeline, presented in Stories, consists primarily of Acts which are purposeful set of Moves. The Acts are performed sequentially but may contain parallel Moves in themselves. Time Containers are FML elements that represent the temporal relation between moves. The basic Time Containers are seq and par corresponding to sequential and parallel activities. The former contains moves that begin at the same time and latter contains moves that start one after another. The Time Containers include primitive moves and also other Time Containers in a nested way.
Time Containers have three other attributes begin, duration, and end (default value for begin is zero, and duration is an alternative to end ) that show the related times in milliseconds.
FML also has a third type of Time Containers, excl , used for implementing exclusive activities and decision-making as discussed later.
FML supports three types of primitive moves: talk, expr, and hdmv for speech, facial expressions, and 3D head movements, correspondingly. fap element is also considered for direct embedding of MPEG-4 FAPs.
* talk (spk) is a non-empty XML element and its content is the text to be spoken.
* expr (exp) specifies facial expressions with attributes type and value . The expression types can be neutral, joy, sadness, anger, fear, disgust, surpris, blink, and nod. They can have a value from zero to 100%. expr is an empty element.
* hdmv (mov) handles 3D head movements with attributes type (yaw, pitch, and roll) and value (-100% to 100%). Considering the three axes X (horizontal), Y (vertical), and Z (normal to 2D plane), these movements are rotation around the axes. This move is also an empty element and has the same attributes as facial expressions.
* fap inserts an MPEG-4 FAP into the document. It is also an empty element with attributes type (FAP number) and value (-100% to 100%).
* rprm (rfp) activates a legacy rFace parameter. It is an empty element with attributes type (param number) and value.
* param (prm): Any player-specific parameter (e.g. MPEG-4 FDP); param has three attributes type , name and value . For example, if used for a Component param in iFACE system, type="comp" name="2-1-1" (group-param-subparam) and value="10"
* play (run) plays a wave or keyframe file. It is an empty element with only one necessary attribute, file (filename). A nonzero value means play-to-file with given FPS.
* capture (rec) captures the audio and animates the face accordingly. This is an empty element with no attributes.
* target (out) is the file that is the target of recor or playback.
* movie (f2m) makes a movie named in file using the current background audio and last output frames and the FPS given in value.
* txture (img) loads a new texture file.
* ptype (pty) loads a new personality type file.
* reset (clr) resets the face.
* bgsound (bgs): Background audio file
* geometry (geo) opens a new geometry file (x, BMP, JPG, MSH, IMG, CHR)
* wait (nop) performs no operation. It is an empty element with timing attributes, only.
* system (sys) executes a system command using value.
* exit (end) terminates the script. It is an empty element without any attributes.
<act>
<seq begin="0">
<talk>Hello</talk>
<hdmv
end="5" type="0" value="30" />
</seq>
<par begin="0">
<talk>Hello</talk>
<expr
end="3" type="3" value="50" />
</par>
</act>
Figure 4. FML Time Containers and Primitive Moves
All primitive moves have three other attributes begin, duration, and end (default value for begin is zero, and duration is an alternative to end). In a sequential time container, begin is relative to start time of the previous move, and in a parallel container it is relative to the start time of the container. In case of a conflict, duration of moves is set according to their own settings rather than the container. Figure 4 illustrates the use of time containers and primitive moves.
The interaction between the owner application (or user) and the FML document is provided through FML External Events. In FML version 1.0, External Events are used in decision-making and indefinite iteration. Generally, they can be used for any interaction by users/applications to dynamically define or alter the behavior of FML document.
External Events are defined by event elements in model section
of an FML document. Each event will be given a name and an initial value by its
attributes and form an empty XML element, for example:
<event name="user" value= "-1" />
Normal Time Containers (i.e. sequential and parallel) define the order in which activities inside an Action are performed. The Exclusive Time Container, excl , allows making decisions and choosing an option among a set of available activities. This is the primary means of dynamically controlling the behavior of an FML document. Each Exclusive Time Container is associated with a pre-defined External Event and performs only one of its available Move sets based on the event value, as shown in Figure 5.
<event name="user" value="-1" />
. . .
<excl event_name="user">
<talk
event_value="0">Hello</talk>
<talk
event_value="1">Bye</talk>
</excl>
Figure 5. FML Decision-Making
If the event value does not match any of the values specified by event_value the FML document playback pauses until the value is set by the user/application. The FML Processor exposes proper interface function to allow event values to be set in run time. event is the FML counterpart of familiar if-else constructs in normal programming languages.
Using repeat attribute (discussed in the next section) we can allow event handlers to work more than once. Using "resident" value for type attribute (of an excl) makes the event handler go to resident mode where the script seems to terminate but event handler continues to work.
Iteration in FML is provided by repeat attribute of Time Container elements that simply cycles through the content for the specified number of times (in Definite Loops) or until a certain condition is satisfied (Indefinite Loops). For a Definite Loop, repeat is either a number or the name of an external event with a numeric non-negative value.
Indefinite Loops are formed when the repeat attribute is associated with an external event (e.g. "kbd;F1_up" for F1 key released event). In such cases, the iteration continues until the event happens. Figure 6 shows examples of FML iteration.
<event name="select" value="kbd;F1_up" />
< ... >
<act repeat="select">
<seq>
<talk
begin="1">Come In</talk>
<
... >
</seq>
</act>
Figure 6. FML Iteration
In version 1.0, FML behavioral templates are similar to subroutines in programming languages. They define a set of parameterized activities to be recalled inside the Story using behavior element. But they can be extended to include behavioral rules and knowledge bases, specially for interactive applications, in later versions. A typical model element is illustrated in Figure 7, defining a behavioral template used later in story.
<model>
<template name="hi"
>
<seq
begin="0">
<talk>Hello</talk>
<hdmv begin="0" end="5" type="0" value="param-1" />
</seq>
</template>
</model>
<story>
<behavior name="hi"
param-1="50" />
</story>
Figure 7. FML Behavioral Templates
The FML object model consists of FML element and their base classes. Figure 8 summarizes the hierarchy of object classes in FML documents and their attributes.
FMLElement ( id )
FMLTimeContainer
( begin, duration, end, value, repeat ) //value for
event
seq
story
act
par
excl ( name ) //name for event
FMLMove ( type, value , file )
talk
expr
hdmv
fap
play
capt
save
txtr
FMLModelItem ( type, name, value, file )
character
img
sound
param
data
range
FMLEtc ( name, value )
template
event
behavior
Figure 8. FML Object Model
It should be noted that each FMLMove object is in fact a Time Container including only one move. Also worth noting is that some attributes may be left unused by the related objects. For example the elements in FMLEtc usually use either value or name.
Most of the attributes are self-explanatory. Followings are some comments on those with different usage.
type
The type attribute is used in range, fap, expr, and hdmv can have the following numeric or string values:
hdmv (the two last values are related to movement in XY plane)
yaw
pitch
roll
expr
netral
joy
sadness
anger
fear
disgust
surprise
nod
blinkfap
standard MPEG-4 FAP numbersexcl
"resident" if we want the event processing to remain active while allowing the script to end.
value
In FMLMove elements and also for range , value has a numeric value (adding a d at the end makes it relative, AKA delta). Otherwise, it is a string (name, address, ...).
range and hdmv , relative movement in degrees.
expr , percent of full expression
fap , standard MPEG-4 FAP values
begin, duration, end
All times are in milliseconds by default. An ending “s” means the value is in seconds, e.g. begin="2s".
FML is a high level abstraction on top of MPEG-4 Face Animation Parameters. FAPs
can be grouped into the following categories:
* Visemes
* Expressions
* Low-level facial movements
In FML, visemes are handled implicitly through talk element. The FML processor translates the input text to a set of phonemes and visemes compatible with those defined in MPEG-4 standard. FML facial expressions are defined in direct correspondence to those in MPEG FAPs. For other face animation parameters, the fap element can be used. This element works like other FML moves, and its type and value attribute are compatible with FAP numbers and values.
Considering this compatibility, FML documents can be easily translated into MPEG-4 streams which make FML a good candidate for Extensible MPEG-4 Textual Format (XMT) framework.
[1] http://www.w3.org/xml
[2] http://www.opengl.org
[3] http://www.vrml.org
[4] Bulterman, D., "SMIL-2," IEEE Multimedia, October 2001.
[5] Ekman, P., and Friesen, W.V. (1978). Facial Action Coding System,
Consulting Psychologists Press Inc., 1978.
[6] Battista, S., et al, "MPEG-4: A Multimedia Standard for the Third
Millennium", IEEE Multimedia, October 1999.