A strong motivation for this work arises from one area of perception: computer vision. An evolution has been proceeding in the field as vision techniques transition from low level static image analysis to dynamic and high level video interpretation. The availability of real-time tracking of bodies, faces, and hands [2] [72] [22] has spurred the development of dynamical analysis of human motion [70] [24] [74] [57] [9]. Isaard [24], Pentland [46], and Bregler [9] discuss the use of multiple linear dynamic models to account for the variability in the motion induced by an underlying behavioural control mechanism. These represent a transition from physical dynamic constraints to behavioural dynamic constraints and can actually infer higher order discrete descriptive models of the actions. In contrast to behaviour modeling vision systems, interactive applications such as Pfinder [72] have been also used for real-time interaction with synthetic characters.
In addition, the development of sophisticated 2D recovery, 3D rigid and non-rigid shape estimation and 3D pose recovery have permitted the use of vision to synthesize and manipulate interesting virtual rendering of real objects [10] [27] [59]. For example, Terzopolous [59] discusses facial tracking and deformation of 3D models to create visually compelling facial animations.
Visual sensing creates an intermediate channel tying together real and virtual worlds. However, it is not the only channel and multi-modal sensing is also becoming an important source of behavioural models and interaction. For instance, auditory perception is used in conjunction with hand tracking in synthetic animals [52] and in synthetic humanoids [60].