Of course, the learning system generates both a
and a
.
Therefore, it would be of no
extra cost to utilize the information in
in some
way while the system is interacting with the user. Recall earlier the
brief discussion of the similarity of this output to that of a filter
(i.e. a Kalman filter). Instead of explicitly using Kalman filters in
the vision systems (as described earlier), one could
consider the predicted
as an alternative
to filtering and smoothing. The ARL system then acts
as a sophisticated non-linear dynamical filter. In that sense, it
could be used to help the vision tracking and even resolve some vision
system errors.
Typically, tracking algorithms use a variety of temporal dynamic models to assist the frame by frame vision computations. The most trivial of these is to use the last estimate in a nearest neighbour approach to initialize the next vision iteration. Kalman filtering and other dynamic models involve more sophistication ranging from constant velocity models to very complex control systems. Here, the the feedback being used to constrain the vision system results from dynamics and behaviour modeling. This is similar in spirit to the mixed dynamic and behaviour models in [46]. In the head and hand tracking case, the system continuously feeds back prediction estimates of the 15 tracked parameters (3 Gaussians) in the vision system for improved results.
More significant vision errors can also be handled. Consider the
specific case of head and hand tracking with skin blobs. As mentioned
earlier, colored gloves were used to overcome some correspondence
problems when heads and hands touched and moved by each other. The
first training sequences involved no mis-correspondence due to explicit
glove labeling of head, left hand and right hand. However, once
appropriately trained, the probabilistic model described above feeds
back the positions of the Gaussians to the vision. This prevents blob
mislabeling by using the whole gesture as a predictor instead of short
range dynamics. Thus, it is possible to recognize a blob as a hand
from its role in a gesture and to maintain proper tracking. This
permits us to reliably do away with colored gloves. In addition, a
coarse model of
is available and can be evaluated to
determine the likelihood of any past interaction. If permutations of
the blobs being tracked by the computer vision are occasionally tested
with
,
any mislabeling of the blob features can be
detected and corrected. The system merely selects one of the 6
permutations of 3 blobs that maximizes
and then feeds
back the appropriate
estimate to the computer
vision. Instead of using complex static computations to resolve these
ambiguities, a reliable correspondence between the blobs is computed
from temporal information.