Figure 5.2 depicts this learning problem. Given the 999
measurements (called )
(which will always be sent to the
salesman), what is the correct shoe size (
)? The problem with
fitting a generative model to this data is that it will make no
distinction between
and
.
These variables are
simply clumped together as measurements in a big joint space and the
model tries to describe the overall phenomenon governing the features.
As a result, the system will be swamped and learn useless interactions
between the 999 variables in
(the file). It will not focus
on learning how to use the
to compute
but instead
equally worry about estimating any variable in the common soup from
any possible observation. Thus, the system will waste resources and be
distracted by spurious features that have nothing to do with shoe size
estimation (like hair and eye color) instead of optimizing the ones
that do (height, age, etc.).
A misallocation of resources occurs when a generative model is used
as in Figure 5.3. Note how the model would place
different linear sub-models to describe the interaction between hair
darkness and eye darkness. This is wasteful for the application since
these attributes have nothing to do with shoe size as can be seen by
the lack of correlation between hair color and shoe size. In fact,
only observing the height of the client seems to be good to determine
the shoe size. The misallocation of resources is a symptom of the
joint density estimation in
which treats all
variables equally. A discriminatory or conditional model that
estimates
does not use resources for estimation
of hair darkness or some useless feature. Instead, only the
estimation of
,
the shoe size, is of concern. Thus, as the
dimensionality of
increases (i.e. 999), more resources are
wasted modeling other features and performance on
usually
degrades even though information is being added to the
system. However, if it is made explicit that only the
output
is of value, additional features in
should be more helpful
than harmful.
We will now show that conditional or discriminative models fit nicely into a fully probabilistic framework. The following chapters will outline a monotonically convergent training and inference algorithm (CEM, a variation on EM) that will be derived for conditional density estimation. This overcomes some of the adhoc inference aspects of conditional or discriminative models and yields a formal, efficient and reliable way to train them. In addition, a probabilistic model allows us to manipulate the model after training using principled Bayesian techniques.