Figure 5.2 depicts this learning problem. Given the 999 measurements (called ) (which will always be sent to the salesman), what is the correct shoe size ()? The problem with fitting a generative model to this data is that it will make no distinction between and . These variables are simply clumped together as measurements in a big joint space and the model tries to describe the overall phenomenon governing the features. As a result, the system will be swamped and learn useless interactions between the 999 variables in (the file). It will not focus on learning how to use the to compute but instead equally worry about estimating any variable in the common soup from any possible observation. Thus, the system will waste resources and be distracted by spurious features that have nothing to do with shoe size estimation (like hair and eye color) instead of optimizing the ones that do (height, age, etc.).
A misallocation of resources occurs when a generative model is used as in Figure 5.3. Note how the model would place different linear sub-models to describe the interaction between hair darkness and eye darkness. This is wasteful for the application since these attributes have nothing to do with shoe size as can be seen by the lack of correlation between hair color and shoe size. In fact, only observing the height of the client seems to be good to determine the shoe size. The misallocation of resources is a symptom of the joint density estimation in which treats all variables equally. A discriminatory or conditional model that estimates does not use resources for estimation of hair darkness or some useless feature. Instead, only the estimation of , the shoe size, is of concern. Thus, as the dimensionality of increases (i.e. 999), more resources are wasted modeling other features and performance on usually degrades even though information is being added to the system. However, if it is made explicit that only the output is of value, additional features in should be more helpful than harmful.
We will now show that conditional or discriminative models fit nicely into a fully probabilistic framework. The following chapters will outline a monotonically convergent training and inference algorithm (CEM, a variation on EM) that will be derived for conditional density estimation. This overcomes some of the adhoc inference aspects of conditional or discriminative models and yields a formal, efficient and reliable way to train them. In addition, a probabilistic model allows us to manipulate the model after training using principled Bayesian techniques.