Having obtained a
or, more compactly a
,
we can compute the probability of any point
in
the vector space5.2. However, evaluating the pdf in such a manner is not
necessarily the ultimate objective. Often, some components of the
vector are given as input (
)
and the learning system is
required the estimate the missing components as output5.3 (
). In other
words,
can be broken up into two sub-vectors
and
and a conditional pdf is computed from the original joint
pdf over the whole vector as in Equation 5.3. This
conditional pdf is
with the j superscript
to indicate that it is obtained from the previous estimate of the
joint density. When an input
is specified, this conditional
density becomes a density over
,
the desired output of the
system. This density is the required function of the learning system
and if a final output estimate
is need, the expectation or arg max can be found via
Equation 5.4.
Obtaining a conditional density from the unconditional (i.e. joint) probability density function in such a roundabout way can be shown to be suboptimal. However, it has remained popular and is convenient partly because of the availability of powerful techniques for joint density estimation (such as EM).
If we know a priori that we will need the conditional density, it is
evident that it should be estimated directly from the training
data. Direct Bayesian conditional density estimation is defined in
Equation 5.5. The vector
(the input
or covariate) is always given and the
(the output or
response) is to be estimated. The training data is of course
also explicitly split into the corresponding
and
vector sets. Note here that the conditional density is referred to as
to distinguish it from the expression in
Equation 5.3.
Here,
parametrizes a conditional density
.
is exactly the parametrization of the conditional
density
that results from the joint density
parametrized by
.
Initially, it seems
intuitive that the above expression should yield exactly the same
conditional density as before. It seems natural that p(y|x)c should
equal p(y|x)j since the
is just the conditioned version
of
.
In other words, if the expression in
Equation 5.1 is conditioned as in
Equation 5.3, then the result in
Equation 5.5 should be identical. This
conjecture is wrong.
Upon closer examination, we note an important difference. The
we are integrating over in
Equation 5.5 is not the same
as in
Equation 5.1. In the direct conditional density estimate
(Equation 5.5), the
only
parametrizes a conditional density
and therefore
provides no information about the density of
or
.
In fact, we can assume that the conditional density parametrized
by
is just a function over
with some
parameters. Therefore, we can essentially ignore any relationship it
could have to some underlying joint density paramtrized by
.
Since this is only a conditional model, the term
in Equation 5.5 behaves
differently than the similar term
in Equation 5.1. This is illustrated
in the manipulation involving Bayes rule shown in
Equation 5.6.
In the final line of Equation 5.6, an
important manipulation is noted:
is replaced
with
.
This implies that observing
does not
affect the probability of
.
This operation is invalid in the
joint density estimation case since
has parameters that
determine a density in the
domain. However, in conditional
density estimation, if
is not also observed,
is
independent from
.
It in no way constrains or provides
information about the density of
since it is merely a
conditional density over
.
The graphical models in
Figure 5.4 depict the difference between joint
density models and conditional density models using a directed acyclic
graph [35] [28]. Note that the
model
and the
are independent if
is not observed in the
conditional density estimation scenario. In graphical terms, the
joint parametrization is a parent of the children nodes
and
.
Meanwhile, the conditional parametrization
and the
data are co-parents of the child
(they are marginally independent).
Equation 5.7 then finally illustrates
directly estimated conditional density solution
.
The Bayesian integration estimate of the conditional density appears to be different and inferior from the conditional Bayesian integration estimate of the unconditional density. 5.4 The integral (typically) is difficult to evaluate. The corresponding conditional MAP and conditional ML solutions are given in Equation 5.8.
At this point, the reader is encouraged to read the Appendix for an
example of conditional Bayesian inference (
)
and
how it differs from conditioned joint Bayesian inference (
). From this example we note that (regardless of the
degree of sophistication of the inference) direct conditional density
estimation is different and superior to conditioned joint density
estimation. Since in many applications, full Bayesian integration is
computationally too intensive, the MLc and the MAPc cases
derived above will be emphasized. In the following, we shall
specifically attend to the conditional maximum likelihood case (which
can be extended to the MAPc) and see how General Bound Maximization
(GBM) techniques can be applied to it. The GBM framework is a set of
operations and approaches that can be used to optimize a wide variety
of functions. Subsequently, the framework is applied to MLc and
MAPc expressions that were advocated above to find their
maximum. The result of this derivation is the Conditional Expectation
Maximization (CEM) algorithm which will be the workhorse learning
system we will be using for the ARL training data.