There are many different issues that need to be addressed before implementing a machine learning system. A large number of approaches exist and a reasonable one must be selected depending on factors such as the type of problem at hand and the practical complexity constraints of its implementation. In addition, sometimes an approach is provably inferior or a mere subset of a more general framework. Many pitfalls have been avoided, for example, when reliable statistical approaches were used instead of adhoc methods. Of course, a full outline of the issues in machine learning is beyond the scope of this document. However, we shall carefully discuss one critical decision: the use of model-based versus model-free learning. In statistical terms, this refers to the estimation of joint densities versus conditional densities. The list below outlines this and other important machine learning concepts that will be referred to in this document.
Before using learning techniques, it is critical to identify the exact problem to be solved: are we trying to build a description of some phenomenon or are we observing a phenomenon to subsequently learn how to perform a particular task? In probabilistic terms, are we trying to estimate a joint density of all variables or is a conditional density ultimately desired? The former is called a generative model (`model-based') since it can simulate the whole phenomenon while the latter is a discriminative model (or `model-free') since it models the phenomenon only enough to accomplish a particular task. For discriminative models, we refer to a system with the `task' of computing an output given an input . Generative models do not explicitly partition output and input but simply describe as a joint phenomenon.
Learning requires data and the learning system is typically fed variables that have been pre-selected and pre-processed. Feature selection (and re-parameterization) is often critical to ensure that the machine learning algorithm does not waste resources, get distracted by spurious features or tackle needlessly high dimensionalities. However, it is advantageous if a learning sytem does not depend excessively on feature parametrization.
It is always necessary to identify the type of model that will be used to learn from data. Many statistical and deterministic approaches exist varying from Decision Trees, Multivariate Densities, Hidden Markov Models, Mixture Models, Neural Networks, etc. Ideally, a general estimation framework (such as statistical techniques) is desired which can cope with such a variety of models.
Certain parameters or attributes of the model will be predetermined by the user who can select them before training on data. Often, these include complexity, constraints, topology and so on. It is desirable that these decisions are not too critical and a learning system should solve its task despite having been given an incorrect structure or inappropriate complexity.
The remaining model parameters the user does not specify can then be estimated by observing data. Typically, the criteria for estimating these parameters include (in increasing degree of sophistication) Least-Square Error, Maximum Likelihood, Maximum A Posteriori, Bayesian Integration and the variants thereof. More sophisticated inference techniques will often converge to more global and more generalizable learning solutions.
Typically, a variety of algorithms can be called upon to perform the desired inference. These include searching, sampling, integration, optimization, EM, gradient ascent and approximation methods. Often, these algorithms and their effectiveness will greatly influence the performance of the resulting learning system.