Pr[h(x) = c(x)] ≥ 1 − ε
for a future observations x. Furthermore, the learning agent wants to succeed at this goal with probability at least 1 − δ over the choice of xi's. In other words, with high probability we want to output a hypothesis that's approximately correct (hence "Probably Approximately Correct").m = O( 1⁄ε · log |C|⁄δ )
samples drawn independently from D are sufficient. Any hypothesis h ∈ C we can find that agrees with all of these m samples (i.e., such that h(xi) = c(xi) for all i) will satisfy Pr[h(x) = c(x)] ≥ 1 − ε with probability at least 1 − δ over the choice of m samples.Proof: (by contrapositive) Let h ∈ C be any "bad" hypothesis: that is, such that Pr[h(x) = c(x)] < 1 − ε. Then if we independently pick m points from the sample distribution D, the hypothesis h will be correct on all of these points with probability at most (1 − ε)m. So the probability that there exists a bad hypothesis in C that nevertheless agrees with all our sample data is at most |C| (1 − ε)m (the total number of hypotheses, good or bad, times the maximum probability of each bad hypothesis agreeing with the sample data).
Note that the above situation is a bad event (we picked a bad hypothesis because it agreed with all m samples). We want to limit the chance of this happening by, say, no more than δ. So we want |C| (1 − ε)m ≤ δ, or equivalently
m ≥ O( 1⁄ε · log |C|⁄δ ).
Note that the true concept c is in C. So the learning agent can use the following algorithm to learn the concept:
We need to capture the true complexity of a concept class (not just the number of elements in it). The key idea that fully characterizes the complexity of a concept class is called VC-dimension (after two of its inventors, Vapnik and Chervonenkis). We say the set of points x1,...,xm is shattered by a concept class C if for all 2m possible settings of c(x1),...,c(xm) to 0 or 1 (reject or accept), there is some concept c ∈ C that agrees with those values. Then the VC-dimension of C, denoted Dim(C), is the size of the largest set of points shattered by C. (If we can find arbitrarily large finite-size sets of points that can be shattered, then Dim(C) = ∞).
One can show that a concept class C is PAC-learnable if and only if its VC-dimension is finite, and that the number of samples
m = O( Dim(C)⁄ε · log 1⁄εδ )
are sufficent. The learning algorithm again can simply output any hypothesis h ∈ C that is consistent with the m observations.
If the learner's goal is to output a hypothesis in some fixed format (such as DNF expressions), it is possible to show that in some cases finding a hypothesis that is consistent with the observations is NP-complete. But if the learner's output hypothesis can in any general format (ie, output a hypothesis that is some polynomial-time algorithm that predicts on data), we don't know whether finding such a hypothesis is NP-complete.