Another learning task performed on data which contains values for attributes which have associated costs is that of the medical diagnosis domain. In this domain, certain tests that may assist in diagnosis have a higher monetary cost than others or may cause more patient discomfort. For example, a CAT scan is more expensive to perform than a blood pressure test, and an enema is more uncomfortable than having one's temperature taken.
Rule-sets for the classification of network intrusion are currently
learned using RIPPER, a
decision-tree learning algorithm developed by William Cohen of AT&T
Laboratories. RIPPER offers a number of modifications to IREP, C4.5,
and C4.5rules which have proven to yeild faster training times and
lower error rates than these predececessors, but does not provide the
capability of associating a cost metric with a data-set's attributes
[Cohen 95]. This work has extended RIPPER to allow for these cost
metrics, enabling RIPPER to learn cost-sensitive hypotheses.
The information gain function can be modified in a number of different ways to weight its value according to the cost of the attribute being tested. Two such ways are presented by Tom Mitchell in his Machine Learning textbook [Mitchell 97].
The first of these methods is that used by Tan and Schlimmer in a robot perception domain where the robot must learn to classify differnt objects based on different sensor input. The cost metric here is the delay time in receiving input from a particular sensor. They modified the information gain function to return the square of the standard information gain divided by the cost of the attribute (Gain^2 / Cost(A)).
The second method presented by Mitchell is that used by Nunez for learning medical diagnosis rules. His modification to the information gain function is as follows: (2^Gain - 1) / (Cost(A)) + 1)^w. Here, a cost-sensitivity constant (w) is used which must be on the interval [0, 1]. This constant defines the weight that cost should play in the computation of information gain.
The method chosen for this experiment was the second method. This was
done in order to allow for different levels of cost-sensitivity to be
defined in order to produce rule-sets that are customized to the level
of importance of computational cost in the network environment for
which the rule-set is being produced.
The data used in this experiment constitued a small portion of the entire data-set totaling 1.25MB of connection records which are each 100bytes.
Classification was into one of nine different classes. Eight of these correspond to different types of known intrusions, and one corresponds to normal network activity.
The data contained 18 different attributes which correspond to different features of the network connection record. There are three different categories of attributes that were constructed by the Link Analysis and Sequence Analysis methods, each with different costs of computation associated with them:
Cost Sensitivity | Train Error Rate (%) | Test Error Rate(%) | Total Cost |
0.0 | 1.20 | 5.82 | 43 |
0.1 | 1.20 | 5.82 | 43 |
0.2 | 1.19 | 5.82 | 42 |
0.3 | 1.21 | 5.82 | 40 |
0.4 | 1.21 | 5.82 | 40 |
0.5 | 1.21 | 5.82 | 40 |
0.6 | 1.21 | 5.84 | 38 |
0.7 | 1.24 | 5.84 | 38 |
0.8 | 1.24 | 5.84 | 38 |
0.9 | 1.25 | 5.84 | 33 |
1.0 | 1.25 | 5.84 | 33 |
It is interesting to note that the error rates over both the training
and test data do not change dramatically as the cost of each attribute
is weighted more heavily in the information gain function. However,
the total cost of execution decreases more than linearly as the cost
sensitivity level rises. This behavior appears to be the result of
additional classification rules not being included in the rule-set and
a decrease in the number of attributes that are included in each rule.
First, there are a wealth of possibilites for different information gain functions that allow for different degrees of weighted cost. Also, these different functions should be tested in multiple domains, over varied sets of data.
Currently, the cost metrics assigned to each attribute do not
accurately reflect the actual cost of computation by a real-time
intrusion detection system. We are in the process of implementing a
real-time system using Network Flight
Recorder [NFR 97], a powerful packet sniffing engine with the
ability to program scripts in N-code (an interpreted traffic analysis
language) in order to compute our connection attributes[Lee2 99]. As
this work progresses, a more thorough cost analysis of attribute
computation can be performed, and actual measures of performance gains
offered by training RIPPER with cost-sensitivity can be determined.
[Cohen 96] Cohen, William W.: Learning Trees and Rules with Set-valued Features. From Proceedings of the Thirteenth National Conference on Artificial Intelligence, 1996.
[Lee 99] Lee, Wenke; Stolfo, Salvatore J.; Mok, Kui W.: A Data Mining Framework for Building Intrusion Detection Models. To appear in the Proc. 1999 IEEE Symposium on Security and Privacy, 1999.
[Lee2 99] Lee, Wenke; Park, Christopher T.; Stolfo, Salvatore J.: Automated Intrusion Detection Using NFR: Methods and Experiences. From USENIX Workshop on Intrusion Detection and Network Monitoring (ID '99) Proceedings, 1999.
[Mitchell 97] Mitchell, Tom: Machine Learning. McGraw-Hill, 1997.
[NFR 97] Network Flight Recorder. http://www.nfr.net, 1997.
[Stolfo 99] Stolfo, Salvatore J.; Fan, Wei; Lee, Wenke; Prodromidis,
Andreas; Chan, Philip K.: Cost-based Modeling and Evaluation for Data
Mining With Application to Fraud and Intrusion Detection: Results from
the JAM Project. Yet to be published, April 21, 1999.