This is the algorithm, "Learn-Forget Algorithm," of the training with this idea. The data D[i] is sequential data and keeps closer relation to the closer data . To derive the neural network for Xth Data D[X] is just algorithm Alg.1.
In Alg.1, procedure 1 makes the general Neural Network. The reason of this procedure is the generation of the base model. This model is changed by "Learn-Forget" part of the algorithm. Procedure 2.1 means the learning of newer example. Procedure 2.2 is "forgetting" of older example. This procedure makes larger the error proved by older example. Then the effect from the older data is assumed to be smaller than the newer data. The parameter d means the number of nodes which keeps in the neural network. Then the i-d th node will be trashed in the i th procedure. f, E is as shown on the explanation. Using this training algorithm, the neural network can have the adaptation to closer data (i.e. newer data if it is time sequential). Thus the model is assumed to fit the current situation.
N=Average of the number of names correctly recognized. (MAX=14, MIN=0, 14 is better.)
A=Average of error in one output.(MAX=1, MIN=0, 0 is better.)
Table. 1 indicates the result of original neural network. The network uses first 200 data of all data for as training data and the rest of the data (1118 data) as the test data. The learning rate r is 5 because from another experience which uses r<1, the convergence of the network seems very slow. From this result, the neural network fits to the training set, but this does not show the effective result for the test set.
N for training set | A for training set | N for test set | A for test set |
---|---|---|---|
13.455 | 0.105 | 7.271 | 0.441 |
Table 2 shows the result using algorithm 1 with some parameters for the recognition of the test set. Note that as mentioned in the approach section and algorithm, the different neural network is prepared for each data, while the generation is sequentially available. In this case, forgetting rate F is fixed to 0.5. From these results, the improvement of Learn-Forget algorithm is less than 3.5%. Thus the improvement is observed, while it is extremely small. From these experiments, E=1 is better. Actually, from the result of another experiment of E=100 case, the neural network overfits to the one certain data then the most of the outputs from the neural network are almost 0 or 1. Then the result becomes rather wrong.
E=> | 1 | 2 | 3 | 4 | 5 | |||||
---|---|---|---|---|---|---|---|---|---|---|
f | N | A | N | A | N | A | N | A | N | A |
5 | 7.495 | 0.437 | 7.379 | 0.436 | 7.350 | 0.441 | 7.297 | 0.443 | 7.308 | 0.444 | 30 | 7.525 | 0.436 | 7.446 | 0.442 | 7.423 | 0.446 | 7.446 | 0.442 | 7.308 | 0.444 |
Table 3 shows the result of f=5,10,30,50,and 100 case. In this case, E=1 and f=0.5 (fixed.) From the result of these experiments, around f=30 is better. But there are not big differences between these results in this experiment.
f | 5 | 10 | 30 | 50 | 100 |
---|---|---|---|---|---|
N | 7.495 | 7.417 | 7.525 | 7.505 | 7.478 |
A | 0.436 | 0.438 | 0.437 | 0.437 | 0.441 |
Table 4 shows the result of d=0, 0.5, 1, and 2 case. In this case, E=1 and d=30 (fixed.) d=0 means there are no forgetting in the algorithm, only adding backpropagation of new data. From these results, without forgetting previous data, there are a little improvement observed. With the forgetting algorithm, there is improvement on the correct number N of output while the error A becomes larger. Thus the forgetting algorithm is effective on N. From this experiment around f=1 is better. In addition, f>1, this means that "forgetting rather than learning," makes the neural network ineffective.
f | 0 | 0.5 | 1 | 2 |
---|---|---|---|---|
N | 7.353 | 7.525 | 7.528 | 7.262 |
A | 0.431 | 0.436 | 0.448 | 0.477 |
Overall, from these experiments, the results do not show the big improvement from general Neural Network on the classifying problem on this domain. But on the other hand, these results show that this algorithm is still effective comparing with general Neural Network in the measures. In this experiment, with many variety of parameters the great improvement is not achieved. From the result of the expriment, the optimized parameter setting is E=1, d=30, and f=1 on measure N. (In practical, measure N is much imporatant.)
Appendix A.: The names of stocks used in this expriment
Hitachi, Toshiba, Mitsubisi Electric, Fuji Electric, Meidensha, NEC, Fujitsu, Oki Electric, Matsushita Electic, Sharp, Sony, Sanyo, Pioneer, Clarion
(Stock code:6501,6502,6503,6504,6508,6701,6702,6703,6752,6753,6758,6764,6773,6796)