SPIE Discovery Challenge

Problem:
The SPIE Discovery Challenge was posted on the web at: http://www.cs.uncc.edu/~zytkow/spie_challenge/.  A description of the data can be found at: http://www.cs.uncc.edu/~zytkow/spie_challenge/data_described-5.htm.

Method:
A software package named Cluzzifier was used to accomplish the following steps: 1) Data pre-processing and coding; 2) Three-layer (input, hidden and output) feed-forward neural network training using back-propagation; 3) Weight interpretation; 4) Node pruning and network restructuring; 5) Steps 2, 3 and 4 iteration; 6) Result assessment; and 7) Data post-processing and report generation.

Results:
Out of the original 23 attributes, 10 are useless, if not harmful, for the correct classification of the type of unbalance.  The highest classification accuracy was achieved when the following 13 inputs were used: 1) RtSp; 2) A04D1; 3) A04D2; 4) A18D1; 5) A18D2; 6) R02D1; 7) R16D1; 8) R16D2; 9) R16TZ (obtained as R02TZ - RdifTZ); 10) node1_#; 11) node2_#; 12) unbal1; and 13) unbal2.  The confusion matrix for four runs using only these 13 inputs are shown below.

            Classified As                   
      -------------------------             
      b     s     q     m     d     Accu.(%)
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     968   4     0     0        99.59
q     0     72    900   0     0        92.59
m     0     0     0     972   0       100.00
d     0     0     4     0     1076     99.62
Overall accuracy: 98.42%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     950   22    0     0        97.74
q     0     36    936   0     0        96.30
m     0     0     0     972   0       100.00
d     0     0     0     16    1064     98.52
Overall accuracy: 98.54%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     958   14    0     0        98.56
q     0     76    896   0     0        92.18
m     0     0     0     972   0       100.00
d     0     0     0     14    1066     98.70
Overall accuracy: 97.95%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     956   16    0     0        98.35
q     0     54    918   0     0        94.44
m     0     0     0     972   0       100.00
d     0     0     0     6     1074     99.44
Overall accuracy: 98.50%
--------------------------------------------
If a slightly lower classification accuracy is acceptable, 8 attributes can be used to achieve the following confusion matrices.  The 8 attributes are: 1) A04D2; 2) A18D1; 3) A18D2; 4) R02D1; 5) R16TZ (obtained as R02TZ - RdifTZ); 6) node1_#; 7) node2_#; and 8) unbal1.  
            Classified As                   
      -------------------------             
      b     s     q     m     d     Accu.(%)
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     958   14    0     0        98.56
q     0     80    892   0     0        91.77
m     0     0     0     972   0       100.00
d     0     0     0     18    1062     98.33
Overall accuracy: 97.79%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     950   22    0     0        97.74
q     0     50    922   0     0        94.86
m     0     0     0     972   0       100.00
d     2     0     0     46    1032     95.56
Overall accuracy: 97.64%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     952   20    0     0        97.94
q     0     94    878   0     0        90.33
m     0     0     0     972   0       100.00
d     0     0     0     24    1056     97.78
Overall accuracy: 97.28%
--------------------------------------------
b     1080  0     0     0     0       100.00
s     0     914   58    0     0        94.03
q     0     72    900   0     0        92.59
m     0     0     0     972   0       100.00
d     0     0     0     62    1010     93.52
Overall accuracy: 96.06%
--------------------------------------------

In comparison, when all the 23 attributes were used, the confusion matrices are:

            Classified As                   
      -------------------------             
      b     s     q     m     d     Accu.(%)
--------------------------------------------
b     1075  0     0     4     1        99.54
s     0     859   111   0     2        88.37
q     0     95    873   1     3        89.81
m     0     0     1     970   1        99.79
d     5     0     13    66    996      92.22
Overall accuracy: 94.03%
--------------------------------------------
b     1064  0     1     13    2        98.52
s     0     866   106   0     0        89.09
q     0     106   864   2     0        88.89
m     0     0     1     964   7        99.18
d     16    2     19    28    1015     93.98
Overall accuracy: 94.03%
--------------------------------------------
b     1073  0     1     0     6        99.35
s     6     844   122   0     0        86.83
q     0     108   841   22    1        86.52
m     1     0     10    960   1        98.77
d     7     2     42    45    984      91.11
Overall accuracy: 92.63%
--------------------------------------------
b     1050  0     3     27    0        97.22
s     0     870   102   0     0        89.51
q     0     130   839   3     0        86.32
m     7     0     3     959   3        98.66
d     26    4     65    41    944      87.41
Overall accuracy: 91.84%
--------------------------------------------

Knowledge discovered:

  1. Among the 23 attributes originally measured for the classification (or the diagnosis) of unbalance types, some are redundant, if not conflicting.  To achieve the highest classification accuracy, a subset of only 13 attributes is necessary.  A subset with only 8 attributes (or almost only one third of the original set) can be used to achieve an overall classification accuracy around 97%.
  2. From the confusion matrices, it can be seen that:  Among the five classes (or unbalance types), Classes b and m are the easiest to identify.  Some samples from Class d are sometimes misidentified as Class m and a few time as Class q.  Samples from Classes s and q are rarely classified into any of the other three classes, but it quite difficult to identify some samples between these two classes.  Perhaps from the mechanics point of view, these two classes are the similar to each other?

Projects Main Page || Neural Network Main Page
Character Recognition || SPIE Challenge || Diabetes Forecast || Gene Recognition