Recognition of Genes in DNA Sequences

One more data set donated by Noordewier et al. [1] was downloaded from: ftp://ftp.ncc.up.pt/pub/statlog/.  The authors of the original work used knowledge-based neural networks to model the existence of splice-junctions within a DNA sequence based on 60 nucleotides, 30 on each side of a given point.  Each symbolic variable representing one of the four nucleotides (A, C, G and T) was coded as a set of three binary variables (100, 010, 001 and 000, respectively).  As a result, there were 180 input variables (coded #1, #2, ... , #180) for each sample representing one of the three output classes: an EI site, an IE site or neither.  The data set came with separate training and validating sample files and the training file contained a different number of samples for each output class.  To make the results as comparable as possible (the number of the samples in [1] is slightly different from that of the downloaded data), the sample sets were kept unchanged for this experiment: 2000 training and 1186 validating samples.  To cut computation time, a 180 x 18 x 3 network was trained five times as a preliminary process.  The ten most important inputs were selected based on the interpretation of the trained network, and thus the automatic feature selection process started with a network structure having ten input and three output nodes.  With the numbering of the original inputs, the ten most important ones were: a: #82; b: #84; c: #85; d: #90; e: #93; f: #94; g: #95; h: #96; i: #97; and j: #105.  Each training session continued for 300 cycles and the process was repeated to get a total of six sets of results.  The order in which the inputs were deleted, the average MCSR* and average CASR** at each iteration during the five processes are shown in the table below. For the last iteration, the average CASR's were not calculated when the MCSR's were zero.


Iteration

Process

Prel.

1

2

3

4

5

6

 

Inputs

Used

1

 

All

180

abcdefghij  

abcdefghij  

abcdefghij  

abcdefghij  

abcdefghij  

abcdefghij  

2

abcdefgh j

abcdefgh j

abcdefgh j

abcdefgh j

abcdefgh j

abcdefgh j

3

abcdefgh

bcdefgh j

bcdefgh j

bcdefgh j

bcdefgh j

bcdefgh j

4

bcdefgh

bcdefgh

cdefgh j

bcdefgh

bcdefgh

bcdefgh

5

cdefgh

cdefgh

cdefgh

cdefgh

cdefgh

cdefgh

6

cdef h

cdef h

cdef h

cdef h

cdef h

cdef h

7

cdef

cdef

cdef

cde h

cdef

def h

8

def

def

def

de h

c ef

def

9

de

de

de

de

c e

de

10

d

d

d

d

e

d

 

Average

MCSR

1

 

 

92.071

93.714

93.740

93.273

93.643

93.714

93.571

2

91.280

91.617

92.460

91.023

91.881

92.435

3

82.571

87.789

87.723

88.647

88.317

88.911

4

83.929

83.929

85.809

83.929

84.071

84.143

5

85.000

85.000

85.000

85.000

85.000

85.000

6

73.571

73.571

73.571

73.571

73.571

73.571

7

62.500

62.500

62.500

62.500

62.500

74.286

8

63.214

63.214

63.214

63.214

59.736

63.214

9

52.143

52.143

52.143

52.143

51.429

52.143

10

0.000

0.000

0.000

0.000

0.000

0.000

 

Average

CASR

1

 

 

93.052

94.334

94.165

94.165

94.283

94.148

94.384

2

93.153

93.288

93.440

93.255

93.390

93.491

3

90.320

92.209

92.243

92.411

92.310

92.293

4

90.304

90.304

91.315

90.472

90.067

90.405

5

89.241

89.713

89.477

89.241

89.713

88.887

6

84.435

83.895

84.435

83.963

83.659

84.132

7

79.848

79.848

79.848

79.933

79.848

75.717

8

73.187

73.187

73.187

73.187

68.718

73.187

9

70.658

70.658

70.658

70.658

66.105

70.658

10

NA

NA

NA

NA

NA

NA

As inputs were deleted, the MCSR's and CASR's decreased as an overall trend.  In the case of ten inputs, all the rates were recorded even higher than when all the 180 inputs were used, both in this experiment and the ones reported by Noordewier et al.[1].  Based on the way the nucleotides were coded, the importance of inputs d and e (i.e., #90 and #93 in the sequence) indicated that, to determine if a splice junction exists at a given point, and its type if it exists, it is very important to know whether or not the neighboring nucleotides are type G.  Further interpretation and verification of the results from the last two experiments will be left to specialists with corresponding domain knowledge.

Reference:
[1]  M. O. Noordewier, G. G. Towell and J. W. Shavlik, "Training knowledge-based neural networks to recognize genes in DNA sequences," Advances in Neural Information Processing Systems (R. P. Lippmann, J. E. Moody and D. S. Touretzky, Ed.), Morgan Kaufmann Publishers, San Mateo, CA, vol. 3, pp. 530-536, 1991.

__________

*MCSR: The Minimum Class Success Rate was the lowest success rate among all the target classes. The average MCSR is the MCSR's averaged over the five training sessions within each process.

**CASR: The Class Average Success Rate is the success rate averaged over all the target classes.  The average CASR is the CASR's averaged over the five training sessions within each process.

Projects Main Page || Neural Network Main Page
Character Recognition || SPIE Challenge || Diabetes Forecast || Gene Recognition