Supplementary Materials

Supplementary Materials for
Sparks, M.E. and Brendel, V. (2008) `MetWAMer`: eukaryotic translation initiation site prediction

Table S1. Method performances on C.elegans TIS-containing data. 17,016 TIS-containing instances were used in three separate five-fold cross-validation experiments. Results from applying a non-stratified parameter set (homogeneous); a priori-known cluster-specific parameter sets for k=3 (cluster-specific); and group-specific parameter sets for a random three-way split of the data (random split) are shown. TP represents the number of instances for which the method correctly identified a TIS; FP for which a prediction was made, though incorrect; and FN for which no prediction was made, but should have been. Sn = TP/(TP+FP+FN), and Sp = TP/(TP+FP).

Deployment	Method	TP	FP	FN	Sn	Sp

	1st-ATG	14,797	2,219	0	0.8696	0.8696

homogeneous	LLKR	6,852	9,185	979	0.4027	0.4273
	WLLKR	10,339	4,378	2,299	0.6076	0.7025
	MFCWLLKR	12,178	4,775	63	0.7157	0.7183
	PFCWLLKR	12,002	4,334	680	0.7053	0.7347
	BAYES	7,079	6,257	3,680	0.4160	0.5308

cluster-specific	LLKR	9,618	6,638	760	0.5652	0.5917
	WLLKR	12,535	3,002	1,479	0.7367	0.8068
	MFCWLLKR	13,883	3,069	64	0.8159	0.8190
	PFCWLLKR	13,417	2,769	830	0.7885	0.8289
	BAYES	9,661	4,788	2,567	0.5678	0.6686

random split	LLKR	6,857	9,092	1,067	0.4030	0.4299
	WLLKR	10,278	4,412	2,326	0.6040	0.6997
	MFCWLLKR	12,189	4,764	63	0.7163	0.7190
	PFCWLLKR	11,752	4,275	989	0.6906	0.7333
	BAYES	7,159	6,283	3,574	0.4207	0.5326

Table S2. Method performances on C.elegans non-TIS-containing data. 16,617 non-TIS-containing instances were used in three separate five-fold cross-validation experiments. Results from applying a non-stratified parameter set (homogeneous); a priori-known cluster-specific parameter sets for k=3 (cluster-specific); and group-specific parameter sets for a random three-way split of the data (random split) are shown. TN represents the number of instances for which the method (correctly) refused to predict a TIS, and FP the number for which some prediction was made, though always incorrect. Sn = TN/(TN+FP).

Deployment	Method	TN	FP	Sn

	1st-ATG	908	15,709	0.0546

homogeneous	LLKR	4,464	12,153	0.2686
	WLLKR	6,589	10,028	0.3965
	MFCWLLKR	2,034	14,583	0.1224
	PFCWLLKR	5,310	11,307	0.3196
	BAYES	7,249	9,368	0.4362

cluster-specific	LLKR	5,698	10,919	0.3429
	WLLKR	7,239	9,378	0.4356
	MFCWLLKR	2,381	14,236	0.1433
	PFCWLLKR	8,244	8,373	0.4961
	BAYES	8,177	8,440	0.4921

random split	LLKR	4,605	12,012	0.2771
	WLLKR	6,577	10,040	0.3958
	MFCWLLKR	2,040	14,577	0.1228
	PFCWLLKR	6,224	10,393	0.3746
	BAYES	7,237	9,380	0.4355

Table S3. Effect of parameter set indexing strategy on PFCWLLKR performance using C.elegans TIS-containing data. 17,016 TIS-containing instances were used in five-fold cross-validation experiments, in which parameter sets were selected for putative TIS evaluation using Hamming distance relative to cached medoids (edit), position weight matrix scores (PWM) and weight array matrix scores (WAM); parameter indexing under both modulating and static approaches was tested. k=3 clusters were considered. TP represents the number of instances for which the method correctly identified a TIS; FP for which a prediction was made, though incorrect; and FN for which no prediction was made, but should have been. Sn = TP/(TP+FP+FN), and Sp = TP/(TP+FP).

	Indexing strategy	TP	FP	FN	Sn	Sp

modulating	edit	11,548	4,981	487	0.6787	0.6987
	PWM	11,505	4,991	520	0.6761	0.6974
	WAM	11,621	4,917	478	0.6829	0.7027

static	edit	12,573	3,580	863	0.7389	0.7784
	PWM	12,514	3,534	968	0.7354	0.7798
	WAM	12,640	3,529	847	0.7428	0.7817

Table S4. Effect of parameter set indexing strategy on PFCWLLKR performance using C.elegans non-TIS-containing data. 16,617 non-TIS-containing instances were used in five-fold cross-validation experiments, in which parameter sets were selected for putative TIS evaluation using Hamming distance relative to cached medoids (edit), position weight matrix scores (PWM) and weight array matrix scores (WAM); parameter indexing under both modulating and static approaches was tested. k=3 clusters were considered. TN represents the number of instances for which the method (correctly) refused to predict a TIS, and FP the number for which some prediction was made, though always incorrect. Sn = TN/(TN+FP).

	Indexing strategy	TN	FP	Sn

modulating	edit	4,470	12,147	0.2690
	PWM	4,524	12,093	0.2723
	WAM	4,468	12,149	0.2689

static	edit	5,466	11,151	0.3289
	PWM	5,560	11,057	0.3346
	WAM	5,460	11,157	0.3286

Table S5. Method performances on H.sapiens TIS-containing data. 273 TIS-containing instances, obtained from (non-withdrawn) gene annotations in the 30 April 2008 CCDS chromosome 21 annotation available at NCBI, were used in one ten-fold cross-validation experiment. Results from applying a non-stratified parameter set (homogeneous) are shown. TP represents the number of instances for which the method correctly identified a TIS; FP for which a prediction was made, though incorrect; and FN for which no prediction was made, but should have been. Sn = TP/(TP+FP+FN), and Sp = TP/(TP+FP).

Method	TP	FP	FN	Sn	Sp

1st-ATG	273	0	0	1.0000	1.0000

LLKR	152	94	27	0.5568	0.6179
WLLKR	186	47	40	0.6813	0.7983
MFCWLLKR	211	61	1	0.7729	0.7757
PFCWLLKR	199	46	28	0.7289	0.8122
BAYES	144	86	43	0.5275	0.6261

This site served
as XHTML here.

Supplementary Materials for Sparks, M.E. and Brendel, V. (2008) MetWAMer: eukaryotic translation initiation site prediction

Supplementary Materials for
Sparks, M.E. and Brendel, V. (2008) `MetWAMer`: eukaryotic translation initiation site prediction