SubCellProt: Predicting protein subcellular localization using machine learning approaches
High throughput proteome sequencing projects continue to churn out enormous amounts of raw sequence data.
However, most of this raw sequence data is unannotated and hence, not very useful. Among the various approaches to decipher the
function of a protein, one is to determine its localization. Experimental approaches for proteome annotation including determination of
a protein's subcellular localization are very costly and labor intensive. Besides the available experimental methods, in silico methods
present alternative approaches to accomplish this task. Here, we present two machine learning approaches for prediction of the subcellular
localization of a protein from the primary sequence information. Two machine learning algorithms, k Nearest Neighbor (k-NN) and
Probabilistic Neural Network (PNN), were used to classify an unknown protein into one of the 11 subcellular localizations.
The final prediction is made on the basis of a consensus of the predictions made by two algorithms and a probability is assigned to it.
Location
No. of sequences (Training set)
No. of sequences (Test set)
Nucleus
4442
227
Extracellular
5409
324
Mitochondria
2803
179
Cytoplasm
3419
148
Chloroplast
3579
307
Plasma membrane
4849
303
Endoplasmic reticulum
828
62
Golgi apparatus
287
32
Peroxisome
186
27
Lysosome
150
22
Vacuole
204
29
SubCellProt: Predicting protein subcellular localization using machine learning approaches