Protein-protein interaction is the molecular basis of most of the cellular processes in a biological system. It is necessary to have the knowledge of protein-protein interactions at the global level, as a network to actually understand the biological processes, where proteins interact with each other to carry out the physiological responses. Due to numerous limitations of currently available experimental methods, in-silico approaches are being devised for prediction of protein-protein interaction (PPI). Here we have presented an approach based on machine learning method for protein-protein interaction prediction of humans using primary sequence information and associated physicochemical properties of constituting amino acids. Protein sequences were represented as a 29-dimensional feature vectors; derived from amino acid composition, their sequence order effect and the physicochemical properties to determine the interaction between two proteins.

The present work makes use of Support Vector Machine (SVM) - a machine learning algorithm, which was trained to recognize patterns in these sequences to make a statistical decision as to whether or not a query protein pair will interact. To generate a good classifier and to avoid use of high false positive data from the databases, a data filtering process was done and only those protein pairs (both interacting and non-interacting) were selected for which a high degree of confidence could be ascertained. Our approach has shown to produce an average prediction accuracy, precision and sensitivity of 84.48%, 85.85% and 82.42% respectively, when applied to an independent data set of 3600 protein pairs. Further, the approach has shown to be able to predict various interaction networks such as signal transduction pathways including immune signaling pathways and cancer signaling pathways. An average prediction accuracy of 82.96% for these pathways was achieved. Results show that protein-protein interaction could be predicted with high accuracy using physicochemical properties of constituting amino acids. Thus this method can be employed for studying genome wide protein-protein interactions as well as in conjunction with high throughput interaction studies to further improve prediction accuracy.