Protein Metal-Binding Residue Prediction

Based on Neural Networks

 

1Chih-Hsien Yang, 1Chi-Hsu Wang, 1Chin-Teng Lin, and 2Yuh-Shyong Yang

Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C

Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan, R.O.C

 

Content List

Abstract

Chapter 1 - Introduction

Chapter 2 - Biological Data Resource

Chapter 3 - Machine Learning Scheme

Chapter 4 - Results and Conclusion

References

 

Abstract

Traditionally, structural biologists used to investigate properties of metalloproteins (proteins which bind with metal ions) by physical means and interpret the function formation and reaction mechanism of enzyme by their structures and observation from experiments in vitro. Most of proteins have primary structures (amino acid sequence information) only; however, the 3-dimension structures are not always available. Moreover, the prediction from protein sequence to structure is still not completely reliable so far.

 

Consequently, a direct analysis method is proposed to predict protein metal-binding amino acid residues only from its sequence information by neural network with sliding window-based feature extraction and biochemical feature encoding techniques in this thesis. In four major bulk elements (Calcium, Potassium, Magnesium, and Sodium) in life system, the metal-binding residues are identified by proposed method with a binding sensitivity > 90% and nearly 100% accuracy under five fold cross validation.

 

 

Chapter 1       Introduction

 

With rapid growth in computer and information science in recent years, most things in daily life have changed the way they were including biology the study of living things. From last several decades, biologists have collected and accumulated data from interaction of spices and populations, the function of tissues and cells within an individual organism, and even the structure and function of molecules (such as protein, DNA, RNA, etc.) inside or outside the cell. Sophisticated laboratory technology today helps biologists collect data faster, but it can’t speed up the interpretation of these massive and divergent biological data.

 

For instance, we have huge volume of human DNA sequences after Human Genome Project (HGP)[1], but how do we know which parts of DNA sequence can control which kinds of chemical processes or reactions in human body (Gene annotation or labeling)? We have many outstanding structural biologists spent great effort on determining protein structures by Nuclear Magnetic Resonance (NMR) or X-ray crystallography, and figuring out the structure of some proteins, but how do we determine the structure and function of other proteins and even a whole new protein (protein structure prediction and function analysis)? Figure 1 show the exponential data growth in GenBank[2] and Protein Data Bank[3] respectively.

 

Consequently, it is necessary for biologists to use current computational and internet technologies to help them store, share and analyze their biological data on computer or world-wide-web instead of their hands and eyes so as to yield “high throughput” biology, and accelerate the discovery in life science and development of biomedical products, such as drug, and therapy for cancers or other currently unsolvable diseases.

 

 

Fig 1 growth in GenBank (left) and Protein data bank (right)

 

Metalloproteins are proteins capable of binding one or more metal ions, which are required for their biological function or for regulation of their activities or even for structure purposes. It is very interesting and amazing that more than one-quarter of the elements in periodic table are required for life, and most of them are metal ions.

Enzymes are essential for the function of cells and are very specific as to the reactions they catalyze and the chemicals (substrates) that involved in the reactions. Substrates fit their enzymes like a key fits its lock. Many enzymes are composed of several proteins that act together as a unit. Most parts of an enzyme have regulatory and structural purposes. The catalyzed reaction takes place in only a small part of the enzyme called active site. Many enzymes incorporate metal divalent cations and transition metal ions within their structures to stabilize the folded conformation of protein or to directly participate in the chemical reactions catalyzed by the enzyme.

Metal also provides a template for protein folding, as in the zinc finger domain of nucleic acid binding proteins, the calcium ions of calmodulin (a protein molecule that is necessary for many biochemical process, including muscle contraction and the release of a chemical that carries nerve signals), and the zinc structural center of insulin. Metal ions can also serve as redox centers for catalysis, such as heme-iron centers, copper ions and non-heme irons. Other metal ions can serve as electrophilic reactants in catalysis, as in the case of active site zinc ions of the metalloprotease. For example, the enzyme carbonic anhydrase (Figure 2) typically forms 4 coordinate bonds in a tetrahedral arrangement about its metal ion.

 

Fig 2 3D metal-binding structure of carbonic anhydrase II

 

In this thesis, two major data sets (protein set and enzyme set, as shown in Figure 3) are extracted from 19771 protein structures in PDB and all experiments are based on enzyme set. There are 7529 protein molecules with metal binding and 6890 protein molecules with EC number[4]. Besides, there are over one-third (36.72%) proteins containing metal ions in 19771 protein structures and nearly 40% of them are enzymes (Figure 4).

 

 

Fig 3 Overview of data resource and working map

 

 

Fig 4 metalloproteins and metalloenzymes distribution

 

According to [5]PIR, the release 78.03 contains 283,336 entries in November 24, 2003. In contrast, in Protein Data Bank there are 24,358 structures are available in February 17, 2004. Transparently, the sequence material is greatly richer than the structure in proteins by 10 times or more. As this result, if a direct prediction method only based on sequences is practical, it will be very helpful in current status. The objective of this thesis is to build metal-binding model for protein by computer-based machine learning method so that it can be a reliable metal-binding residues predictor for proteins without actual coordinate information, and further be used to investigate and understand the formation of biochemical function of metalloprotein, and eventually offer “functional templates” as a guideline to design new protein with specified function in the future.

 

 

 

Chapter 2        Biological Data Resource

 

The main data resources come from two web sites; one is the metalloprotein database and browser (MDB) of metalloprotein structure and design program of the Scripps Research Institute (http://metallo.scripps.edu). Another one is Protein Data Bank (PDB, http://www.rcsb.org/pdb/), which provides general information about every protein structure. Hence, by combing these two data sets, the detail description of metalloprotein can be driven. For simplicity, the PDB information can be replaced by another compacted data PDBFinder (http://www.cmbi.kun.nl/gv/pdbfinder/) released at September, 14, 2003.

                    

 

Fig 5 DSD schematic of PDBFinder and MDB

Figure 5 shows the DSD (Data Structure Definition) schematics of PDBFinder and MDB. Abstractly, the data hierarchy can be defined as 4 layers ordered by their size. They are PROTEIN, CHAIN, SITE and LIGAND. The top level PROTEIN may contain one or several chain (s), and each chain is represented as one polypeptide chain belonged to one protein in nature. Every site contains the coordinate information about entire metal center binding site, just like shown in Figure 2. The environment information describes about how many binding atoms (ligands) participate in the site, which residue the ligand located, and what these binding ligands are. The binding hierarchy model and ERD (Entity-Relationship Diagram) is shown in Figure 6 and Figure 7.

 

 

                                  

 

 Fig 6 Metal-binding protein data hierarchy

 

 

 

 

Fig 7 Entity Relationship diagram of PDBFinder and MDB

 

 

After cross querying between MDB and PDBFinder by scripts written in network programming language PHP on local MySQL database, 41 and 35 metal types can be found in protein and enzyme respectively. Table 1 shows the list of elements in metal binding residue prediction after cross querying and it is classified by their biological level and order by their enzyme-protein ratio (E/P, the last column) with respect to each biological level set.

 

For simplicity, each instance in integrated database is treated as one chain of protein in real world; as the result, the inter-chain metal binding won’t be considered. By binding information from MDB, every position in protein chain sequence can be marked as binding or non-binding to be input for learn scheme. Figure 8 concludes all demanded process steps and data flow.

 

 

                           

 

Table 1 Number of chains in protein and enzyme

 

Figure 9 illustrates all life elements in periodic table in biological system., and there are 11 bulk biological elements, hydrogen (H), carbon (C), nitrogen (N), oxygen (O), sodium (Na), magnesium (Mg), phosphorus (P), sulfur (S), chlorine (Cl), potassium (K), and calcium (Ca), 12 trace elements essentials for life vanadium (V), chromium (Cr), manganese (Mn), iron (Fe), cobalt (Co), nickel (Ni), copper (Cu), zinc (Zn), selenium (Se), molybdenum (Mo), tin (Sn), and iodine (I) and 2 possible trace elements, arsenic (As) and bromine (Br) in periodic table as indicated in [4]. After cross comparison, there are 4 of 11 (36%) bulk biological elements, 11 of 12 (91.6%) trace elements, and 1 of 2 (50%) possible trace elements in MDB.

 

 

 

Fig 8 Data processing pipeline

 

 

 

Fig 9 Life elements in periodic table

 

 

 

Chapter 3        Machine Learning Scheme

 

The learning schemes used, in this thesis, are as simple as possible so that it becomes easy to observe the prediction performances according to various coding using non-biological or biological features. Besides, the relationship between the performance and size of sequence sampling window also can be found.

 

Neural Networks

 

Neural network consist of groups of parallel processing unit with connection between layers and each connection has one weight parameter. Neural networks use these weights between layers to “memorize” the patterns fed from input layer. The basic unit within a layer is an artificial neuron (node). In this thesis, multi-layer Perceptron (MLP) neural networks with back-propagation (BP) algorithm are chosen as learning machine applied to the experiments. In the NNs, only one hidden layer with 30 hidden nodes is used so that there are (30 × dimension of input vector) weights between input layer and hidden layer and (30 × dimension of output vector) weights between hidden layer and output layer respectively.

 

Besides, dimension of input layer is depended on the size of sequence sample window and dimension of output layer is two. In testing phase, if first output value is larger than second one, then the prediction result is defined as positive (binding), otherwise negative (non-binding).

 

Feature Encoding

 

There are two input coding used in our experiments. One is direct one-hot coding which presents every amino acid as one 21-bits array. Only one bit in array is ‘1’ and other bits in array are ‘0’.  In this way, every type of natural amino acid can be indicated by the position of the only “1” bit. Owing to the unknown type (usually use the symbol ‘X’ in sequence) of amino acid in protein sequence, add one bit to record this condition. This is the non-biological coding for amino acid.

 

Another coding method is done by referencing five different types of biological features about amino acid. The definition and content are shown as Table 2 and values are listed in Table 3.

 

 

Feature Set (size)

Definition and Content

References

Physical (3)

mass, volume, and area

[6]NCBI statistics

Solvent Exposed Area Levels (3)

three levels

SEA > 30

[8]

10 < SEA < 30

SEA < 10

Hydrophobicity Scales (6)

six scales

Engleman-Steitz

[9]

Hopp-Woods

[10]

Kyte-Doolittle

[11]

Janin

[12]

Chothia

[13]

Eisenberg Weiss

[14]

Secondary Structure Propensity (3)

three secondary structures

Alpha helix

[1]

Beta strand

Turn (loop, coil)

Chemical Classification (8)

eight classifications

Polar

[7]

Non-Polar

Charged

Positive

Tiny

Small

Aromatic

Aliphatic

 

Table 2 Definition and content of 5 biological feature sets

 

Table 3 Values of 5 biological feature sets

 

Because the binding behavior of central metal atom is influenced by the surrounding environment in protein, it is necessary to observe in wider scope than single one amino acid so as to determine whether the binding happens or not. Accordingly, each input vector applied to learning machine is extracted from one segment of entire chain by the concept continuous sliding window. Each sliding window is centered by the “target” amino acid. And the rest of the amino acids in window are the “neighbors” of the target. Figure 10 shows the feature extraction, learning scheme and how sliding window works. For simplicity the window size illustrated is 5.

 

 

Fig 10 feature extraction, learning scheme and sliding window

 

 

 

Chapter 4        Results and Conclusion

 

First, one-hot coding method is used varied by size of window from 5 to 17 so as to observe the change of performance according to different window size. Owing to the extremely low P/N (positive and negative instance ratio), specificity and negative prediction rate (almost approach 100%) are relatively higher than sensitivity (Q-observed). As the result, sensitivity (Q-observed) becomes only one critical term in performance measures in these absolutely unbalanced (positive and negative) training. Table 4 shows all Q-observed in enzyme set with respect to different window size.

 

Table 4 True positive rate of 31 elements in enzyme

 

 

Increasing window size indeed improves the sensitivity (that is, true positive rate, Q-observed) in each specified metal-binding enzyme set; but in some sets, it is not necessary to have better performance with longer window size, such as in metal sets calcium (Ca) and zinc (Zn). Nevertheless, the large computation cost resulted from the extension of sampling window doesn’t bring enough improvement on performance.

 

Next, one-hot coding method is replaced by biological feature sets as shown in Table 2 and Table 3. Data set focus on four bulk element (calcium, potassium, magnesium and sodium) subsets with less than 25% sequence identity and sliding window size is 15. The comparison between different feature sets is listed in Table 5. For simplicity, only Q-observed (true positive rate) and Q-predicted (positive predictive value) are listed in the table.

 

Table 5 comparison between different feature sets

 

 

By comparing the Q-observed value, physical and solvent exposed area feature sets do not work well in discrimination of metal-binding and non-metal-binding residues, even worst than direct one-hot coding method. Other three biological feature sets (secondary structure propensity, hydrophobicity scales and chemical classification) get better performance than one-hot coding.

 

These results reflect and correspond to the characteristics of metal-binding chelates, a three dimension cave for metal ion to “reside” in protein and it also can be interpreted as that the formation of metal-binding chelate is highly related to the secondary structure tendency, degree of hydrophobicity and chemical classification of neighboring amino acids of which the entire protein molecule is composed. It is also apparent that metal-binding phenomena don’t be dominated by the physical features of surrounding amino acids only before these experiments began. However, the results in this section have proved this idea true and show that solvent exposed area is not quite highly related to the formation of metal-binding chelates in protein.

 

From Table 4 and Table 5, it is clear that biological insight indeed play an important role in prediction the biochemical phenomena in nature. Although one-hot coding is straight-forward idea in feature encoding of 20 amino, it can not completely represent the behavior and characteristics of metal-binding in protein. After these verbose experiments in this thesis, eventually a direct metal-binding prediction method is proposed and proven to be useful and absolutely accurate in proteins binding four bulk elements under 5 fold cross validation.

 

REFERENCES

 

[1] C. H. Wu and J. W. McLarty, Neural Networks and Genome Informatics, Elsevier Science Ltd, UK, pp. 67-86, 2000.

 

[2]   C. T. Lin and C. S. George Lee, Neural Fuzzy Systems, Prentice-Hall, Inc. N.J., U.S.A. 1996.

 

[3] C. Branden and J. Tooze, Introduction to Protein Structure, 2nd edition, Garland Publishing, Inc., New York, pp. 205-220, 1999.

 

[4] M. J. Kendrick, M. T. May, M. J. Plishka, and K. D. Robinson, Metals in Biological System,   Ellis Horwood Limited, England, pp. 11-48, 1992.

 

[5] R. A. Copeland, Enzymes A Practical Introduction to Structure, Mechanism and Data Analysis, 2nd edition, Wiley-VHC, Inc, Canada, pp. 42-74, 2000.

 

[6] J. M. Castagnetto, S. W. Hennessy, V. A. Roberts, E. D. Getzoff, J. A. Tainer and M.E. Pique, “MDB: the Metalloprotein Database and Browser at The Scripps Research Institute”, Nucleic Acids Res. ,Vol. 30, No.1 , pp.379-382, 2002.

 

[7] W. R. Taylor, “The Classification of Amino Acid Conservation”, J. Theor. Biol., Vol.119, pp. 205-218, 1986.

 

[8]   D. Bordo and P. Argos, “Suggestions for Safe Residue Substitutions in Site-Directed Mutagensis”, J. Mol. Biol. Vol.217, pp. 721-729, 1991.

 

[9]  D. M. Engelman, T. A. Steitz, and A. Goldman, ”Identifying nonpolar transbilayer helices inamino acid sequences of membrane proteins”, Annu. Rev. Biophys. Biophys. Chem. Vol.15, pp. 321-353, 1986.

 

[10] T. P. Hoop and K. R. Woods, “Prediction of protein antigenic determinants from amino acid sequences”. Proc Natl Acad Sci, Vol.78, pp.3824, 1981.

 

[11] J. Kyte and R. Doolit, “A Simple Method for Displaying the Hydropathic Character of a Protein”, J. Mol Biol. Vol.157, pp.105-132, 1982.

 

[12] J. Janin, “Surface and Inside Volumes in Globular Proteins”, Nature, Vol. 277, pp.491-492, 1979.

 

[13] C. Chothia, “Hydrophobic bonding and accessible surface area in proteins”, Nature, Vol.248, pp.338-339, 1974.

 

[14] Eisenberg D., Weiss R.M., Terwilliger C.T., Wilcox W., 1982. Hydrophobic moments and protein structure, Faraday Symp. Chem. Soc. 17:109-120.

 


 

[1]         It is a international research effort to identify sequence and genes in human DNA. http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

[2] GenBank is the NIH (National Institutes of Health, http://www.nih.gov/) genetic sequence database, an annotated collection of all publicly available DNA sequences.

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

[3] It is a single worldwide repository for the processing and distribution of 3-D biological macro-molecular structure data. http://www.rcsb.org/pdb/

[4] Enzyme Commission number, a nomenclature for enzymes, developed by The International Union of Biochemistry and Molecular Biology, is described by a sequence of four numbers, preceded by “EC” in the form of “EC X.X.X.X.”

[5] Protein Information Resource, an integrated public resource of protein informatics to support genomic and proteomic research and scientific discovery. http://pir.georgetown.edu/

[6]  National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/