COMPARATIVE ANALYSIS OF SH2 DOMAIN STRUCTURES

Src Homology 2 (SH2) is the compact globular domains, which is involved in intra-cellular signaling pathways and play an important role in mediating specific protein-protein interactions. It consists of about 100 amino acids and include sevens β -sheets and two α -helices. The SH2 domains comprise two highly conservative parts of binding pocket – pTyr and pTyr +3. The knowledge of the protein complexes structure is an important step forward to understanding of the mechanisms of their functioning. The binding site of SH2 domains and surrounding to binding site sequences were analyzed by using in silico methods. All SH2 domains were divided into the groups by sequence similarity. The parts of sequences which are common to all domains and the unique parts of certain domains were found within the framework of conservation and similarity analysis of SH2 domains. Furthermore, the surface area analysis displays that the highly conservative structures occupy the smallest area. These results indicate the ability of the SH2 domains to recognize not only linear phosphopeptide sequences. It opens new insights on the interpretation of possible mechanisms of interaction between SH2 domains and ligands/protreins (e. g. possibility of binding between protein or ligand with SH2 domain not only within binding site area).


INTRODUCTION
Proteins are characterized by sequence similarity relationships. The understanding of protein relationships is important for the genome sequences annotation [12]. Proteins with high sequence identity and high structural similarity commonly have functional simi larity and evolutionary relationships. Examples of proteins deviating from this general relationship of sequence/structure/function homology are well-recognized. For example, high sequence identity and low structure similarity can occur from conformational plasticity, mutations, solvent effects and ligand binding.
SH2 domains were identified across structure and sequence similarity. These domains are highly structurally conserved motifs comprising of about 100 amino acids; they participate in large number of signal-transduction between proteins. Human genome encodes about 120 SH2 domains which are present in 110 proteins such as kinases (Src, Lck), phosphatases (SHP2, SHIP2), phospholipases (PLCγ1), transcription factors (STAT), regulatory proteins (SOCS), adapter proteins (Grb2), structural proteins (SHC) and others. Wide spread of SH2 domains in animals and their almost complete absence in microorga nisms (e.g. primitive SH2 fragments in yeast) may testify to their appearance related to complication of signal transduction mechanisms in multicellular organisms [14].
Latest research has shown that it is possible to divide SH2 domains according to the recognition specificity of pTyr residue with C-terminus. Such recognition may take place in residues position +1, +2 and +3, bind to pTyr [10]. So, each SH2 domain binds only to specific phosphotyrosine-containing fragments. For example, the Src SH2 domain mostly recognizes Glu-Glu-Ile (binding fragment pYEEI), whereas the Grb2 SH2 domain binds to another fragment -pYVNV. However, complete understanding of this effect requires detailed study of thermodynamic peculiarities of the interaction between SH2 domains and phosphopeptides.
The 3D and 1D structure of SH2 domain was studied by computer modeling methods. They are analyzed according to sequence and structure similarity. The conservation which is not present across whole SH2 domains, was found. It shows a difference between groups at the amino acid level. So, conversation of some groups is larger than in others. Also, the several common motifs of binding pocket and spatial close part were selected. Surface analysis identified potential opportunity of binding in selected conserved regions in binding site and other parts of domain.
The aim of this work was to select the main conservative regions of SH2 domain, which can take part in protein-protein and ligand-protein interactions, and to compare the selected sequences within every separate group and across whole domain.

MATERIALS AND METHODS
3D dataset of SH2 domain. 219 (66 -Nuclear magnetic resonance, 153 -X-ray crystallography) 3D crystal structures of SH2 domains were retrieved from PDB (Protein Databank) [2]. It was taken into account that there is more than one SH2 domain per file. Consequently, they were divided in 1129 separate coordinate structures. Obtained struc tures were used in further calculation steps.
3D structural environment analysis of SH2 domain binding pocket. 3D structure analysis of the SH2 domain binding site was done by using Chimera [5] and helixweb [6,9] software tools.
Protein residues with at least one atom takes part in the binding pocket creating (even if residue atom occupies minimal area of pocket), were included in the binding site definition. The residue selection was done by evaluating ASA (Accessible Surface Area) of protein residue. So, if amino acid occupies more than 5 Å 2 and locates within atomic distance of 5 Å or less from the nearest ligand atom (makes strong contacts), it is considered as amino acid with a large contribution to the binding pocket creating. However, amino acid with less area or bigger distance between the nearest atoms of interacting structures is considered as a support of amino acid binding in pocket (without significant contribution to the binding pocket creating).
The comparison of SH2 domain structures.
Step by step, whole available coordinate files were compared. Firstly, the comparison was provided within every separate ID by using ClustalX [15] (sequence comparison) and Chimera [4] (comparison of rmsd between structures) software. Then, the same procedure was applied for PDB struc- tures with different ID, 56 structures were selected. Finally, utilizing ClustalX sequences were aligned and edited using Jalview [3] and divided to six groups.
Residue columns analysis. Residue consistency and sequence similarity of selected groups were assessed using PRALINE software [13].
The conservation score were calculated in the Scorecons Server [1] by following entropy scores method calculation. Entropy scores used here normalize Shannon's entropy so that conserved (low entropy) columns score 1 and diverse (high entropy) columns score 0. The entropy scores for every position within the alignment defined as where N is a number of residues in column, K is a number of residue types and P a = n a /N (n a is a number of residues of type a). In this part of work two variants of calculation were done: (1) amino acids were classified into one of K = 21 types: 20 standard amino acid types + 1 gap type; (2) amino acids were classified into one of K = 7 types: aliphatic (AVLIMC), aromatic (FWYH), polar (STNQ), positive (KR), negative (DE), special conformations (GP) and gaps.
Also, it should be note that all calculations present above are done in BLOSSUM62 substitution matrix [11]. However, before a substitution matrix is used, it must first be transformed into a convenient range. Therefore a Karlinlike transformation was applied [8].

Selection of every separate group.
The selection of every SH2 domain group was done in such order: (1) 346 PDB structures were retrieved from protein data bank [2]; (2) they were divided to 1129 structures; (3) then those obtained coordinate files were compared among every separate PDB ID of structures and if there were any differences between compared structures, they were considered as different structures. As a result 56 distinctive structures were obtained (see below in SH2 domain characteristics within the groups and Fig. 2); (4) in the next stage whole obtained structures of SH2 domains have been divided into six groups based on amino acid similarity (Fig. 2, Table).
For example, based on this algorithm the domain structure of 2K7A (group 5) presents information about seven others PDB (1LUI. (residues 238-344), 1LUK.A (residues 238-344), 1LUM.A (residues 238-344), 1LUN.A (residues 238-344), 2K79.B, 2ETZ.A and 2EU0.A (all of them are Tyrosine-protein kinase ITK/TSK and contains in Mus musculus) structures and contains 18 similar coordinate files. It means that only one structure was selected to next comparison stage.
Accessible surface area determination has shown that SH2 domain binding site occupies about ¼ of whole domain area. It is not surprising because in all cases the binding site is flat with two main grooves (pTyr-binding part and hydrophobic part) [7] (Fig. 3). Besides, the matching of whole available structures shows similarity increasing of the binding site in comparison with whole domain (Table). However, if take into consideration the group 1, only. The difference between pocket and whole domain parameters are not significant (e.g. sequence similarity increasing is only 1 %). Groups 1 and 2 are highly conservative, especially in ligand binding positions: the increasing of structures identity (if compare with whole SH2 domains) is 61 and 47 % and of similarity is 55.1, 35.2 %, respectively. These groups exhibit the largest conserved surface area, as might be expected from the high level of sequence similarity. But it affects on overall domain surface area and overall surface area of its parts. For example, group 1 has the lowest area of binding pocket, which is highly conserved while less conserved binding pocket of group 2 occupies a bigger surface area (Table). It is achieved by a variable domain amino acid composition.
Groups 3-6 are much less conservative than the groups 1 and 2: the increasing of structures identity (if compare with whole SH2 domains) is 29, 14, 8 and 14 % and of similarity is 14.5, 12.2, 15.5, 11.63%, correspondingly. It leads to significant surface area increasing.
Of course, increasing of similarity and identity of groups 3-6 is not such significant as in groups 1 and 2. But the quality of above parameters is good in the most interesting regions of domains -binding pocket and surrounding areas (Fig. 2, Table) (e. g. similarity increasing in binding pocket in group 3 is 21% and identity increasing is 32 %). In all cases conservative and identity parameters within every separate group is not lower than 40 % (such results are acceptable in bioinformatics research).
Moreover, the conserved regions are located in binding site almost in all cases. However, there are few highly conserved sequences not within binding site, but spatially close to one (For example, the positions 1-5 and 10-11 in all sequences) (Fig. 2). Such conserved parts, which are not involved in a binding site, challenge the notion that SH2 domains recognize only a linear phosphopeptide sequences.
3D structural environment of SH2 domain binding pocket. The eight main separate motifs and single amino acids are present in SH2 domains binding pocket environment. However, the groups 3, 4 and 6 comprise one additional binding point (number 9) (Table).
SH2 domain binding site comprises 20-24 amino acids. It was defined by determining accessible surface area of whole amino acid which is forming SH2 domain binding site and spatially close to one. The mean conservation is much greater in the binding sites than complete domain structure. It confirms that these sites are more conserved than the rest of the domain (Table 1) However, there are few positions of binding site, which are not involve in any binding motif describe above (e. g., T (position 37 -group 3), L (position 69 -group 4) and Q (position 40 -group 6)). This diversity, located in the known binding interface, may be important in the recognition of ligands.

CONCLUSIONS
Protein members could be studied using the framework created from protein classification data. It is a starting point for examining similarity and diversity inside large protein families. Such kind of studies can indicate functionally important part within the protein structure. It is difficult to make useful conclusion until the protein family is divided to groups.
It is likely that some parts of protein especially binding sites are defined more completely than others. Structure and similarity parameters difference between groups and lack of conservation across whole binding site suggest that while the binding site is conserved in all family, region spatially close to this site can be diverse. However, it is possible to assume that the residue conservation within groups of similar domains outside binding site area might correspond to the conservation of protein-protein interface between SH2 domain and phosphorylated protein. In turn, it casts doubt notion that SH2 domain recognizes short linear peptide motifs only. That assumption based on literature and presented in this article dates. Native ligands mostly are short peptide structures which contact with SH2 domains within pTyr binding site. But, when bound to proteins or long peptide, there are interactions outside pTyr binding site.