env.Ztest()

Description

Searches for amino acids either overrepresented or avoided at given positions within a sequence environment.

Usage

env.Ztest(pos, ctr, alpha = 0.05)

Arguments

pos a 21 x m matrix containing the absolute frequencies of 21 amino acids at the m positions, in the positive environments.

ctr a 21 x m matrix containing the absolute frequencies of 21 amino acids at the m positions, in the control environments.

alpha significance level.

Value

Returns a list with three elements: (1) a matrix with the values of the Z statistical. (2) A dataframe with information regarding amino acid overrepresented in the positive environments, and (3) a dataframe similar to the previous one, but for amino acids avoided from the positive environments. Please, note that in addition to the 20 proteinogenetic amino acid we are using the symbol X when the target (central) residue is closer to the N-terminal or C-terminal of the protein than the radious used.

Reference

Amino acids in the vicinity of a PTM site are critical in promoting or hindering the incoming substituent on specific amino acid. Thus, knowledge about amino acids surronding modified sites and the correlation between them is very valuable.

Aledo et al. Sci Rep. 2015; 5: 16955

Details

The package ptm offers four functions that will assist you in the process of gaining such a knowledge:

env.e xtract
env.matrices
env.Ztest (the current document)
env.plot

A more elaborated vignette showing how these four functions can be coordinately used to gain knowledge can be found here

For the sake of concretion, let’s suppose we are interested in investigating whether the environment of serines that are susceptible to being phosphorylated (phosphosites) can be statistically discriminated against those other enviroments belonging to serines that are not phosphorylatable. To address this issue, we need a collection of positive environments (sequence of amino acids arround a phosphorylatable serine) and its control counterpart (sequence of amino acids arround a non phosphorylatable serine). These sets can be obtained with the help of env.extract(). Suppose we have already extracted the sequence environments of 40 phosphoserines, as well as, the same number of environments corresponding to non-phosphorylatable serines (to be used as control).

positive <- c("ERNLLsVAYKN", "SWRIIsSIEQK", "LNEPLsNEDRN", "LTLWTsDQQDD", "WRVLSsIEQKS",
              "ESELRsICTTV", "ASQAEsKVFYL", "RKILLsEWKSQ", "GGSSCsQTPSR", "QVLLEsGEKST",
              "ARAVYsDADIF", "NRQLPsDGKKM", "SPGYRsVRERT", "KDRTTsEAQTE", "KTEAEsYEGLL",
              "CGRTGsGKSSL", "HVPAPsPQGPG", "IQESEsHSKNG", "LHGKKsGKPPL", "SISAPsSDKPL",
              "NTVANsPQTLL", "PYAHLsKKEKK", "GKQQVsPIRNL", "VCEKQtITKWP", "ASQAGsRKESR",
              "FIRGVsGGERK", "LTPGGsMGLQV", "PCPRYsNPADF", "GITGSsQDTYV", "KLKGKsPGIIF",
              "GQQLAsMLRWT", "YKVLSsLGYHV", "FISGLsDQLIP", "LFRSRsLREFE", "PGIDLsQVYEL",
              "SPRTLsPTPSA", "IRRSSsDFFYS", "PASSTsGSPSR", "TPTSRsPQHYS", "MKARSsSYADP")

control <- c("PEKACsLAKTA", "AYKAAsDIAMT", "LALNFsVFYYE", "ATVVEsSEKAY", "TLSEDsYKDST",
             "AWRVIsSIEQK", "PEKACsLAKTA", "VKKENsVETQA", "MSGGSsCSQTP", "SGHQPsQSRAI",
             "FALVLsALILA", "RTFSEsSVWSQ", "CGSVGsGKTSL", "EDPQQsNPCPE", "STLEYsNERLK",
             "GTMDPsQVPEH", "PIVTPsGEVVV", "AVIQEsESHSK", "LYSNLsKPFLD", "TSTRGsVQMLT",
             "FDEPSsYLDVK", "PTQKFsGGWRM", "HIINLsLTFHG", "SRLESsGKNKS", "EKEILsNINGI",
             "TMIFSsVCYWT", "LVKTLsRLAKG", "QAAQHsPYVAL", "YSGVGsSDGNS", "VPVAPsSSSGG",
             "SSSSGsAAAAL", "SEGEAsEEGLY", "PADQFsDGREP", "GPAEEsRVRRH", "CSSEKsKVTSS",
             "SYGDVsGGVRD", "GIRCDsCEKYI", "SVPASsTSGSP", "RSGPEsGRSSP", "TTAGNsSQVSD")

After that, using env.matrices() we can compute two matrices: one containing the frequencies of each amino acid at the corresponding position within the sequence environment around the positive targen (p-Ser), and its counterpart matrix for the control (environment around non-phosphorilatable serine).

pos <- env.matrices(positive)[[2]]
ctr <- env.matrices(control)[[2]]

Once we have these two matrices, it’s time to use env.Ztest() to test the null hypothesis: the environment of phosphorylatable and non-phosphorylatable serine residues are not different.

results <- env.Ztest(pos, ctr, alpha = 0.05)

This object we have called ‘results’ is a list of three elements. The first one is a matrix showing the $Z_{ij}$ statistical for each amino acid ( $i$ ) at each position ( $j$ ) within the sequence environment. This matrix, accounting for the standardized difference in frequencies, is computed according to equation:

$Z_{ij} = \frac{f^{pos}_{ij}- f^{ctr}_{ij}}{\sqrt{ \frac{f^{pos}_{ij}(1 - f^{pos}_{ij}) + f^{ctr}_{ij}(1 - f^{ctr}_{ij})}{n} }}$

where $f_{ij}$ is the computed relative frequency for amino acid $i$ at position $j$ , and $n$ is is the total number of environmental sequences analyzed.

kable(round(results[[1]], 1))

	-5	-4	-3	-2	-1	1	2	3	4	5
A	-0.4	-1.4	0.0	0.3	-0.6	-1.5	0.0	-0.5	-2.1	-1.4
C	-0.6	1.5	0.0	-1.0	-0.6	-1.5	0.0	-1.0	0.0	0.0
D	0.0	-0.6	-1.0	-0.6	-1.5	1.2	0.6	0.0	1.8	-1.0
E	0.0	-1.8	1.2	-2.9	-0.4	0.0	-0.7	2.0	-0.6	0.0
F	0.0	1.0	-1.0	-1.0	-1.8	0.0	0.0	0.0	1.0	1.8
G	0.4	0.5	-1.2	1.2	-0.4	-0.3	0.7	-0.5	-1.5	-0.5
H	0.0	1.0	-1.0	1.0	-1.0	1.0	0.0	0.0	0.0	-1.5
I	1.5	0.4	-0.5	0.0	0.0	1.8	-0.5	0.0	1.8	-1.8
K	1.8	0.9	-1.7	0.6	0.6	0.0	0.4	0.7	0.9	-0.4
L	0.7	0.0	-0.9	3.2	0.6	-0.5	-1.0	1.5	-1.2	1.7
M	0.0	-1.0	-1.0	0.0	0.0	1.5	0.0	-1.0	-1.0	0.0
N	1.5	1.0	1.0	-1.8	-0.6	-0.5	0.0	-1.5	0.6	1.5
P	-0.4	0.5	0.5	-0.6	-0.4	2.0	-0.6	1.4	1.0	-0.4
Q	0.0	1.5	1.7	-1.7	0.0	0.5	1.2	1.0	0.0	0.0
R	-0.6	1.7	1.8	0.6	2.1	-0.6	0.6	-0.9	0.0	1.8
S	-1.0	-1.1	-0.4	1.5	0.0	-1.1	-1.2	-0.6	-0.3	0.0
T	-1.4	-0.7	0.6	0.5	1.8	-1.0	1.0	0.0	-0.5	-0.4
V	-0.6	-0.5	0.0	-1.7	0.6	-0.9	-0.9	-2.1	-1.5	1.4
W	1.0	0.0	0.0	1.0	0.0	0.0	1.0	-1.5	0.6	0.0
Y	0.0	-1.0	0.0	1.0	0.6	-0.6	0.0	0.5	0.9	-1.5
X	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

A standard score, $Z_{ij}$ , that is much greater than zero or much less than zero indicates that amino acid $i$ is overrepresented or underrepresented at position $j$ , respectively, in an environment of the positive sequences with respect to an environment of the control ones. To this respect, the second element of the list ‘results’ is a dataframe with information regarding those amino acids overrepresented in the positive environments (Z values much higher than 0):

kable(results[[2]])

	aa	at	Z	pValue
5	L	-2	3.162278	0.0007827
6	R	-1	2.108185	0.0175075
9	P	1	2.028602	0.0212494
10	E	3	2.028602	0.0212494
4	R	-3	1.804807	0.0355524
1	K	-5	1.800901	0.0358593
7	T	-1	1.800901	0.0358593
8	I	1	1.800901	0.0358593
11	D	4	1.800901	0.0358593
12	I	4	1.800901	0.0358593
13	F	5	1.800901	0.0358593
15	R	5	1.800901	0.0358593
2	R	-4	1.729351	0.0418732
3	Q	-3	1.729351	0.0418732
14	L	5	1.650696	0.0494004

The third element of the list is also a dataframe, but in this case with information regarding amino acid underrepresented in the positive environments (Z values much lower thant 0):

kable(results[[3]])

	aa	at	Z	pValue
3	E	-2	-2.912876	0.0017906
8	V	3	-2.108185	0.0175075
9	A	4	-2.108185	0.0175075
1	E	-4	-1.800901	0.0358593
4	N	-2	-1.800901	0.0358593
7	F	-1	-1.800901	0.0358593
10	I	5	-1.800901	0.0358593
2	K	-3	-1.729351	0.0418732
5	Q	-2	-1.729351	0.0418732
6	V	-2	-1.729351	0.0418732

Finally, the function env.Zplot() allows us to visualize the results of these analyses graphycally. For that we choose the amino acid we want to focus on. In this way, we can visualize that at the position -2 (towards the N-terminal) glutamate es significantly avoided from the environment of p-Ser:

env.plot(Z = results[[1]], 
         aa = 'E', 
         pValue = 0.05, 
         title = "Glutamete around p-Ser")

while leucine is clearly enriched at that position:

env.plot(Z = results[[1]], 
         aa = 'L', 
         pValue = 0.05, 
         title = "Leucine around p-Ser")