env.Ztest()

Description

Searches for amino acids either overrepresented or avoided at given positions within a sequence environment.

Usage

env.Ztest(pos, ctr, alpha = 0.05)

Arguments

pos a 21 x m matrix containing the absolute frequencies of 21 amino acids at the m positions, in the positive environments.

ctr a 21 x m matrix containing the absolute frequencies of 21 amino acids at the m positions, in the control environments.

alpha significance level.

Value

Returns a list with three elements: (1) a matrix with the values of the Z statistical. (2) A dataframe with information regarding amino acid overrepresented in the positive environments, and (3) a dataframe similar to the previous one, but for amino acids avoided from the positive environments. Please, note that in addition to the 20 proteinogenetic amino acid we are using the symbol X when the target (central) residue is closer to the N-terminal or C-terminal of the protein than the radious used.

Reference

Amino acids in the vicinity of a PTM site are critical in promoting or hindering the incoming substituent on specific amino acid. Thus, knowledge about amino acids surronding modified sites and the correlation between them is very valuable.

Aledo et al. Sci Rep. 2015; 5: 16955

See Also

env.extract(), env.matrices(), env.plot()

Details

The package ptm offers four functions that will assist you in the process of gaining such a knowledge:

A more elaborated vignette showing how these four functions can be coordinately used to gain knowledge can be found here

For the sake of concretion, let’s suppose we are interested in investigating whether the environment of serines that are susceptible to being phosphorylated (phosphosites) can be statistically discriminated against those other enviroments belonging to serines that are not phosphorylatable. To address this issue, we need a collection of positive environments (sequence of amino acids arround a phosphorylatable serine) and its control counterpart (sequence of amino acids arround a non phosphorylatable serine). These sets can be obtained with the help of env.extract(). Suppose we have already extracted the sequence environments of 40 phosphoserines, as well as, the same number of environments corresponding to non-phosphorylatable serines (to be used as control).

positive <- c("ERNLLsVAYKN", "SWRIIsSIEQK", "LNEPLsNEDRN", "LTLWTsDQQDD", "WRVLSsIEQKS",
              "ESELRsICTTV", "ASQAEsKVFYL", "RKILLsEWKSQ", "GGSSCsQTPSR", "QVLLEsGEKST",
              "ARAVYsDADIF", "NRQLPsDGKKM", "SPGYRsVRERT", "KDRTTsEAQTE", "KTEAEsYEGLL",
              "CGRTGsGKSSL", "HVPAPsPQGPG", "IQESEsHSKNG", "LHGKKsGKPPL", "SISAPsSDKPL",
              "NTVANsPQTLL", "PYAHLsKKEKK", "GKQQVsPIRNL", "VCEKQtITKWP", "ASQAGsRKESR",
              "FIRGVsGGERK", "LTPGGsMGLQV", "PCPRYsNPADF", "GITGSsQDTYV", "KLKGKsPGIIF",
              "GQQLAsMLRWT", "YKVLSsLGYHV", "FISGLsDQLIP", "LFRSRsLREFE", "PGIDLsQVYEL",
              "SPRTLsPTPSA", "IRRSSsDFFYS", "PASSTsGSPSR", "TPTSRsPQHYS", "MKARSsSYADP")

control <- c("PEKACsLAKTA", "AYKAAsDIAMT", "LALNFsVFYYE", "ATVVEsSEKAY", "TLSEDsYKDST",
             "AWRVIsSIEQK", "PEKACsLAKTA", "VKKENsVETQA", "MSGGSsCSQTP", "SGHQPsQSRAI",
             "FALVLsALILA", "RTFSEsSVWSQ", "CGSVGsGKTSL", "EDPQQsNPCPE", "STLEYsNERLK",
             "GTMDPsQVPEH", "PIVTPsGEVVV", "AVIQEsESHSK", "LYSNLsKPFLD", "TSTRGsVQMLT",
             "FDEPSsYLDVK", "PTQKFsGGWRM", "HIINLsLTFHG", "SRLESsGKNKS", "EKEILsNINGI",
             "TMIFSsVCYWT", "LVKTLsRLAKG", "QAAQHsPYVAL", "YSGVGsSDGNS", "VPVAPsSSSGG",
             "SSSSGsAAAAL", "SEGEAsEEGLY", "PADQFsDGREP", "GPAEEsRVRRH", "CSSEKsKVTSS",
             "SYGDVsGGVRD", "GIRCDsCEKYI", "SVPASsTSGSP", "RSGPEsGRSSP", "TTAGNsSQVSD")

After that, using env.matrices() we can compute two matrices: one containing the frequencies of each amino acid at the corresponding position within the sequence environment around the positive targen (p-Ser), and its counterpart matrix for the control (environment around non-phosphorilatable serine).

pos <- env.matrices(positive)[[2]]
ctr <- env.matrices(control)[[2]]

Once we have these two matrices, it’s time to use env.Ztest() to test the null hypothesis: the environment of phosphorylatable and non-phosphorylatable serine residues are not different.

results <- env.Ztest(pos, ctr, alpha = 0.05)

This object we have called ‘results’ is a list of three elements. The first one is a matrix showing the Z_{ij} statistical for each amino acid (i) at each position (j) within the sequence environment. This matrix, accounting for the standardized difference in frequencies, is computed according to equation:

Z_{ij} = \frac{f^{pos}_{ij}- f^{ctr}_{ij}}{\sqrt{ \frac{f^{pos}_{ij}(1 - f^{pos}_{ij}) +  f^{ctr}_{ij}(1 - f^{ctr}_{ij})}{n} }}

where f_{ij} is the computed relative frequency for amino acid i at position j, and n is is the total number of environmental sequences analyzed.

kable(round(results[[1]], 1))
-5-4-3-2-1012345
A-0.4-1.40.00.3-0.60-1.50.0-0.5-2.1-1.4
C-0.61.50.0-1.0-0.60-1.50.0-1.00.00.0
D0.0-0.6-1.0-0.6-1.501.20.60.01.8-1.0
E0.0-1.81.2-2.9-0.400.0-0.72.0-0.60.0
F0.01.0-1.0-1.0-1.800.00.00.01.01.8
G0.40.5-1.21.2-0.40-0.30.7-0.5-1.5-0.5
H0.01.0-1.01.0-1.001.00.00.00.0-1.5
I1.50.4-0.50.00.001.8-0.50.01.8-1.8
K1.80.9-1.70.60.600.00.40.70.9-0.4
L0.70.0-0.93.20.60-0.5-1.01.5-1.21.7
M0.0-1.0-1.00.00.001.50.0-1.0-1.00.0
N1.51.01.0-1.8-0.60-0.50.0-1.50.61.5
P-0.40.50.5-0.6-0.402.0-0.61.41.0-0.4
Q0.01.51.7-1.70.000.51.21.00.00.0
R-0.61.71.80.62.10-0.60.6-0.90.01.8
S-1.0-1.1-0.41.50.00-1.1-1.2-0.6-0.30.0
T-1.4-0.70.60.51.80-1.01.00.0-0.5-0.4
V-0.6-0.50.0-1.70.60-0.9-0.9-2.1-1.51.4
W1.00.00.01.00.000.01.0-1.50.60.0
Y0.0-1.00.01.00.60-0.60.00.50.9-1.5
X0.00.00.00.00.000.00.00.00.00.0

A standard score, Z_{ij}, that is much greater than zero or much less than zero indicates that amino acid i is overrepresented or underrepresented at position j, respectively, in an environment of the positive sequences with respect to an environment of the control ones. To this respect, the second element of the list ‘results’ is a dataframe with information regarding those amino acids overrepresented in the positive environments (Z values much higher than 0):

kable(results[[2]])

aaatZpValue
5L-23.1622780.0007827
6R-12.1081850.0175075
9P12.0286020.0212494
10E32.0286020.0212494
4R-31.8048070.0355524
1K-51.8009010.0358593
7T-11.8009010.0358593
8I11.8009010.0358593
11D41.8009010.0358593
12I41.8009010.0358593
13F51.8009010.0358593
15R51.8009010.0358593
2R-41.7293510.0418732
3Q-31.7293510.0418732
14L51.6506960.0494004

The third element of the list is also a dataframe, but in this case with information regarding amino acid underrepresented in the positive environments (Z values much lower thant 0):

kable(results[[3]])

aaatZpValue
3E-2-2.9128760.0017906
8V3-2.1081850.0175075
9A4-2.1081850.0175075
1E-4-1.8009010.0358593
4N-2-1.8009010.0358593
7F-1-1.8009010.0358593
10I5-1.8009010.0358593
2K-3-1.7293510.0418732
5Q-2-1.7293510.0418732
6V-2-1.7293510.0418732

Finally, the function env.Zplot() allows us to visualize the results of these analyses graphycally. For that we choose the amino acid we want to focus on. In this way, we can visualize that at the position -2 (towards the N-terminal) glutamate es significantly avoided from the environment of p-Ser:

env.plot(Z = results[[1]], 
         aa = 'E', 
         pValue = 0.05, 
         title = "Glutamete around p-Ser")

while leucine is clearly enriched at that position:

env.plot(Z = results[[1]], 
         aa = 'L', 
         pValue = 0.05, 
         title = "Leucine around p-Ser")