Whenever you embark yourself on a computational project, you should not forget the maxim that affirms “junk in means junk out”. So, the very first step must be to build a clean and solid dataset. To this respect, ptm can be your friend!
For the sake of concretion, let’s suppose we are interested in investigating whether, in human proteins, the environment of serines that are susceptible to being phosphorylated (phosphosites) can be statistically discriminated against those other enviroments belonging to serines that are not phosphorilatable. Alternativaly, we may be interested in addressing whether or not the sequence environments of those phospho-serines that have been described as playing a regulatory function, can be statistically distinguished from those environments belonging to phospho-serines detected in high-throughput analyses, but for which we have no information about their biological relevance.
To address these researches, the raw dataset we are going to build consist in a dataframe where each row will correspond to a serine residue from the human phosphoproteome, and the columns will be:
- UniProt ID of the phosphoprotein.
- Position of the serine residue in the primary structure of the protein.
- Whether or not the serine is phosphorylatable.
- Whether or not the serine has been described as regulatory.
- Sequence environment (flanking amino acids at positions from –10 to +10 relative to the central Ser).
As starting point, we are going to download a raw dataframe containing phosphosites reported by the following popular databases:
url <- "https://github.com/jcaledo/ptm_db/raw/master/p_db.Rda"
destfile <- tempfile(pattern = 'all',
tmpdir = tempdir(),
fileext = '.Rda')
download.file(url, destfile, quiet = TRUE)
load(destfile)
head(p_db)
## up_id organism modification database ## 1 Q9CTA4 Mus musculus S10-p PSP ## 2 Q9CTA4 Mus musculus T11-p PSP ## 3 P31946 Homo sapiens T2-p PSP ## 4 Q9CQV8 Mus musculus T2-p PSP ## 5 P31946 Homo sapiens S6-p PSP ## 6 Q9CQV8 Mus musculus S6-p PSP
These lines of code will load an object (dataframe) called p_db
in the global environment of R. This dataframe contains data regarding phosphosites found in different databases and for different species. So, we need to filter out non-human phosphoproteins, as well as those human sites corresponding to tyrosine or threonine.
hpSer <- p_db[which(p_db$organism == 'Homo sapiens' &
p_db$database == 'PSP' &
grepl('S', p_db$modification)), c(1,3) ]
head(hpSer)
## up_id modification ## 5 P31946 S6-p ## 10 P31946 S39-p ## 11 P31946 S47-p ## 16 P31946 S59-p ## 17 P31946 S60-p ## 22 P31946 S66-p
Then, we’re going to build a dataframe (df) where each row corresponds to a single human phosphoprotein:
id <- unique(hpSer$up_id)
df <- data.frame(id = id, seq = rep(NA, length(id)), pSer = rep(NA, length(id)))
sites <- function(x){
t <- hpSer$modification[which(hpSer$up_id == x)]
output <- lapply(t, function(y) stringr::str_extract_all(y,"([0-9]+)", simplify = TRUE))
return(as.numeric(unlist(output)))
}
df$pSer <- lapply(id, sites)
df$seq <- lapply(id, get.seq) # Please, note that the execution of this command can take a while!
Next, after removing from our dataframe those entries that are currently considered as obsolete in UniProt, we will add the positions where serine residues are present, regardless of their phospho-status:
obsolete <- df$id[which(nchar(df$seq) < 10)]
df <- df[which(!df$id %in% obsolete), ]
df$allSer <- NA
for (i in 1:nrow(df)){
print(i)
df$allSer[i] <- list(gregexpr("S", df$seq[i])[[1]])
}
df$non_pSer <- NA
for (i in 1:nrow(df)){
print(i)
df$non_pSer[i] <- list(setdiff(df$allSer[[i]], df$pSer[[i]]))
}
Before continuing with our process, let’s carry out some data-quality tests:
- Test1: Check that the intersection between df$allSer and df$pSer is equal to df$pSer
- Test2: Check that the union df$pSer and df$non_pSer is equal to df$allSer
Next, we build a dataframe with as many rows as serine sites we are analyzing:
ser.sites <- function(prot_id){
p <- df$pSer[[which(df$id == prot_id)]]
np <- df$non_pSer[[which(df$id == prot_id)]]
all <- c(p, np)
output <- data.frame(id = rep(prot_id, length(all)),
pos = all,
pSer = rep(FALSE, length(all)))
output$pSer[which(output$pos %in% p)] <- TRUE
return(output)
}
serine_sites <- ser.sites(df$id[1]) # seed
for (i in 2:nrow(df)){ # This step can take a few minutes!
print(i)
serine_sites <- rbind(serine_sites, ser.sites(df$id[i]))
}
serine_sites <- serine_sites[complete.cases(serine_sites), ] # When a protein hasn't non-phosphorylatable Ser
Searching for regulatory phospho-serines
We have collected tens of thousands of phospho-serine sites in our data set. However, as pointed out by Gustav E. Lienhard, there are reasons for believing that some portion of these reported phosphorylations might have little or no functional importance. Thus, we are going to use an astringent criterion to label a phosphosite as regulatory: it should have been empirically proved that the phosphorylation of that site has an effect on the phosphoprotein. To search for those phosphosites fulfilling such criterion, we will make use of the function reg.scan():
serine_sites$regulatory <- NA
prot <- unique(serine_sites$id)
for (i in 11:length(prot)){
print(i)
t <- reg.scan(prot[i])
if (is.data.frame(t)){
w <- t$modification
w <- unlist(lapply(w, function(x) strsplit(x, "-")))
w <- w[grepl("S", w)]
reg <- as.numeric(substr(w, 2, nchar(w)))
serine_sites$regulatory[which(serine_sites$id == prot[i] & serine_sites$pos %in% reg)] <- TRUE
}
closeAllConnections()
}
serine_sites$regulatory[is.na(serine_sites$regulatory)] <- FALSE
Extracting the sequence environments
We can start analyzing the sequence environment of these phospho-acceptors. To this end, we’ll start extracting the sequence environment using the function env.extract():
serine_sites$env <- NA
for (i in 1:nrow(serine_sites)){
print(i)
serine_sites$env[i] <- env.extract(prot = df$seq[which(df$id == serine_sites$id[i])][[1]],
c = serine_sites$pos[i],
r = 10)$Positive
}
Only serine residues with ten or more neighbors in both directions will be considered in this analysis:
serine_sites <- serine_sites[!grepl('X', serine_sites$env), ]
head(serine_sites)
## id pos pSer regulatory env ## 2 P31946 39 TRUE FALSE KAVTEQGHELsNEERNLLSVA ## 3 P31946 47 TRUE FALSE ELSNEERNLLsVAYKNVVGAR ## 4 P31946 59 TRUE FALSE AYKNVVGARRsSWRVISSIEQ ## 5 P31946 60 TRUE FALSE YKNVVGARRSsWRVISSIEQK ## 6 P31946 66 TRUE FALSE ARRSSWRVISsIEQKTERNEK ## 7 P31946 116 TRUE FALSE YLIPNATQPEsKVFYLKMKGD
This is the dataframe we need to start addressing our hypothesis: Regulatory phosphoserines have a differential sequence environment with respect to non-regulatory phosphoserines. A detailed description of how to carry out such analysis can be found in the vignette Regulatory versus Non-Regulatory Phosphoserines.