Building Up a Dataset for the Study of Phosphoserines

Whenever you embark yourself on a computational project, you should not forget the maxim that affirms “junk in means junk out”. So, the very first step must be to build a clean and solid dataset. To this respect, ptm can be your friend!

For the sake of concretion, let’s suppose we are interested in investigating whether, in human proteins, the environment of serines that are susceptible to being phosphorylated (phosphosites) can be statistically discriminated against those other enviroments belonging to serines that are not phosphorilatable. Alternativaly, we may be interested in addressing whether or not the sequence environments of those phospho-serines that have been described as playing a regulatory function, can be statistically distinguished from those environments belonging to phospho-serines detected in high-throughput analyses, but for which we have no information about their biological relevance.

To address these researches, the raw dataset we are going to build consist in a dataframe where each row will correspond to a serine residue from the human phosphoproteome, and the columns will be:

UniProt ID of the phosphoprotein.
Position of the serine residue in the primary structure of the protein.
Whether or not the serine is phosphorylatable.
Whether or not the serine has been described as regulatory.
Sequence environment (flanking amino acids at positions from –10 to +10 relative to the central Ser).

As starting point, we are going to download a raw dataframe containing phosphosites reported by the following popular databases:

url <- "https://github.com/jcaledo/ptm_db/raw/master/p_db.Rda"
destfile <- tempfile(pattern = 'all',
                     tmpdir = tempdir(),
                     fileext = '.Rda')
download.file(url, destfile, quiet = TRUE)
load(destfile)
head(p_db)

##    up_id     organism modification database
## 1 Q9CTA4 Mus musculus        S10-p      PSP
## 2 Q9CTA4 Mus musculus        T11-p      PSP
## 3 P31946 Homo sapiens         T2-p      PSP
## 4 Q9CQV8 Mus musculus         T2-p      PSP
## 5 P31946 Homo sapiens         S6-p      PSP
## 6 Q9CQV8 Mus musculus         S6-p      PSP

These lines of code will load an object (dataframe) called p_db in the global environment of R. This dataframe contains data regarding phosphosites found in different databases and for different species. So, we need to filter out non-human phosphoproteins, as well as those human sites corresponding to tyrosine or threonine.

hpSer <- p_db[which(p_db$organism == 'Homo sapiens' &
                      p_db$database == 'PSP' &
                      grepl('S', p_db$modification)), c(1,3) ]
head(hpSer)

##     up_id modification
## 5  P31946         S6-p
## 10 P31946        S39-p
## 11 P31946        S47-p
## 16 P31946        S59-p
## 17 P31946        S60-p
## 22 P31946        S66-p

Then, we’re going to build a dataframe (df) where each row corresponds to a single human phosphoprotein:

id <- unique(hpSer$up_id)
df <- data.frame(id = id, seq = rep(NA, length(id)), pSer = rep(NA, length(id)))

sites <- function(x){
  t <- hpSer$modification[which(hpSer$up_id == x)]
  output <- lapply(t,  function(y) stringr::str_extract_all(y,"([0-9]+)", simplify = TRUE))
  return(as.numeric(unlist(output)))
}

df$pSer <- lapply(id, sites) 
df$seq <- lapply(id, get.seq) # Please, note that the execution of this command can take a while!

Next, after removing from our dataframe those entries that are currently considered as obsolete in UniProt, we will add the positions where serine residues are present, regardless of their phospho-status:

obsolete <- df$id[which(nchar(df$seq) < 10)]
df <- df[which(!df$id %in% obsolete), ]
df$allSer <- NA
for (i in 1:nrow(df)){
   print(i)
   df$allSer[i] <- list(gregexpr("S", df$seq[i])[[1]])
}
df$non_pSer <- NA
for (i in 1:nrow(df)){
   print(i)
   df$non_pSer[i] <- list(setdiff(df$allSer[[i]], df$pSer[[i]]))
}

Before continuing with our process, let’s carry out some data-quality tests:

Test1: Check that the intersection between df$allSer and df$pSer is equal to df$pSer
Test2: Check that the union df$pSer and df$non_pSer is equal to df$allSer

Next, we build a dataframe with as many rows as serine sites we are analyzing:

ser.sites <- function(prot_id){
  p <- df$pSer[[which(df$id == prot_id)]]
  np <- df$non_pSer[[which(df$id == prot_id)]]
  all <- c(p, np)
  output <- data.frame(id = rep(prot_id, length(all)),
                       pos = all,
                       pSer = rep(FALSE, length(all)))
  output$pSer[which(output$pos %in% p)] <- TRUE
  return(output)
}

serine_sites <- ser.sites(df$id[1]) # seed
for (i in 2:nrow(df)){ # This step can take a few minutes!
   print(i)
   serine_sites <- rbind(serine_sites, ser.sites(df$id[i]))
}
serine_sites <- serine_sites[complete.cases(serine_sites), ] # When a protein hasn't non-phosphorylatable Ser

Searching for regulatory phospho-serines

We have collected tens of thousands of phospho-serine sites in our data set. However, as pointed out by Gustav E. Lienhard, there are reasons for believing that some portion of these reported phosphorylations might have little or no functional importance. Thus, we are going to use an astringent criterion to label a phosphosite as regulatory: it should have been empirically proved that the phosphorylation of that site has an effect on the phosphoprotein. To search for those phosphosites fulfilling such criterion, we will make use of the function reg.scan():

serine_sites$regulatory <- NA
prot <- unique(serine_sites$id)
for (i in 11:length(prot)){
   print(i)
   t <- reg.scan(prot[i])
   if (is.data.frame(t)){
     w <- t$modification
     w <- unlist(lapply(w, function(x) strsplit(x, "-")))
     w <- w[grepl("S", w)]
     reg <- as.numeric(substr(w, 2, nchar(w)))
     serine_sites$regulatory[which(serine_sites$id == prot[i] & serine_sites$pos %in% reg)] <- TRUE
    }
    closeAllConnections()
}
serine_sites$regulatory[is.na(serine_sites$regulatory)] <- FALSE

Extracting the sequence environments

We can start analyzing the sequence environment of these phospho-acceptors. To this end, we’ll start extracting the sequence environment using the function env.extract():

serine_sites$env <- NA
for (i in 1:nrow(serine_sites)){
   print(i)
   serine_sites$env[i] <- env.extract(prot = df$seq[which(df$id == serine_sites$id[i])][[1]],
                                      c = serine_sites$pos[i],
                                      r = 10)$Positive
}

Only serine residues with ten or more neighbors in both directions will be considered in this analysis:

serine_sites <- serine_sites[!grepl('X', serine_sites$env), ]
head(serine_sites)

##       id pos pSer regulatory                   env
## 2 P31946  39 TRUE      FALSE KAVTEQGHELsNEERNLLSVA
## 3 P31946  47 TRUE      FALSE ELSNEERNLLsVAYKNVVGAR
## 4 P31946  59 TRUE      FALSE AYKNVVGARRsSWRVISSIEQ
## 5 P31946  60 TRUE      FALSE YKNVVGARRSsWRVISSIEQK
## 6 P31946  66 TRUE      FALSE ARRSSWRVISsIEQKTERNEK
## 7 P31946 116 TRUE      FALSE YLIPNATQPEsKVFYLKMKGD

This is the dataframe we need to start addressing our hypothesis: Regulatory phosphoserines have a differential sequence environment with respect to non-regulatory phosphoserines. A detailed description of how to carry out such analysis can be found in the vignette Regulatory versus Non-Regulatory Phosphoserines.

Searching for regulatory phospho-serines

Extracting the sequence environments

Compártelo: