get.seq()

Description

Imports a protein sequence from a selected database

Usage

get.seq(id, db = 'uniprot', as.string = TRUE)

Arguments

id the identifier of the protein of interest.

db a character string specifying the desired database; it must be one of ‘uniprot’, ‘metosite’, ‘pdb’, ‘kegg-aa’, ‘kegg-nt’.

as.string logical, if TRUE the imported sequence will be returned as a character string.

Value

Returns a protein (or nucleotide) sequence either as a character vector or a as a character string.

Details

The ptm package offers a set of functions aimed to assist us to download and handle sequences from different databases:

At the heart of this set of functions is get.seq(), which imports a biological sequence (either as a string or an array) from the selected database.

Regarding the identifiers, it should be noted that MetOSite uses the same type of protein ID than UniProt. However, if the chosen database is PDB, the identifier should be the 4-character unique identifier characteristic of PDB, followed by colon and the chain of interest (for proteins with quaternary structure). For instance, ‘2OCC:B’ means we are interested in the sequence of chain B from the structure 2OCC. Please, note that while the function is case-insensitive regarding the PDB ID, that is not the case for the letter that identify the chain, which must be a capital letter. KEGG used its own IDs (see examples).

View Page

# Two valid queries:
chainB <- get.seq('2occ:B', db = 'pdb')

##    PDB has ALT records, taking A only, rm.alt=TRUE

CHAINB <- get.seq('2OCC:B', db = 'pdb')

##    PDB has ALT records, taking A only, rm.alt=TRUE

chainB == CHAINB

## [1] TRUE

If we request the PDB sequence of an oligomeric protein, but we don’t provide the chain identifier, then the function will return the concatenated sequences of all the chains

all_chains <- get.seq('2occ', db = 'pdb')

##    PDB has ALT records, taking A only, rm.alt=TRUE

nchar(all_chains)

## [1] 3612

We can check that chain B is indeed found within this super-sequence:

gregexpr(chainB, all_chains)

## [[1]]
## [1]  515 2321
## attr(,"match.length")
## [1] 227 227
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

We have been using the chain B from the structure 2OCC as an example to show the usage of get.seq(), but you may be wondering what protein it is. If that is the case, you may find useful the function id.features(), which returns diverse features related to the protein being query. However, this function require the UniProt ID as argument, and all we have so far is the PDB ID. Fortunately, that won’t be a problem because id.mapping() allows the ID interconversion we need.

up_id <- id.mapping('2occ', from = 'pdb', to = 'uniprot')
up_id

##  [1] "P00396" "P00415" "P00423" "P00426" "P00428" "P00429" "P00430" "P04038" "P07470" "P07471" "P10175"
## [12] "P13183" "P68530"

We can verify that the structure 2OCC consist of 13 different polypeptide chains. Now, we are in condition to find out more about 2OCC. For this purpose, we are going to build, with the help of id.features(), a dataframe with some features for each polypeptide (Please, type ?id.features in the RStudio console to get further details).

features_2occ <- data.frame(Entry = rep(NA, length(up_id)),
                            Status = rep(NA, length(up_id)),
                            Entry_name = rep(NA, length(up_id)),
                            Organism = rep(NA, length(up_id)))


for (i in 1:length(up_id)){ 
  chain <- id.features(up_id[i], features = 'ec')
  features_2occ$Entry[i] <- chain$Entry
  features_2occ$Status[i] <- chain$Status
  features_2occ$Entry_name[i] <- chain$Entry_name
  features_2occ$Organism[i] <- chain$Organism
}
library(knitr)
kable(features_2occ)

Entry	Status	Entry_name	Organism
P00396	reviewed	COX1_BOVIN	Bos taurus (Bovine)
P00415	reviewed	COX3_BOVIN	Bos taurus (Bovine)
P00423	reviewed	COX41_BOVIN	Bos taurus (Bovine)
P00426	reviewed	COX5A_BOVIN	Bos taurus (Bovine)
P00428	reviewed	COX5B_BOVIN	Bos taurus (Bovine)
P00429	reviewed	CX6B1_BOVIN	Bos taurus (Bovine)
P00430	reviewed	COX7C_BOVIN	Bos taurus (Bovine)
P04038	reviewed	COX6C_BOVIN	Bos taurus (Bovine)
P07470	reviewed	CX7A1_BOVIN	Bos taurus (Bovine)
P07471	reviewed	CX6A2_BOVIN	Bos taurus (Bovine)
P10175	reviewed	COX8B_BOVIN	Bos taurus (Bovine)
P13183	reviewed	COX7B_BOVIN	Bos taurus (Bovine)
P68530	reviewed	COX2_BOVIN	Bos taurus (Bovine)

Now, that we know that the 2OCC structure correspond to the cytochrome c oxidase (COX),
let’s suppose that we are interested in obtaining the DNA sequence that codes for the COX2 chain. We know the UniProt ID for this chain, but we need to know the ID of that sequence in the suitable database: KEGG. Actually, an important part of handling sequences from different databases involves the conversion of identifiers among different databases, but we already know how to do it with id.mapping()

kegg_id <- id.mapping(id = 'P68530', from = 'uniprot', to = 'kegg')
kegg_id

##     up:P68530 
## "bta:3283880"

Thus, we can proceed to download the nucleotide sequence from KEGG:

cox2_dna <- get.seq(kegg_id, db = 'kegg-nt')
cox2_dna

## [1] "ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTACTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTACATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAAGTAGAGACAATCTGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCTTTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATAGGACATCAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCCTACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAATCGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTATTACACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAACCAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGCGGGTCAAACCACAGTTTCATGCCCATTGTCCTTGAGTTAGTCCCACTAAAGTACTTTGAAAAATGATCTGCGTCAATATTATAA"

Next, we are going to check that this DNA sequence encodes indeed for the polypeptide sequence contained in the object chainB that we got above. To carry out this task we are going to make use of the packages seqinr and bio3d. So, if you’ve not done so already, install them with the command:

install.packages("pkg_name")

library(seqinr)

## 
## Attaching package: 'seqinr'

## The following object is masked from 'package:ptm':
## 
##     read.fasta

translated <- seqinr::translate(seqinr::s2c(cox2_dna), numcode = 2)
translated <- paste(translated, collapse = "")

Observe that, since COX2 is a mtDNA encoded protein, we have passed the argument numcode = 2 to indicate that the vertebrate mitochondrial genetic code should be used.

library(bio3d)

## 
## Attaching package: 'bio3d'

## The following objects are masked from 'package:seqinr':
## 
##     consensus, read.fasta, write.fasta

## The following object is masked from 'package:ptm':
## 
##     get.seq

translated <- as.character(translated)
sequences <- seqbind(chainB, translated, blank = '-') 
myaln <- seqaln(sequences, id = c("chainB", "translated"))
myaln

##              1        .         .         .         .         .         60 
## chainB       MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
## translated   MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
##              ************************************************************ 
##              1        .         .         .         .         .         60 
## 
##             61        .         .         .         .         .         120 
## chainB       VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
## translated   VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
##              ************************************************************ 
##             61        .         .         .         .         .         120 
## 
##            121        .         .         .         .         .         180 
## chainB       YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
## translated   YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
##              ************************************************************ 
##            121        .         .         .         .         .         180 
## 
##            181        .         .         .         .      227 
## chainB       QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML
## translated   QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML
##              *********************************************** 
##            181        .         .         .         .      227 
## 
## Call:
##   seqaln(aln = sequences, id = c("chainB", "translated"))
## 
## Class:
##   fasta
## 
## Alignment dimensions:
##   2 sequence rows; 227 position columns (227 non-gap, 0 gap) 
## 
## + attr: id, ali, call

Indeed, both protein sequences are identical!

A straightforward alternative approach to get the DNA coding sequence, implies using the function prot2codon(). This function accepts as argument either the UniProt ID or the PDB ID. In the last case, we have to provide the chain ID.

prot_dna <- prot2codon(prot = '2occ', chain = 'B')

##    PDB has ALT records, taking A only, rm.alt=TRUE

paste(prot_dna$codon, collapse = "")

## [1] "ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTACTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTACATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAAGTAGAGACAATCNATGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCTTTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATAGGACATNANACAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCCTACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAATCGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTANATTACACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAACCAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGCGGGTCAAACCACAGTTTCATGCCCATTGTCCTTGAGTTAGTCCCACTAAAGNATACTTTGAAAAATGA"

When using prot2codon(), a caveat to keep in mind is that the translation is carry out using the standard genetic code. For that reason, those methionine encoded by ATA are marked as check = FALSE, because the script expected isoleucine instead of methionine.

library(knitr)
kable(head(prot_dna))

id	chain	pos	aa	codon	check
2occ	B	1	M	ATG	TRUE
2occ	B	2	A	GCA	TRUE
2occ	B	3	Y	TAT	TRUE
2occ	B	4	P	CCC	TRUE
2occ	B	5	M	ATA	FALSE
2occ	B	6	Q	CAA	TRUE

Finally, we present the function species.mapping(), which maps a protein ID (either from UniProt or PDB) to its corresponding organism. We have seen above, that this functionality can be obtanined using the more general purpose function id.features(), but when all we wish is to assign a species to a given protein, the easier and faster option is:

species.mapping('2occ', db = 'pdb')

## [1] "Bos taurus"