Description
Imports a protein sequence from a selected database
Usage
get.seq(id, db = 'uniprot', as.string = TRUE)
Arguments
id
the identifier of the protein of interest.
db
a character string specifying the desired database; it must be one of ‘uniprot’, ‘metosite’, ‘pdb’, ‘kegg-aa’, ‘kegg-nt’.
as.string
logical, if TRUE the imported sequence will be returned as a character string.
Value
Returns a protein (or nucleotide) sequence either as a character vector or a as a character string.
Details
The ptm package offers a set of functions aimed to assist us to download and handle sequences from different databases:
- get.seq (current tutorial)
- prot2codon
- id.mapping
- id.features
- species.mapping
At the heart of this set of functions is get.seq(), which imports a biological sequence (either as a string or an array) from the selected database.
Regarding the identifiers, it should be noted that MetOSite uses the same type of protein ID than UniProt. However, if the chosen database is PDB, the identifier should be the 4-character unique identifier characteristic of PDB, followed by colon and the chain of interest (for proteins with quaternary structure). For instance, ‘2OCC:B’ means we are interested in the sequence of chain B from the structure 2OCC. Please, note that while the function is case-insensitive regarding the PDB ID, that is not the case for the letter that identify the chain, which must be a capital letter. KEGG used its own IDs (see examples).
# Two valid queries:
chainB <- get.seq('2occ:B', db = 'pdb')
## PDB has ALT records, taking A only, rm.alt=TRUE
CHAINB <- get.seq('2OCC:B', db = 'pdb')
## PDB has ALT records, taking A only, rm.alt=TRUE
chainB == CHAINB
## [1] TRUE
If we request the PDB sequence of an oligomeric protein, but we don’t provide the chain identifier, then the function will return the concatenated sequences of all the chains
all_chains <- get.seq('2occ', db = 'pdb')
## PDB has ALT records, taking A only, rm.alt=TRUE
nchar(all_chains)
## [1] 3612
We can check that chain B is indeed found within this super-sequence:
gregexpr(chainB, all_chains)
## [[1]] ## [1] 515 2321 ## attr(,"match.length") ## [1] 227 227 ## attr(,"index.type") ## [1] "chars" ## attr(,"useBytes") ## [1] TRUE
We have been using the chain B from the structure 2OCC as an example to show the usage of get.seq(), but you may be wondering what protein it is. If that is the case, you may find useful the function id.features(), which returns diverse features related to the protein being query. However, this function require the UniProt ID as argument, and all we have so far is the PDB ID. Fortunately, that won’t be a problem because id.mapping() allows the ID interconversion we need.
up_id <- id.mapping('2occ', from = 'pdb', to = 'uniprot')
up_id
## [1] "P00396" "P00415" "P00423" "P00426" "P00428" "P00429" "P00430" "P04038" "P07470" "P07471" "P10175" ## [12] "P13183" "P68530"
We can verify that the structure 2OCC consist of 13 different polypeptide chains. Now, we are in condition to find out more about 2OCC. For this purpose, we are going to build, with the help of id.features(), a dataframe with some features for each polypeptide (Please, type ?id.features in the RStudio console to get further details).
features_2occ <- data.frame(Entry = rep(NA, length(up_id)),
Status = rep(NA, length(up_id)),
Entry_name = rep(NA, length(up_id)),
Organism = rep(NA, length(up_id)))
for (i in 1:length(up_id)){
chain <- id.features(up_id[i], features = 'ec')
features_2occ$Entry[i] <- chain$Entry
features_2occ$Status[i] <- chain$Status
features_2occ$Entry_name[i] <- chain$Entry_name
features_2occ$Organism[i] <- chain$Organism
}
library(knitr)
kable(features_2occ)
Entry | Status | Entry_name | Organism |
---|---|---|---|
P00396 | reviewed | COX1_BOVIN | Bos taurus (Bovine) |
P00415 | reviewed | COX3_BOVIN | Bos taurus (Bovine) |
P00423 | reviewed | COX41_BOVIN | Bos taurus (Bovine) |
P00426 | reviewed | COX5A_BOVIN | Bos taurus (Bovine) |
P00428 | reviewed | COX5B_BOVIN | Bos taurus (Bovine) |
P00429 | reviewed | CX6B1_BOVIN | Bos taurus (Bovine) |
P00430 | reviewed | COX7C_BOVIN | Bos taurus (Bovine) |
P04038 | reviewed | COX6C_BOVIN | Bos taurus (Bovine) |
P07470 | reviewed | CX7A1_BOVIN | Bos taurus (Bovine) |
P07471 | reviewed | CX6A2_BOVIN | Bos taurus (Bovine) |
P10175 | reviewed | COX8B_BOVIN | Bos taurus (Bovine) |
P13183 | reviewed | COX7B_BOVIN | Bos taurus (Bovine) |
P68530 | reviewed | COX2_BOVIN | Bos taurus (Bovine) |
Now, that we know that the 2OCC structure correspond to the cytochrome c oxidase (COX),
let’s suppose that we are interested in obtaining the DNA sequence that codes for the COX2 chain. We know the UniProt ID for this chain, but we need to know the ID of that sequence in the suitable database: KEGG. Actually, an important part of handling sequences from different databases involves the conversion of identifiers among different databases, but we already know how to do it with id.mapping()
kegg_id <- id.mapping(id = 'P68530', from = 'uniprot', to = 'kegg')
kegg_id
## up:P68530 ## "bta:3283880"
Thus, we can proceed to download the nucleotide sequence from KEGG:
cox2_dna <- get.seq(kegg_id, db = 'kegg-nt')
cox2_dna
## [1] "ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTACTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTACATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAAGTAGAGACAATCTGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCTTTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATAGGACATCAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCCTACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAATCGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTATTACACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAACCAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGCGGGTCAAACCACAGTTTCATGCCCATTGTCCTTGAGTTAGTCCCACTAAAGTACTTTGAAAAATGATCTGCGTCAATATTATAA"
Next, we are going to check that this DNA sequence encodes indeed for the polypeptide sequence contained in the object chainB that we got above. To carry out this task we are going to make use of the packages seqinr and bio3d. So, if you’ve not done so already, install them with the command:
install.packages("pkg_name")
library(seqinr)
## ## Attaching package: 'seqinr'
## The following object is masked from 'package:ptm': ## ## read.fasta
translated <- seqinr::translate(seqinr::s2c(cox2_dna), numcode = 2)
translated <- paste(translated, collapse = "")
Observe that, since COX2 is a mtDNA encoded protein, we have passed the argument numcode = 2 to indicate that the vertebrate mitochondrial genetic code should be used.
library(bio3d)
## ## Attaching package: 'bio3d'
## The following objects are masked from 'package:seqinr': ## ## consensus, read.fasta, write.fasta
## The following object is masked from 'package:ptm': ## ## get.seq
translated <- as.character(translated)
sequences <- seqbind(chainB, translated, blank = '-')
myaln <- seqaln(sequences, id = c("chainB", "translated"))
myaln
## 1 . . . . . 60 ## chainB MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE ## translated MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE ## ************************************************************ ## 1 . . . . . 60 ## ## 61 . . . . . 120 ## chainB VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS ## translated VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS ## ************************************************************ ## 61 . . . . . 120 ## ## 121 . . . . . 180 ## chainB YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN ## translated YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN ## ************************************************************ ## 121 . . . . . 180 ## ## 181 . . . . 227 ## chainB QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML ## translated QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML ## *********************************************** ## 181 . . . . 227 ## ## Call: ## seqaln(aln = sequences, id = c("chainB", "translated")) ## ## Class: ## fasta ## ## Alignment dimensions: ## 2 sequence rows; 227 position columns (227 non-gap, 0 gap) ## ## + attr: id, ali, call
Indeed, both protein sequences are identical!
A straightforward alternative approach to get the DNA coding sequence, implies using the function prot2codon(). This function accepts as argument either the UniProt ID or the PDB ID. In the last case, we have to provide the chain ID.
prot_dna <- prot2codon(prot = '2occ', chain = 'B')
## PDB has ALT records, taking A only, rm.alt=TRUE
paste(prot_dna$codon, collapse = "")
## [1] "ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTACTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTACATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAAGTAGAGACAATCNATGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCTTTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATAGGACATNANACAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCCTACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAATCGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTANATTACACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAACCAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGCGGGTCAAACCACAGTTTCATGCCCATTGTCCTTGAGTTAGTCCCACTAAAGNATACTTTGAAAAATGA"
When using prot2codon(), a caveat to keep in mind is that the translation is carry out using the standard genetic code. For that reason, those methionine encoded by ATA are marked as check = FALSE, because the script expected isoleucine instead of methionine.
library(knitr)
kable(head(prot_dna))
id | chain | pos | aa | codon | check |
---|---|---|---|---|---|
2occ | B | 1 | M | ATG | TRUE |
2occ | B | 2 | A | GCA | TRUE |
2occ | B | 3 | Y | TAT | TRUE |
2occ | B | 4 | P | CCC | TRUE |
2occ | B | 5 | M | ATA | FALSE |
2occ | B | 6 | Q | CAA | TRUE |
Finally, we present the function species.mapping(), which maps a protein ID (either from UniProt or PDB) to its corresponding organism. We have seen above, that this functionality can be obtanined using the more general purpose function id.features(), but when all we wish is to assign a species to a given protein, the easier and faster option is:
species.mapping('2occ', db = 'pdb')
## [1] "Bos taurus"