Finds the codons corresponding to a given protein


prot2codon(prot, chain = “”, laxity = TRUE)


prot is either a UniProt or PDB id, or the path to a pdb file.

chain when prot corresponds to a pdb, the chain of interest must be provided.

laxity logical, if FALSE the program stop when a mismatch between the protein and the gene sequences is detected. Otherwise, the program finishes running and at the end points out the mismatches.


Returns a dataframe with as many rows as residues has the protein.


The ptm package offers a set of functions aimed to assist us to download and handle sequences from different databases:

At the heart of this set of functions is get.seq(), which imports a biological sequence (either as a string or an array) from the selected database.

This function can take three arguments:

  • id: the identifier of the molecule in the desired database.
  • db: the database of interest, which must be one of ‘uniprot’, ‘metosite’, ‘pdb’, ‘kegg-aa’ or ‘kegg-nt’. By default, db = ‘uniprot’.
  • as.string: logical, if TRUE the imported sequence will be returned as a character string, otherwise the sequence is returned as an array. By default, as.string = TRUE.

Regarding the identifiers, it should be noted that MetOSite uses the same type of protein ID than UniProt. However, if the chosen database is PDB, the identifier should be the 4-character unique identifier characteristic of PDB, followed by a dot and the chain of interest (for proteins with quaternary structure). For instance, ‘2OCC.B’ means we are interested in the sequence of chain B from the structure 2OCC. Please, note that while the function is case-insensitive regarding the PDB ID, that is not the case for the letter that identify the chain, which must be a capital letter.

# Two valid queries:
chainB <- get.seq('2occ:B', db = 'pdb')

CHAINB <- get.seq('2OCC:B', db = 'pdb')
chainB == CHAINB
## [1] TRUE

If we request the PDB sequence of an oligomeric protein, but we don’t provide the chain identifier, then the function will return the concatenated sequences of all the chains

all_chains <- get.seq('2occ', db = 'pdb')
## Fetching... Please wait. Done.
##    3612 

We can check that chain B is indeed found within this super-sequence:

gregexpr(chainB, all_chains)
[1] 515 2321
[1] 227 227
[1] "chars"
[1] TRUE

We have been using the chain B from the structure 2OCC as an example to show the usage of get.seq(), but you may be wondering what protein it is. If that is the case, you may find useful the function id.features(), which returns diverse features related to the protein being query. However, this function require the UniProt ID as argument, and all we have so far is the PDB ID. Fortunately, that won’t be a problem because id.mapping() allows the ID interconversion we need.

up_id <- id.mapping('2occ', from = 'pdb', to = 'uniprot')
##  [1] "P00396" "P00415" "P00423" "P00426" "P00428" "P00429" "P00430" "P04038" "P07470" "P07471" "P10175"
## [12] "P13183" "P68530"

We can verify that the structure 2OCC consist of 13 different polypeptide chains. Now, we are in condition to find out more about 2OCC. For this purpose, we are going to build, with the help of id.features(), a dataframe with some features for each polypeptide (Please, type ?id.features in the RStudio console to get further details).

features_2occ <- data.frame(Entry = rep(NA, length(up_id)),
                            Status = rep(NA, length(up_id)),
                            Entry_name = rep(NA, length(up_id)),
                            Organism = rep(NA, length(up_id)))

for (i in 1:length(up_id)){ 
  chain <- id.features(up_id[i], features = 'ec')
  features_2occ$Entry[i] <- chain$Entry
  features_2occ$Status[i] <- chain$Status
  features_2occ$Entry_name[i] <- chain$Entry_name
  features_2occ$Organism[i] <- chain$Organism
P00396reviewedCOX1_BOVINBos taurus (Bovine)
P00415reviewedCOX3_BOVINBos taurus (Bovine)
P00423reviewedCOX41_BOVINBos taurus (Bovine)
P00426reviewedCOX5A_BOVINBos taurus (Bovine)
P00428reviewedCOX5B_BOVINBos taurus (Bovine)
P00429reviewedCX6B1_BOVINBos taurus (Bovine)
P00430reviewedCOX7C_BOVINBos taurus (Bovine)
P04038reviewedCOX6C_BOVINBos taurus (Bovine)
P07470reviewedCX7A1_BOVINBos taurus (Bovine)
P07471reviewedCX6A2_BOVINBos taurus (Bovine)
P10175reviewedCOX8B_BOVINBos taurus (Bovine)
P13183reviewedCOX7B_BOVINBos taurus (Bovine)
P68530reviewedCOX2_BOVINBos taurus (Bovine)

Now, that we know that the 2OCC structure correspond to the cytochrome c oxidase (COX),
let’s suppose that we are interested in obtaining the DNA sequence that codes for the COX2 chain. We know the UniProt ID for this chain, but we need to know the ID of that sequence in the suitable database: KEGG. Actually, an important part of handling sequences from different databases involves the conversion of identifiers among different databases, but we already know how to do it with id.mapping()

kegg_id <- id.mapping(id = 'P68530', from = 'uniprot', to = 'kegg')

##     up:P68530 
## "bta:3283880"

Thus, we can proceed to download the nucleotide sequence from KEGG:

cox2_dna <- get.seq(kegg_id, db = 'kegg-nt')
[1] "kegg-nt"

Next, we are going to check that this DNA sequence encodes indeed for the polypeptide sequence contained in the object chainB that we got above. To carry out this task we are going to make use of the packages seqinr and bio3d. So, if you’ve not done so already, install them with the command:

translated <- seqinr::translate(seqinr::s2c(cox2_dna), numcode = 2)
translated <- paste(translated, collapse = "")

Observe that, since COX2 is a mtDNA encoded protein, we have passed the argument numcode = 2 to indicate that the vertebrate mitochondrial genetic code should be used.

translated <- as.character(translated)
sequences <- seqbind(chainB, translated, blank = '-') 
myaln <- seqaln(sequences, id = c("chainB", "translated"))

##              1        .         .         .         .         .         60 
##              ************************************************************ 
##              1        .         .         .         .         .         60 
##             61        .         .         .         .         .         120 
##              ************************************************************ 
##             61        .         .         .         .         .         120 
##            121        .         .         .         .         .         180 
##              ************************************************************ 
##            121        .         .         .         .         .         180 
##            181        .         .         .         .      227 
##              *********************************************** 
##            181        .         .         .         .      227 
## Call:
##   seqaln(aln = sequences, id = c("chainB", "translated"))
## Class:
##   fasta
## Alignment dimensions:
##   2 sequence rows; 227 position columns (227 non-gap, 0 gap) 
## + attr: id, ali, call

Indeed, both protein sequences are identical!

A straightforward alternative approach to get the DNA coding sequence, implies using the function prot2codon(). This function accepts as argument either the UniProt ID or the PDB ID. In the last case, we have to provide the chain ID.

prot_dna <- prot2codon(prot = '2occ', chain = 'B')

##    PDB has ALT records, taking A only, rm.alt=TRUE
paste(prot_dna$codon, collapse = "")


When using prot2codon(), a caveat to keep in mind is that the translation is carry out using the standard genetic code. For that reason, those methionine encoded by ATA are marked as check = FALSE, because the script expected isoleucine instead of methionine.



Finally, we present the function species.mapping(), which maps a protein ID (either from UniProt or PDB) to its corresponding organism. We have seen above, that this functionality can be obtanined using the more general purpose function id.features(), but when all we wish is to assign a species to a given protein, the easier and faster option is:

species.mapping('2occ', db = 'pdb')

## [1] "Bos taurus"