pdb.select()

Description

Select the PDB and chain with the higher coverage to a given UniProt sequence

Usage

pdb.select(up_id)

Arguments

up_id the UniProt ID.

Value

A list of two elements: (i) the PDB ID and (ii) the chain. The coverage with the UniProt sequence is given as an attribute.

Details

The ptm package contains a number of ancillary functions that deal with Protein Data Bank (PDB) files. These functions may be useful when structural 3D data need to be analyzed. The mentioned functions are:

Often I have find myself in the situation of having an UniProt ID, and wanting to know the best PDB ID associated to this UniProt sequence. What should we understand for “the best”?
As you may know, dealing with PDB files can be challenging. For example, many structures (particular those determined by crystallography) only include information about part of the functional biological assembly. Also, many PDB entries are missing portions of the molecule that were not observed in the experiment. Therefore, in our case, “the best” means the PDB structure containing the largest extension of the UniProt sequence. The function that fulfils this work is pdb.select().

For instance, if we ask for a PDB of the human enzyme glyceraldehyde-3-phosphate dehydrogenase (P04406), all we need to type is:

pdb.select('P04406')
##    PDB has ALT records, taking A only, rm.alt=TRUE
## [[1]]
## [1] "1u8f"
## 
## [[2]]
## [1] "O"
## 
## attr(,"coverage")
## [1] 0.994

As we can observe, the chain ‘O’ of the PDB ‘1U8F’ contain nearly 100 % of the amino acids corresponding to the sequence that we may recover from UniProt.

Sometimes we are not so lucky! and the coverage is well bellow that figure. That is the case with the splicing factor, proline- and glutamine-rich (P23246):

pdb.select('P23246')
##    PDB has ALT records, taking A only, rm.alt=TRUE
##    PDB has ALT records, taking A only, rm.alt=TRUE
##    PDB has ALT records, taking A only, rm.alt=TRUE
##    PDB has ALT records, taking A only, rm.alt=TRUE
##    PDB has ALT records, taking A only, rm.alt=TRUE
## [[1]]
## [1] "5wpa"
## 
## [[2]]
## [1] "A"
## 
## attr(,"coverage")
## [1] 0.317

We got a PDB and a chain, but only 31.7 % of the amino acids in the UniProt sequence are present in this chain.

Even worse, many proteins present in the UniProt database are not represented the the PDB database. In these cases, we’ll get, I’m afraid, a negative response:

pdb.select('G3SB67')
## [1] "NO PDB FOUND"