get.hssp()

Description

Gets a HSSP file of the requested structure.

Usage

get.hssp(file, path, keepfiles = TRUE)

Arguments

pdb the 4-letter identifier of the PDB file.

path character string providing the path to the in-house HSSP database.

keepfiles logical, if TRUE the dataframes will be saved in the working directory and we will keep the hssp file.

Value

Returns 4 dataframes containing the information found in hssp files, as describe below.

References

Touw et al (2015) Nucl. Ac. Res. 43:D364-368.
Lange et al (2020) Protein Sci. 29:330-344.

See Also

msa(), custom.aln(), list.hom(), parse.hssp(), shannon(), site.type()

Details

Multiple sequence alignment (MSA), which consists in the alignment of three or more biological sequences. From the output, homology can be inferred and the evolutionary relationships between the sequences studied. Thus, alignment is the most important stage in most evolutionary analyses. In addition, MSA is also an essential tool for protein structure and function prediction. The package ptm offers several functions that will assist you in the process of sequence analysis:

msa
custom.aln
list.hom
parse.hssp
get.hssp (current doc)
shannon
site.type

The function get.hssp() will obtain and parse the requested HSSP file. HSSP stands for Homology-derived Secondary Structure of Proteins. These files contain information related to MSAs of UniProtKB against PDB. When the argument ‘keepfiles’ is set to TRUE, the get.hssp() function will build and save (in the working directory) the following 4 dataframes:

  • id_seq_list.Rda: This block of information holds the metadata per sequence, and some alignment statistics. For a detailed description of the information that can be find in this block, check here.

  • id_aln.Rda: This dataframe contains the alignment itself (each sequence is a column). Additional information such as secondary structure, SASA (solvent accessible surface area), etc is also found in this block.

  • id_profile.Rda: This dataframe holds per amino acid type its percentage in the list of residues observed at the indicated position. In addition, this dataframe also informs about the entropy at each position, as well as the number of sequences spanning this position (NOOC).

  • id_insertions.Rda: A dataframe with information regarding those sequences that contain insertions. Click here for further details.

In order to use this function, you need to obtain a local copy of the HSSB database. This is a process that can take a few minutes, but it only have to be done once. To do that, you can follow the indication given here. I have created my copy of HSSB in a folder whose absolute path is ‘/Users/juancarlosaledo/ptm_outdropbox/local_HSSP/’, so to obtain the hssp information related to the PDB structure 1U8F, all I have to type in R is:

profile <- get.hssp(pdb = '1u8f', 
                     path = "/Users/juancarlosaledo/ptm_outdropbox/local_HSSP/", 
                     keepfiles = TRUE)

The object ‘profile’ is a dataframe with as many rows as residues has the protein. For each position, the following variables (columns) are shown:

  • SeqNo: Sequence residue number.
  • PDBNo: PDB residue number.
  • V: Percentage at which the amino acid valine (Val) is found at that position.
  • L: Percentage at which the amino acid leucine (Leu) is found at that position.
  • I: Percentage at which the amino acid Isoleucine (Ile) is found at that position.
  • M: Percentage at which the amino acid methionine (Met) is found at that position.
  • F: Percentage at which the amino acid phenylalanine (Phe) is found at that position.
  • W: Percentage at which the amino acid tryptopha (Trp) is found at that position.
  • Y: Percentage at which the amino acid tyrosine (Tyr) is found at that position.
  • A: Percentage at which the amino acid alanine (Ala) is found at that position.
  • G: Percentage at which the amino acid glycine (Gly) is found at that position.
  • P: Percentage at which the amino acid proline (Pro) is found at that position.
  • S: Percentage at which the amino acid serine (Ser) is found at that position.
  • T: Percentage at which the amino acid threonine (Thr) is found at that position.
  • C: Percentage at which the amino acid cysteine (Cys) is found at that position.
  • Q: Percentage at which the amino acid glutamine (Gln) is found at that position.
  • N: Percentage at which the amino acid asparragine (Asn) is found at that position.
  • H: Percentage at which the amino acid histidine (His) is found at that position.
  • R: Percentage at which the amino acid arginine (Arg) is found at that position.
  • K: Percentage at which the amino acid lysine (Lys) is found at that position.
  • E: Percentage at which the amino acid glutamate (Glu) is found at that position.
  • D: Percentage at which the amino acid aspartate (Asp) is found at that position.
  • NOCC: Number of aligned sequences spanning this position (including the test sequence).
  • NDEL: Number of sequences with a deletion in the test protein at this position.
  • NINS: Number of sequences with an insertion in the test protein at this position.
  • ENTROPY: Entropy measure of sequence variaparbility at this position.
  • RELENT: Relative entropy, i.e. entropy normalized to the range 0-100.
  • WEIGHT: Conservation weight.

We can have a visual impression of which are the most variable and the most conserved positions by plotting the relative entropy as a function of the position:

plot(profile$SeqNo, profile$RELENT, ty = 'h', xlab = 'Position', ylab = 'Relative Entropy')


In this way, the most variable position is:

maxS_at <- which(profile$ENTROPY ==  max(profile$ENTROPY))
x <- as.data.frame(t(profile[maxS_at, 3:22]))
x$col <- c(rep("orange", 8), rep("purple", 2), rep("green", 5), rep("blue", 3), rep("red",2))
names(x) <- c('frequency', 'col')

barplot(height = x$frequency,
        names = rownames(x),
        col = x$col,
        main = paste("Position:", profile$PDBNo[maxS_at]))

Here, we have colored the amino acids according to their physicochemical nature. Acidic (E, D) in red, basic (H, R, K) in blue, hydrophobic (L, I, M, F, W, Y, A) in orange, polar (S, T, C, Q, N) in green and special (G, P) in purple.

In contrast, the most conserved position is:

minS_at <- which(profile$ENTROPY ==  min(profile$ENTROPY))
x <- as.data.frame(t(profile[minS_at, 3:22]))
x$col <- c(rep("orange", 8), rep("purple", 2), rep("green", 5), rep("blue", 3), rep("red",2))
names(x) <- c('frequency', 'col')

barplot(height = x$frequency,
        names = rownames(x),
        col = x$col,
        main = paste("Position:", profile$PDBNo[minS_at]))


where valine is the only amino acid present!

In addition to this dataframe we have colled ‘profile’, we can access, if we wish, the alignment itself:

load("./1u8f_aln.Rda")
dim(aln)
## [1] 333 528

This dataframe, that we have placed in an object colled ‘aln’, has 333 raws (one per residue) and 528 columns. The first eight colums are:

  • SeqNo: Sequence residue number.
  • PDBNo: PDB residue number.
  • Chain: Chain identifier.
  • AA: Amino Acid at that position in the reference sequence.
  • SS: Element of secondary structure.
  • ACC: Solven accessible area.
  • NOCC: Number of aligned sequences spanning this position (including the reference sequence).
  • VAR: Sequence variability on a scale of 0-100 as derived from the number of sequences aligned.

The ninth column (named in this example ‘P04406’) gives the reference sequence, while the remaining colums provide the sequence of the protein included in the alignment. These columns are named with the UniProt ID of the corresponding protein.

Information regarding the metadata per sequence, and some alignment statistic, can be found in a third dataframe:

load("./1u8f_seq_list.Rda")
head(seq_list)
##   NR           ID  IDE WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LESEQ2     ACCNUM
## 1  1    G3P_HUMAN 1.00    1    1  333    3  335  333    0    0    335     P04406
## 2  2 G3R288_GORGO 1.00    1    1  333    3  335  333    0    0    335     G3R288
## 3  3 H2Q5A6_PANTR 1.00    1    1  333    3  335  333    0    0    335     H2Q5A6
## 4  4 V9HVZ4_HUMAN 1.00    1    1  333    3  335  333    0    0    335     V9HVZ4
## 5  5 A0A096MS12_P 0.99    1    1  333    3  335  333    0    0    335 A0A096MS12
## 6  6 A0A0A7KUP9_M 0.99    1    1  333    3  335  333    0    0    335 A0A0A7KUP9
##                                                                                  PROTEIN
## 1            Glyceraldehyde-3-phosphate dehydrogenase OS=Homo sapiens GN=GAPDH PE=1 SV=3
## 2 Glyceraldehyde-3-phosphate dehydrogenase OS=Gorilla gorilla gorilla GN=GAPDH PE=3 SV=1
## 3         Glyceraldehyde-3-phosphate dehydrogenase OS=Pan troglodytes GN=GAPDH PE=3 SV=1
## 4      Glyceraldehyde-3-phosphate dehydrogenase OS=Homo sapiens GN=HEL-S-162eP PE=2 SV=1
## 5                     Glyceraldehyde-3-phosphate dehydrogenase OS=Papio anubis PE=3 SV=1
## 6     Glyceraldehyde-3-phosphate dehydrogenase OS=Macaca fascicularis GN=GAPDH PE=2 SV=1

The number of rows, in our example, is 520 (one per sequence included in the alignment). The variables (columns) holded in this dataframe are:

  • NR: Sequence number.
  • ID: EMBL/SWISSPROT identifier of the aligned (homologous) protein.
  • IDE: Percentage of residue identity of the alignment.
  • WSIM: Weighted similarity of the alignment.
  • IFIR: First residue of the alignment in the test sequence.
  • ILAS: Last residue of the alignment in the test sequence.
  • JFIR: First residue of the alignment in the alignend protein.
  • JLAS: Last residue of the alignment in the alignend protein.
  • LALI: Length of the alignment excluding insertions and deletions.
  • NGAP: Number of insertions and deletions in the alignment.
  • LGAP: Total length of all insertions and deletions.
  • LSEQ2: Length of the entire sequence of the aligned protein.
  • ACCNUM: SwissProt accession number.
  • PROTEIN: One-line description of aligned protein.

Finally, a fourth dataframe, named ‘insertions’ can be assessed

load("./1u8f_insertions.Rda")
inser
##     AliNo IPOS JPOS Len             Sequence
## 1      72  235  221   2                 pTPt
## 2      73  297  297   5              gAGIAGa
## 3      77  175  177   3                nWLLp
## 4      77  189  194   3                gTGSp
## 5      78  289  286   1                  tQh
## 6      90   23   54   1                  sGc
## 7     112   27   27   1                  gVe
## 8     113   27   27   1                  gVe
## 9     119   23   23   2                 sSAs
## 10    123   27   27   1                  gVe
## 11    126   41   41   3                yMAMl
## 12    127   27   26   1                  vVe
## 13    130  229  218   2                 gMAt
## 14    137   56   56   1                  kLg
## 15    138   56   75   1                  kLg
## 16    139   27   27   1                  gVe
## 17    141   23   23   2                 sSAs
## 18    143  133  128  18 gVNQDKYDNSLKIVSNVMGv
## 19    146  263  248   1                  rRr
## 20    149   56   56   1                  kHs
## 21    150   27   25   1                  gAe
## 22    152  192  196  15    gKLCITCSRRWGYSVSl
## 23    152  245  264   4               sCHLTc
## 24    154   56   56   1                  kHg
## 25    157   27   25   1                  gGd
## 26    159  263  268   1                  aSa
## 27    160   27   17   1                  gAk
## 28    161   27   17   1                  gAk
## 29    162   27   25   1                  gAq
## 30    163   27   51   1                  gAt
## 31    165   27   25   1                  gAq
## 32    167  289  256   2                 pPAh
## 33    168  142  141   1                  aSm
## 34    171   27   25   1                  gAq
## 35    172   27   25   1                  gAk
## 36    173  276  268   1                  tTd
## 37    173  307  300   1                  sLk
## 38    174   41   40   1                  yIl
## 39    175   27   25   1                  gAq
## 40    176   27   25   1                  gGq
## 41    177   27   25   1                  gAq
## 42    178   27   25   1                  gAq
## 43    179   27   27   1                  gAq
## 44    180   27   51   1                  gAt
## 45    181   27   51   1                  gAt
## 46    182   27   51   1                  gAt
## 47    183   27   51   1                  gAt
## 48    184   27   51   1                  gAt
## 49    185   27   51   1                  gAt
## 50    186   27   51   1                  gAt
## 51    187   27   51   1                  gAt
## 52    188   23   23   2                 eRGg
## 53    188   24   26   1                  gQv
## 54    188   27   30   1                  vNq
## 55    189   27   25   1                  gAq
## 56    190   27   55   1                  gGq
## 57    191   27   25   1                  gGq
## 58    192   27   32   1                  gGq
## 59    193   27   53   1                  gGq
## 60    194   27   25   1                  gAq
## 61    195   27   25   1                  gAn
## 62    196   27   25   1                  gAn
## 63    197   27   25   1                  gAs
## 64    198   27   25   1                  gAn
## 65    199   27   25   1                  gAq
## 66    200   23   26   1                  eKg
## 67    201   27   25   1                  gGq
## 68    202   27   25   1                  gGq
## 69    203   69   71   5              tKDGKSq
## 70    203  142  149   1                  aNd
## 71    204   27   25   1                  gAq
## 72    205   27   25   1                  gAt
## 73    207   69   71   5              tKDGKTq
## 74    207  142  149   1                  aNd
## 75    208   27   25   1                  gAq
## 76    209   27   25   1                  gAq
## 77    210   27   25   1                  gAq
## 78    211   27   25   1                  gAq
## 79    212   27   25   1                  gAt
## 80    213   27   25   1                  gAq
## 81    214   27   25   1                  gAt
## 82    215   27   27   1                  gIk
## 83    215  142  143   1                  sSm
## 84    216   27   25   1                  gAq
## 85    217   27   25   1                  gAq
## 86    218  263  264   1                  aAs
## 87    219   27   28   1                  gGt
## 88    220   27   25   1                  gVe
## 89    220  263  262   1                  aAa
## 90    221   27   25   1                  gAq
## 91    222   27   51   1                  gAt
## 92    223   23   24   1                  aLn
## 93    223  263  264   1                  kAs
## 94    224  263  264   1                  aAs
## 95    225   27   25   1                  gAq
## 96    226   27   25   1                  gVe
## 97    226  263  262   1                  aAa
## 98    227   27   25   1                  gAs
## 99    228   27   25   1                  gAs
## 100   229   27   25   1                  gAe
## 101   230   27   25   1                  gAq
## 102   231   27   25   1                  gAn
## 103   232   27   25   1                  gAs
## 104   233   27   25   1                  gAn
## 105   234   27   25   1                  gAn
## 106   235   27   25   1                  gAn
## 107   236   27   25   1                  gAq
## 108   237   27   27   1                  gIk
## 109   237  142  143   1                  sSm
## 110   238   27   25   1                  gVe
## 111   238  263  262   1                  aAa
## 112   239   27   25   1                  gGq
## 113   240   69   71   5              tKEGKSq
## 114   240  142  149   1                  aNd
## 115   241   18   18   1                  tSr
## 116   242   27   27   1                  gIk
## 117   242  142  143   1                  sSm
## 118   243   69   71   5              aNEGKSq
## 119   243  142  149   1                  aNd
## 120   244   27   25   1                  gAs
## 121   245   27   25   1                  gAe
## 122   246   27   25   1                  gAq
## 123   247   27   25   1                  gAq
## 124   248   27   25   1                  gAq
## 125   250   27   25   1                  gAn
## 126   251   27   25   1                  gAs
## 127   252   27   27   1                  gIk
## 128   252  142  143   1                  sSm
## 129   253   27   27   1                  gIk
## 130   253  142  143   1                  sSm
## 131   254   27   27   1                  gIk
## 132   254  142  143   1                  sSm
## 133   255   27   25   1                  gVe
## 134   255  263  262   1                  aAa
## 135   256   27   25   1                  gVd
## 136   256  263  262   1                  aAa
## 137   257   27   25   1                  gAq
## 138   258   27   25   1                  gAs
## 139   259   27   27   1                  gIk
## 140   259  142  143   1                  sSm
## 141   260  263  264   1                  aAs
## 142   261   27   25   1                  gAq
## 143   262   27   25   1                  gAe
## 144   265   69   71   5              kQDGKDt
## 145   265  142  149   1                  aNd
## 146   266   27   42   1                  gAe
## 147   267   69   77   5              sKEGKSt
## 148   267  142  155   1                  aNn
## 149   268   27   38   1                  gIk
## 150   268  142  154   1                  sSm
## 151   269   27   25   1                  gAs
## 152   270   27   25   1                  gAq
## 153   271   27   25   1                  gAq
## 154   272   27   25   1                  gAs
## 155   273   27   25   1                  gAq
## 156   274   69   71   4               sFGGSt
## 157   274  142  148   1                  nTn
## 158   275   27   25   1                  gGq
## 159   276  263  277   1                  aAa
## 160   277   27   25   1                  gAn
## 161   278  113  115   1                  nDr
## 162   279  113  115   1                  nDr
## 163   280  113  115   1                  nDr
## 164   281  113  115   1                  nDr
## 165   282  113  115   1                  nDr
## 166   283  113  115   1                  nDr
## 167   284  113  115   1                  nDr
## 168   285  113  115   1                  nDr
## 169   286  113  115   1                  nDr
## 170   287  113  115   1                  nDr
## 171   288  113  115   1                  nDr
## 172   289   23   19   2                 sSAs
## 173   289  200  185   1                  gGa
## 174   290  189  188   2                 gSYv
## 175   292   27   27   1                  gIk
## 176   292  142  143   1                  sSm
## 177   294   23   24   1                  lNp
## 178   295   27   27   1                  gIk
## 179   295  142  143   1                  sSm
## 180   296  113  115   1                  nDr
## 181   297  113  115   1                  nDr
## 182   298  263  264   1                  aAs
## 183   299   69   71   5              sLDGKSt
## 184   299  142  149   1                  aNn
## 185   299  235  243   2                 rVPt
## 186   300  189  149   1                  gWp
## 187   301   69   71   5              kQDGKDa
## 188   301  142  149   1                  aNd
## 189   302  142  142   1                  tKh
## 190   303   23   24   1                  dNp
## 191   303  263  264   1                  kAs
## 192   304   23   24   1                  dNp
## 193   304  263  264   1                  kAs
## 194   305   69   71   5              gLDGKSt
## 195   305  142  149   1                  aNn
## 196   306   23   24   1                  aNp
## 197   307   23   23   1                  qLp
## 198   308   69   71   4               tNGKTt
## 199   308  142  148   1                  aNn
## 200   309   69   71   5              kQDGKDp
##  [ reached 'max' / getOption("max.print") -- omitted 337 rows ]

Further details regarding the information provided by this dataframe can be obtained here.

Please, mind that if the argument ‘keepfiles’ is set to FALSE, only the dataframe ‘profile’ will be returned, and the hssp file will be delated from you machine.