parse.hssp()

Description

Parses a HSSP file to return dataframes

Usage

parse.hssp(file, keepfiles = TRUE)

Arguments

file input hssp file.

keepfiles logical, if TRUE the dataframes will be saved in the working directory and we will keep the hssp file.

Value

Returns 4 dataframes containing the information found in hssp files, as describe below.

References

Touw et al (2015) Nucl. Ac. Res. 43:D364-368.
Lange et al (2020) Protein Sci. 29:330-344.

Details

Multiple sequence alignment (MSA), which consists in the alignment of three or more biological sequences. From the output, homology can be inferred and the evolutionary relationships between the sequences studied. Thus, alignment is the most important stage in most evolutionary analyses. In addition, MSA is also an essential tool for protein structure and function prediction. The package ptm offers several functions that will assist you in the process of sequence analysis:

msa
custom.aln
list.hom
parse.hssp (current doc)
get.hssp
shannon
site.type

The function parse.hssp() is a parser of HSSP files. HSSP, which stands for Homology-derived Secondary Structure of Proteins, are files containing information related to MSAs of UniProtKB against PDB. When the argument ‘keepfiles’ is set to TRUE, the parse.hssp() function will build and save (in the working directory) the following 4 dataframes:

id_seq_list.Rda: This block of information holds the metadata per sequence, and some alignment statistics. For a detailed description of the information that can be find in this block, check here.
id_aln.Rda: This dataframe contains the alignment itself (each sequence is a column). Additional information such as secondary structure, SASA (solvent accessible surface area), etc is also found in this block.
id_profile.Rda: This dataframe holds per amino acid type its percentage in the list of residues observed at the indicated position. In addition, this dataframe also informs about the entropy at each position, as well as the number of sequences spanning this position (NOOC).
id_insertions.Rda: A dataframe with information regarding those sequences that contain insertions. Click here for further details.

Since, parse.hssp() is a parser, you must have in your machine the corresponding hssp file that you may have downloaded previously, for instance, using the server xssp. Thus, to illustrate herein the use of parse.hssp(), I got the file ‘3cwm.hssp’ in my current directory

profile <- parse.hssp(file = "./3cwm.hssp", keepfiles = TRUE)

The object ‘profile’ is a dataframe with as many rows as residues has the protein. For each position, the following variables (columns) are shown:

SeqNo: Sequence residue number.
PDBNo: PDB residue number.
V: Percentage at which the amino acid valine (Val) is found at that position.
L: Percentage at which the amino acid leucine (Leu) is found at that position.
I: Percentage at which the amino acid Isoleucine (Ile) is found at that position.
M: Percentage at which the amino acid methionine (Met) is found at that position.
F: Percentage at which the amino acid phenylalanine (Phe) is found at that position.
W: Percentage at which the amino acid tryptopha (Trp) is found at that position.
Y: Percentage at which the amino acid tyrosine (Tyr) is found at that position.
A: Percentage at which the amino acid alanine (Ala) is found at that position.
G: Percentage at which the amino acid glycine (Gly) is found at that position.
P: Percentage at which the amino acid proline (Pro) is found at that position.
S: Percentage at which the amino acid serine (Ser) is found at that position.
T: Percentage at which the amino acid threonine (Thr) is found at that position.
C: Percentage at which the amino acid cysteine (Cys) is found at that position.
Q: Percentage at which the amino acid glutamine (Gln) is found at that position.
N: Percentage at which the amino acid asparragine (Asn) is found at that position.
H: Percentage at which the amino acid histidine (His) is found at that position.
R: Percentage at which the amino acid arginine (Arg) is found at that position.
K: Percentage at which the amino acid lysine (Lys) is found at that position.
E: Percentage at which the amino acid glutamate (Glu) is found at that position.
D: Percentage at which the amino acid aspartate (Asp) is found at that position.
NOCC: Number of aligned sequences spanning this position (including the test sequence).
NDEL: Number of sequences with a deletion in the test protein at this position.
NINS: Number of sequences with an insertion in the test protein at this position.
ENTROPY: Entropy measure of sequence variaparbility at this position.
RELENT: Relative entropy, i.e. entropy normalized to the range 0-100.
WEIGHT: Conservation weight.

We can have a visual impression of which are the most variable and the most conserved positions by plotting the relative entropy as a function of the position:

plot(profile$SeqNo, profile$RELENT, ty = 'h', xlab = 'Position', ylab = 'Relative Entropy')

In this way, the most variable position is:

maxS_at <- which(profile$ENTROPY ==  max(profile$ENTROPY))
x <- as.data.frame(t(profile[maxS_at, 3:22]))
x$col <- c(rep("orange", 8), rep("purple", 2), rep("green", 5), rep("blue", 3), rep("red",2))
names(x) <- c('frequency', 'col')

barplot(height = x$frequency,
        names = rownames(x),
        col = x$col,
        main = paste("Position:", profile$PDBNo[maxS_at]))

Here, we have colored the amino acids according to their physicochemical nature. Acidic (E, D) in red, basic (H, R, K) in blue, hydrophobic (L, I, M, F, W, Y, A) in orange, polar (S, T, C, Q, N) in green and special (G, P) in purple. We observe that, except Trp and Cys, any amino acid can be found at this position.

In contrast, the most conserved position is:

minS_at <- which(profile$ENTROPY ==  min(profile$ENTROPY))
x <- as.data.frame(t(profile[minS_at, 3:22]))
x$col <- c(rep("orange", 8), rep("purple", 2), rep("green", 5), rep("blue", 3), rep("red",2))
names(x) <- c('frequency', 'col')

barplot(height = x$frequency,
        names = rownames(x),
        col = x$col,
        main = paste("Position:", profile$PDBNo[minS_at]))

where phenylalanine is the only amino acid present!

The 3D structure of the human protein is shown below. The conserved Phe208 and the highly variable position 360, that in the reference protein (PDB ID: 3CWM) is occupied by Ile, are marked.

In addition to this dataframe we have called ‘profile’, we can access, if we wish, the alignment itself:

load("./3cwm_aln.Rda")
dim(aln)

[1]  370 1205

This dataframe, that we have placed in an object named ‘aln’, has 370 raws (one per residue) and 1205 columns. The first eight columns are:

SeqNo: Sequence residue number.
PDBNo: PDB residue number.
Chain: Chain identifier.
AA: Amino Acid at that position in the reference sequence.
SS: Element of secondary structure.
ACC: Solvent accessible area.
NOCC: Number of aligned sequences spanning this position (including the reference sequence).
VAR: Sequence variability on a scale of 0-100 as derived from the number of sequences aligned.

The ninth column (named in this example ‘P01009’) gives the reference sequence, while the remaining columns provide the sequence of the protein included in the alignment. These columns are named with the UniProt ID of the corresponding protein.

Information regarding the metadata per sequence, and some alignment statistics, can be found in a third dataframe:

load("./3cwm_seq_list.Rda")
head(seq_list)

##   NR           ID  IDE WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LESEQ2     ACCNUM
## 1  1   A1AT_HUMAN 1.00 1.00    1  370   48  417  370    0    0    418     P01009
## 2  2 E9KL23_HUMAN 1.00 1.00    1  370   48  417  370    0    0    418     E9KL23
## 3  3 A0A024R6I7_H 0.99 0.99    1  370   48  417  370    0    0    418 A0A024R6I7
## 4  4 A0A0G2JRN3_H 0.99 0.99    1  308   48  355  308    0    0    359 A0A0G2JRN3
## 5  5 A0A2J8QMJ1_P 0.99 0.99    1  370   48  417  370    0    0    418 A0A2J8QMJ1
## 6  6 A0A2J8QMJ5_P 0.99 0.99    1  308   48  355  308    0    0    359 A0A2J8QMJ5
##                                                                                           PROTEIN
## 1                               Alpha-1-antitrypsin OS=Homo sapiens OX=9606 GN=SERPINA1 PE=1 SV=3
## 2 Epididymis secretory sperm binding protein Li 44a OS=Homo sapiens OX=9606 GN=SERPINA1 PE=2 SV=1
## 3                               Alpha-1-antitrypsin OS=Homo sapiens OX=9606 GN=SERPINA1 PE=1 SV=1
## 4                               Alpha-1-antitrypsin OS=Homo sapiens OX=9606 GN=SERPINA1 PE=1 SV=1
## 5                             SERPINA1 isoform 1 OS=Pan troglodytes OX=9598 GN=SERPINA1 PE=3 SV=1
## 6                      SERPINA1 isoform 19 OS=Pan troglodytes OX=9598 GN=CK820_G0025082 PE=3 SV=1

The number of rows, in our example, is 1197 (one per sequence included in the alignment). The variables (columns) holded in this dataframe are:

NR: Sequence number.
ID: EMBL/SWISSPROT identifier of the aligned (homologous) protein.
IDE: Percentage of residue identity of the alignment.
WSIM: Weighted similarity of the alignment.
IFIR: First residue of the alignment in the test sequence.
ILAS: Last residue of the alignment in the test sequence.
JFIR: First residue of the alignment in the alignend protein.
JLAS: Last residue of the alignment in the alignend protein.
LALI: Length of the alignment excluding insertions and deletions.
NGAP: Number of insertions and deletions in the alignment.
LGAP: Total length of all insertions and deletions.
LSEQ2: Length of the entire sequence of the aligned protein.
ACCNUM: SwissProt accession number.
PROTEIN: One-line description of aligned protein.

Finally, a fourth dataframe, named ‘insertions’ can be assessed

load("./3cwm_insertions.Rda")
inser

##     AliNo IPOS JPOS Len                                                Sequence
## 1      69  252  302   1                                                     nTs
## 2      85  355  397   1                                                     rQt
## 3      97  355  397   1                                                     rQt
## 4     121  298  348   1                                                     gVt
## 5     163  169  219  53 kVSIATAFAMPSLGAKGDARTEIMKALGYNSKKALNADVHGGVHHLLDISIRQDg
## 6     164  298  362   1                                                     gIt
## 7     165  298  339   1                                                     gIt
## 8     166  298  339   1                                                     gIt
## 9     168  298  339   1                                                     gIt
## 10    169  298  339   1                                                     gIt
## 11    170  298  339   1                                                     gIt
## 12    171  298  338   1                                                     gIt
## 13    174  298  339   1                                                     gIt
## 14    176  298  339   1                                                     gIt
## 15    177  298  339   1                                                     gIt
## 16    178  298  339   1                                                     gIt
## 17    198  298  339   1                                                     gIt
## 18    207  336  387   2                                                    mHCw
## 19    213  336  339   3                                                   gASMd
## 20    217  211  277   2                                                    rDNe
## 21    222  336  477   5                                                 gSCFSYv
## 22    230  336  348   4                                                  lSAPFp
## 23    233  298  333   1                                                     gVt
## 24    234  336  393   3                                                   gASMd
## 25    250  336  448   6                                                xXXALLYv
## 26    250  339  457   8                                              gSCSYAAPPp
## 27    253  252  287   1                                                     nRr
## 28    256  336  391   4                                                  mSAPId
## 29    258  336  392   4                                                  fSAEFp
## 30    260  336  348   4                                                  lSAQFp
## 31    262  336  348   4                                                  lSAGFp
## 32    268  336  336   4                                                  rSGDFp
## 33    270  336  348   4                                                  mSAEIp
## 34    271  336  348   4                                                  lSAEFp
## 35    273  336  348   4                                                  ySIEIp
## 36    276  336  348   4                                                  lSAEIp
## 37    278  336  346   4                                                  lSAEFp
## 38    281  336  348   4                                                  lTAEFp
## 39    282  336  338   4                                                  rSGDFp
## 40    284  336  336   4                                                  rSGDFp
## 41    285  336  348   4                                                  mSAEFp
## 42    286  336  348   4                                                  lSALFp
## 43    288  336  348   4                                                  lSAEFp
## 44    290  336  346   4                                                  lSAMFp
## 45    291  336  335   4                                                  rSGDFp
## 46    294  336  348   4                                                  lSRRFp
## 47    297  336  348   4                                                  lSAVIp
## 48    300  336  348   4                                                  mSLPIp
## 49    301  336  348   4                                                  mSAQIp
## 50    306  200  244   1                                                     sIe
## 51    306  252  297   1                                                     kWk
## 52    306  336  382   3                                                   tSAKl
## 53    307  260  362   2                                                    sEDl
## 54    309  336  392   4                                                  rSGDFp
## 55    317  308  369   2                                                    sKVr
## 56    322   24   70   2                                                    dGKs
## 57    323   24   90   2                                                    dGEs
## 58    326  308  329   2                                                    sKVr
## 59    328  200  203   1                                                     tQk
## 60    328  336  340   4                                                  kSAPFi
## 61    330  336  383   4                                                  vCQRNr
## 62    331  336  387   3                                                   fSAMv
## 63    334  336  392   4                                                  rSGQIs
## 64    336  226  277   2                                                    rNVa
## 65    336  252  305   1                                                     rLe
## 66    336  260  314   1                                                     rSr
## 67    340  336  336   4                                                  rSGDFp
## 68    342  336  348   4                                                  fLLPFp
## 69    343  336  348   4                                                  mSAVIp
## 70    344  336  346   4                                                  vSALSp
## 71    345  336  348   4                                                  lSAAYp
## 72    347  336  336   4                                                  kSGDIp
## 73    348   24   36   1                                                     gNn
## 74    348  336  346   4                                                  mSAHFh
## 75    349  336  348   4                                                  lLASFp
## 76    351  336  336   4                                                  rSGDFp
## 77    353  336  343   4                                                  lSAMFp
## 78    354  336  348   4                                                  lSALFp
## 79    355  336  348   4                                                  lSAEFp
## 80    356  336  348   4                                                  mSAQIp
## 81    362  336  336   4                                                  rSGDFp
## 82    363  336  346   4                                                  lSAAFp
## 83    365  336  348   4                                                  lSAQFp
## 84    366  336  348   4                                                  lSAEFp
## 85    367  336  332   4                                                  rSGDFp
## 86    369  211  291   1                                                     dEe
## 87    369  260  341   1                                                     rWv
## 88    369  336  418   4                                                  mSGKIg
## 89    372  336  386   4                                                  mSAPFp
## 90    374  336  405   3                                                   fSAMi
## 91    375  211  253   1                                                     dEe
## 92    375  260  303   1                                                     rLi
## 93    375  336  380   1                                                     lSs
## 94    376  308  317   2                                                    sKVr
## 95    379  169  208  13                                         kGKIIFILNHIPSTh
## 96    380  336  389   4                                                  mSFSSp
## 97    381  336  392   4                                                  rSGDFp
## 98    383  336  414   4                                                  rSGDFp
## 99    384  336  336   4                                                  rSGDFp
## 100   387  336  348   5                                                 mSSMFEt
## 101   388  336  331   4                                                  rSGDFp
## 102   389  336  327   4                                                  rSGDFp
## 103   390  336  348   6                                                mSAGPASp
## 104   391  336  332   4                                                  rSGDFp
## 105   392  336  348   4                                                  tSLQLp
## 106   394  336  426   4                                                  rSGDFp
## 107   395  336  384   1                                                     aLs
## 108   396  336  404   4                                                  rSGDFp
## 109   397  336  397   4                                                  rSGDFs
## 110   399  211  260   1                                                     dEe
## 111   399  260  310   1                                                     rRi
## 112   399  336  387   1                                                     sTl
## 113   400  336  386   6                                                tSAFAEFs
## 114   402  336  374   3                                                   rSARl
## 115   403  211  257   1                                                     dEe
## 116   403  260  307   1                                                     rHi
## 117   403  336  384   4                                                  lSGLVd
## 118   404  211  257   1                                                     dEe
## 119   404  260  307   1                                                     rWi
## 120   404  336  384   4                                                  lSAKVg
## 121   405  211  257   1                                                     dEe
## 122   405  260  307   1                                                     rHi
## 123   405  336  384   4                                                  lSALVd
## 124   406  336  382   3                                                   kSAPm
## 125   407  336  382   3                                                   kSAPm
## 126   408  127  168   1                                                     gAt
## 127   408  211  253   1                                                     dEv
## 128   408  260  303   1                                                     rLi
## 129   408  336  380   4                                                  tSSRNg
## 130   409  336  378   4                                                  mSLPFp
## 131   411  211  254   1                                                     dEe
## 132   411  260  304   1                                                     rRi
## 133   411  336  381   1                                                     tSi
## 134   414  211  263   1                                                     dEe
## 135   414  260  313   1                                                     rGi
## 136   414  336  390   2                                                    tSIn
## 137   415  211  254   1                                                     dEe
## 138   415  260  304   1                                                     rGi
## 139   415  336  381   2                                                    tSIn
## 140   416  211  304   1                                                     dEe
## 141   416  260  354   1                                                     rGi
## 142   416  336  431   4                                                  tSINNh
## 143   418  336  384   1                                                     aLs
## 144   419  336  385   3                                                   fSAMi
## 145   420  211  255   1                                                     dEe
## 146   420  260  305   1                                                     rRi
## 147   420  336  382   1                                                     sTl
## 148   421  211  248   1                                                     dKw
## 149   421  336  374   3                                                   lSAPk
## 150   422  169  219   2                                                    kGKa
## 151   423  211  250   1                                                     dEe
## 152   423  260  300   1                                                     rRi
## 153   424  336  343   2                                                    rSGr
## 154   425  211  257   1                                                     dKe
## 155   425  336  383   4                                                  lSAKMg
## 156   427  169  212   1                                                     kEa
## 157   427  336  380   4                                                  rSGDFp
## 158   428  211  254   1                                                     dEe
## 159   428  260  304   1                                                     rRi
## 160   428  336  381   1                                                     tSi
## 161   429  336  374   3                                                   rSARl
## 162   430  211  257   1                                                     dEe
## 163   430  260  307   1                                                     rEi
## 164   430  336  384   4                                                  lSALVe
## 165   431  336  324   4                                                  rSGDFp
## 166   432  336  336   4                                                  rSGDFp
## 167   433  336  324   4                                                  rSGEFp
## 168   434  336  333   4                                                  rSGDFp
## 169   435  336  332   4                                                  rSGDFp
## 170   437  336  348   4                                                  lSARFp
## 171   438  336  329   4                                                  rSGEFp
## 172   440  336  336   4                                                  rSGDFp
## 173   441  336  374   3                                                   rSARl
## 174   442  211  282   1                                                     dEe
## 175   442  260  332   1                                                     rHi
## 176   442  336  409   3                                                   lSGFv
## 177   442  339  415   1                                                     pKi
## 178   443  211  282   1                                                     dEe
## 179   443  260  332   1                                                     rHi
## 180   443  336  409   3                                                   lSAFv
## 181   443  339  415   1                                                     pKi
## 182   444  211  265   1                                                     dEe
## 183   444  260  315   1                                                     rRi
## 184   444  336  392   3                                                   tSGKi
## 185   445  336  392   4                                                  rSGDFp
## 186   446  336  383   1                                                     vHq
## 187   447  211  255   1                                                     dEe
## 188   447  260  305   1                                                     rVi
## 189   447  336  382   3                                                   kSRKw
## 190   448  336  383   4                                                  lFAAFp
## 191   450  211  260   1                                                     dEe
## 192   450  260  310   1                                                     rRi
## 193   450  336  387   1                                                     tSi
## 194   451  211  300   1                                                     dEe
## 195   451  260  350   1                                                     rRi
## 196   451  339  427   4                                                  tSIEFl
## 197   452  336  383   3                                                   rSARl
## 198   453  211  257   1                                                     dEe
## 199   453  260  307   1                                                     rEi
## 200   453  336  384   4                                                  lSALVe
##  [ reached 'max' / getOption("max.print") -- omitted 1757 rows ]

Further details regarding the information provided by this dataframe can be obtained here.

Please, mind that if the argument ‘keepfiles’ is set to FALSE, only the dataframe ‘profile’ will be returned, and the hssp file will be delated from you machine.