Description
Aligns multiple protein sequences.
Usage
msa(sequences, ids = names(squences), sfile = FALSE, inhouse = FALSE)
Arguments
sequences
vector containing the sequences.
ids
vector containing the sequences’ ids.
sfile
path to the file where the fasta alignment should be saved, if any.
logical, if TRUE the in-house MUSCLE software is used. It must be installed on your system and in the search path for executables.inhouse
Value
Returns a list of four elements. The first one (seq) provides the sequences analyzed, the second element (ids) retuns the identifiers, ther third element (aln) privides the alignment in fasta format and the fourth element (ali) gives the alignment in matricial format.
References
Edgar RC. Nucleic Acids Res. 2004 32:1792-1797.
H. Pagès, P. Aboyoun, R. Gentleman and S. DebRoy (2019). Biostrings: Efficient
manipulation of biological strings. R package version 2.52.0.
Edgar RC. BMC Bioinformatics 5(1):113.
See Also
custom.aln(), list.hom(), parse.hssp(), get.hssp(), shannon(), site.type()
Details
Multiple sequence alignment (MSA) is generally the alignment of three or more biological sequences. From the output, homology can be inferred and the evolutionary relationships between the sequences studied. Thus, alignment is the most important stage in most evolutionary analyses. In addition, MSA is also an essential tool for protein structure and function prediction. The package ptm offers several functions that will assist you in the process of sequence analysis:
msa (the current document)
custom.aln
list.hom
parse.hssp
get.hssp
shannon
site.type
The function msa() carries out MSAs either taking advantage of the functionalities of Biostrings or, alternatively, making use of the program MUSCLE. In the first case, you must have installed the R package Biostrings. To install that package, start R and enter:
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("Biostrings")
Alternatively, if you have previously installed MUSCLE in your machine, msa() can call this software, passing the argument ‘inhouse = TRUE’, to carry out the alingment. MUSCLE is a fast multiple sequence alignment program available from the muscle home page. Details to guide you through the installation of MUSCLE can be found here.
Let’s see msa() in action. To this end, we will use as a case study the protein COX3 (subunit 3 from the Cytochrome c Oxidase Complex) that will help to illustrate the relevance of epistatic effects on protein evolution.
Leber’s hereditary optic neuropathy (LHON) is a degeneration of the retinal gangliocytes and their axons, inherited mitochondrially (from the mother to all her children), leading to an acute or subacute loss of central vision. LHON is only transmitted through the mother since it is mainly due to mutations in the mitochondrial genome (not the nuclear one) and only the egg contributes mitochondria to the embryo. The pathogenic A32 to T32 mutation (change from alanine to threonine at position 32) in the COX3 protein has been related to LHON.
We can check, that an alanine residue, indeed, is found at position 32 in the human protein:
aa.at(at = 32, target = 'P00414')
## [1] "A"
Next we will obtain the COX3 sequence for human, bonobo, chimp, gorilla and orangutan (Hominidae family) and carry out the MSA using msa():
sequences <- sapply(c('P00414', 'E0XI88', 'Q9T9V9', 'Q9T9Y6', 'P92696' ), ptm::get.seq)
ids <- c('human', 'bonobo', 'chimpazee', 'gorilla', 'orangutan')
msa(sequences, ids, inhouse = TRUE)
## 1 . . . . . 60 ## human MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSMTLLMLGLLTNTLTMYQWWRD ## bonobo MAHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFYSTTLLTLGLLTNTLTMYQWWRD ## chimpazee MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFYSTTLLTLGLLTNTLTMYQWWRD ## gorilla MIHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSTTLLMLGLLTNMLTMYQWWRD ## orangutan MAHQSHAYHMVKPSPWPLTGALSALLTTSGLTMWFHFHSTTLLLTGLLTNALTMYQWWRD ## * ************************ **** ***** * *** ***** ********* ## 1 . . . . . 60 ## ## 61 . . . . . 120 ## human VTRESTYQGHHTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTG ## bonobo VMRESTYQGHHTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTG ## chimpazee VMREGTYQGHHTPPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTG ## gorilla VMRESTYQGHHTLPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGAHWPPTG ## orangutan VVRESTYQGHHTLPVQKGLRYGMILFITSEVFFFAGFFWAFYHSSLAPTPQLGGHWPPTG ## * ** ******* ****************************************^****** ## 61 . . . . . 120 ## ## 121 . . . . . 180 ## human ITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASE ## bonobo ITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASE ## chimpazee ITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASE ## gorilla ITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMIQALLITILLGLYFTLLQASE ## orangutan IIPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRTQMIQALLITILLGIYFTLLQASE ## * ********************************** *************^********* ## 121 . . . . . 180 ## ## 181 . . . . . 240 ## human YFESPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTICFIRQLMFHFTSKHHFGFEAAAW ## bonobo YFESPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTICLIRQLMFHFTSKHHFGFEAAAW ## chimpazee YFESPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTICLIRQLMFHFTSKHHFGFQAAAW ## gorilla YFEAPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTICLIRQLMFHFTSKHHFGFEAAAW ## orangutan YIEAPFTISDGIYGSTFFMATGFHGLHVIIGSTFLTVCLARQLLFHFTSKHHFGFEAAAW ## * * **************^*****************^* ***^*********** **** ## 181 . . . . . 240 ## ## 241 . .261 ## human YWHFVDVVWLFLYVSIYWWGS ## bonobo YWHFVDVVWLFLYVSIYWWGS ## chimpazee YWHFVDVVWLFLYVSIYWWGS ## gorilla YWHFVDVVWLFLYVSIYWWGS ## orangutan YWHFVDVVWLFLYVSIYWWGS ## ********************* ## 241 . .261 ## ## Call: ## bio3d::seqaln(aln = sqs, id = ids, exefile = "muscle") ## ## Class: ## fasta ## ## Alignment dimensions: ## 5 sequence rows; 261 position columns (261 non-gap, 0 gap) ## ## + attr: id, ali, call, seq
What amino acid has been fixed at position 32 into the orangutan wild-type sequence? Yes, threonine! Thus, while a threonine at this position causes a disease in humans, in the genetic context of orangutans, T32 is fine!