gorilla()

Description

Performs GO terms enrichment analyses

Usage

gorilla(target, background = NULL, mode = 'mhg', db = 'proc', pv = 0.001, spe = NULL)

Arguments

target path to the txt file containing (one per line) the UniProt id of the proteins belonging to the target set.

background path to the txt file containing (one per line) the UniProt id of the proteins belonging to the background set.

mode a character string specifying the desired analysis mode; it must be one of ‘mhg’ (identifies enriched GO terms in ranked lists), ‘hg’ (identifies enriched GO terms in the target set compared to the background set).

db a character string specifying the chosen ontology; it must be one of ‘proc’ (biological process), ‘func’ (molecular function), ‘comp’ (cellular component), ‘all’ (all the three previous ontologies).

pv a numeric value for the p-value threshold. Only GO terms with a p-value better than this threshold are reported.

spe a character string specifying the organism of interest. The species supported by GOrilla are: (Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Homo sapiens, Mus musculus, Rattus norvegicus)

Value

Returns either a dataframe with the enrichment results if a single ontology has been selected, or a list with three dataframe if the three ontologies were selected.

References

Rhee, Wood, Dolinkski & Draghici, Nature Reviews Genetics 2008, 9:509–515.

Eden et al. (2009) BMC Bioinformatics 10:48.

Details

The Gene Ontolory project (GO) provides a controlled vocabulary to describe gene and gene product attributes. Thus, a GO annotation is a statement describing some biological aspect. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about three disjoint categories: cellular component, molecular function and biological process. In this way, GO terms, using a controlled and hierarchical vocabulary, help to describe how a gene functions at the molecular level, where in the cell it functions, and what biological processes it helps to carry out.

The use of this vocabulary (GO annotations) has diverse applications, perhaps the most popular of them is the functional profiling. The goal of functional profiling is to determine which processes might be different in particular sets of genes (or gene products). That is, GO annotations are used to determine which biological processes, functions, and/or locations are significantly over- or under-represented in a given group of genes (or gene products). This approach is also known as GO term enrichment analysis.

Whenever we have a set of genes that are differentially expressed between different conditions (for example, cancerous versus healthy), we can apply GO enrichment analyses. For the sake of concretion let’s build some toy data that may be useful to illustrate the theory behind the GO term enrichment analysis. So, suppose we have analyzed the expression of 20 genes (N = 20) in cancerous and healthy cells. Let’s continue imagining: among these 20 genes, 5 of them are differentially expressed between the two conditions. This set of differentially expressed genes constitutes our target set (T), the remaining 15 genes belong to the complementary set ( $T^c$ ). Please, note that the union of these two sets gives the background set.

At this point, we must be aware that two types of questions can be addressed when performing functional profiling:

Hypothesis-generating query.
Hypothesis-driven query.

In the former (hypothesis-generating query) we assess which GO terms are significant in our target set, while in the later (hypothesis-driven query) we evaluate whether, for instance, apoptosis (or any other preselected process, function or component) is significantly enriched or depleted in our target set. Let’s start with the latter.

Thus, we start computing how many of the 20 genes forming the background set (B), are related to apoptosis and how many of the 5 differentially expressed genes are linked to apoptosis. Suppose, that after performing that computations we obtain the following result:

We can observe that the proportion of genes related to apoptosis within the target set is (3/5 = 0.6) much higher than that in the background set (5/20 = 0.25), but …, what can we conclude? The main problem here is that any enrichment value can occur just by chance. That is, we need an appropriate statistical model. Thus, we define our random variable X as the number of genes in the target set that are related to apoptosis. Now, as a first approach, we can assume that X follows a binomial distribution with parameters p = 7/20 (probability of randomly pick an apoptotic gene from the background set) and n = 5 (cardinal of the target set). That is, $X \sim Bin(p = 0.35, n = 5)$ . Therefore, the probability of getting just by random three apoptotic genes in the target set would be: $P[X = 3] = \binom{5}{3} 0.35^3 0.65^2 = 0.181$ . If we want to estimate the p-value, using the binomial model, we have to estimate the following probability:

1- pbinom(2, size = 5, p = 0.35)

## [1] 0.2351694

$p-value = P[X \geq 3] = 0.235$

A valid critic to the model we have chosen (the binomial one) is that the binomial model assumes sampling with replacement, that is, that the probability of picking from the background set a second gene (or protein) related to apoptosis once we have drawn a first apoptotic gene remain constant. However, such an assumption may not be acceptable, especially when the cardinal of the background set is not very high. For instance, in our toy example, the probability of picking an apoptotic gene is 7/20, but the probability of obtaining a second apoptotic gene is 6/19, and for a third one would be 5/18. Under these circumstances, a better suited model is the hypergeometric distribution $(X \sim HG(N, K, n))$ that describes the number of success (number of apoptotic genes in the target set) in a sequence of $n$ draws (cardinal of the target set) from a finite population (the background set of cardinal N) without replacement. Now, using this HG model we are going to compute again the probability of getting 3 apoptotic genes in our target set just by chance, to this end we will use the probability mass function of the hypergeometric distribution. $P[X = 3] = \frac{\binom{7}{3} \binom{13}{2}}{\binom{20}{5}}$

dhyper(3, 7, 13, 5)

## [1] 0.1760836

Again, to compute the p-value we have to resort to the cumulative distribution function instead to the probability mass function. That is, $P[X \geq 3] = \sum_{i=3}^5 \frac{\binom{7}{i} \binom{13}{5-i}}{\binom{20}{5}}$

1- phyper(2, 7, 13, 5)

## [1] 0.2067853

In general, if we have summarized the results in a contingency table as the one shown in the figure above, the question we want to answer is: knowing that we have $a + b$ genes related to the process (function or component) of interest in the background set, which is formed by $N$ genes, and assuming the null hypothesis (any observed enrichment is due to chance), what is the probability that the these genes of interest would be so unevenly (or even more) distributed as in our observation? In other words, what is the probability of observing data as extreme or more extreme than our observation? To answer this question we use, as we have already noted, the hypergeometric distribution, although in this context we may refer to the statistic model as the Fisher’s exact test. In any case, the reasoning is as follows: In a population of $N$ genes, we have $a + b$ belonging to the process of interest. Now, if we randomly draw (that is the null hypothesis) $a + c$ genes from the background set, what is the probability that $a$ of them would belong to the process of interest?

$P[X = a] = \frac{\binom{a+b}{a} \binom{c+d}{c}}{\binom{N}{a+c}} = \frac{(a + b)! (c + d)! (a + c)! (b + d)!}{a! b! c! d! N!}$

To obtain the p-value, the probability we have to compute is $P[X \geq a]$ , which can be expressed as $1 - F_X(a -1)$ , where $F_X$ is the cumulative distribution function of the random variable X. Remember that $F_X(x) = P[X \leq x]$ . In R:

#  p-value = 1 - phyper(a-1, a+b, c+d, a+c)

If instead of an enrichment test we would like to perform a depletion test, using the same contingency table we should compute: $P[X \leq a] = F_X(a)$ :

#  p-value = phyper(a, a+b, c+d, a+c)

Hitherto, we have addressed the so-called hypothesis-driven approach. Now, we will present the hypothesis-generating one. In this approach we wonder what GO terms are significantly enriched in our test set. To this end, instead of performing a single test as the one we have described to test for enrichment of apoptotic genes, we carry out multiple tests for different categories of genes. Whenever using this second approach, it is important to correct for the fact that many test performed in parallel will greatly increase the number of false positives in the target set. For instance, in the background set shown in the figure above, suppose that the 20 genes belong to 5 disjoint GO categories. If we pick randomly 5 genes from this background set, the probability of the 5 being from the GO category ‘red’ is

dhyper(5, 7, 13, 5)

this probability is as low as 0.001. However, if we compute the probability of the 5 genes belonging to the same category, we find that this probability has been multiplied by the number of tests we have to performed (as many as colors are being tested).

Thus, it is obvious that we have to make a correction for multiple tests. The simplest correction would multiply the p-values of all terms with the number of parallel test performed. This correction is known as the Bonferroni correction. Bonferroni correction is very stringent, that is, the Bonferroni correction can be too conservative in the sense that while it reduces the number of false positives, it also reduces the number of true discoveries. The False Discovery Rate (FDR) approach is a more recent development. This approach also determines adjusted p-values for each test. However, it controls the number of false discoveries in those tests that result in a discovery (i.e. a significant result). Because of this, it is less conservative that the Bonferroni approach and has greater ability (i.e. power) to find truly significant results. The FDR threshold is often called the q-value. Another way to look at the difference between p-values and q-values is that a p-value of 0.05 implies that 5% of all tests will result in false positives, while q-value of 0.05 implies that 5% of significant tests will result in false positives.

How do we compute the q-value? To illustrate its calculation, suppose we have carried out 55 tests for as many GO categories, and the results obtained, ranked from lowest to highest p-value, were:

We are going to add a new column to this table to shown the “adjusted p-value”, which is computed as the p-value times the number of test divided by the rank:

We have emphasized with red the adjusted p-value lower than our significance level $alpha = 0.05$ with the highest rank number. Next, we assign this value to all the rows with a rank number lower than that:

Q-values can be compute from p-values using the qvalue() function from the qvalue package:

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("qvalue")
# 
# library(qvalue)
# p <- c(0.001, 0.002, 0.003, 0.0035, 0.005) # p-values vector
# qvalue(p = p) # computes the corresponding q-values

After this rather long introduction to GO term enrichment analysis, let’s point out that the package ptm offers several functions that will assist you in the process of GO analysis:

search.go
term.go
get.go
bg.go
hdfisher.go
gorilla (current doc)
net.go

This function is a client of GOrilla, which is a web-based application that identifies enriched GO terms. It can be run in one of two modes:

Searching for enriched GO terms that appear densely at the top of a ranked list of proteins (mode = ‘mgh’).
Searching for enriched GO terms in a target list of genes compared to a background list of genes (mode = ‘gh’).

To illustrate its use, we are going to use two unranked lists of proteins (target and background), so we start getting a text file containing the UniProt ID of the proteins from the background set (all the human proteins with MetO oxidized with hydrogen peroxide), which we will save temporally in the current directory:

sites <- meto.search(organism = 'Homo sapiens', 
                     oxidant = 'hydrogen peroxide')
bs <- unique(sites$prot_id)

for (i in 1:length(bs)){
  cat(paste(bs[i], "\n", sep = ""), 
      file = "./fichero_de_texto_temporal_background.txt", 
      append = TRUE)
}

Then we form the target set (proteins oxidized exclusively in vivo):

vivo <- sites[which(sites$met_vivo_vitro == "vivo"), ]
target <- unique(vivo$prot_id)

for (i in 1:length(target)){
  cat(paste(target[i], "\n", sep = ""),
      file = "./fichero_de_texto_temporal_target.txt", 
      append = TRUE)
}

Finally, we are ready to use the function gorilla(). As stated above, herein we will use the mode ‘hg’, which is for two unranked lists of proteins (target and background), but GOrilla offers us the possibility to carry out an enrichment analysis in a ranked list of proteins (the background is ordered according to the desired criteria, without need of a target set), in that case the mode ‘mhg’ must be selected. Also we can choose the database to be used, which can be either (i): ‘proc’ for biological process, (ii) ‘func’ for molecular function, (iii) ‘comp’ for cellular component, or (iv) ‘all’ for all the three previous ontologies. The parameter ‘pv’ allows us to select the threshold p-value: only GO terms with a p-value better than this threshold will be reported. The default value is 0.001.

gori <- gorilla(target = "./fichero_de_texto_temporal_target.txt", 
                background = "./fichero_de_texto_temporal_background.txt", 
                mode = "hg", 
                db = 'proc', 
                spe = 'Homo sapiens')

The function returns a dataframe with the results ordered from more to less significance:

kable(gori[, 1:9])

GO.Term	Description	P.value	FDR.q.value	Enrichment	N	B	n	b
GO:0006397	mRNA processing	0.00e+00	4.09e-05	1.31	2149	161	1386	136
GO:0008380	RNA splicing	3.00e-07	1.23e-03	1.29	2149	143	1386	119
GO:0006396	RNA processing	3.00e-07	9.89e-04	1.22	2149	243	1386	191
GO:0090304	nucleic acid metabolic process	1.30e-06	2.83e-03	1.13	2149	526	1386	384
GO:0000375	RNA splicing, via transesterification reactions	7.10e-06	1.28e-02	1.28	2149	116	1386	96
GO:0000377	RNA splicing, via transesterification reactions with bulged adenosine as nucleophile	9.40e-06	1.40e-02	1.28	2149	115	1386	95
GO:0000398	mRNA splicing, via spliceosome	9.40e-06	1.20e-02	1.28	2149	115	1386	95
GO:0090501	RNA phosphodiester bond hydrolysis	1.24e-04	1.40e-01	1.45	2149	32	1386	30
GO:0022402	cell cycle process	1.37e-04	1.37e-01	1.17	2149	234	1386	176
GO:0070125	mitochondrial translational elongation	1.48e-04	1.33e-01	1.55	2149	20	1386	20
GO:0006259	DNA metabolic process	2.06e-04	1.69e-01	1.19	2149	169	1386	130
GO:0016070	RNA metabolic process	2.33e-04	1.74e-01	1.12	2149	404	1386	291
GO:1903047	mitotic cell cycle process	2.54e-04	1.76e-01	1.18	2149	178	1386	136
GO:0070126	mitochondrial translational termination	3.58e-04	2.30e-01	1.55	2149	18	1386	18
GO:0006325	chromatin organization	5.33e-04	3.19e-01	1.21	2149	127	1386	99
GO:0033044	regulation of chromosome organization	5.37e-04	3.02e-01	1.24	2149	99	1386	79
GO:0032774	RNA biosynthetic process	5.50e-04	2.91e-01	1.22	2149	113	1386	89
GO:0016073	snRNA metabolic process	8.08e-04	4.03e-01	1.48	2149	22	1386	21

Once we have finish, we are going to remove from the current directory the two txt files we have created:

file.remove('fichero_de_texto_temporal_target.txt', 
            'fichero_de_texto_temporal_background.txt')

## [1] TRUE TRUE