Gene Ontology

Gene ontology data is represented by two data types; you can find details of these in Gene Ontology. At present, the only input format supported is that produced by DAVID. Here’s an example of loading several DAVID output files and producing a new data frame containing the GO terms, catagories, and significance (p-value) from each enrichment analysis for those terms where at least one of the analyses was significant:

>>> import os, sys
>>> from pyokit.io.david import david_results_iterator
>>> PVAL_THRESHOLD = 0.01
>>> filenames = sys.argv[1:]
>>> # load all of the DAVID results for each file
>>> by_trm = {}
>>> for fn in filenames:
>>>   for r in david_results_iterator(fn):
>>>     if not r.name in by_trm:
>>>       by_trm[r.name] = {}
>>>     by_trm[r.name][fn] = r
>>> # drop terms where no file has p < threshold
>>> by_trm = {term:by_trm[term] for term in by_trm
>>>           if min([by_term[term][fn].pvalue for fn in by_term[term]]) < PVAL_THRESHOLD}
>>> # output
>>> for term in by_trm:
>>>   for fn in by_trm[term]:
>>>     r = by_trm[term][fn]
>>>     print r.name + "\t" + str(r.pvalue) + "\t" + r.catagory + "\t" + fn

This makes use of an iterator for DAVID otuput-format files. Here are the details of that function:

pyokit.io.david.david_results_iterator(fn, verbose=False)

Iterate over a DAVID result set and yeild GeneOntologyTerm objects representing each of the terms reported. The expected format for a DAVID result file is tab-seperated format. The following fields should be present:

Num Field Type Example
0 Category string GOTERM_BP_FAT
1 Term string GO:0046907~intracellular transport
2 Count int 43
3 Percent float 11.345646437994723
4 PValue float 1.3232857694449546E-9
5 Genes string ARSB, KPNA6, GNAS
6 List Total int 310
7 Pop Hits int 657
8 Pop Total int 13528
9 Fold Enrichment float 2.8561103746256196
10 Bonferroni float 2.6293654579179204E-6
11 Benjamini float 2.6293654579179204E-6
12 FDR float 2.2734203852792234E-6

The first line is a header giving the field names – this is ignored though, and we expect them in the order given above.

Most of the fields are ignored at present; we take fields 0,1, and 11 (as the significance/p-value). When parsing the term field, we try to extract a term ID by splitting on tilde, but if we can’t then this is set to None.

Parameters:
  • fn – the file to parse
  • verbose – if True, output progress to stderr.