Gene Ontology¶

Gene ontology data is represented by two data types; you can find details of these in Gene Ontology. At present, the only input format supported is that produced by DAVID. Here’s an example of loading several DAVID output files and producing a new data frame containing the GO terms, catagories, and significance (p-value) from each enrichment analysis for those terms where at least one of the analyses was significant:

>>> import os, sys
>>> from pyokit.io.david import david_results_iterator
>>> PVAL_THRESHOLD = 0.01
>>> filenames = sys.argv[1:]
>>> # load all of the DAVID results for each file
>>> by_trm = {}
>>> for fn in filenames:
>>>   for r in david_results_iterator(fn):
>>>     if not r.name in by_trm:
>>>       by_trm[r.name] = {}
>>>     by_trm[r.name][fn] = r
>>> # drop terms where no file has p < threshold
>>> by_trm = {term:by_trm[term] for term in by_trm
>>>           if min([by_term[term][fn].pvalue for fn in by_term[term]]) < PVAL_THRESHOLD}
>>> # output
>>> for term in by_trm:
>>>   for fn in by_trm[term]:
>>>     r = by_trm[term][fn]
>>>     print r.name + "\t" + str(r.pvalue) + "\t" + r.catagory + "\t" + fn

This makes use of an iterator for DAVID otuput-format files. Here are the details of that function:

pyokit.io.david.david_results_iterator(fn, verbose=False)¶

Iterate over a DAVID result set and yeild GeneOntologyTerm objects representing each of the terms reported. The expected format for a DAVID result file is tab-seperated format. The following fields should be present:

Num	Field	Type	Example
0	Category	string	GOTERM_BP_FAT
1	Term	string	GO:0046907~intracellular transport
2	Count	int	43
3	Percent	float	11.345646437994723
4	PValue	float	1.3232857694449546E-9
5	Genes	string	ARSB, KPNA6, GNAS
6	List Total	int	310
7	Pop Hits	int	657
8	Pop Total	int	13528
9	Fold Enrichment	float	2.8561103746256196
10	Bonferroni	float	2.6293654579179204E-6
11	Benjamini	float	2.6293654579179204E-6
12	FDR	float	2.2734203852792234E-6

The first line is a header giving the field names – this is ignored though, and we expect them in the order given above.

Most of the fields are ignored at present; we take fields 0,1, and 11 (as the significance/p-value). When parsing the term field, we try to extract a term ID by splitting on tilde, but if we can’t then this is set to None.

Parameters:	fn – the file to parse verbose – if True, output progress to stderr.

Gene Ontology¶

Previous topic

Next topic

This Page