Processing BED files¶
Pyokit has a number of classes and functions for processing BED files, which are plain-text files containing genomic intervals. There are basically two approches for processing them in Pyokit: iterate over the BED file, which provides access to one (or a small subset of) genomic interval at a time without storing the whole file in memory, or bulk-load them into some data structure. Both approaches are covered below. Pyokit also contains facilities for processing multiple BED files simultaneously with the intention of finding matching or missing entries in one or more of the files. Examples of this are also given below.
You can find the definition of the BED format here: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
Iterating over BED files¶
If you only need sequential access to the genomic intervals in a BED file, then this is the best approach – memeory usage will be O(1).
The basic BED iterator¶
This is the basic iterator, and will be sufficient in most situations. The usual idiom would be:
>>> from pyokit.io.bedIterators import BEDIterator
>>> for e in BEDIterator(myFile) :
>>> # e is a GenomicInterval object.
>>> # Do whatever you want with it here.
Here’s the signature of the basic BED iterator function:
-
pyokit.io.bedIterators.
BEDIterator
(filehandle, sortedby=None, verbose=False, scoreType=<type 'int'>, dropAfter=None)¶ Get an iterator for a BED file
Parameters: - filehandle – this can be either a string, or a stream-like object. In the former case, it is treated as a filename. The format of the file/stream must be BED.
- sortedby – if None, order is not checked. if == ITERATOR_SORTED_START, elements in file must be sorted by chrom and start index (an exception is raised if they are not) if == ITERATOR_SORTED_END, element must be sorted by chrom and end index.
- verbose – if True, output additional progress messages to stderr
- scoreType – The data type for scores (the fifth column) in the BED file.
- dropAfter – an int indicating that any fields after and including this field should be ignored as they don’t conform to the BED format. By default, None, meaning we use all fields. Index from zero.
Returns: iterator where subsequent calls to next() yield the next BED element in the stream as a GenomicInterval object.
The paired BED iterator¶
Sometimes you may want to iterate over multiple BED files at the same time. That’s what the paired BED iterator is for. You can either ignore intervals that don’t exist in all of the BED files (the default), or you can populate missing intervals on-the-fly with a default. By deault, intervals are considered to match if they have the same chromosome, start and end index, and strand. Which fields are used for matching can be adjusted though. Here’s an example of using the iterator to find all the matching intervals in a set of BED files.
>>> from itertools import izip
>>> from pyokit.io.bedIterators import pairedBEDIterator
>>>
>>> filenames = ["one.bed","two.bed","three.bed"]
>>> for lst in pairedBEDIterator(filenames) :
>>> # each element yielded by the iterator will be a list of GenomicInterval
>>> # objects; the index in the list will match the index of the filename that
>>> # the interval came from. The chrom, start and end will always match
>>> chrom = set([x.chrom for x in lst])
>>> start = set([x.start for x in lst])
>>> end = set([x.end for x in lst])
>>> assert(len(chrom) == 1)
>>> assert(len(start) == 1)
>>> assert(len(end) == 1)
>>> chrom = list(chrom)[0]
>>> start = list(start)[0]
>>> end = list(end)[0]
>>>
>>> msgs = ", ".join([" in ".join(list(x)) for x in
>>> izip([x.name for x in lst], filenames)])
>>> print "region at " + str(chrom) + " " + str(start) + " " +\
>>> str(end) + " has name " + msgs
Given the following contents of one.bed
, two.bed
and three.bed
:
$ cat one.bed
chr1 10 20 reg1 1 +
chr1 40 50 reg2 2 -
$ cat two.bed
chr1 10 20 regA 1 +
chr1 30 35 regB 5 -
chr1 40 50 regC 2 -
chr2 10 20 regD 2 +
$ cat three.bed
chr1 10 20 regW 1 +
chr1 30 35 regX 5 -
chr1 40 50 regY 2 -
chr2 10 20 regZ 2 +
The output of the above script would be:
region at chr1 10 20 has name reg1 in one.bed, regA in two.bed, regW in three.bed
region at chr1 40 50 has name reg2 in one.bed, regC in two.bed, regY in three.bed
Here’s the signature of the paired BED iterator function:
-
pyokit.io.bedIterators.
pairedBEDIterator
(inputStreams, mirror=False, mirrorScore=None, ignoreStrand=False, ignoreScore=True, ignoreName=True, sortedby=2, scoreType=<type 'float'>, verbose=False)¶ Iterate over multiple BED format files simultaneously and yield lists of genomic intervals for each matching set of intervals found. By default, regions which are not found in all files will be skipped (mirror = false). Optionally (by setting mirror to true) if a file is missing an interval, it can be added on-the-fly, and will have the same chrom, start and end and name as in other files. The score will be taken from the first file in inputStreams if mirrorScore is not set, otherwise that value will be used.
Parameters: - inputStreams – a list of input streams in BED format
- mirror – if true, add missing elements so all streams contain the same elements. Inserted elements will have the same
- ignoreStrand – ignore strand when comparing elements for equality?
- ignoreScore – ignore score when comparing elements for equality?
- ignoreScore – ignore name when comparing elements for equality?
- sortedby – must be set to one of the sorting orders for BED streams; we require the streams to be sorted in some fashion.
- scoreType – interpret scores as what type? Defaults to float, which is generally the most flexible.
Bulk-loading BED files¶
Sometimes you’ll need to load all of the intervals in a BED file, generally when you need random-access to them (rather than sequential). It’s easy to load them into a list (see below), but this isn’t practical if you’re trying to look up intervals that fall within a certain part of the genome. An easy way to do that is to load them into an interval tree, which then allows O(log(n)) time to find intervals (instead of O(n) for checking a list).
Note
this isn’t neccessarily the fastest way to do this; depending on what you’re doing, it may be faster to sort your BED files beforehand and then iterate over them simultaneously. Still, interval trees are easy and I find they’re usually fast enough in practice.
Loading into a list¶
The easiest way to do this is to use any of the above iterators and populate a list, for example:
>>> allMyRegions = [x for x in BEDIterator(fn)]
Loading into an interval tree¶
the intervalTrees function will load a BED file and return a dictionary of interval trees. The keys to the dictionary are chromosome names, and each entry is an interval tree for that chromosome. For details about the interval tree objects, see Interval Trees.
-
pyokit.io.bedIterators.
intervalTrees
(reffh, scoreType=<type 'int'>, verbose=False)¶ Build a dictionary of interval trees indexed by chrom from a BED stream or file
Parameters: - reffh – This can be either a string, or a stream-like object. In the former case, it is treated as a filename. The format of the file/stream must be BED.
- scoreType – The data type for scores (the fifth column) in the BED file.
- verbose – output progress messages to sys.stderr if True