==================== Processing BED files ==================== Pyokit has a number of classes and functions for processing BED files, which are plain-text files containing genomic intervals. There are basically two approches for processing them in Pyokit: iterate over the BED file, which provides access to one (or a small subset of) genomic interval at a time without storing the whole file in memory, or bulk-load them into some data structure. Both approaches are covered below. Pyokit also contains facilities for processing multiple BED files simultaneously with the intention of finding matching or missing entries in one or more of the files. Examples of this are also given below. You can find the definition of the BED format here: http://genome.ucsc.edu/FAQ/FAQformat.html#format1 ------------------------ Iterating over BED files ------------------------ If you only need sequential access to the genomic intervals in a BED file, then this is the best approach -- memeory usage will be O(1). The basic BED iterator `````````````````````` This is the basic iterator, and will be sufficient in most situations. The usual idiom would be: >>> from pyokit.io.bedIterators import BEDIterator >>> for e in BEDIterator(myFile) : >>> # e is a GenomicInterval object. >>> # Do whatever you want with it here. Here's the signature of the basic BED iterator function: .. autofunction:: pyokit.io.bedIterators.BEDIterator The paired BED iterator ``````````````````````` Sometimes you may want to iterate over multiple BED files at the same time. That's what the paired BED iterator is for. You can either ignore intervals that don't exist in all of the BED files (the default), or you can populate missing intervals on-the-fly with a default. By deault, intervals are considered to match if they have the same chromosome, start and end index, and strand. Which fields are used for matching can be adjusted though. Here's an example of using the iterator to find all the matching intervals in a set of BED files. >>> from itertools import izip >>> from pyokit.io.bedIterators import pairedBEDIterator >>> >>> filenames = ["one.bed","two.bed","three.bed"] >>> for lst in pairedBEDIterator(filenames) : >>> # each element yielded by the iterator will be a list of GenomicInterval >>> # objects; the index in the list will match the index of the filename that >>> # the interval came from. The chrom, start and end will always match >>> chrom = set([x.chrom for x in lst]) >>> start = set([x.start for x in lst]) >>> end = set([x.end for x in lst]) >>> assert(len(chrom) == 1) >>> assert(len(start) == 1) >>> assert(len(end) == 1) >>> chrom = list(chrom)[0] >>> start = list(start)[0] >>> end = list(end)[0] >>> >>> msgs = ", ".join([" in ".join(list(x)) for x in >>> izip([x.name for x in lst], filenames)]) >>> print "region at " + str(chrom) + " " + str(start) + " " +\ >>> str(end) + " has name " + msgs Given the following contents of ``one.bed``, ``two.bed`` and ``three.bed``: .. code-block:: bash $ cat one.bed chr1 10 20 reg1 1 + chr1 40 50 reg2 2 - $ cat two.bed chr1 10 20 regA 1 + chr1 30 35 regB 5 - chr1 40 50 regC 2 - chr2 10 20 regD 2 + $ cat three.bed chr1 10 20 regW 1 + chr1 30 35 regX 5 - chr1 40 50 regY 2 - chr2 10 20 regZ 2 + The output of the above script would be: .. code-block:: bash region at chr1 10 20 has name reg1 in one.bed, regA in two.bed, regW in three.bed region at chr1 40 50 has name reg2 in one.bed, regC in two.bed, regY in three.bed Here's the signature of the paired BED iterator function: .. autofunction:: pyokit.io.bedIterators.pairedBEDIterator ---------------------- Bulk-loading BED files ---------------------- Sometimes you'll need to load all of the intervals in a BED file, generally when you need random-access to them (rather than sequential). It's easy to load them into a list (see below), but this isn't practical if you're trying to look up intervals that fall within a certain part of the genome. An easy way to do that is to load them into an interval tree, which then allows O(log(n)) time to find intervals (instead of O(n) for checking a list). .. note:: this isn't neccessarily the fastest way to do this; depending on what you're doing, it may be faster to sort your BED files beforehand and then iterate over them simultaneously. Still, interval trees are easy and I find they're usually fast enough in practice. Loading into a list ``````````````````` The easiest way to do this is to use any of the above iterators and populate a list, for example: >>> allMyRegions = [x for x in BEDIterator(fn)] Loading into an interval tree ````````````````````````````` the `intervalTrees` function will load a BED file and return a dictionary of interval trees. The keys to the dictionary are chromosome names, and each entry is an interval tree for that chromosome. For details about the interval tree objects, see :ref:`intervalTreesSection`. .. autofunction:: pyokit.io.bedIterators.intervalTrees