==============================
Manipulating Genomic Intervals
==============================

Pyokit has a class for representing and manipulating genomic intervals, and some
associated functions for manipulating collections of genomic intervals.

------------------------
Collapsing interval sets
------------------------
.. autofunction:: pyokit.datastruct.genomicInterval.collapseRegions

--------------------------
Intersecting interval sets
--------------------------
.. autofunction:: pyokit.datastruct.genomicInterval.regionsIntersection


-----------------------
Searching interval sets
-----------------------
Usually when you're searching for particular intervals, you're looking for
ones that exist in some region of the genome, or contain some point in the
genome. The naive approach would be to check all intervals, requiring O(n) time.
Although this sounds inefficient, it might be the best option if you only
need to make one pass of the intervals. Here's an example:

>>> from pyokit.io.bedIterators import BEDIterator
>>>
>>> # print out all intervals that intersect chr1, 10, 20
>>> roi = GenomicInterval("chr1", 10, 20)
>>> hits = [e for e in BEDIterator("somefile.bed") if e.intersects(roi)]

More often though, you'll want to make multiple passes. For example, you might
want to count how many intervals in ``two.bed`` intersect each interval in
``one.bed``. The naive approach is to use nested iterators, but it has
worst case complexity of O(n^2). This occurs sufficiently often that Pyokit has
an iterator for dealing with exactly this situation: the bucket iterator --
which will iterate over one set of genomic intervals (the buckets), and for each
one will give you the list of intervals in another set that are intersected by
it. You can provide it with regular BEDIterators, or lists, or any iterable
object that contains genomic intervals. Here's the signature of the iterator:

.. autofunction:: pyokit.datastruct.genomicInterval.bucketIterator

As an example of using this iterator, consider that you have the following
sets of buckets, and elements that you want to bin:

>>> # set up some buckets
>>> g1 = GenomicInterval("chr1", 1,  6)
>>> g2 = GenomicInterval("chr1", 4,  10)
>>> g3 = GenomicInterval("chr1", 5,  19)
>>> g4 = GenomicInterval("chr1", 9,  15)
>>> g5 = GenomicInterval("chr1", 18, 22)
>>> g6 = GenomicInterval("chr2", 1,  12)
>>> g7 = GenomicInterval("chr2", 4,  7)
>>> buckets = [g1,g2,g3,g4,g5,g6,g7]
>>>
>>> # set up some elements to bin
>>> e1 = GenomicInterval("chr1", 2,  3)
>>> e2 = GenomicInterval("chr1", 4,  5)
>>> e3 = GenomicInterval("chr1", 4,  5)
>>> e4 = GenomicInterval("chr1", 7,  9)
>>> e5 = GenomicInterval("chr1", 7,  8)
>>> e6 = GenomicInterval("chr1", 10, 14)
>>> e7 = GenomicInterval("chr1", 12, 20)
>>> e8 = GenomicInterval("chr1", 20, 23)
>>> e9 = GenomicInterval("chr1", 24, 26)
>>> e10 = GenomicInterval("chr2", 1,  3)
>>> e11 = GenomicInterval("chr2", 2,  6)
>>> e12 = GenomicInterval("chr2", 4,  8)
>>> e13 = GenomicInterval("chr2", 10, 15)
>>> elements = [e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11,e12,e13]

Now we can bin these elements as follows:

>>> actual = [(x, set(l)) for x,l in bucketIterator(elements, buckets)]
>>> actualStr = "\n".join([str(x) + " == " + ", ".join([str(y) for y in l])
>>>                        for x, l in actual])
>>> print actualStr

This will give you the following:

.. code-block:: bash

   chr1	1	6 == chr1	4	5, chr1	2	3
   chr1	4	10 == chr1	4	5, chr1	7	9,chr1	7	8
   chr1	5	19 == chr1	10	14, chr1	12	20, chr1	7	9, chr1	7	8
   chr1	9	15 == chr1	12	20, chr1	10	14
   chr1	18	22 == chr1	20	23, chr1	12	20
   chr2	1	12 == chr2	2	6, chr2	4	8, chr2	10	15, chr2	1	3
   chr2	4	7 == chr2	2	6, chr2	4	8


In certain more complicated scenarios, it might be easier to use an interval
tree to do this kind of thing. Here is code using the interval tree from list
function to perform the same operation:

>>> trees = intervalTreesFromList(elements, openEnded = True)
>>> treeRes = [(g, set(trees[g.chrom].intersectingInterval(g.start, g.end)))
>>>            for g in buckets]
>>> treeReStr = "\n".join([str(x) + " == " + ",".join([str(y) for y in l])
>>>                        for x, l in treeRes])
>>> print treeReStr

The output will be the same.