Manipulating Genomic Intervals

Pyokit has a class for representing and manipulating genomic intervals, and some associated functions for manipulating collections of genomic intervals.

Collapsing interval sets

pyokit.datastruct.genomicInterval.collapseRegions(s, stranded=False, names=False, verbose=False)

Get the union of a set of genomic intervals.

given a list of genomic intervals with chromosome, start and end field, collapse into a set of non-overlapping intervals. Intervals must be sorted by chromosome and then start coordinate.

Note:

O(n) time, O(n) space

Returns:

list of intervals that define the collapsed regions. Note that these are all new objects, no existing object from s is returned or altered. Returned regions will all have name “X” and score 0

Parameters:
  • s – list of genomic regions to collapse
  • stranded – if True, regions on positive and negative strands are collapsed separately. Otherwise, strand is ignored and all output regions are on positive strand.
Raises GenomicIntervalError:
 

if the input regions are not correctly sorted (chromosome then start)

Intersecting interval sets

pyokit.datastruct.genomicInterval.regionsIntersection(s1, s2, collapse=True)

given two lists of genomic regions with chromosome, start and end coordinates, return a new list of regions which is the intersection of those two sets. Lists must be sorted by chromosome and start index

Returns:

new list that represents the intersection of the two input lists. output regions will all have name “X”, be on strand “+” and have score 0

Parameters:
  • s1 – first list of genomic regions
  • s2 – second list of genomic regions
Raises GenomicIntervalError:
 

if the input regions are not sorted correctly (by chromosome and start index)

Note:

O(n) time, O(n) space; informally, might use up to 3x space of input

Searching interval sets

Usually when you’re searching for particular intervals, you’re looking for ones that exist in some region of the genome, or contain some point in the genome. The naive approach would be to check all intervals, requiring O(n) time. Although this sounds inefficient, it might be the best option if you only need to make one pass of the intervals. Here’s an example:

>>> from pyokit.io.bedIterators import BEDIterator
>>>
>>> # print out all intervals that intersect chr1, 10, 20
>>> roi = GenomicInterval("chr1", 10, 20)
>>> hits = [e for e in BEDIterator("somefile.bed") if e.intersects(roi)]

More often though, you’ll want to make multiple passes. For example, you might want to count how many intervals in two.bed intersect each interval in one.bed. The naive approach is to use nested iterators, but it has worst case complexity of O(n^2). This occurs sufficiently often that Pyokit has an iterator for dealing with exactly this situation: the bucket iterator – which will iterate over one set of genomic intervals (the buckets), and for each one will give you the list of intervals in another set that are intersected by it. You can provide it with regular BEDIterators, or lists, or any iterable object that contains genomic intervals. Here’s the signature of the iterator:

pyokit.datastruct.genomicInterval.bucketIterator(elements, buckets)

For each bucket in buckets, yield it and any elements that overlap it.

Parameters:
  • elements – the genomic intervals to place into the buckets. Must be sorted by chromosome and start index. This could be a list, or an iterator.
  • buckets – the buckets into which genomic intervals should be binned. Must be sorted by chromosome and start index. This could be a list, or an iterator
Returns:

iterator that will yeild a tuple of 1 bucket and 1 list of elements in the bucket for each call to __next__().

As an example of using this iterator, consider that you have the following sets of buckets, and elements that you want to bin:

>>> # set up some buckets
>>> g1 = GenomicInterval("chr1", 1,  6)
>>> g2 = GenomicInterval("chr1", 4,  10)
>>> g3 = GenomicInterval("chr1", 5,  19)
>>> g4 = GenomicInterval("chr1", 9,  15)
>>> g5 = GenomicInterval("chr1", 18, 22)
>>> g6 = GenomicInterval("chr2", 1,  12)
>>> g7 = GenomicInterval("chr2", 4,  7)
>>> buckets = [g1,g2,g3,g4,g5,g6,g7]
>>>
>>> # set up some elements to bin
>>> e1 = GenomicInterval("chr1", 2,  3)
>>> e2 = GenomicInterval("chr1", 4,  5)
>>> e3 = GenomicInterval("chr1", 4,  5)
>>> e4 = GenomicInterval("chr1", 7,  9)
>>> e5 = GenomicInterval("chr1", 7,  8)
>>> e6 = GenomicInterval("chr1", 10, 14)
>>> e7 = GenomicInterval("chr1", 12, 20)
>>> e8 = GenomicInterval("chr1", 20, 23)
>>> e9 = GenomicInterval("chr1", 24, 26)
>>> e10 = GenomicInterval("chr2", 1,  3)
>>> e11 = GenomicInterval("chr2", 2,  6)
>>> e12 = GenomicInterval("chr2", 4,  8)
>>> e13 = GenomicInterval("chr2", 10, 15)
>>> elements = [e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11,e12,e13]

Now we can bin these elements as follows:

>>> actual = [(x, set(l)) for x,l in bucketIterator(elements, buckets)]
>>> actualStr = "\n".join([str(x) + " == " + ", ".join([str(y) for y in l])
>>>                        for x, l in actual])
>>> print actualStr

This will give you the following:

chr1 1       6 == chr1       4       5, chr1 2       3
chr1 4       10 == chr1      4       5, chr1 7       9,chr1  7       8
chr1 5       19 == chr1      10      14, chr1        12      20, chr1        7       9, chr1 7       8
chr1 9       15 == chr1      12      20, chr1        10      14
chr1 18      22 == chr1      20      23, chr1        12      20
chr2 1       12 == chr2      2       6, chr2 4       8, chr2 10      15, chr2        1       3
chr2 4       7 == chr2       2       6, chr2 4       8

In certain more complicated scenarios, it might be easier to use an interval tree to do this kind of thing. Here is code using the interval tree from list function to perform the same operation:

>>> trees = intervalTreesFromList(elements, openEnded = True)
>>> treeRes = [(g, set(trees[g.chrom].intersectingInterval(g.start, g.end)))
>>>            for g in buckets]
>>> treeReStr = "\n".join([str(x) + " == " + ",".join([str(y) for y in l])
>>>                        for x, l in treeRes])
>>> print treeReStr

The output will be the same.