Manipulating Genomic Intervals¶
Pyokit has a class for representing and manipulating genomic intervals, and some associated functions for manipulating collections of genomic intervals.
Collapsing interval sets¶
-
pyokit.datastruct.genomicInterval.
collapseRegions
(s, stranded=False, names=False, verbose=False)¶ Get the union of a set of genomic intervals.
given a list of genomic intervals with chromosome, start and end field, collapse into a set of non-overlapping intervals. Intervals must be sorted by chromosome and then start coordinate.
Note: O(n) time, O(n) space
Returns: list of intervals that define the collapsed regions. Note that these are all new objects, no existing object from s is returned or altered. Returned regions will all have name “X” and score 0
Parameters: - s – list of genomic regions to collapse
- stranded – if True, regions on positive and negative strands are collapsed separately. Otherwise, strand is ignored and all output regions are on positive strand.
Raises GenomicIntervalError: if the input regions are not correctly sorted (chromosome then start)
Intersecting interval sets¶
-
pyokit.datastruct.genomicInterval.
regionsIntersection
(s1, s2, collapse=True)¶ given two lists of genomic regions with chromosome, start and end coordinates, return a new list of regions which is the intersection of those two sets. Lists must be sorted by chromosome and start index
Returns: new list that represents the intersection of the two input lists. output regions will all have name “X”, be on strand “+” and have score 0
Parameters: - s1 – first list of genomic regions
- s2 – second list of genomic regions
Raises GenomicIntervalError: if the input regions are not sorted correctly (by chromosome and start index)
Note: O(n) time, O(n) space; informally, might use up to 3x space of input
Searching interval sets¶
Usually when you’re searching for particular intervals, you’re looking for ones that exist in some region of the genome, or contain some point in the genome. The naive approach would be to check all intervals, requiring O(n) time. Although this sounds inefficient, it might be the best option if you only need to make one pass of the intervals. Here’s an example:
>>> from pyokit.io.bedIterators import BEDIterator
>>>
>>> # print out all intervals that intersect chr1, 10, 20
>>> roi = GenomicInterval("chr1", 10, 20)
>>> hits = [e for e in BEDIterator("somefile.bed") if e.intersects(roi)]
More often though, you’ll want to make multiple passes. For example, you might
want to count how many intervals in two.bed
intersect each interval in
one.bed
. The naive approach is to use nested iterators, but it has
worst case complexity of O(n^2). This occurs sufficiently often that Pyokit has
an iterator for dealing with exactly this situation: the bucket iterator –
which will iterate over one set of genomic intervals (the buckets), and for each
one will give you the list of intervals in another set that are intersected by
it. You can provide it with regular BEDIterators, or lists, or any iterable
object that contains genomic intervals. Here’s the signature of the iterator:
-
pyokit.datastruct.genomicInterval.
bucketIterator
(elements, buckets)¶ For each bucket in buckets, yield it and any elements that overlap it.
Parameters: - elements – the genomic intervals to place into the buckets. Must be sorted by chromosome and start index. This could be a list, or an iterator.
- buckets – the buckets into which genomic intervals should be binned. Must be sorted by chromosome and start index. This could be a list, or an iterator
Returns: iterator that will yeild a tuple of 1 bucket and 1 list of elements in the bucket for each call to __next__().
As an example of using this iterator, consider that you have the following sets of buckets, and elements that you want to bin:
>>> # set up some buckets
>>> g1 = GenomicInterval("chr1", 1, 6)
>>> g2 = GenomicInterval("chr1", 4, 10)
>>> g3 = GenomicInterval("chr1", 5, 19)
>>> g4 = GenomicInterval("chr1", 9, 15)
>>> g5 = GenomicInterval("chr1", 18, 22)
>>> g6 = GenomicInterval("chr2", 1, 12)
>>> g7 = GenomicInterval("chr2", 4, 7)
>>> buckets = [g1,g2,g3,g4,g5,g6,g7]
>>>
>>> # set up some elements to bin
>>> e1 = GenomicInterval("chr1", 2, 3)
>>> e2 = GenomicInterval("chr1", 4, 5)
>>> e3 = GenomicInterval("chr1", 4, 5)
>>> e4 = GenomicInterval("chr1", 7, 9)
>>> e5 = GenomicInterval("chr1", 7, 8)
>>> e6 = GenomicInterval("chr1", 10, 14)
>>> e7 = GenomicInterval("chr1", 12, 20)
>>> e8 = GenomicInterval("chr1", 20, 23)
>>> e9 = GenomicInterval("chr1", 24, 26)
>>> e10 = GenomicInterval("chr2", 1, 3)
>>> e11 = GenomicInterval("chr2", 2, 6)
>>> e12 = GenomicInterval("chr2", 4, 8)
>>> e13 = GenomicInterval("chr2", 10, 15)
>>> elements = [e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11,e12,e13]
Now we can bin these elements as follows:
>>> actual = [(x, set(l)) for x,l in bucketIterator(elements, buckets)]
>>> actualStr = "\n".join([str(x) + " == " + ", ".join([str(y) for y in l])
>>> for x, l in actual])
>>> print actualStr
This will give you the following:
chr1 1 6 == chr1 4 5, chr1 2 3
chr1 4 10 == chr1 4 5, chr1 7 9,chr1 7 8
chr1 5 19 == chr1 10 14, chr1 12 20, chr1 7 9, chr1 7 8
chr1 9 15 == chr1 12 20, chr1 10 14
chr1 18 22 == chr1 20 23, chr1 12 20
chr2 1 12 == chr2 2 6, chr2 4 8, chr2 10 15, chr2 1 3
chr2 4 7 == chr2 2 6, chr2 4 8
In certain more complicated scenarios, it might be easier to use an interval tree to do this kind of thing. Here is code using the interval tree from list function to perform the same operation:
>>> trees = intervalTreesFromList(elements, openEnded = True)
>>> treeRes = [(g, set(trees[g.chrom].intersectingInterval(g.start, g.end)))
>>> for g in buckets]
>>> treeReStr = "\n".join([str(x) + " == " + ",".join([str(y) for y in l])
>>> for x, l in treeRes])
>>> print treeReStr
The output will be the same.