Retrotransposons¶
Pyokit has two data structures for representing retro-transposons: Retrotransposon and RetrotransposonOccurrence. The first of these contains information about retro-transposons that is general to all occurrences, such as the name, family name and concensus sequence. The second describes one particular occurrence of a retrotransposon, including details about where it occurs and the name of the sequence it occurrs in (usually a chromosome). An occurrence objsect also stores the start and end co-ordinates into the concensus sequence (becuase sometimes it is truncated at the 3’ or 5’ end), and whether the match is to the concensus or the reverse compliment of the concensus. RetrotransposonOccurrence objects contain references to Retrotransposon objects in a one-to-many fashion. Optionally, the full alignment between the consensus and the occurrence might also be stored by the object, which allows more accurate liftover between the two.
Loading annotations¶
At the moment, the only annotations that are supported are those of RepeatMasker. The coordinate annoation (which describes the locations of the repeat occurrences), as well the full alignments can be loaded seperately, or at the same time.
Repeatmasker coordinate annotation¶
Repeatmasker coordinate annotations are loaded using the following function from the IO module. alignment_index should be left as None (default) when loading only annotations (more details about this below).
The description of what each field means is given below
-
pyokit.datastruct.retrotransposon.
from_repeat_masker_string
(s)¶ Parse a RepeatMasker string. This format is white-space separated and has 15 columns. The first 11 are invariably as follows:
Description Type Example Notes alignment score int (?) 463 percent divergence float 1.3 percent deletions float 0.6 percent insertions float 1.7 query seq. name string chr1 query seq. start int 10001 always pos. strand coords query seq. end int 10468 always pos. strand query seq. remaining int always in parenthesis, always negative strand coords. Bases left in query sequence before end (alt. can be interpreted as negative strand coord of genomic end coord). strand char C either + or C (complement); this refers to whether the match is to the consensus or the rev. comp. of it. repeat name string AluJo repeat family string SINE/Alu The next three depend on whether the genomic sequence was a match to the consensus sequence, or to the reverse complement of the consensus sequence. In the former case, they are as follows:
Description Type Example Notes consensus match start int 45 awlays pos strand consensus match end int 60 always pos strand; end >= start; it is inclusive of the final coordinate (unlike e.g. UCSC coords) consensus match remain (int) for ‘+’ match, always negative strand coords. Num of bases after match before end of consensus. (Alt. can be considered the negative strand coords of the match end in consensus seq.) In the latter case, they have this interpretation:
Description Type Example Notes consensus match before (int) always neg. strand (always parens); number of bases before start of match on negative strand (alt. can be interpreted as the negative strand coordinates of the match start) consensus match start int 345 start of match in pos. strand coords. consensus match end int 143 end of match in pos. strand coords. Because match is on neg. strand, ‘start’ is always >= end. The final column should be an ID, but I have noticed this is sometimes missing. We don’t use all of the above fields, and we convert everything into positive strand coords where start < end always.
Parameters: s – string representation of a retrotransposon occurrence in RepeatMasker format. Returns: Retrotransposon object corresponding to the string s.
Repeatmasker full alignments¶
There is an iterator in the io.alignmentIterators module for iterating over repeatmasker alignments; you can read about how it works in Reading and writing pairwise alignments.
This should be sufficient for any sequential access you need to do. If you need random access to the alignments you could use this to load them into into whatever data-structure you like, but the large size of these files generally makes it impractical to load the full thing. Instead, it’s easier to build an index of the file and then load the alignments you need on-demand. Building the index is fairly straighforward and simply requires an index friendly iterator (one which doesn’t buffer the stream, and supports seeking; see the above link on reading and writing alignments), and a hash function to hash each element the iterator yields to a unique ID (easily done, since the repeat-masker format contains a unique ID for each alignment):
from pyokit.io.indexedFile import IndexedFile
from pyokit.io.alignmentIterators import repeat_masker_alignment_iterator
def extract_UID(rm_alignment):
return rm_alignment.meta[multipleAlignment.RM_ID_KEY]
index = IndexedFile(filename, repeat_masker_alignment_iterator, extract_UID)
You can then use index to extract any alignment you want in O(1), as long as you know the ID for it, using the subscript operator:
index[5158]
You can read more about how indexing works in pyokit by taking a look at File indexes.
This may not be all that useful on its own, since you need to know the IDs to get the alignment you want, but it is very helpful when paired with the repeatmasker coordinate annotations, as described in the next section.
Repeatmasker annotations with on-demand full alignments¶
The iterator for repeatmasker coordinate annotations accepts an alignment index as a parameter. If this is provided, each RetrotransposonOccurrence yielded by the iterator will have access to its full alignment by using the index. This is loaded on-demand when it is needed using the index without the application programmer needing to do anything.
Liftover¶
One operation that I often want to perform is to lift a genomic region that overlaps a retrotransposon occurrence to the consensus sequence – to convert the coordinates of the region so they’re relative to the consensus sequence start. Although this is conceptually straighfoward, there are a couple of details that complicate matters.
The first wrinkle is insertions and deletions. A retrotransposon occurrence might have been inserted a long time ago, in which case other insertions and deletions can occur inside of it. So although you know the start and end coordinates of the occurrence, these may define a region that is longer or shorter than the consensus because of insertions and deletions. Without knowing exactly where these indels are located, your lift over will be inaccurate. If you have the full alignment, then these can be corrected for, but that leads to the second wrinkle: the full alignments are large; loading them whole is generally not practical.
Coordinate-only liftover¶
The folloing program demonstrates how to compute the frequency with which BED regions in a given file overlap each region of the consensus sequence of retrotransposons if only the annotations are avaialable (i.e. no full alignments):
import os, sys
from pyokit.datastruct.genomicInterval import intervalTreesFromList
from pyokit.io.bedIterators import BEDIterator
from pyokit.io.repeatMaskerIterators import repeat_masker_iterator
def build_profiles(repeat_masker_fn, peaks_fn):
l = [e for e in repeat_masker_iterator(repeat_masker_fn, header=False)]
transposon_trees = intervalTreesFromList(l, verbose=True)
for peak in BEDIterator(peaks_fn):
if not peak.chrom in transposon_trees : continue
hits = transposon_trees[peak.chrom].intersectingInterval(peak.start,
peak.end)
for hit in hits :
sys.stderr.write("found hit for peak " + str(peak) +\
" in transposon " + str(hit) + "\n")
lifted_peak = hit.liftover(peak)
sys.stderr.write("\tconverted peak to " + str(lifted_peak) + "\n")
for i in range(lifted_peak.start, lifted_peak.end):
print hit.name + "\t" + hit.family_name + "\t" + str(i)
repeat_masker_fn = sys.argv[1]
peaks_fn = sys.argv[2]
build_profiles(repeat_masker_fn, peaks_fn)