================ Retrotransposons ================ Pyokit has two data structures for representing retro-transposons: Retrotransposon and RetrotransposonOccurrence. The first of these contains information about retro-transposons that is general to all occurrences, such as the name, family name and concensus sequence. The second describes one particular occurrence of a retrotransposon, including details about where it occurs and the name of the sequence it occurrs in (usually a chromosome). An occurrence objsect also stores the start and end co-ordinates into the concensus sequence (becuase sometimes it is truncated at the 3' or 5' end), and whether the match is to the concensus or the reverse compliment of the concensus. RetrotransposonOccurrence objects contain references to Retrotransposon objects in a one-to-many fashion. Optionally, the full alignment between the consensus and the occurrence might also be stored by the object, which allows more accurate liftover between the two. ------------------- Loading annotations ------------------- At the moment, the only annotations that are supported are those of RepeatMasker. The coordinate annoation (which describes the locations of the repeat occurrences), as well the full alignments can be loaded seperately, or at the same time. `````````````````````````````````` Repeatmasker coordinate annotation `````````````````````````````````` Repeatmasker coordinate annotations are loaded using the following function from the IO module. alignment_index should be left as None (default) when loading only annotations (more details about this below). .. autofunction:: pyokit.io.repeatMaskerIterators.repeat_masker_iterator The description of what each field means is given below .. autofunction:: pyokit.datastruct.retrotransposon.from_repeat_masker_string ```````````````````````````` Repeatmasker full alignments ```````````````````````````` There is an iterator in the io.alignmentIterators module for iterating over repeatmasker alignments; you can read about how it works in :ref:`readWriteAlignmentsSection`. This should be sufficient for any sequential access you need to do. If you need random access to the alignments you could use this to load them into into whatever data-structure you like, but the large size of these files generally makes it impractical to load the full thing. Instead, it's easier to build an index of the file and then load the alignments you need on-demand. Building the index is fairly straighforward and simply requires an index friendly iterator (one which doesn't buffer the stream, and supports seeking; see the above link on reading and writing alignments), and a hash function to hash each element the iterator yields to a unique ID (easily done, since the repeat-masker format contains a unique ID for each alignment):: from pyokit.io.indexedFile import IndexedFile from pyokit.io.alignmentIterators import repeat_masker_alignment_iterator def extract_UID(rm_alignment): return rm_alignment.meta[multipleAlignment.RM_ID_KEY] index = IndexedFile(filename, repeat_masker_alignment_iterator, extract_UID) You can then use index to extract any alignment you want in O(1), as long as you know the ID for it, using the subscript operator:: index[5158] You can read more about how indexing works in pyokit by taking a look at :ref:`fileIndexesSection`. This may not be all that useful on its own, since you need to know the IDs to get the alignment you want, but it is very helpful when paired with the repeatmasker coordinate annotations, as described in the next section. ``````````````````````````````````````````````````````` Repeatmasker annotations with on-demand full alignments ``````````````````````````````````````````````````````` The iterator for repeatmasker coordinate annotations accepts an alignment index as a parameter. If this is provided, each RetrotransposonOccurrence yielded by the iterator will have access to its full alignment by using the index. This is loaded on-demand when it is needed using the index without the application programmer needing to do anything. -------- Liftover -------- One operation that I often want to perform is to lift a genomic region that overlaps a retrotransposon occurrence to the consensus sequence -- to convert the coordinates of the region so they're relative to the consensus sequence start. Although this is conceptually straighfoward, there are a couple of details that complicate matters. The first wrinkle is insertions and deletions. A retrotransposon occurrence might have been inserted a long time ago, in which case other insertions and deletions can occur inside of it. So although you know the start and end coordinates of the occurrence, these may define a region that is longer or shorter than the consensus because of insertions and deletions. Without knowing exactly where these indels are located, your lift over will be inaccurate. If you have the full alignment, then these can be corrected for, but that leads to the second wrinkle: the full alignments are large; loading them whole is generally not practical. ```````````````````````` Coordinate-only liftover ```````````````````````` The folloing program demonstrates how to compute the frequency with which BED regions in a given file overlap each region of the consensus sequence of retrotransposons if only the annotations are avaialable (i.e. no full alignments):: import os, sys from pyokit.datastruct.genomicInterval import intervalTreesFromList from pyokit.io.bedIterators import BEDIterator from pyokit.io.repeatMaskerIterators import repeat_masker_iterator def build_profiles(repeat_masker_fn, peaks_fn): l = [e for e in repeat_masker_iterator(repeat_masker_fn, header=False)] transposon_trees = intervalTreesFromList(l, verbose=True) for peak in BEDIterator(peaks_fn): if not peak.chrom in transposon_trees : continue hits = transposon_trees[peak.chrom].intersectingInterval(peak.start, peak.end) for hit in hits : sys.stderr.write("found hit for peak " + str(peak) +\ " in transposon " + str(hit) + "\n") lifted_peak = hit.liftover(peak) sys.stderr.write("\tconverted peak to " + str(lifted_peak) + "\n") for i in range(lifted_peak.start, lifted_peak.end): print hit.name + "\t" + hit.family_name + "\t" + str(i) repeat_masker_fn = sys.argv[1] peaks_fn = sys.argv[2] build_profiles(repeat_masker_fn, peaks_fn) ``````````````````````` Full-alignment liftover ```````````````````````