Sequence Alignments

At present, pyokit has support only for pairwise alignments; multiple sequence alignment support is coming soon.

Pairwise alignments

Pairwise alignments are represented by the PairwiseAlignment class; the full documentation for this class is given in The PairwiseAlignment class. Briefly, objects of this class store the actual sequence data of the two component sequences, as well as a range of meta-data stored as key-value pairs. Construction is fairly straighforward

class pyokit.datastruct.multipleAlignment.PairwiseAlignment(s1, s2, meta_data=None)

An alignment of two sequences (DNA, RNA, protein...).

Parameters:
  • s1 – the first sequence, with gaps
  • s2 – the second sequence, with gaps
  • meta_data – a dictionary with key-value pairs representing meta-data about this alignment.

Pairwise Alignment Meta-data

The class attaches meaning to a number of keys-value pairs that can be included in the meta-data dictionary. Certain keys have special meaning – these are used for various things, including (and probably mostly, at least for now) formatting the alignment for display. None of these strictly need to be defined in the meta-data dictionary, although some might be required for producing certain string representations. These known keys can be broken into related groups. Firstly, there are those that store the names and co-ordinates of the two sequences.

Then there are keys that index data about the alignment itself

Some are specific to pairwise alignments generated by repeat-masker

Finally, there is a set containing all of the values

Other meta-data can be stored in the dictionary as needed, but won’t be used by the class.

Reading and writing pairwise alignments

At present, there is only one iterator for reading pariwise alignments: one for repeat-masker alignments. This is the documentation for the iterator, which describes the format, and the header-parsing function, which describes the header format and mapping between the header columns and the above meta-data keys.

Here is an example of a simple program that counts the number of alignments in a repeat-masker alignment file

>>> import sys, os
>>> from pyokit.io.alignmentIterators import repeat_masker_alignment_iterator
>>> #from pyokit.datastruct import multipleAlignment
>>> alig_2_header = "318 22.50 3.61 0.00 chr1 15266 15323 (249235276) C " +\
>>>                 "MIR3#SINE/MIR (65) 143 84 m_b1s601i1 5"
>>> alig_2 = "  chr1          15266 GAAACT--GGCCCAGAGAGGTGAGGCAGCG 15293 \n" +\
>>>          "                            --               i iii         \n" +\
>>>          "C MIR3#SINE/MIR   143 GAAACTGAGGCCCAGAGAGGTGAAGTGACG 114   \n" +\
>>>          "                                                           \n" +\
>>>          "  chr1          15294 GGTCACAGAGCAAGGCAAAAGCGCGCTGGG 15323 \n" +\
>>>          "                             v   ?  vi ivi    v            \n" +\
>>>          "C MIR3#SINE/MIR   113 GGTCACACAGCKAGTTAGTGGCGAGCTGGG 84"
>>> alig_2_m = "Matrix = 25p47g.matrix                                   \n" +\
>>>            "Kimura (with divCpGMod) = 26.25                          \n" +\
>>>            "Transitions / transversions = 2.40 (12/5)                \n" +\
>>>            "Gap_init rate = 0.03 (2 / 79), avg. gap size = 1.50 (3 / 2)"
>>> alig_3_header = "18 23.18 0.00 1.96 chr1 15798 15830 (249234772) " +\
>>>                 "(TGCTCC)n#Simple_repeat 1 32 (0) m_b1s252i0 6"
>>> alig_3 = "  chr1          15798 GCTGCTTCTCCAGCTTTCGCTCCTTCATGCT 15828  \n" +\
>>>          "                         v  v    v   iii      v - v          \n" +\
>>>          "  (TGCTCC)n#Sim     1 GCTCCTGCTCCTGCTCCTGCTCCTGC-TCCT 30     \n" +\
>>>          "                                                             \n" +\
>>>          "  chr1          15829 GC 15830                               \n" +\
>>>          "                                                             \n" +\
>>>          "  (TGCTCC)n#Sim    31 GC 32                                    "
>>> records = [alig_2_header + "\n\n" + alig_2 + "\n\n" + alig_2_m,
>>>           alig_3_header + "\n\n" + alig_3 + "\n\n" + alig_3_m]
>>> input_d = "\n\n".join(records)
>>> results = [r for r in
>>>            repeat_masker_alignment_iterator(StringIO.StringIO(input_d))]
>>> print len(results)
2