Pyokit data-structures¶
Genomic Intervals¶
Genomic intervals in pyokit are represented using the GenomicInterval class. The basic information needed to define an interval is the chromosome it’s on, and the start and end indices for the itnerval. We consider the end interval to be exclusive (i.e. it’s not included in the interval). All intervals also have a DNA strand, which defaults to the positive strand if not set. Additionally, intervals can be given names and scores. The GenomicInterval class is described below. For more information about using objects created from this class, see Genomic Intervals.
The GenomicInterval class¶
-
class
pyokit.datastruct.genomicInterval.
GenomicInterval
(chrom, start, end, name=None, score=None, strand=None, scoreType=<type 'int'>)¶ Represents contiguous segment of a genome; inclusive of start, but not end.
Parameters: - chrom – The chromosome this genomic interval is on.
- start – The start of this genomic interval (inclusive).
- end – The end of this genomic interval (exclusive).
- name – A name associated with this genomic interval.
- score – A score associated with this genomic interval.
- strand – The DNA strand (+ or -) that this genomic interval is on.
- scoreType – The type (e.g. int, float) of the score associated with this genomic interval
-
distance
(e)¶ Distance between this interval and e – number of nucleotides.
We consider intervals that overlap to have a distance of 0 to each other. The distance between two intervals on different chromosomes is considered undefined, and causes an exception to be raised.
Returns: the distance from this GenomicInterval to e. Parameters: e – the other genomic interval to find the distance to. Raises GenomicIntervalError: if self and e are on different chromosomes.
-
intersects
(e)¶ Check whether e intersects self.
Returns: true if this elements intersects the element e
-
isNegativeStrand
()¶ Check if this genomic interval is on the negative strand.
Returns: True if this element is on the negative strand
-
isPositiveStrand
()¶ Check if this genomic region is on the positive strand.
Returns: True if this element is on the positive strand
-
sameRegion
(e)¶ Check whether self represents the same DNA region as e.
Parameters: e – genomic region to compare against Returns: True if self and e are for the same region (ignores differences in non-region related fields, such as name or score – but does consider strand)
-
signedDistance
(e)¶ Signed distance between this interval and e – number of nucleotides.
Get the signed distance from this genomic interval to another one. We consider intervals that overlap to have a distance of 0 to each other. The distance between two intervals on different chromosomes is considered undefined, and causes an exception to be raised. If e comes earlier than self, the distance will be negative.
Returns: the signed distance from this GenomicInterval to e. Parameters: e – the other genomic interval to find the distance to. Raises GenomicIntervalError: if self and e are on different chromosomes.
-
sizeOfOverlap
(e)¶ Get the size of the overlap between self and e.
Returns: the number of bases that are shared in common between self and e.
-
subtract
(es)¶ Subtract the BED elements in es from self.
Parameters: es – a list of BED elements (or anything with chrom, start, end) Returns: a list of BED elements which represent what is left of self after the subtraction. This might be an empty list.
-
transform_center
(size)¶ Tranform self so it is centered on the same spot, but has new size.
If the region grows, the extra nucleotides will be distributed evenly to the 5’ and 3’ side of the current region if possible. If the extra is odd, the 3’ side will get the extra one. Similarly, if the resize shrinks the interval, bases will be removed from the 5’ and 3’ sides equally; if the number to remove is odd, the extra one will be removed from the 3’ side.
Parameters: size – size of the region after transformation.
Interval Trees¶
Interval trees are binary trees that allow random access lookup of intervals that are intersected by a given point or interval.
The IntervalTree class¶
-
class
pyokit.datastruct.intervalTree.
IntervalTree
(intervals, openEnded=False)¶ An interval tree is a binary tree that allows fast O(log(n)) lookup of intervals that intersect a given point or interval.
Parameters: intervals – list of intervals, doesn’t need to be sorted in any way. Can be any object, as long as they have ‘start’ and ‘end’ attributes. -
intersectingInterval
(start, end)¶ given an interval, get intervals in the tree that are intersected.
Parameters: - start – start of the intersecting interval
- end – end of the intersecting interval
Returns: the list of intersected intervals
-
intersectingIntervalIterator
(start, end)¶ Get an iterator which will iterate over those objects in the tree which intersect the given interval - sorted in order of start index
Parameters: - start – find intervals in the tree that intersect an interval with with this start index (inclusive)
- end – find intervals in the tree that intersect an interval with with this end index (exclusive)
Returns: an iterator that will yield intersected intervals
-
intersectingPoint
(p)¶ given a point, get intervals in the tree that are intersected.
Parameters: p – intersection point Returns: the list of intersected intervals
-
Sequences¶
Sequences in Pyokit are represented using the Sequence class, which wraps a name and the actual sequence data, and provides a lot of the basic functionality for manipulating sequences. Specializations exist for particular sequence formats. At present, fasta and fastq are supported. The Sequence, FastaSequence and FastqSequence classes are described below. For more information about manipulating sequence data, see Biological Sequences
The Sequence class¶
-
class
pyokit.datastruct.sequence.
Sequence
(seqName, seqData, start_coord=None, end_coord=None, strand='+', remaining=0, meta_data=None, useMutableString=False)¶ This is the base class for all sequences in Pyokit. Objects from this class will have only a sequence name and actual nucleotide sequence data.
Parameters: - seqName – A name describing the sequence. Can be any string.
- seqData – The nucleotide sequence data. Can be DNA or RNA. Note that there is no check to make sure the sequence data is valid, that’s the responsibility of the caller.
- start_coord – TODO
- end_coord – TODO
- strand – By default, this is +, but can also be set to - to indicate that this sequence is a reverse complement.
- remaining – the amount of sequence that comes after this; 0 if this is the whole sequence. Alterntively, you might think of this as the negative strand coordinates of the end of this sequence.
- meta_data – dictionary containing meta-data key-value pairs
- useMutableString – Store the sequence data as a mutable string, rather than a regular python string. This should make editing operations must faster, but it comes at the expense of less flexibility (e.g. the object can not be used as a hash key because it is mutable.)
-
clip_end
(seq, mm_score)¶ Clip a sequence from the end of this sequence – we assume the sequence to be clipped will always begin somewhere in this sequence, but may not be fully contained. If found, replaced with Ns.
Parameters: - seq – sequence to be clipped
- mm_score – the number of matching bases needed to consider a hit, mm_score = len(seq) would be 100% match
-
copy
()¶ Copy constructor for Sequence objects.
-
effective_len
¶ Get the length of the sequence if N’s are disregarded.
-
end
¶ Returns: The coordinate of the end of this sequence; as with all other indexing of sequences in pyokit, sequences are not inclusive of their last index. Computed just-in-time from the ungapped sequence length if it wasn’t provided at construction time.
-
gapped_relative_subsequence
(start, end)¶ Extract a subsequence from this sequence using coordinates that are relative to the start of the sequence (relative position 1) and the number of nuceltodies in the sequence, including gaps. For example:
46 –> A–CTGC-TAGC-GATCGACT <– 62 subsequence(2,7) == –CTG
-
isDNA
()¶ Make a guess as to whether this sequence is a DNA sequence or not by looking at the symbols it contains.
Returns: True if contains only DNA nucleotides, False otherwise
-
isLowQuality
()¶ Determine whether this is a low quality sequence. To be considered a low quality sequence, it must have > 10% Ns.
Returns: True if this sequence meets the above definition of low-quality.
-
isPolyA
()¶ Determine whether this sequence is polyA. To be a polyA sequence, it must have > 90% Adenine.
Returns: True if the sequence is PolyA by the above definition.
-
isPolyT
()¶ Determine whether this sequence is polyT. To be a polyT sequence, it must have > 90% Thymine.
Returns: True if the sequence is PolyT by the above definition.
-
isRNA
()¶ Make a guess as to whether this sequence is an RNA sequence or not by looking at the symbols it contains.
Returns: True if contains only RNA nucleotides, False otherwise
-
is_positive_strand
()¶
-
maskMatch
(mask)¶ Determine whether this sequence matches the given mask.
Parameters: mask – string to match against. Ns in the mask are considered to match anything in the sequence – all other chars must match exactly. Returns: True if the mask matches at all places, otherwise false
-
maskRegion
(region)¶ Replace nucleotides in this sequence in the regions given by Ns
Parameters: region – any object with .start and .end attributes. Co-ords are zero based and inclusive of both end points. Any other attributes (e.g. chrom.) are ignored. Raises SequenceError: if region specifies nucleotides not present in this sequence
-
maskRegions
(regions, verbose=False)¶ Mask the given regions in this sequence with Ns.
Parameters: - region – iterable of regions to mask. Each region can be any object with .start and .end attributes. Co-ords are zero based and inclusive of both end points. Any other attributes (e.g. chrom.) are ignored.
- verbose – print status messages to stderr if True
-
meta_data_to_string
()¶
-
nsLeft
(amount)¶ Replace leftmost <amount> bases by Ns.
-
nsRight
(amount)¶ Replace rightmost <amount> bases by Ns
-
percentNuc
(nuc)¶ return the percentage of the sequence which is equal to the passed nuc.
Parameters: nuc – the nucleotide to compute percentage composition for. There is no check to make sure this is a valid nucleotide. Returns: the percentage of the sequence that is <nuc>
-
relative_subsequence
(start, end)¶ Extract a subsequence from this sequence using coordinates that are relative to the start (relative position 1) and end coordinates of the sequence. For example:
46 –> A–CTGC-TAGC-GATCGACT <– 62 subsequence(2,7) == CTGC-TParameters: - start – the index marking the start (inclusive) of the subsequence. This is a one-based index, and is in the coordinate space of the sequence (i.e. from 1 to N, where N is the number of non-gap nucleotides in the sequence)
- end – the index marking the end (exclusive) of the subsequence. This is a one-based index, and is in the coordinate space of the sequence (i.e. from 1 to N, where N is the number of non-gap nucleotides in the sequence)
Returns: a new sequence object that represents the subsequence
Rasie SequenceError: if the start coordinate is less than 1 or the end coordinate is greater than the ungapped length of this sequence.
-
reverseComplement
(isRNA=None)¶ Reverse complement this sequence in-place.
Parameters: isRNA – if True, treat this sequence as RNA. If False, treat it as DNA. If None (default), inspect the sequence and make a guess as to whether it is RNA or DNA.
-
similarity
(self_start, self_end, other_start, other_end, other)¶ Compute the number of matching bases in the subsequences self[start, end] and other[o_start, o_end]. Note that the subsequences must be the same length.
Parameters: - self_start – start index for sub-sequence in self
- self_end – end index for sub-sequence in self
- other_start – start index for subsequence in other sequence
- other_end – end index for subsequence in other sequence
- other – other sequence to compare to this.
-
split
(point=None)¶ Split this sequence into two halves and return them. The original sequence remains unmodified.
Parameters: point – defines the split point, if None then the centre is used Returns: two Sequence objects – one for each side
-
start
¶ Returns: The coordinate of the first nucleotide in this sequence; by convention, we call this coordinate 1 if no other value was provided.
-
subsequence
(start, end)¶ Extract a subsequence from this sequence object using absolute coordinates that exist in the same coordinate space as the sequence itself. For example:
46 –> A–CTGC-TAGC-GATCGACT <– 62 subsequence(47,52) == CTGC-TParameters: - start – the index marking the start (inclusive) of the subsequence. This is a one-based index, and is in the same coordinate space as this sequence object.
- end – the index marking the end (exclusive) of the subsequence. This is a one-based index, and is in the same coordinate space as this sequence object.
Returns: a new sequence object that represents the subsequence of this from position start (indexed from 1, inclusive) to end (indexed from 1, exclusive). Ungapped length of this will always be equal to end - start
Rasie SequenceError: if the coordinates given fall outside of the start and end indices of this sequence object.
-
toDNA
()¶ Convert this sequence in-place to a DNA sequence by changing any Us to Ts
-
toRNA
()¶ Convert this sequence in-place to an RNA sequence by changing any Ts to Us
-
to_fasta_str
(line_width=50, include_coords=True)¶ Returns: string representation of this sequence object in fasta format
-
truncate
(newLength)¶ Truncate this sequence in-place so it’s only <newLength> nucleotides long.
Parameters: newLength – the length to truncate this sequence to.
The FastaSequence class¶
The FastqSequence class¶
Sequence Alignments¶
The PairwiseAlignment class¶
-
class
pyokit.datastruct.multipleAlignment.
PairwiseAlignment
(s1, s2, meta_data=None)¶ An alignment of two sequences (DNA, RNA, protein...).
Parameters: - s1 – the first sequence, with gaps
- s2 – the second sequence, with gaps
- meta_data – a dictionary with key-value pairs representing meta-data about this alignment.
-
s1
¶ :return the first sequences in the alignment.
-
s2
¶ :return the second sequences in the alignment.
Gene Ontology¶
There are two classes for representing gene ontology data in Pyokit. The first of these is a simple data type wrapping basic information about a GO term:
-
class
pyokit.datastruct.geneOntology.
GeneOntologyTerm
(name, identified=None, catagory=None)¶ Represents a gene ontology term.
Parameters: - name – the term name, can contain spaces; e.g.: intracellular transport
- identifier – the term identifier; e.g: GO:0046907
- catagory – the database or catagory the term belongs to; e.g.: GOTERM_BP_FAT
While the second is a subclass of the general GO term class which adds additional information about term enrichment.
-
class
pyokit.datastruct.geneOntology.
GeneOntologyEnrichmentResult
(name, pvalue, identifier=None, catagory=None)¶ Represents the result of a gene ontology enrichment calculation for a single GO term.
Parameters: - name – the term name, can contain spaces; e.g.: intracellular transport
- pvalue – significance of the enrichment of this term
- identifier – the term identifier; e.g: GO:0046907
- catagory – the database or catagory the term belongs to; e.g.: GOTERM_BP_FAT