Biological Sequences

Sequences in Pyokit are represented using the Sequence class, which wraps a name and the actual sequence data, and provides a lot of the basic functionality for manipulating sequences. Further specializations provide extra functionality for fasta-format and fastq-format reads. For the class descriptions, see The Sequence class, The FastaSequence class, and The FastqSequence class

Processing Fasta files

Pyokit contains an iterator for fasta files. Here’s an example of the usual idiom for processing a fasta file in a scenario that requires only one pass (in this case, reverse complement all the sequences in the fasta file).

>>> from pyokit.io.fastaIterators import fastaIterator
>>> for s in fastaIterator("sequences.fa") :
>>>   s.reverseComplement()
>>>   print s

Given the following contents for sequences.fa:

>one
ACTGATGCGCTAGCGCGTA
CTGACGCG
>two
CTAGCTAGCGCGCTAGTGCGGG
>three
TTTCGAGCCG
GGGCAAAAA

This script will produce this output:

>one
CGCGTCAGTACGCGCTAGC
GCATCAGT
>two
CCCGCACTAGCGCGCTAGCTAG
>three
TTTTTGCCCC
GGCTCGAAA

Notice that the iterator can handle sequence data split over multiple lines, and that output formatting respects line width from the input file (with the minor exception that line-width for a single sequence cannot be ragged, so it takes the length of the first line for that sequence).

Here’s the signature for the iterator:

pyokit.io.fastaIterators.fastaIterator(fn, useMutableString=False, verbose=False)

A generator function which yields fastaSequence objects from a fasta-format file or stream.

Parameters:
  • fn – a file-like stream or a string; if this is a string, it’s treated as a filename, else it’s treated it as a file-like object, which must have a readline() method.
  • useMustableString – if True, construct sequences from lists of chars, rather than python string objects, to allow more efficient editing. Use with caution.
  • verbose – if True, output additional status messages to stderr about progress

Processing FastQ files

This is basically the same as fasta files, but there are a few extra wrinkles. Pyokit provides three iterators for fastq files. The difference between them is how they handle data with sequence and/or quality data split over multiple lines. The first, fastqIteratorSimple, ignores this possibility and expects all fastq records to occupy 4 lines. In general this will be fine, as fastq data, in my experience, always follows this convention. The second iterator, fastqIteratorComplex, allows for the processing of files/streams where this convention isn’t followed (but it is slower, so should be avoided if possible). Finally, there is a general function, called fastqIterator which at present just wraps the simple iterator. In almost all cases, it will be sufficient just to use this.

Here’s the same example from above, but this time using a fastq iterator:

>>> from pyokit.io.fastqIterators import fastqIterator
>>> for s in fastqIterator("sequences.fq") :
>>>   s.reverseComplement()
>>>   print s

Given this input fastq file:

@one
ACTGATGCGCTAGCGCGTACTGACGCG
+one
!''*((((***+))%%%++)(%%%%).
@two
CTAGCTAGCGCGCTAGTGCGGG
+two
1***-+*''))**55CCF>>!*
@three
TTTCGAGCCGGGGCAAAAA
+three
CCCCCCC65+))%%%+%%)

The output will be:

@one
CGCGTCAGTACGCGCTAGCGCATCAGT
+one
.)%%%%()++%%%))+***((((*''!
@two
CCCGCACTAGCGCGCTAGCTAG
+two
*!>>FCC55**))''*+-***1
@three
TTTTTGCCCCGGCTCGAAA
+three
)%%+%%%))+56CCCCCCC

Notice that quality scores are also correctly reversed.

Here’s the signature for the general fastQ iterator:

pyokit.io.fastqIterators.fastqIterator(fn, verbose=False, allowNameMissmatch=False)

A generator function which yields FastqSequence objects read from a file or stream. This is a general function which wraps fastqIteratorSimple. In future releases, we may allow dynamic switching of which base iterator is used.

Parameters:
  • fn – A file-like stream or a string; if this is a string, it’s treated as a filename specifying the location of an input fastq file, else it’s treated as a file-like object, which must have a readline() method.
  • useMustableString – if True, construct sequences from lists of chars, rather than python string objects, to allow more efficient editing. Use with caution.
  • verbose – if True, print messages on progress to stderr.
  • debug – if True, print debugging messages to stderr.
  • sanger – if True, assume quality scores are in sanger format. Otherwise, assume they’re in Illumina format.
  • allowNameMissmatch – don’t throw error if name in sequence data and quality data parts of a read don’t match. Newer version of CASVA seem to output data like this, probably to save space.

Here’s the signature for the simple iterator:

pyokit.io.fastqIterators.fastqIteratorSimple(fn, verbose=False, allowNameMissmatch=False)

A generator function that yields FastqSequence objects read from a fastq-format stream or filename. This is iterator requires that all sequence and quality data is provided on a single line – put another way, it cannot parse fastq files with newline characters interspersed in the sequence and/or quality strings. That’s probably okay though, as fastq files tend not to be formated like that (famous last words..).

Parameters:
  • fn – filename or stream to read data from.
  • allowNameMismatch – don’t throw error if name in sequence data and quality data parts of a read don’t match. Newer version of CASVA seem to output data like this, probably to save space.
  • verbose – if True, output additional status messages to stderr about progress.

And here’s the signature for the iterator that handles multi-line quality and sequence data:

pyokit.io.fastqIterators.fastqIteratorComplex(fn, useMutableString=False, verbose=False)

A generator function which yields FastqSequence objects read from a file or stream. This iterator can handle fastq files that have their sequence and/or their quality data split across multiple lines (i.e. there are newline characters in the sequence and quality strings).

Parameters:
  • fn – A file-like stream or a string; if this is a string, it’s treated as a filename specifying the location of an input fastq file, else it’s treated as a file-like object, which must have a readline() method.
  • useMustableString – if True, construct sequences from lists of chars, rather than python string objects, to allow more efficient editing. Use with caution.
  • verbose – if True, print messages on progress to stderr.
  • debug – if True, print debugging messages to stderr.
  • sanger – if True, assume quality scores are in sanger format. Otherwise, assume they’re in Illumina format.

Manipulating sequences

Pretty much everything you need should be covered in the class descriptions for the sequence classes: see The Sequence class, The FastaSequence class, and The FastqSequence class