.. _sequencesSection:

====================
Biological Sequences
====================

Sequences in Pyokit are represented using the Sequence class, which wraps a
name and the actual sequence data, and provides a lot of the basic functionality
for manipulating sequences. Further specializations provide extra functionality
for fasta-format and fastq-format reads. For the class descriptions, see
:ref:`sequenceClassSection`, :ref:`fastaSequenceClassSection`, and
:ref:`fastqSequenceClassSection`

----------------------
Processing Fasta files
----------------------

Pyokit contains an iterator for fasta files. Here's an example of the usual
idiom for processing a fasta file in a scenario that requires only one pass
(in this case, reverse complement all the sequences in the fasta file).

>>> from pyokit.io.fastaIterators import fastaIterator
>>> for s in fastaIterator("sequences.fa") :
>>>   s.reverseComplement()
>>>   print s

Given the following contents for sequences.fa:

.. code-block:: bash

   >one
   ACTGATGCGCTAGCGCGTA
   CTGACGCG
   >two
   CTAGCTAGCGCGCTAGTGCGGG
   >three
   TTTCGAGCCG
   GGGCAAAAA

This script will produce this output:

.. code-block:: bash

   >one
   CGCGTCAGTACGCGCTAGC
   GCATCAGT
   >two
   CCCGCACTAGCGCGCTAGCTAG
   >three
   TTTTTGCCCC
   GGCTCGAAA

Notice that the iterator can handle sequence data split over multiple lines,
and that output formatting respects line width from the input file (with the
minor exception that line-width for a single sequence cannot be ragged, so
it takes the length of the first line for that sequence).

Here's the signature for the iterator:

.. autofunction:: pyokit.io.fastaIterators.fastaIterator

----------------------
Processing FastQ files
----------------------

This is basically the same as fasta files, but there are a few extra wrinkles.
Pyokit provides three iterators for fastq files. The difference between them
is how they handle data with sequence and/or quality data split over multiple
lines. The first, fastqIteratorSimple, ignores this possibility and expects
all fastq records to occupy 4 lines. In general this will be fine, as fastq
data, in my experience, always follows this convention. The second iterator,
fastqIteratorComplex, allows for the processing of files/streams where this
convention isn't followed (but it is slower, so should be avoided if possible).
Finally, there is a general function, called fastqIterator which at present just
wraps the simple iterator. In almost all cases, it will be sufficient just to
use this.

Here's the same example from above, but this time using a fastq iterator:

>>> from pyokit.io.fastqIterators import fastqIterator
>>> for s in fastqIterator("sequences.fq") :
>>>   s.reverseComplement()
>>>   print s

Given this input fastq file:

.. code-block:: bash

   @one
   ACTGATGCGCTAGCGCGTACTGACGCG
   +one
   !''*((((***+))%%%++)(%%%%).
   @two
   CTAGCTAGCGCGCTAGTGCGGG
   +two
   1***-+*''))**55CCF>>!*
   @three
   TTTCGAGCCGGGGCAAAAA
   +three
   CCCCCCC65+))%%%+%%)

The output will be:

.. code-block:: bash

   @one
   CGCGTCAGTACGCGCTAGCGCATCAGT
   +one
   .)%%%%()++%%%))+***((((*''!
   @two
   CCCGCACTAGCGCGCTAGCTAG
   +two
   *!>>FCC55**))''*+-***1
   @three
   TTTTTGCCCCGGCTCGAAA
   +three
   )%%+%%%))+56CCCCCCC

Notice that quality scores are also correctly reversed.

Here's the signature for the general fastQ iterator:

.. autofunction:: pyokit.io.fastqIterators.fastqIterator

Here's the signature for the simple iterator:

.. autofunction:: pyokit.io.fastqIterators.fastqIteratorSimple

And here's the signature for the iterator that handles multi-line quality and
sequence data:

.. autofunction:: pyokit.io.fastqIterators.fastqIteratorComplex

----------------------
Manipulating sequences
----------------------

Pretty much everything you need should be covered in the class descriptions for
the sequence classes: see :ref:`sequenceClassSection`,
:ref:`fastaSequenceClassSection`, and :ref:`fastqSequenceClassSection`