my-server
← Wiki

FASTQ format

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer.

Format

A FASTQ file has four line-separated fields per sequence:

  • Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Field 2 is the raw sequence letters.
  • Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

<pre> @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !*((((***+))%%%++)(%%%%).1***-+*))**55CCF>>>>>>CCCCCCC65 </pre>

The byte representing quality runs from 0x21 (lowest quality; '!' in ASCII) to 0x7e (highest quality; '~' in ASCII). Here are the quality value characters in left-to-right increasing order of quality (ASCII): <pre> !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz