Monday, December 29, 2014

FASTQ and Sequencing Quality



The output from an Illumina platform is the FASTQ file. To understand this file, we’ll jump right into a real example from our data:

QD_0_ACTTGA_L005_R1_001.fastq.gz

First thing to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by underscores:
          - QD_0 = sample name (defined by our lab in this case)
          - ACTTGA = barcode sequence ligated to each DNA fragment
          - L005 = sequencing lane on the flow cell
          - 001 = set number

Each file is can only hold a certain amount of sequencing data and anything exceeding that size is split into multiple files denoted by the set number. This means that files that differ only by set number are part of the same sample and must be combined before analysis.

FASTQ files have an extension of either .fastq or .fq but as you may have noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file is compressed by the Gzip algorithm to minimize the space each file takes up.

To unzip the file, simply navigate to the file in a terminal window and use the following command:

gunzip <filename>.gz

To gzip a file, use this command:

gzip <filename>.gz

It is not recommended to unzip the FASTQ file unless necessary to keep file sizes as small as possible. However, certain actions such as viewing the file will require you to first unzip it. To view the first 12 lines of the FASTQ file without unzipping the whole file, do this:

gunzip -cd <filename>.gz | head -12

For the file I mentioned above, the output looks like this:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
+
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094 1:N:0:ACTTGA
TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT
+
CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137 1:N:0:ACTTGA
AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC
+
CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9

This is what the guts of a FASTQ file looks like. Each sequenced fragment – called a read – is contained in 4 lines of the file. Since this file is 12 lines, it describes 3 reads. The first line in the file:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

The first line always begins with @ and must be a unique identifier of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate on the flow cell where this read was sequenced. For more detailed information, go here.

GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA

The second line contains the raw sequence for the read.

+

The third line is a “+” followed by an optional description (nothing, in this case).

CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>

The final line is the quality score for each base. Each score is ASCII encoded, where the higher the score, the higher the quality. The following table can be used to calculate the score. Take the value of the character and subtract 33 to find the quality score.


Ex. for the character C, 34 is the quality score (67 – 33 = 34).

But what does this score mean? The score is actually a Phred quality score where:
Q = –10 log P
Where Q = quality score and P = probability base was incorrectly called

A Phred score of 34, therefore, means that a base has a 99.96% chance of being correct whereas a score of 10 has a 90% probability of being correct. We can use this scoring system to evaluate the quality of our sequencing results. An efficient way of doing this is using the tool FastQC.  To run this tool, simply use the following command: 

fastqc <filename>

The program computes and graphs a bunch of statistics on the quality of the data.


This image shows quality distribution at each base position in a read. As typical of Illumina sequencing data, the quality drops towards the end of the read due to phasing and pre-phasing. Recall that a single read is sequenced from a cluster of DNA strand. Phasing and pre-phasing occurs when strands in a sequencing cluster fail to insert a nucleotide or add multiple nucleotides during a sequencing cycle. These strands fall out of sync with the rest of the cluster and reduce the purity of the wavelength emitted by the cluster as a whole, thus resulting in a greater probability of the wrong base being called. Because this problem increases with every cycle, end of the read tends to have lower quality.


Overall, FastQC shows that most bases have a quality score greater than 30 (0.1% chance of error for each base). This suggests that the sequencing data has sufficiently high quality. For more information on the output from FastQC, please see this PDF.

The next post will attempt to detail how to map these reads to a reference genome. 

No comments:

Post a Comment