The output
from an Illumina platform is the FASTQ file. To understand this file, we’ll
jump right into a real example from our data:
QD_0_ACTTGA_L005_R1_001.fastq.gz
First thing
to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by
underscores:
- QD_0
= sample name (defined by our lab in this case)
- ACTTGA
= barcode sequence ligated to each DNA fragment
- L005 =
sequencing lane on the flow cell
- 001
= set number
Each file is
can only hold a certain amount of sequencing data and anything exceeding that
size is split into multiple files denoted by the set number. This means that
files that differ only by set number are part of the same sample and must be
combined before analysis.
FASTQ files
have an extension of either .fastq
or .fq but as you may have
noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file
is compressed by the Gzip algorithm to minimize the space each file takes up.
To unzip the
file, simply navigate to the file in a terminal window and use the following
command:
gunzip <filename>.gz
To gzip a file, use this command:
gzip <filename>.gz
It is not recommended to unzip the FASTQ file unless necessary to
keep file sizes as small as possible. However, certain actions such as viewing
the file will require you to first unzip it. To view the first 12 lines of the
FASTQ file without unzipping the whole file, do this:
gunzip -cd <filename>.gz
| head -12
For the file I mentioned above, the output looks like this:
@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093
1:N:0:ACTTGA
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
+
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094
1:N:0:ACTTGA
TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT
+
CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137
1:N:0:ACTTGA
AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC
+
CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9
This is what the guts of a FASTQ file looks like. Each sequenced
fragment – called a read – is contained in 4 lines of the file. Since this file
is 12 lines, it describes 3 reads. The first line in the file:
@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093
1:N:0:ACTTGA
The first line always begins with @ and must be a unique identifier
of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate
on the flow cell where this read was sequenced. For more detailed information,
go here.
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
The second line contains the raw sequence for the read.
+
The third line is a “+” followed by an optional description (nothing,
in this case).
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
The final line is the quality score for each base. Each score is
ASCII encoded, where the higher the score, the higher the quality. The following
table can be used to calculate the score. Take the value of the character and
subtract 33 to find the quality score.
Ex. for the character C, 34 is the quality score (67 – 33 =
34).
But what does this
score mean? The score is actually a Phred quality score where:
Q = –10 log P
Where Q = quality
score and P = probability base was incorrectly called
A Phred score of 34,
therefore, means that a base has a 99.96%
chance of being correct whereas a score of 10 has a 90% probability of being
correct. We can use this scoring system to evaluate the quality of our
sequencing results. An efficient way of doing this is using the tool FastQC. To run this tool, simply use the following
command:
fastqc <filename>
The program computes and graphs a bunch of
statistics on the quality of the data.
This image shows
quality distribution at each base position in a read. As typical of Illumina
sequencing data, the quality drops towards the end of the read due to phasing
and pre-phasing. Recall that a single read is sequenced from a cluster of DNA
strand. Phasing and pre-phasing occurs when strands in a sequencing cluster fail
to insert a nucleotide or add multiple nucleotides during a sequencing cycle.
These strands fall out of sync with the rest of the cluster and reduce the
purity of the wavelength emitted by the cluster as a whole, thus resulting in a
greater probability of the wrong base being called. Because this problem
increases with every cycle, end of the read tends to have lower quality.
Overall, FastQC
shows that most bases have a quality score greater than 30 (0.1% chance of
error for each base). This suggests that the sequencing data has sufficiently
high quality. For more information on the output from FastQC, please see this
PDF.
The next post will attempt to detail how to map these reads to a reference genome.