Monday, December 29, 2014

FASTQ and Sequencing Quality

The output from an Illumina platform is the FASTQ file. To understand this file, we’ll jump right into a real example from our data:


First thing to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by underscores:
          - QD_0 = sample name (defined by our lab in this case)
          - ACTTGA = barcode sequence ligated to each DNA fragment
          - L005 = sequencing lane on the flow cell
          - 001 = set number

Each file is can only hold a certain amount of sequencing data and anything exceeding that size is split into multiple files denoted by the set number. This means that files that differ only by set number are part of the same sample and must be combined before analysis.

FASTQ files have an extension of either .fastq or .fq but as you may have noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file is compressed by the Gzip algorithm to minimize the space each file takes up.

To unzip the file, simply navigate to the file in a terminal window and use the following command:

gunzip <filename>.gz

To gzip a file, use this command:

gzip <filename>.gz

It is not recommended to unzip the FASTQ file unless necessary to keep file sizes as small as possible. However, certain actions such as viewing the file will require you to first unzip it. To view the first 12 lines of the FASTQ file without unzipping the whole file, do this:

gunzip -cd <filename>.gz | head -12

For the file I mentioned above, the output looks like this:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094 1:N:0:ACTTGA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137 1:N:0:ACTTGA

This is what the guts of a FASTQ file looks like. Each sequenced fragment – called a read – is contained in 4 lines of the file. Since this file is 12 lines, it describes 3 reads. The first line in the file:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

The first line always begins with @ and must be a unique identifier of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate on the flow cell where this read was sequenced. For more detailed information, go here.


The second line contains the raw sequence for the read.


The third line is a “+” followed by an optional description (nothing, in this case).


The final line is the quality score for each base. Each score is ASCII encoded, where the higher the score, the higher the quality. The following table can be used to calculate the score. Take the value of the character and subtract 33 to find the quality score.

Ex. for the character C, 34 is the quality score (67 – 33 = 34).

But what does this score mean? The score is actually a Phred quality score where:
Q = –10 log P
Where Q = quality score and P = probability base was incorrectly called

A Phred score of 34, therefore, means that a base has a 99.96% chance of being correct whereas a score of 10 has a 90% probability of being correct. We can use this scoring system to evaluate the quality of our sequencing results. An efficient way of doing this is using the tool FastQC.  To run this tool, simply use the following command: 

fastqc <filename>

The program computes and graphs a bunch of statistics on the quality of the data.

This image shows quality distribution at each base position in a read. As typical of Illumina sequencing data, the quality drops towards the end of the read due to phasing and pre-phasing. Recall that a single read is sequenced from a cluster of DNA strand. Phasing and pre-phasing occurs when strands in a sequencing cluster fail to insert a nucleotide or add multiple nucleotides during a sequencing cycle. These strands fall out of sync with the rest of the cluster and reduce the purity of the wavelength emitted by the cluster as a whole, thus resulting in a greater probability of the wrong base being called. Because this problem increases with every cycle, end of the read tends to have lower quality.

Overall, FastQC shows that most bases have a quality score greater than 30 (0.1% chance of error for each base). This suggests that the sequencing data has sufficiently high quality. For more information on the output from FastQC, please see this PDF.

The next post will attempt to detail how to map these reads to a reference genome. 

Sunday, December 28, 2014

Library construction and Illumina sequencing

This will be the first of a series of posts that will detail the process I have used to analyze a set of RNA seq data for our lab. 

Library Preparation

We’ll start by assuming that RNA has been successfully extracted from some biological source and ribosomal RNA has been removed. 

Step 1A: RNA -> cDNA 
To create a DNA library that can be sequenced, it is first necessary to convert our pool of RNA into complementary DNA (cDNA). One technique is to use a small DNA primer to target the poly-A tail on strands of mRNA and extend the primer to create cDNA.

For species that don’t attach a poly-A tail to mRNA, a strategy called random priming is used. The RNA is combined with synthetic DNA hexamer primers (444444 = 4096 different sequences) and the cDNA is constructed by extending off wherever a primer anneals to the RNA molecule.

Step 1B: Strand specificity (optional)
During the creation of the cDNA library, the initial strand cDNA is synthesized from a primer annealed to an RNA strand. During the synthesis of the complementary cDNA strand, it is possible to use dUTP instead of dTTP to mark the second strand with “U”s. This method results in double stranded cDNA where each strand is distinguishable from each other (where one contains T and the other contains U). The second strand can be targeted and degraded leaving a single strand that can be sequenced. The benefit of this method is that it allows each sequenced fragment to be mapped uniquely to one specific strand of the reference genome. For instance, if two genes overlap on opposite strands of a genome then this strand-specific sequencing strategy will be able to determine which strand (and therefore gene) each RNA fragment was derived from.

Step 2: Size selection

At this point, some protocol is used to break down all the cDNA fragments to a common size. Typically, this is some physical process such as sonication. The ends of these fragments are repaired and additional segments of synthetic DNA are ligated on the ends of the fragments. These include primers required to initiate sequencing and barcode sequences to identify which fragments belong to which sample (crucial for multiplexed experiments where multiple samples are sequenced at the same time). 

Step 3: Amplification

Finally, the library is PCR amplified and is ready to be sequenced.

Illumina Sequencing

For those out there (including myself) that are visually-orientated, this video will demonstrate the basic steps:

After the library is prepared, the first step is to wash and mount the fragments on a flow cell. Fragments will be randomly distributed over this surface. Next, a technique called bridge amplification duplicates the annealed fragments to create a monoclonal cluster of DNA. 

This is where the sequencing begins. First a primer is annealed to every DNA segment attached to the flow cell. A series of cycles is preformed where a single (reversible terminators) fluorescent nucleotide is added to each growing DNA fragment. The each cluster of fragments emits a specific wavelength corresponding to one of four nucleotides. By measuring the wavelength of light emitted from each cluster after each cycle, the sequence of each fragment can be determined.

Finally, a technique known as paired-end sequencing can be used to double the sequencing information derived from each sample. The general idea is that each strand of DNA is sequenced from 5’ to 3’.  This results in two sets of sequences, one corresponding to each DNA strand. Each pair may or may not overlap depending on the size of the fragment and the length of each read.

(For further details and figures, please check out this page:

Once all this is done, the Illumina platform analyzes a huge set of images to derive the sequence of all the fragments. Our RNA-seq experiment uses random priming, pair-end sequencing, and is not strand specific. The blog posts that follow this will be written for this type of analysis.

In the next post, I will explain the output of this sequencing and how to analyze its quality.

Sunday, November 2, 2014

Reviewing Data and Revising Hypotheses

Hi blog, it’s been a while. I recently gathered evidence (via restriction digest) that a pair of double mutants I’ve been using (sxy-1 Δhfq and murE749 Δhfq) were swapped at some point. Since then, I’ve been carefully evaluating all the data I’ve collected and diving into the literature to try to make sense of everything. My goal was to come up with a concrete hypothesis that explains what I have seen and I think I have one. But first, we will look at the data:

If you get lost at any point, I have provided a little bit of background at the bottom of this post.
(order of the bars in the graphs is KW20, sxy-1, murE749 for any colour-blind readers out there)

First, we’ll look at murE749. We really don’t know much about why this mutant is so competent, but previous work in our lab has shown that the murE mutant expresses the same amount of Sxy protein as KW20, but has the Sxy-dependent competence genes turned on when they are typically off. ( The murE mutant also shows no competence when sxy is removed. This suggests that the effect of the murE mutation takes place somewhere downstream of Sxy in the competence-inducing pathway and its effect is dependent on the presence of Sxy. Looking at my data, it seems that removing hfq makes no significant difference to how competence murE749 is (except for, perhaps, the late-log + cAMP condition). I understand this to mean that whatever the murE mutation does, its effect supersedes the negative effect of removing hfq.

Turning to sxy-1, we see something similar. There may be a tiny reduction in competence under most conditions, but for the most part it does not seem very strong or even statistically significant. Competence is strongly induced when it is not normally in KW20 (see log phase) presumably because of higher translational efficiency of basal levels of sxy mRNA.

Finally, the most interesting results are with the single Δhfq mutant. We see a 10 fold reduction in cells lacking hfq when grown in MIV. At low density, cells with and without hfq show similar levels of competence. However, as the culture gets dense we fail to see any induction of competence in Δhfq while KW0 becomes about 50 fold more competent. The story is different when exogenous cyclic AMP is added though. The data suggests that exogenous cAMP temporarily rescues competence (log phase + cAMP).

To illustrate this further, here’s time course data of KW20 and Δhfq with and without cAMP:

This data illustrates a few things:
  • Without cAMP, KW20 can become 100 fold more competent than Δhfq
  • KW20 becomes more competent as cells get dense, see little to no increase for Δhfq
  • Competence in both strains drops as time continues
  • With exogenous cAMP, both strains become almost equally competent, perhaps KW20 slightly more than Δhfq
  • Competence drops off back to no cAMP levels at about OD = 0.1 (also seen in previous set of graphs)
  • Knowing all this, what conclusions can be drawn? I think it’s very likely that Hfq has a role in cAMP regulation.  
If we are to assume that Hfq in necessary to fully induce the production of intracellular cAMP, the data begins to make sense. As cells become dense and the nutrients in the medium in which they are growing become depleted, competence increases presumably due to activation of the pathway that induces cAMP production. Perhaps by removing hfq, cAMP levels fail to increase to such an environmental influence. Giving cells exogenous cAMP removes the influence of intracellular regulation, and under this condition, we see competence rescued (at least for a period of time) in the hfq knockout.

This makes me think that the influence Hfq has on competence happens before the transcription of sxy. Hfq’s function could be to:
  • increase the sensitivity of the phosphotransferase system
  • increase translation of adenylate cyclase (increase cAMP levels)
  • reduce translation of cAMP phosphodiesterase (prevent depletion of cAMP)
  • increase translation of crp (increase response to cAMP)
I don’t think that CRP is regulated by Hfq as it is in Y.pestis. If Hfq regulated CRP at the transcriptional level, it would be odd that adding exogenous cAMP would somehow remove that layer of regulation.

I would be surprised if translation of protein in the phosphotransferase system was regulated by Hfq merely because it would odd to affect sugar uptake to indirectly change cAMP levels in the cell. There may be something here I’m missing but my initial thoughts are that this should not be my focus.

So in terms of a hypothesis, I’m favouring adenylate cyclase. cAMP phosphodiesterase is another option, but I think it makes less sense. To use an analogy: if you want a room dark, it makes more sense to prevent a light switch from being turned on rather than standing there waiting to turn it off every time it comes on.

So here’s my current hypothesis:
Removing hfq causes a competence defect by reducing adenylate cyclase translation and therefore reducing intracellular cAMP levels, preventing the induction of competence.

I think this hypothesis aligns with what is seen between Δhfq/KW20. The murE mutant appears to be consistently hypercompetent whether or not cAMP is present, suggesting that it is independent of adenylate cyclase activity and it makes sense that little to no difference is seen when hfq is removed. As for sxy-1, even basal levels of sxy expression are enough to induce competence, downplaying the role of cAMP in competence regulation. Perhaps that’s why the difference between the mutant and double mutant is almost trivial.

I did want to talk about future experiments, but this post is already extremely long, so I’ll do that in a separate one.

Some Background:

About the strains
KW20 is the Rd Haemophilus influenzae strain. Δhfq is KW20 with the hfq gene removed with a spc resistance gene added. sxy-1 is the Rd strain with a point mutation that increases translation of a gene necessary for transcribing the competence regulon. This mutation increases competence by removing an inhibitory structure in the sxy mRNA and inducing higher rates of translation of the Sxy protein. murE749 has a point mutation in an enzyme required for peptidoglycan biosynthesis that increases competence by some unknown mechanism. Double mutants carry both mutations. Error bars indicate that replications have been done, typically 2 or 3 times.

About competence and cAMP
When H. influenzae is starved, the phosphotransferase system detects a lack of preferred sugar and activated adenylate cyclase (cyaA). Adenylate cyclase then catalyzes the production of intracellular cAMP. This cAMP binds to C-reactive protein (CRP), this complex then induced the transcription of CRP-dependent genes.  This set of genes involves sxy, which when induced activates the competence genes.