In this brief post we will examine the concepts of read accuracy and read quality scores, and how they are calculated by Oxford Nanopore Technologies.

Phred Scores

Quality scores may be given for a single base, a read, or a consensus sequence. In each case quality scores simply reflect the probability of error in the associated sequence. A higher score indicates a higher probability that a sequence decision is correct whilst a lower score indicates a higher probability that the decision is incorrect.

You are likely familiar with Phred scores which originated from a base calling program written to identify fluorescently labelled DNA bases, separated via electrophoresis. The quality score for each base pair was calculated by comparing observed and expected chromatogram peak shapes and resolution.

The Phred quality score’s relationship with error probability is:

$$ Q = -10 \log_{10}P $$

where P is the error probability. The error probability can also be calculated from a given quality score from the inverse transform:

$$ p = 10^\frac{-q}{10}$$

Using Log10 means that a quality score of 10 represents a 1 in 10 chance of an incorrect base call (a base call accuracy of 90%), where as quality score of 20 represents a 1 in 100 chance of incorrect base call (or 99% accuracy).

Phred quality scores are usually recorded in fastq files using ASCII characters, which you can learn more about by looking at our Introduction to FastQ tutorial.

Producing quality scores

The way per-nucleotide quality scores are calculated depends on the base caller. The quality scores output by ONT base callers are based on the outputs of the neural networks used to produce the base call. The direct per-base score output from the neural networks do not reflect-well empirically observed errors: the scores must be calibrated globally across all sequence positions in the training data. After such calibration the probability implied by the score for each group of bases (bases with that score), is expected to correlate with the empirical probability that the bases in that group are correct, as judged by alignment to reference sequences.

When thinking about quality scores, it is important to consider how discriminative they are. Take an example of 100 bases at a locus, 50 of which are incorrect. Now we could label all bases with ~Q3 indicating 50% error. However, that’s rather unhelpful in determining the true identity of the base at the locus. We’d prefer a score which discriminates the correct bases from the incorrect bases: so, a score that indicates 100% confidence in the correct bases and 100% dismissal of the incorrect bases (though naturally if you knew a base was definitely incorrect you might be better off randomly selecting another base!). Algorithms which produce quality scores attempt to do the latter, with varying degree of success. The scores which they output need to be interpreted based on studies of populations.

Read accuracy

Much has been made in the past of how exactly Oxford Nanopore Technologies calculates single-read accuracy. The definition that we choose is the proportion of base identical match columns within an alignment of a read to a reference sequence:

$$ \frac{N_{matches}}{N_{matches} + N_{mismatches} + N_{deletions} + N_{insertions}} $$

This is the same definition as describe in Heng Li’s blog post, referred to there as BLAST identity. This definition was in use at ONT and used in all our publically available materials long before it became fashionable to cite Heng’s post! The definition is used by the Guppy basecaller (when asked to perform alignments on the fly), and several research tools including the stats_from_bam program from pomoxis.

Guppy read Q-scores and read accuracies

Let’s have a closer inspection of the read quality scores from the base caller Guppy. First let us look at the Guppy’s predicted read Q-score (the “mean_qscore_template” column of its summary file) versus the read accuracy as measured by alignment to a reference sequence. This per-read quality score is calculated from the per-base quality scores as:

$$ \text{read Q} = -10\log_{10}\big[\tfrac{1}{N}\sum 10^{-q_i/10}\big] $$

That is to say it is the average per-base error probability, expressed on the log (Phred) scale.

The data below are taken from around twenty thousand reads from chromosome 1 of our NA24385 open dataset.

Empirical read error as calculated from read-to-reference alignment vs. Guppy 'mean Q-score' read metric, for ~20k reads from chromosome 1 of NA24385. The Guppy per-base Q-scores are calibrated such that the mean Q-score predicts the empirical read error; this plot is therefore approximately linear.

The black dashed line shown in the plot indicates what would be a perfect correlation between the two quantities. We see that there is a strong correlation between the quantities but some samples are overconfident and error for those reads was higher than predicted.

It is in fact these quantities which are used to calibrate the per-base quality scores. The calibration procedure proceeds as follows. A fit between read accuracy and mean Q-score is performed using iteratively reweighted least squares regression with a Huber downweighting of outliers. The derived regression coefficients are then used to rescale the per-base quality scores.

Guppy per-base quality scores are calibrated using the predicted mean per-base error for a read and the read accuracy measured by alignment.

Let us now focus on the per-base quality score, as stored within fastq and bam files. For each read base within an alignment we can determine whether it is a match, mismatch, insertion, or deletion. Using the base’s assigned Q-score we can build a database of the proportion of bases which are errors for each Q-score value. We can then plot the empirical error over all Q-scores:

Empirical error rate for bases labelled with given Q-score. Substitution, insertion, and deletion errors are included (with deletion events assigned the Q-score of the preceding base). Guppy's Q-score calibration does not attempt to directly optimise the fit of these quantities.

Again the dashed line in the plot indicates what would be a perfect correlation between these quantities. We see that up to around Q10 the per-base quality scores do reflect the empirical error for that score. Quality scores of greater than 10 tend to underestimate the error probability. For example of all bases labeled as Q20 the proportional of them which are errors in the sequencing is equivalent to ≈Q12.5. This unfortunately compounds the use of the per-base qualities when attempts are made to use them in algorithms such as consensus or variant calling. Notwithstanding this observation, the calibration performed by Guppy, such as it is, is performed globally across all sequence positions: it does not attempt to model error contigent on the genomic context (although we may presume that the neural network outputs take-in such effects to some extent).

Highly accurate reads

This section was contributed by Tim Massingham, Research Fellow, Oxford Nanopore Technologies, and edited by the EPI2ME Labs Team.

As read accuracies increase, read accuracy distributions can have an artifactual mass on infinity due to shorter, perfect reads.

Extrapolating from these reads to declare that a proportion of reads from sequencing are >Q50 is unsafe and plotting raw read error-rate distributions gives a misleading view of the sequencing accuracy. Can we derive alternative ways of estimating per-read Q-scores.

Let’s first define $e_i$ the number of errors in a read $i$, and $N_i$ the alignment length of read $i$. The following are then all ways to produce improved predicted read error rate distributions:

Frequentist “plus-4” rule

The Wilson Score Interval is an improvement to the standard Normal confidence interval with better behaviour for small samples and extremely skewed distributions.

Intuitively, the confidence interval pulls the centre of the distribution, and so the point estimate of error rate, towards one-half in a manner dependent on the sequence length. This leads to the “plus 4 rule”:

$$\hat{p_i} = \frac{e_i + 2}{N_i + 4}$$

Bayesian: Jeffreys’ prior

For those of a Bayesian persuasion, we could instead use a “non-informative” Jeffreys’ prior over the error proportion, this is the Beta distribution with parameters (1/2 , 1/2).

The Beta distribution is self-conjugate with respect to the Binomial distribution; the posterior probability is also a Beta distribution and we can obtain a MAP estimate of the error rate:

$$ Post \sim \beta(\tfrac{1}{2} + e_i, \tfrac{1}{2} + N_i - e_i) $$ $$ \hat{p_i} = \frac{e_i + \tfrac{1}{2}}{N_i + 1} $$

In this case, a Jeffreys’ prior is equivalent to adding half an error and half a success to the calculation of a read’s error.

Empirical Bayes

We can go futher than a Jeffreys’ prior and estimate the error proportion from all of the reads and assume a Beta prior distribution with this error rate.

As with the Jeffreys’ prior approach, the posterior distribution for the error rate for each read is also a Beta distribution and we can obtain a MAP estimate of the error rate, leading to:

$$ \hat{p} = \sum_i \frac{e_i}{N_i} $$ $$ \hat{v} = \sum_i \frac{e_i^2}{N_i^2} - \hat{p}^2 $$ $$ w = \frac{\hat{p}(1-\hat{p})}{\hat{v}} - 1 $$ $$ \alpha = \hat{p}w $$ $$ \hat{p_i} = \frac{e_i + \alpha}{N_i + w} $$

Let’s apply the second method to some “duplex” nanopore sequencing data:

Comparison of read error histograms with and without inclusion of a prior. raw: error calculated via alignment. Jeffreys': as for raw with inclusion of uninformative prior. In the former case reads with error <1e-5 have been trucated to this value for display purposes. The count axis is truncated at 450.

The effect of the prior is to regress the large values back toward the the uniform prior, that is toward an error of one-half.

Summary

To summarise, quality scores are often taken for granted but there are many aspects to consider and keep in mind when using quality scores for evaluating samples or as part of further analysis. Knowing how they have been produced, and what they reflect is crucial to using them effectively.

The data presented here suggest that current Guppy read “mean Q-scores” are useful for filtering lower quality reads from a dataset, but that care should be taken in using the per-base quality scores to compare bases between reads.