In this blog post we will explore small variant calling using the previously released HG002 (GM24385 Ashkenazi Son) data release.
The GM24385 dataset comprises whole genome sequencing of a well-characterised human cell line. It therefore provides a useful benchmark sample; the cell line was also used as a “seen” sample in the recent [https://precision.fda.gov/challenges/10/view](PrecisionFDA Truth Challenge V2) competition.
As an easily reproducible example we will focus on chromosome 20 of the genome rather than performing computation on the whole genome.
This walkthrough assumes some familiarity with standard bioinformatic tools for handling genomics data. A working installation of samtools, bedtools, medaka, docker, and the AWS command-line tools are required to follow the process below.
To start let us download the pre-aligned reads corresponding to chromosome 20 from the dataset resource (see our tutorial FAQs) for more information on downloading data):
for ext in .bam .bam.bai; doaws s3 --no-sign-request cp s3://ont-open-data/gm24385_2020.09/analysis/r9.4.1/20200914_1354_6B_PAF27096_e7c9eae6/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/chr20/calls2ref${ext} PAF27096.chr20${ext}aws s3 --no-sign-request cp s3://ont-open-data/gm24385_2020.09/analysis/r9.4.1/20200914_1357_1-E11-H11_PAF27462_d3c9678e/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/chr20/calls2ref${ext} PAF27462.chr20${ext}done
and merge these into a single .bam
file:
samtools merge chr20.bam PAF27096.chr20.bam PAF27462.chr20.bamsamtools index chr20.bam
As an additional step we will filter the chr20.bam
file to leave only primary
alignments, removing secondary and supplementary alignments. This is necessary
as DeepVariant, which we will use later, will otherwise process these
additional alignments.
samtools view chr20.bam -F 2308 -@ 64 -b > chr20.primary.bamsamtools index chr20.primary.bam
To perform variant calling we also require the human reference sequence, the same sequence as used to create the alignment files prepared above. An indexed copy of this is available from the dataset resource:
for ext in .fasta .fasta.fai; doaws s3 --no-sign-request cp s3://ont-open-data/gm24385_2020.09/config/ref/GCA_000001405.15_GRCh38_no_alt_analysis_set${ext} .done
We have now all the inputs required to perform variant calling with medaka and DeepVariant.
Medaka is Oxford Nanopore Technologies’ software for performing consensus and small variant calling from nanopore long-read data. It can perform diploid variant calling either in isolation or in tandem with DeepVariant; to enable this second use-case small modifications have been made to how medaka represents variants to allow it to function as a variant-candidate generator for DeepVariant.
To perform candidate generation with medaka we run:
medaka_variant -i chr20.primary.bam -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \-r chr20 -t 8 -P 0 -l
The final two options here (-P 0 -l
) instruct medaka to leave its
substitution calls unfiltered and to decompose multi-nucleotide substitutions
into independent single nucleotide substitutions. Details of workflow employed
by the medaka_variant
program can be found in the medaka
documentation.
The result of running the above command will be a directory medaka_variant
containing (amongst other files):
round_0_hap_mixed_phased.bam
: alignments (as in chr20.primary.bam
), tagged with a calculated haplotype,round_1.vcf
: the final output variant candidates for DeepVariant.With variant-candidate generation performed by medaka we can now use DeepVariant to calculate our final variant calls. In order to simplify this we will download a program that automates the execution of the docker container used to run DeepVariant:
wget https://gist.githubusercontent.com/cjw85/23d2b0675ec5a5c7fd4074456524c971/raw/c716c85639f047b9a9cff2079be2868bccb61659/run_deepvariant.shchmod +x run_deepvariant.sh
This script simply gathers together the required inputs before executing DeepVariant within the docker container, it can be run simply with:
./run_deepvariant.sh \-b medaka_variant/round_0_hap_mixed_phased.bam \-v medaka_variant/round_1.vcf \-r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \-o deep_variant -t 64 -q
The final variant calls will be present at deepvariant/deepvariant.vcf.gz
.
The -q
option here runs a “high quality” version of the DeepVariant
calculation; this version incurs a runtime cost of aaround a factor of
four and gives marginable improvement in results.
The veracity of the variant calling performed above can be obtained by comparing the results to the Genome In A Bottle truth sets for the GM24385 sample. The truth sets can be downloaded from the NCBI repository:
for ext in .bed .bed.gz .bed.gz.tbi .vcf.gz .vcf.gz.tbi; dowget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.1/GRCh38/HG002_GRCh38_1_22_v4.1_draft_benchmark${ext}done
With these reference data we will use hap.py to assess the recall and precision of the variant calls made by DeepVariant. Hap.py is most easily run using the docker container provided by it’s authors:
docker run -it -v ${PWD}:${PWD} pkrusche/hap.py /opt/hap.py/bin/hap.py \${PWD}/HG002_GRCh38_1_22_v4.1_draft_benchmark.vcf.gz ${PWD}/deepvariant/deepvariant.vcf.gz \-f ${PWD}/HG002_GRCh38_1_22_v4.1_draft_benchmark.bed \-r ${PWD}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \-o happy_out --pass-only -l chr20 --engine=vcfeval --threads=20
The output of the above will be a table summarising the results of the comparison. An abbreviated form is given below:
Type | INDEL | SNP |
---|---|---|
TRUTH.TOTAL | 11271 | 71334 |
METRIC.Recall | 0.5927 | 0.9972 |
METRIC.Precision | 0.8384 | 0.9973 |
METRIC.F1_Score | 0.6944 | 0.9973 |
An active area of research is the calling of variants in low complexity regions. The NCBI reference data includes an index of such regions, we can mask these regions from the comparison:
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/LowComplexity/GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed.gzbedtools intersect \-a HG002_GRCh38_1_22_v4.1_draft_benchmark.bed \-b GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz> HG002_GRCh38_1_22_v4.1_draft_benchmark.norepeat.beddocker run -it -v ${PWD}:${PWD} pkrusche/hap.py /opt/hap.py/bin/hap.py \${PWD}/HG002_GRCh38_GIABv4.1.vcf.gz ${PWD}/deepvariant/deepvariant.vcf.gz \-f ${PWD}/ HG002_GRCh38_1_22_v4.1_draft_benchmark.norepeat.bed \-r ${PWD}/truths/GRCh38_no_alt_chr20.fa \-o happy_out --pass-only -l chr20 --engine=vcfeval --threads=20
With these regions masked we now obtain:
Type | INDEL | SNP |
---|---|---|
METRIC.Recall | 0.9458 | 0.9992 |
METRIC.Precision | 0.9827 | 0.9992 |
METRIC.F1_Score | 0.9639 | 0.9992 |
The latest research basecallers are better able to provide accurate calls through low complexity regions. The example below compares basecalls produced with Guppy 4.0.11 and the soon to be release Bonito version 0.3.0 basecaller.
The IGV screenshot shows how the Bonito basecaller is less prone to long deletion tracks in the low complexity region. With improvements to basecalling we therefore anticipate being able to produce highly accurate INDEL calls in these regions in the near future.
We kindly acknowledge Andrew Carroll of Google Health and Benedict Paten’s group at the Computational Genomics Lab, UC Santa Cruz Genomics Institute for releasing the DeepVariant inference models for nanopore data and discussions concerning how to present variant candidates to DeepVariant.