This repository contains a nextflow workflow for analysing variation in human genomic data. Specifically this workflow can perform the following:
The wf-human-variation workflow consolidates the small variant calling from the previous wf-human-snp, structural variant calling from wf-human-sv, CNV calling from wf-cnv (all of which are now deprecated), as well as performing STR expansion genotyping. This pipeline performs the steps of the four pipelines simultaneously and the results are generated and output in the same way as they would have been had the pipelines been run separately.
This workflow uses Clair3 for calling small variants from long reads. Clair3 makes the best of two methods: pileup (for fast calling of variant candidates in high confidence regions), and full-alignment (to improve precision of calls of more complex candidates).
This workflow uses sniffles2 for calling structural variants.
This workflow uses modbam2bed to aggregate modified base counts into a bedMethyl file.
This workflow uses Dorado
for basecalling pod5
or fast5
signal data.
This workflow uses QDNAseq for calling copy number variants.
This workflow uses a fork of Straglr for genotyping short tandem repeat expansions.
The workflow uses Nextflow to manage compute and software resources, as such Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed.
It is not required to clone or download the git repository in order to run the workflow. For more information on running EPI2ME Labs workflows visit our website.
Workflow options
To obtain the workflow, having installed nextflow
, users can run:
nextflow run epi2me-labs/wf-human-variation --help
to see the options for the workflow.
Download demonstration data
A small test dataset is provided for the purposes of testing the workflow software, it can be downloaded using:
wget -O demo_data.tar.gz \https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-human-variation/demo_data.tar.gztar -xzvf demo_data.tar.gz
The basecalling, SNP, SV, 5mC aggregation, and CNV workflows are all independent and can be run in isolation or together using options to activate them. The STR workflow can also be run independently but will trigger the SNP workflow to run first, as a phased VCF is required to haplotag the input BAM file in order to successfully perform STR genotyping.
The SNP and SV workflows can be run with the demonstration data using:
OUTPUT=outputnextflow run epi2me-labs/wf-human-variation \-w ${OUTPUT}/workspace \-profile standard \--snp --sv \--bam demo_data/demo.bam \--bed demo_data/demo.bed \--ref demo_data/demo.fasta \--basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2' \--sample_name MY_SAMPLE \--out_dir ${OUTPUT}
Each subworkflow is enabled with a command line option:
--fast5_dir <input_dir>
--snp
--sv
--methyl
--cnv
--str
Subworkflows where the relevant option is omitted will not be run.
Some subworkflows have additional required options:
The SV workflow takes an optional --tr_bed
option to specify tandem
repeats in the reference sequence --- see the sniffles
documentation for more information.
The STR workflow takes a required --sex
option which is male
or female
. If --sex
is not specified, the workflow will default to female
. Please be aware that incorrect sex assignment will result in the wrong number of calls for all repeats on chrX.
To enable the 5mC aggregation step use the --methyl
option. For this
step to produce meaningful output the input BAM file must have been produced
by a basecaller capable of emitting the 5mC calls.
This brings us to activating the basecalling workflow. To run all the above including basecalling:
OUTPUT=outputnextflow run epi2me-labs/wf-human-variation \-w ${OUTPUT}/workspace \-profile standard \--snp --sv --methyl \--fast5_dir path/to/fast5/dir \--basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2' \--remora_cfg 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2' \--bed path/to.bed \--ref path/to.fasta \--out_dir ${OUTPUT}
Workflow outputs
The primary outputs of the workflow include:
--snp
)--sv
)--methyl
)--cnv
)--str
)<sample_name>.pass.cram
contains reads with qscore >= threshold
(only pass reads are used to make downstream variant cals)<sample_name>.fail.cram
contains reads with qscore < threshold
The secondary outputs of the workflow include:
{sample_name}.mapula.csv
and {sample_name}.mapula.json
provide basic alignment metrics (primary and secondary counts, read N50, median accuracy)mosdepth
outputs include:{sample_name}.mosdepth.global.dist.txt
: a cumulative distribution indicating the proportion of total bases for each and all reference sequences more info{sample_name}.regions.bed.gz
: reports the mean coverage for each region in the provided BED file{sample_name}.thresholds.bed.gz
: reports the number of bases in each region that are covered at or above each threshold value (1, 10, 20, 30X) more info{sample_name}.readstats.tsv.gz
: a gzipped TSV summarising per-alignment statistics produced by bamstats
Workflow tips
wf-human-snp
and wf-human-sv
are recommended to familiarise themselves with any parameter changes by using --help
, in particular:--ref
(not --reference
) and --bed
(not --target_bedfile
)--tr_bed
can improve the accuracy of SV calling.--methyl
requires data to be basecalled with a model that includes base modifications, providing the MM
and ML
BAM tagsInformation