wf-somatic-variation documentation

By EPI2ME Labs
6 min read

Somatic variation workflow

Nextflow workflow to identify somatic variation.

Introduction

This workflow calls variants from the alignment files of a paired tumor/normal sample.

This workflow can be used for the following:

  • Alignment QC and statistics.
  • Somatic short variant calling (SNV and Indels).
  • Somatic structural variants calling (SV).
  • Modified sites calling (mod).

Compute requirements

Recommended requirements:

  • CPUs = 64
  • Memory = 256GB

Minimum requirements:

  • CPUs = 12
  • Memory = 32GB

Approximate run time: Variable depending on sequencing modality (targeted sequencing or whole genome sequencing), as well as coverage and the individual analyses requested. For instance, a 60X/30X Tumor/Normal pair takes approximately 17h:30m on a 64 core machine with 256Gb of memory.

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME application.

The workflow uses nextflow to manage compute and software resources, therefore nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either docker or singularity is installed. This is controlled by the -profile parameter as exemplified in the example below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-somatic-variation –help

A demo dataset is provided for testing of the workflow. It can be downloaded using:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-somatic-variation/wf-somatic-variation-demo.tar.gz
tar -xzvf wf-somatic-variation-demo.tar.gz

The workflow can be run with the demo data using:

nextflow run epi2me-labs/wf-somatic-variation \
--bam_normal wf-somatic-variation-demo/demo_normal.bam \
--bam_tumor wf-somatic-variation-demo/demo_tumor.bam \
--ref wf-somatic-variation-demo/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna \
--sv --snv --mod \
--sample_name SAMPLE \
--normal_min_coverage 0 \
--tumor_min_coverage 0 \
-profile standard

For further information about running a workflow on the cmd line see https://labs.epi2me.io/wfquickstart/

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related protocols in the Nanopore community.

Input example

This workflow accepts BAM files (aligned or unaligned) as input.

The BAM input parameters for this workflow accept the path to a single BAM file for the tumor sample (--bam_tumor), and one optional bam for the normal sample (--bam_normal). A sample name can be supplied with --sample.

Input parameters

Workflow Options

Nextflow parameter nameTypeDescriptionHelpDefault
snvbooleanCall for somatic small variants.If this option is selected, small variant calling will be carried out using ClairS.False
svbooleanCall for somatic structural variants.If this option is selected, the workflow will call somatic structural variants using nanomonsv.False
modbooleanEnable output of differentially modified sites and differentially modified regions [requires input BAMs with ML and MM tags].If this option is selected, modified bases will be aggregated with modkit and differential modifications will be computed with DSS.False

Main options

Nextflow parameter nameTypeDescriptionHelpDefault
sample_namestringSample name to be displayed in workflow outputs.The sample name will be used from the workflow to correctly name output files.SAMPLE
bam_normalstringPath to a BAM (or CRAM) containing aligned or unaligned reads for the normal sample.You may choose to provide a BAM/CRAM, but not both.
bam_tumorstringPath to a BAM (or CRAM) containing aligned or unaligned reads for the tumor sample.You may choose to provide a BAM/CRAM, but not both.
refstringPath to a reference FASTA file.Reference against which to compare reads for variant calling.
bedstringAn optional BED file enumerating regions to process for variant calling.
tr_bedstringAn optional BED file enumerating simple repeat regions.This command provides a bed file specifying the location of the simple repetitive elements in the genome of choice. This file should be a standard bed file, as described in the UCSC specification.
out_dirstringDirectory for output of all workflow results.output
annotationbooleanPerform SnpEff annotation.If this option is deselected, VCFs will not be annotated with SnpEff.True

Quality Control Options

Nextflow parameter nameTypeDescriptionHelpDefault
depth_intervalsbooleanOutput a bedGraph file with entries for each genomic interval featuring homogeneous depth.The output bedGraph file will have an entry for each genomic interval in which all positions have the same alignment depth. By default this workflow outputs summary depth information from your aligned reads. Per-base depth outputs are slower to generate but may be required for some downstream applications.False
tumor_min_coveragenumberMinimum read coverage for the tumor sample required to run analysis.A tumor BAM below this coverage value will fail the coverage threshold, and will not be processed in downstream analyses.20
normal_min_coveragenumberMinimum read coverage for the normal sample required to run analysis.A normal BAM below this coverage value will fail the coverage threshold, and will not be processed in downstream analyses.20

Small variant calling options

Nextflow parameter nameTypeDescriptionHelpDefault
basecaller_cfgstringName of the model to use for converting signal and selecting a small variant calling model.Required for basecalling and small variant calling. The basecaller configuration is used to automatically select the appropriate small variant calling model. Refer to the model table on the Dorado repository for selecting a simplex basecalling model.dna_r10.4.1_e8.2_400bps_sup@v3.5.2
hybrid_mode_vcfstringEnable hybrid calling mode that combines the de novo calling results and genotyping results at the positions in the VCF file given.
genotyping_mode_vcfstringVCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.
normal_vcfstringVCF file input containing normal sites for the given sample.Pointing to a pre-computed normal VCF file will prevent the workflow from calling the germline sites for the normal sample, reducing processing time.
include_all_ctgsbooleanCall for variants on all sequences in the reference, otherwise small variants will only be called on chr{1..22,X,Y}.Enabling this option will call for variants on all contigs of the input reference sequence. Typically this option is not required as standard human reference sequences contain decoy and unplaced contigs that are usually omitted for the purpose of variant calling. This option might be useful for non-standard reference sequence databases.False
skip_haplotype_filterbooleanSkip haplotype filtering of variants.Setting this will skip haplotype filtering of variants.False
fast_modebooleanFast germline variants calling in Clair3 (does not emit germline calls).Setting this will speed up the germline calling from Clair3 by relaxing the variant calling parameters; this matches ClairS default behaviour, and therefore will not emit germline VCFs.False
germlinebooleanThe workflow will perform germline calling and tumor phasing as default; set to false to disable this (greatly speeds up execution).True
phase_normalbooleanPhase and tag the normal, in addition to the tumor dataset.False

Somatic structural variant calling options

Nextflow parameter nameTypeDescriptionHelpDefault
min_sv_lengthintegerMinimum SV size to call.Provide the minimum size of the structural variants to call with nanomonsv.50
classify_insertbooleanPerform SV insert classification.Run nanomonsv insert_classify to annotate transposable and repetitive elements for the inserted SV sequences.False
qvintegerApproximate single base quality value (QV), one of 10, 15, 20 or 25.Expected single base quality as described in the nanomonsv web page.
control_panelstringPath to the directory containing the non-matched control panel data generated with nanomonsv merge_control.nanomonsv get can use a panel of non-matched control data to remove reads carrying common alleles; see here for more details.

Methylation calling options

Nextflow parameter nameTypeDescriptionHelpDefault
force_strandbooleanRequire modkit to call strand-aware modifications.False

Multiprocessing Options

Nextflow parameter nameTypeDescriptionHelpDefault
ubam_map_threadsintegerSet max number of threads to use for aligning reads from uBAM (limited by config executor cpus).8
ubam_sort_threadsintegerSet max number of threads to use for sorting and indexing aligned reads from uBAM (limited by config executor cpus).3
ubam_bam2fq_threadsintegerSet max number of threads to use for uncompressing uBAM and generating FASTQ for alignment (limited by config executor cpus).1
nanomonsv_get_threadsintegerTotal number of threads to use in nanomonsv get (minimum of 2 and limited by config executor cpus).4
dss_threadsintegerTotal number of threads to use in the DSS differential modification analysis (limited by config executor cpus).1
modkit_threadsintegerTotal number of threads to use in modkit modified base calling (limited by config executor cpus).4
haplotype_filter_threadsintegerSet max number of threads to use for the haplotype filtering stage in SNV workflow (limited by config executor cpus).4

Miscellaneous Options

Nextflow parameter nameTypeDescriptionHelpDefault
disable_pingbooleanEnable to prevent sending a workflow ping.False

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

TitleFile pathDescriptionPer sample or aggregated
Workflow alignment statistics report{{ alias }}.wf-somatic-variation-readQC-report.htmlReport of the alignment statistics for each tumor/normal paired sample.per-sample
Workflow SNV report{{ alias }}.wf-somatic-snv-report.htmlReport of the SNV for each tumor/normal paired sample.per-sample
Workflow SV report{{ alias }}.wf-somatic-sv-report.htmlReport of the SV for each tumor/normal paired sample.per-sample
Workflow MOD report{{ alias }}.wf-somatic-mod-report.htmlReport of the modified bases for each tumor/normal paired sample.per-sample
Somatic short variant VCF{{ alias }}.wf-somatic-snv.vcf.gzVCF file with the somatic SNVs for the sample.per-sample
Somatic short variant VCF index{{ alias }}.wf-somatic-snv.vcf.gz.tbiThe index of the resulting VCF file with the somatic SNVs for the sample.per-sample
Somatic structural variant VCF{{ alias }}.wf-somatic-sv.vcf.gzVCF file with the somatic SVs for the sample.per-sample
Somatic structural variant VCF index{{ alias }}.wf-somatic-sv.vcf.gz.tbiThe index of the resulting VCF file with the somatic SVs for the sample.per-sample
Modified bases BEDMethyl (normal){{ alias }}_normal.wf_mod.bedmethyl.gzBED file with the aggregated modification counts for the normal sample.per-sample
Modified bases BEDMethyl (tumor){{ alias }}_tumor.wf_mod.bedmethyl.gzBED file with the aggregated modification counts for the tumor sample.per-sample
Haplotagged alignment file (normal){{ alias }}_normal.ht.bamBAM or CRAM file with the haplotagged reads for the normal sample.per-sample
Haplotagged alignment file index (normal){{ alias }}_normal.ht.bam.baiThe index of the resulting BAM or CRAM file with the haplotagged reads for the normal sample.per-sample
Haplotagged alignment file (tumor){{ alias }}_tumor.ht.bamBAM or CRAM file with the haplotagged reads for the tumor sample.per-sample
Haplotagged alignment file index (tumor){{ alias }}_tumor.ht.bam.baiThe index of the resulting BAM or CRAM file with the haplotagged reads for the tumor sample.per-sample

Pipeline overview

This workflow is designed to perform variant calling of small variants, structural variants and modified bases aggregation from paired tumor/normal BAM files for a single sample.

Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

1. Input and data preparation.

The workflow relies on three primary input files:

  1. A reference genome in fasta format
  2. A BAM file for the tumor sample (either aligned or unaligned)
  3. An optional BAM file for the normal sample (either aligned or unaligned)

The BAM files can be generated from:

  1. POD5/FAST5 files using the wf-basecalling workflow, or
  2. fastq files using wf-alignment. Both workflows will generate aligned BAM files that are ready to be used with wf-somatic-variation. It is possible to run the workflow without the BAM file of the “normal” sample. See tumor-only mode for more details.

2. Data QC and pre-processing.

The workflow starts by performing multiple checks of the input BAM files, as well as computing:

  1. The depth of sequencing of each BAM file with mosdepth.
  2. The read alignment statistics for each BAM file with fastcat.

After computing the coverage, the workflow will check that the input BAM files have a depth greater than --tumor_min_coverage and --normal_min_coverage for the tumor and normal BAM files, respectively. It is necessary that both BAM files have passed the respective thresholds. In cases where the user sets the minimum coverage to 0, the check will be skipped and the workflow will proceed directly to the downstream analyses.

3. Somatic short variants calling with ClairS.

The workflow currently implements a deconstructed version of ClairS (v0.1.6) to identify somatic variants in a paired tumor/normal sample. This workflow takes advantage of the parallel nature of Nextflow, providing optimal efficiency in high-performance, distributed systems.

Currently, ClairS supports the following basecalling models:

Workflow basecalling modelClairS model
dna_r10.4.1_e8.2_400bps_sup@v4.2.0ont_r10_dorado_5khz
dna_r10.4.1_e8.2_400bps_sup@v4.1.0ont_r10_dorado_4khz
dna_r10.4.1_e8.2_400bps_sup@v3.5.2ont_r10_guppy
dna_r9.4.1_e8_hac@v3.3ont_r9_guppy
dna_r9.4.1_e8_sup@v3.3ont_r9_guppy
dna_r9.4.1_450bps_hac_promont_r9_guppy
dna_r9.4.1_450bps_hacont_r9_guppy

Any other model provided will prevent the workflow to start.

Currently, indel calling is supported only for dna_r10 basecalling models. When the user specifies an r9 model the workflow will automatically skip the indel processes and perform only the SNV calling.

The workflow uses Clair3 to call germline sites on both the normal and tumor sample, which are then used internally to refine the somatic variant calling. This mode is computationally demanding, and it’s behaviour can be changed with a few options:

  • Reduce the accuracy of the variant calling with --fast_mode.
  • Provide a pre-computed VCF file reporting the germline calls for the normal sample with --normal_vcf.
  • Disable the germline calling altogether with --germline false.

SNVs called using the GRCh37 or GRCh38 genomes can be annotated using SnpEff by setting --annotation true. Furthermore, the workflow will add annotations from the ClinVar database.

4. Somatic structural variant (SV) calling with Nanomonsv.

The workflow allows for the calling of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will:

  1. Parse the SV signatures in the reads using nanomonsv parse
  2. Call the somatic SVs using nanomonsv get
  3. Filter out the SVs in simple repeats using add_simple_repeat.py (optional)
  4. Annotate transposable and repetitive elements using nanomonsv insert_classify (optional)

As of nanomonsv v0.7.1 (and v0.4.0 of this workflow), users can provide the approximate single base quality value (QV) for their dataset. To decide which is the most appropriate value for your dataset, visit the get section of the nanomonsv web page, but it can be summarized as follow:

BasecallerQuality value
guppy (v5)10
guppy (v5 or v6)15
dorado20

To provide the correct qv value, simply use --qv 20.

The VCF produced by nanomonsv is now processed to have one sample with the name specified with --sample_name (rather than the two-sample TUMOR/CONTROL). By default, this file does not report a genotype in the resulting VCF, unless the user specifies the --genotype_sv option. In this case, a genotype is defined based on the number of reference-supporting reads in the tumor sample: if the value is >= min_ref_support, then it is called as heterozygote, otherwise it is called as homozygote. The original VCFs generated by nanomonsv can still be accessed by the user, and is saved as {{ alias }}/sv/vcf/{{ alias }}.results.nanomonsv.vcf.

SVs called using the GRCh37 or GRCh38 genomes can be annotated using SnpEff by setting --annotation true.

It is possible to provide a non-matching control panel with the --control_panel option. nanomonsv get can use a panel of non-matched control data to remove reads carrying common alleles, as described in the documentation. These datasets can be generated by using nanomonsv merge_control as described here.

5. Modified base calling with modkit

Modified base calling can be performed by specifying --mod. The workflow will aggregate the modified bases using modkit and perform differential modification analyses using DSS. The default behaviour of the workflow is to run modkit with the --cpg --combine-strands options set. It is possible to report strand-aware modifications by providing --force_strand. Users can further change the behaviour of modkit by passing options directly to modkit via the --modkit_args option. This will override any preset, and allow full control over the run of modkit. For more details on the usage of modkit pileup, checkout the software documentation.

6. Tumor-only mode

It is possible to run a reduced version of the workflow using only the tumor BAM files. Currently, the following components can run in tumor-only mode:

  • base workflow: BAM coverage and QC statistics
  • --mod: the workflow will run modkit on the tumor BAM file, but will skip the differentially modified region and loci detection

7. Run the workflow on a region

When sequencing specific regions or genes, the runtime can vary substantially. The table below provides a guideline on the average compute time for a given number of genes or regions. All analyses are run using up to 128GB of RAM and 16 cores, computing --sv and --snv on a 75X/55X tumor/normal coverage in the regions of interest.

Number of genesRegion sizeCPU/hRuntime
1-10200Kb-1Mb~2.08m-15m
11-1001Mb-6Mb~7.025m-45m
100-100020Mb-60Mb~14.045m-1h:30m

Troubleshooting

  • If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
  • See how to interpret some common nextflow exit codes here.
  • To run the workflow with a non-human organism, proceed as follows:
    • EPI2ME app: disable the Classify insertion sequence and Annotation options.
    • Command line: set --classify_insert false and --annotation false.
  • The SV calling workflow requires a computing system supporting AVX2 instructions. Please, ensure that your system supports these before running it.
  • Short somatic Indel calling is supported only for dna_r10 basecalling models.

FAQ’s

  • Does the workflow calls 5hmC, on top of 5mC? - Yes, the workflow does call 5hmC, but only if you performed the basecalling with the appropriate module; for more details, check out the dorado github page.

  • Can I run the workflow in tumor-only mode? - Yes, but currently this mode is only available for --mod.

See the EPI2ME website for lots of other resources and blog posts.


Share

EPI2ME Labs

EPI2ME Labs

Senior Button Pusher

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.