This repository contains a nextflow workflow to identify somatic variation in a paired normal/tumor sample. This workflow currently perform:
This workflow enables analysis of somatic variation using the following tools:
The workflow uses nextflow to manage compute and software resources, as such nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed.
It is not required to clone or download the git repository in order to run the workflow. For more information on running EPI2ME Labs workflows visit our website.
Workflow options
To obtain the workflow, having installed nextflow
, users can run:
nextflow run epi2me-labs/wf-somatic-variation --help
to see the options for the workflow.
Input and Data preparation
The workflow relies on three primary input files:
The workflow is designed to work with human samples, and the reference genome should be either hg19 (GRCh37) or hg38 (GRCh38). Despite this, the majority of tasks within the workflow are species agnostic. The following options will require the workflow to check for the genome build, and will require hg19 or hg38:
--classify_insert
)The aligned bam files can be generated starting from:
Both workflows will generate aligned BAM files that are ready to be used with wf-somatic-variation
.
Demo data
The workflow comes with matched demo data accessible here:
wget -q -O demo_data.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-somatic-variation/wf-somatic-variation-demo.tar.gz
This demo is derived from a Tumor/Normal pair of samples, that we have made publicly accessible. Check out our blog post for more details.
Somatic short variant calling
The workflow currently implements a deconstructed version of ClairS (v0.1.0) to identify somatic variants in a paired tumor/normal sample. This workflow allows to take advantage of the parallel nature of Nextflow, providing the best performance in high-performance, distributed systems.
Currently, ClairS supports the following basecalling models:
Indel calling
Currently, indel calling is supported only for dna_r10
basecalling models. When the user specify an r9 model the workflow will automatically skip the indel processes and perform only the SNV calling.
Somatic structural variant (SV) calling with Nanomonsv
The workflow allows for the call of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will:
nanomonsv parse
nanomonsv get
add_simple_repeat.py
(optional)nanomonsv insert_classify
(optional)As of nanomonsv
v0.7.1, users can provide the approximate single base quality value (QV) for their dataset. To decide which is the most appropriate value for your dataset visit nanomonsv get
web page, but it can be summarized as follow:
Basecaller | Quality value |
---|---|
guppy (v5) | 10 |
guppy (v5 or v6) | 15 |
dorado | 20 |
To provide the correct qc value, simply use --qv 20
.
Modified base calling
Modified base calling can be performed by specifying --mod
. The workflow will call modified bases using modkit.
The default behaviour of the workflow is to run modkit with the --cpg --combine-strands
options set. It is possible to report strand-aware modifications
by providing --force_strand
, which will trigger modkit to run in default mode.
The modkit run can be fully customized by providing --modkit_args
. This will override any preset, and allow full control over the run of modkit.
Output folder
The output directory has the following structure:
output/├── execution # Execution reports│ ├── report.html│ ├── timeline.html│ └── trace.txt│├── SAMPLE│ ├── qc│ │ ├── coverage│ │ │ ├── SAMPLE_normal.mosdepth.global.dist.txt│ │ │ ├── SAMPLE_normal.mosdepth.summary.txt│ │ │ ├── SAMPLE_normal.per-base.bed.gz│ │ │ ├── SAMPLE_normal.regions.bed.gz│ │ │ ├── SAMPLE_normal.thresholds.bed.gz│ │ │ ├── SAMPLE_tumor.mosdepth.global.dist.txt│ │ │ ├── SAMPLE_tumor.mosdepth.summary.txt│ │ │ ├── SAMPLE_tumor.per-base.bed.gz│ │ │ ├── SAMPLE_tumor.regions.bed.gz│ │ │ └── SAMPLE_tumor.thresholds.bed.gz│ │ └── readstats│ │ ├── SAMPLE_normal.flagstat.tsv│ │ ├── SAMPLE_normal.readstats.tsv.gz│ │ ├── SAMPLE_tumor.flagstat.tsv│ │ └── SAMPLE_tumor.readstats.tsv.gz│ ││ ├── snv # ClairS outputs│ │ ├── change_counts # Mutational change counts for the sample; for now, it only works for the SNVs│ │ │ └── SAMPLE_changes.csv│ │ ├── varstats # Bcftools stats output│ │ │ └── SAMPLE.stats│ │ └── vcf # VCF outputs│ │ ├── SAMPLE_tumor_germline.vcf.gz│ │ ├── SAMPLE_tumor_germline.vcf.gz.tbi│ │ ├── SAMPLE_normal_germline.vcf.gz│ │ ├── SAMPLE_normal_germline.vcf.gz.tbi│ │ ├── SAMPLE_somatic_indels.vcf.gz│ │ ├── SAMPLE_somatic_indels.vcf.gz.tbi│ │ ├── SAMPLE_somatic_snv.vcf.gz│ │ └── SAMPLE_somatic_snv.vcf.gz.tbi│ ││ ├── sv│ │ ├── single_breakend│ │ │ └── SAMPLE.nanomonsv.sbnd.result.txt│ │ └── txt│ │ └── SAMPLE.nanomonsv.result.annot.txt│ ││ └── mod│ ├── modC # Modified bases code│ │ ├── DML # Differentially methylated loci│ │ │ └── SAMPLE.modC.dml.tsv│ │ ├── DMR # Differentially methylated regions│ │ │ └── SAMPLE.modC.dmr.tsv│ │ ├── DSS # DSS input files│ │ │ ├── modC.SAMPLE_normal.dss.tsv│ │ │ └── modC.SAMPLE_tumor.dss.tsv│ │ └── bedMethyl # bedMethyl output files│ │ ├── modC.SAMPLE_normal.bed.gz│ │ └── modC.SAMPLE_tumor.bed.gz│ └── raw # Raw outputs from modkit│ ├── SAMPLE_normal.bed│ └── SAMPLE_tumor.bed│├── info # single component runtime info│ ├── mod│ │ ├── params.json│ │ └── versions.txt│ ├── snv│ │ ├── params.json│ │ └── versions.txt│ └── sv│ ├── params.json│ └── versions.txt│├── SAMPLE_somatic_mutype.vcf.gz├── SAMPLE_somatic_mutype.vcf.gz.tbi├── SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz├── SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz.tbi├── SAMPLE.normal.mod_summary.tsv├── SAMPLE.tumor.mod_summary.tsv├── SAMPLE.wf-somatic-snp-report.html├── SAMPLE.wf-somatic-sv-report.html├── SAMPLE.wf-somatic-variation-readQC-report.html├── params.json└── versions.txt
The primary outputs are:
output/SAMPLE_somatic_mutype.vcf.gz
: the final VCF file with SNVs and, if r10, InDelsoutput/SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz
: the final VCF with the somatic SVs from nanomonsvoutput/*.html
: the reports of the different stagesoutput/SAMPLE/snp/spectra/SAMPLE_changes.csv
: the mutation changes for the sampleoutput/SAMPLE/snp/vcf/germline/[tumor/normal]
: the germline calls for both the tumor and normal bam filesoutput/SAMPLE/sv/txt/SAMPLE.nanomonsv.result.annot.txt
: the somatic SVs called with nanomonsv in tabular formatoutput/SAMPLE/sv/single_breakend/SAMPLE.nanomonsv.sbnd.result.txt
: the single break-end SVs called with nanomonsvoutput/SAMPLE/mod/
: the results from modkit and DSSSomatic structural variant (SV) calling with Nanomonsv
The workflow allows for the call of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will:
nanomonsv parse
nanomonsv get
add_simple_repeat.py
(optional)nanomonsv insert_classify
(optional)Hardware limitations: the SV calling workflow requires to run on a system supporting AVX2 instructions. please, ensure that your system support it before running it.
Information