TutorialsWorkflowsOpen dataDownloadsOur Team
Data Releases
GM24385 Dataset Release
September 22, 2020
1 min

The R9.4.1 data referenced in this page have been superceded by the June 2021 Guppy 5 rebasecalling of the November 2020 dataset.

We are happy to annouce the release of a Nanopore sequencing dataset of the Genome in a Bottle sample GM24385.

Multiple PromethION flowcells using both the R9.4.1 and R10.3 nanopores. The direct sequencer output is included, raw signal data stored in .fast5 files and basecalled data in .fastq file. Additional secondary analyses are included, notably alignments of sequence data to the reference genome are provided along with statistics derived from these.

Whats included?

The dataset comprises multiple R9.4.1 and R10.3 flowcells of multiple separately prepared samples; each sample was run on each flowcell type.

The initial sequencer outputs are included in self container directories. In addition derived outputs from an automated pipeline are stored separately.

Details concerning sample preparations

Below is described briefly the method of analyte preparation. Standard, published protocols were followed with no intentional deviation.

The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM24385

  • High molecular weight DNA from GM24385 lymphoblastoid cells was prepared by Evotec.
  • Circulomics Short Read Eliminator XL protocol was used to deplete DNA fragments < 40kb in length.
  • DNA was end repaired and dA tailed prior to LSK based library preparation.
  • DNA sequencing was performed using PromethION device.

The dataset comprises multiple flowcells for each pore:

PoreTreatmeantFlowcells
R9.4.1SREPAF27096, PAF27462
R10.3SREPAF26223, PAF26161

Location of the data

The data is located in an Amazon Webservice S3 bucket at:

s3://ont-open-data/gm24385_2020.09/

See the tutorials page for information on downloading the dataset.

Description of available data

The uploaded dataset has been prepared using a snakemake pipeline to:

  1. Align basecalls to reference sequence. All primary, secondary and supplementary alignments are kept
  2. Filter .bam file to list of regions defined in configuration file retaining only primary alignments.
  3. Produce read statistics from per-region .bams.
  4. Repack/group source .fast5 files according to primary alignment .bams to produce per-region .fast5 file sets.

For more details see our post detailing the pipeline and its outputs.


Tags

#datasets#human cell-line#R9.4.1R10.3#basecalling

Related Posts

Q20 single-read accuracy with ultra-long CliveOME dataset
May 21, 2021
1 min
Rebasecalling of SRE and ULK GM24385 Dataset
Chris Wright
May 18, 2021
2 min
© 2020 - 2021
Oxford Nanopore Technologies
All Rights Reserved.

Quick Links

TuorialsWorkflowsOpen DataContact

Social Media