Phasing, also referred to as haplotyping, relates to methods used to infer which genetic variants occur together, usually distinguishing which genetic variants came from each of the maternal or paternal haplotypes. It may also be used to predict more than two haplotypes in polyploid phasing. It can refer to phasing of the whole genome, or a smaller subsequence.
Understanding which variants are inherited together ‘in phase’, can help give a clearer picture of how certain genes may affect expression and regulation, functional impacts of genes or sets of genes, diseases associated with sets of inherited genes, genetics within populations and gene inheritance (Tewhey et al., 2011). It can also help find variants resulting from compound heterozygosity (when each parent donates one alternate allele located at different loci within the same gene) and find whether 2 heterozygous variants occur in 1 copy of a gene or in both copies (Choi et al., 2018). Phased genes may be referred to as cis when variants are from the same chromosome or trans when they are from different chromosomes.
There are various methods for phasing. The method used for phasing will depend on the sequencing technology used and what level of detail is required:
Physically separating sequences in the lab, which involves isolating chromosomes, amplification, sequencing and then piecing fragments together – requiring detailed lab protocols (Martin et al., 2016). The later step can introduce errors where segments are phased accurately but pieced back together incorrectly.
Population based phasing methods use large data sets to predict haplotypes based on statistical likelihood, projects such as 1000 genomes and Hapmap catalogue common variants and haplotype blocks. Tools such as Shape IT and Beagle use a maximum-likelihood model using the data sets to predict phasing. This method is unable to find variants that are rare or unique to an individual and relies on large datasets being representative of the population (Choi et al., 2018).
Trio Phasing or genetic haplotyping where parent(s)’ and offspring(s)’ heterozygous variant calls can be compared to predict which haploid they come from by considering Mendel’s laws of inheritance, this is a very straight forward method but can be relatively expensive due to the need to sequence 3 genomes and will not, find de-novo mutations, or distinguish variants that are heterozygous in all individuals (Martin et al., 2016).
Read based phasing also known as haplotype assembly relies on long enough reads that span two or more heterozygous variants. (Martin et al., 2016).). If a read spans two variant positions and contains both variants, this suggests that both variants lie on the same allele and vice versa (Hager et al., 2020). A weighted minimum error correction algorithm is used to infer haplotypes. This approach can phase individual-specific variants, providing the most specific detail on haplotypes.
Sequencing read data from Oxford Nanopore Technologies’ can be used for all the methods but is uniquely suited to read based phasing because the length of reads is enough that they are likely to contain multiple SNVs. The greater sequence context helps align the reads to a reference. Phasing short reads can be difficult because each read will contain fewer SNVs. Here is a nice animation illustrating this. It is possible to do with as little as 60x reads and it is also now possible to include methylation information in the phasing step.
There are many tools for read based phasing but we currently favour Whatshap, due to good run time and has low error rates. Whatshap can phase SNP’s, insertions, deletions, multiple adjacent SNP’s and some complex variants. Whatshap has an algorithm to solve a weighted minimum error correction problem and takes into account phred-scaled error probabilities, to find the minimum number of corrections required in order to arrange the reads into two haplotypes. You can read about it here (Martin et al., 2016).
Whatshap requires sequencing reads and an unphased VCF as input. The initial VCF can be created by aligning sequence reads to a reference and using a variant calling tool but it is also integrated in to popular variant calling tools including Medaka and Clair3. Phasing information is output in the VCF file using 0/1 1/0 for heterozygous, 1/1 0/0 homozygous and can be visualised in IGV or other tools. Whilst Whatshapp is currently our choice of phasing tool there is still ongoing research and development of tools by algorithm experts that we continually review.
Why not try out our new Human snp workflow that uses Clair3 and have a look at the Phased output VCF.
Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. and Schork, N. J., 2018. Comparison of phasing strategies for whole human genomes. PLOS Genetics [online], 14 (4), e1007308. Available from: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308 [Accessed 26 Nov 2021].
Hager, P., Mewes, H.-W., Rohlfs, M., Klein, C. and Jeske, T., 2020. SmartPhase: Accurate and fast phasing of heterozygous variant pairs for genetic diagnosis of rare diseases. PLOS Computational Biology [online], 16 (2), e1007613. Available from: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007613 [Accessed 26 Nov 2021].
Martin, M., Patterson, M., Garg, S., O Fischer, S., Pisanti, N., Klau, G. W., Schöenhuth, A. and Marschall, T., 2016. WhatsHap: fast and accurate read-based phasing [online]. Bioinformatics. preprint. Available from: http://biorxiv.org/lookup/doi/10.1101/085050 [Accessed 26 Nov 2021].
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. and Schork, N. J., 2011. The importance of phase information for human genomics. Nature reviews. Genetics [online], 12 (3), 215–223. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3753045/ [Accessed 26 Nov 2021