The following figure shows an important milestone in genome assembly. Different color backgrounds show the main assembly results from the earliest early sequencing based on nucleotide, shotgun sequencing based on Sanger, large-scale second-generation NGS sequencing, and now the third-generation TGS sequencing. The Human Genome Project (HGP), which lasted 13 years (1990-2003) and cost $3 billion, undoubtedly accelerated the process of genome assembly. NGS has derived a series of novel applications, including whole exon sequencing, RNA-seq, ChIp-seq and WGBS-seq, which greatly promoted the application of genome sequencing. After 20 10 years, brand-new technology opened the third generation sequencing TGS era-long reading and long sequencing, which greatly increased the advantages of genome assembly and greatly improved the continuity of genome assembly.
The definition of TGS may be different, and it usually refers to the technology of sequencing a single DNA molecule directly without amplification. These techniques produce longer reads than NGS, and each read can span several kbps to several hundred kbps. NGS technologies such as 10X genomic linkage reading and Hi-C can improve the continuity of genome assembly, but the appearance of TGS makes it easier to improve the continuity of assembly.
At present, there are three generations of sequencing technologies widely used. One is the single molecule real-time sequencing technology (SMRT) perfected and commercialized by PaciBio Bio, and the other is the nano-pore sequencing technology commercialized by Oxford Nano-pore Technology Company (ONT). SMRT sequencing technology applies the principle of sequencing in synthesis. With SMRT chip as sequencing carrier, there are millions of nanoscale zero-mode waveguide holes (ZMW) distributed on the carrier. Polymerase in each ZMW captures the DNA sequence of the library, and dNTP is excited by fluorescence, so that it can be synthesized and sequenced according to the length of the captured fluorescence signal. At present, SMRT sequencing has two modes, one is continuous long reading (CLR) mode, and the other is cyclic consensus sequence (CCS) mode. The reading length of CLR is longer, but the error rate of base sequencing is higher (the accuracy of 90% is much lower than that of NGS (99.9%)), but the sequencing error is completely random. CCS mode takes advantage of this feature and reduces the timing error rate to NGS level through self-correction, while CLR sacrifices the timing reading length.
Nanopore sequencing uses transgenic bacterial nanopores inserted into artificial lipid bilayers, which are placed in a single micropore with a width of tens of microns and arranged on the sensor chip. When each single-stranded DNA passes through a channel, it will interfere with the current flowing through the hole, and this change will be measured by the semiconductor sensor. Different bases destroy the electric field in slightly different ways, and the recorded current changes can be converted into DNA sequences. The length that ONT can read is longer, depending on the size of the prepared DNA library, but its base accuracy is difficult to correct and the sequencing error rate is high.
The third generation sequencing technology, because of its long reading length, can effectively span complex regions in the genome, thus significantly improving the quality of genome assembly. In addition, in diploid (polyploid) genome, TGS can more easily generate haplotype long-term fragments, distinguish genetic information from parents, avoid chimeric genome, and help to accurately detect structural variation (SV), including long variation, large insertion deletion, repetition, inversion and translocation in highly repetitive regions. At the same time, the third generation sequencing can also realize epigenetic sequencing through the enzymatic kinetic reaction of PacBio or the ion current signal in nanopore.
FALCON is a software based on three generations of data, which was directly developed by PacBio and released on 20 13. It inherits the hierarchical genome assembly (HGAP) process. First, the sequences themselves are compared to correct the reading accuracy of the three generations of sequencing, and then the Debrugin diagram (DBG) is used to construct the contig, as shown in the following figure. FALCON can identify diploid sequences, and can output allele sequences (alternative contig /a- contig) and major genome sequences (primary contig /p- contig) containing site variation information. FALCON-Unzip is an upgraded version of FALCON. It can obtain highly typed haplotypes by using heterozygous SNPs identified in the initial assembly, and then draw an assembly diagram by using Hi-C data, and completely assemble two haplotypes by using haplotigs and * * * sequences.
Canu is a three-generation assembly software originated from Celera Assember, which can be used for sequencing results obtained by PacBio and Nanopore. It is assembled in the way of overlap-layout-consistency (OLC), that is, by using the overlap between long sequences, it is mainly divided into three steps: error correction, pruning and assembly. For FALCON, although the error correction before assembly has been greatly improved compared with the short reading length, the assembled haplotypes are still chimeric, and repeated sequences are often folded into one sequence. In order to solve this problem, TrioCanu, a new version of software released on 20 18, can completely use parental information for haplotype phase. It uses the second generation illumina data of parents to classify the sequence of assembled samples according to different SNPs before assembly, and then independently assembles two sets of haplotypes from parents, so TrioCanu is especially suitable for high heterozygosity genome assembly.
Canu's calculation speed is very slow. HiFiasm is a rapid haplotype analysis ab initio assembly software developed in recent two years for PacBio HiFi reads. It can run on a single machine with multi-threads, complete genome assembly quickly with less resource consumption, and at the same time, realize haplotype assembly of different parents' offspring with given parental data. However, the accuracy of haplotype typing is slightly worse than that of TrioCanu
The accuracy of assembly results and the optimization of calculation work are both aspects that need to be considered in assembly. At present, various softwares assembled from scratch have been developed, including Wtdbg2, Flye, Peregrine, Shasta and so on. They are relatively fast, but their assembly quality may not be so accurate. All genome assembly methods and software have advantages and disadvantages. In practical application, we can consider the actual assembly species, sequencing strategy and assembly objectives, and comprehensively consider the selection of accurate and excellent assembly software.
For large genomes, even long reading can't span the whole chromosome sequence, and other linkage information is needed to locate and sequence the assembled overlapping groups, so that genome assembly can be promoted to the level of Scanfold. Bio-nano-optical atlas is a single molecule DNA technology. This method generates genetic optical maps based on DNA markers, and then combines them with the initial assembled contigs, which can further phase and sequence the contigs and produce longer scaffolds. In addition, Bionano spectrum can also be used for SV and methylation analysis.
Another technique to orient and sort the contigs is based on chromosome conformation capture (3C) (Hi-C). Hi-C technology firstly fixes the spatial conformation of chromosomes with formaldehyde, and then uses restriction endonuclease to treat DNA and reconnect spatially adjacent DNA molecules. This technology uses the spatial information of genome, and combines overlapping groups and scaffolds to distribute it to chromosome level. Hi-C is the only way to realize chromosome level scaffold in large genome at present, but it is often not as conservative as Bionano scaffold. The unpredictable folding of chromatin leads to the interaction of chromosomes in the far region, which may lead to assembly errors, such as artificial inversion, stent dislocation in the same chromosome or stent mismatch in different chromosomes. Comprehensive utilization of different technologies can better correct these errors, and even obtain telomere-to-telomere assembly of the whole chromosome.
The way of genome assembly has been constantly innovating and optimizing. By constantly improving the existing technology and introducing new DNA sequencing methods and bioinformatics tools, the assembly quality has been improved. Qualcomm ability introduced by NGS and higher quality sequences provided by TGS finally make complex genomes available for whole genome research. Human genetics research, including population genomics, genetic disease location and diagnosis, personalized medical planning, cancer research and prenatal testing, has benefited from the progress of genome sequencing and assembly in the past decade. Similarly, these methods are increasingly used in non-model organisms to understand the ecological and evolutionary processes. The commitment of reference genome sequencing and assembly has been extended from single-species project to multi-species coordination, and the project aimed at producing high-quality genomes for most organisms by combining NGS and TGS methods is currently under way.
The long road of genomics: the history and current methods of genome sequencing and assembly. Journal of Computer Architecture Biotechnology 20th19th 1 1 Month17th; 18:9- 19.doi: 10. 10 16/j . csbj . 20 19. 1 1.002。 PMID:3 1890 139; PMCID: PMC6926 122。