Review Article, Cbrt Vol: 10 Issue: 7
Advances in Simple Sequencing, NGS and WGS: from Cancer to Oleaginous Yeast and to COVID-19
Tania Iouk*
Department of Biology, Concordia University, Montreal, Canada
*Corresponding Author: Dr. Tania Iouk
Department of Biology, Concordia
University
Science pavilion, room SP 532.01 7141 rue Sherbrooke H4B 1R6,
Montreal, Canada
E-mail: tania.iouk@concordia.ca
Received: December 15, 2021 Accepted: December 20, 2021 Published: December 25, 2021
Citation: Iouk T (2021) Advances in Simple Sequencing, NGS and WGS: from Cancer to Oleaginous Yeast and to COVID-19. Cell Biol (Henderson, NV) 10:7.
Abstract
Objective: Until recently, the whole-genome-sequencing (WGS) results for the new strains and organisms weren’t devoid of errors, and individual coding sequences from each contig had to be revisited and re-deposited based on the mRNA sequence.
Developed to detect somatic mutations in cancer patients, the next-generation sequencing (NGS) approach substantially decreased the error rate, and it was then applied to the sequencing of new or frequently mutating genomes including the SARS-CoV-2 genome. Sequence assembly of novel oleaginous yeast Yarrowia lipolytica strain also showed the usefulness of NGS in the new-strain sequencing. Hence, previously, the oleic-acid utilizing yeast led to the discovery of novel transcription factors and they also proved to be useful in testing the chemical libraries.
Conclusions: The advances in genome sequencing including NGS and emerging long-read sequencing (LRS) technologies are constantly improving the characterization of genetic variants. Some alternative bioinformatics tools are described in this review.
Keywords: DNA sequencing, NGS, cancer, oleaginous yeast, SARS-CoV-2, genome annotation
Keywords
DNA sequencing, NGS, cancer, oleaginous yeast, SARS-CoV-2, genome annotation
Introduction
While analyzing the new sequences of COVID-19 coronavirus isolates we often look back at the previously established error profiles in reporting the sequencing data, including the related to cancer deep NGS (next-generation sequencing) results [1-3].
In a way, the coronavirus mutations are similar to the related to cancer somatic mutations, as they develop in the patient’s organism, sometimes on the background of certain other abnormalities or due to aging [4-5]. NGS was increasingly suggested as a method of choice for the analysis of the coronavirus genome [6].
Revealed by biopsy, including the blood-cell biopsy which analyzes a circulating tumor DNA, the NGS-elucidated cancer somatic mutations include the p53 tumor-suppressor gene mutations in patients with Li-Fraumeni syndrome, the somatic mutations in NRAS/ KRAS GTPase genes that cause a relapse of leukemias, and the somatic mutations in DNMT3A, ASXL1, and TET2 genes that increase the likelihood of blood cancer in aging patients (65 years and older) [3, 7-8].
The detected somatic cancer mutations often exhibit mosaicism (or clonal hematopoiesis – in the case of blood cancer) when only a small percentage of identical reads (3-20% in the case of p53 for example) contains new mutations [7]. In the case of SARS-Cov2, the mosaicism could be linked to the presence of two or more virus variants in organism [9].
The somatic cancer mutations are considered the low-frequency ones, as were initially the coronavirus mutations [4-5]. Indeed, early SARS-Cov-2 isolates were treated as the products of low-frequency mutagenesis until the appearance of Delta with its 7 hot-spot mutations and then Omicron (a.k.a. B.1.1.529) variant with 30 mutations in a spike-protein RNA [10-11]. In the case of Omicron, the exceptionally high mutation rate was attributed to the presence of HIV in the immune-suppressed patient.
In an experiment, a high rate of mutagenesis is usually achieved by decreasing (10 fold) concentration of one of the nucleotides in the PCR reaction, usually ATP, and also by using a non-high fidelity polymerase such as Taq [12]. However, the shortage of ATP in organisms is not known to increase the incidence of somatic mutations. Instead, it causes energy-related dysfunctions and increased blood urate levels (after the AMP deaminase degrades AMP to IMP) [13]. IMP then becomes transformed into inosine and subsequently to purines in the hypoxanthine -> xanthine -> urate chain of biochemical reactions. The treatment with the xanthine oxidoreductase inhibitors and inosine is known to increase the pool of salvageable purines and to restore ATP levels in humans [13]. Remarkably, however, one publication suggests that SARS-CoV-2 infection causes the xanthineoxidoreductase inhibition and that it, therefore, could cause a localized depletion of ATP, thereby increasing the likelihood of random mutagenesis during propagation of viral RNA [14].
COVID-19 studies
Originally it was the WGS (whole genome sequencing) to be applied for studying SARS-CoV-2. The mutation rate was initially estimated at 33 genomic mutations/year. The analysis of numerous (>18,000) SARS-CoV-2 sequences sampled between December 2019 and fall 2020 suggested that mutations across the genomes were due to neutral evolution and not adaptive selection [15]. Based on the presence in the viral genome mutations, the lineages were assigned (Omicron for example is a B.1.1.529) [10, 11].
Recently, the NGS was proposed as a method of choice to identify and track the emergence and prevalence of novel strains of SARSCoV- 2, and low- to mid-throughput NGS assay was developed specifically to combat the COVID-19 pandemics [6, 16-17]. The kit was industrially developed that includes 98 amplicons so that SARSCoV- 2 consensus could be reported with confidence if 90 or more amplicons were detected. The sequencing procedures are being standardized according to ARTIC (Advancing Real-Time Infection Control Network) guidelines that were also specifically developed to fight COVID-19 [18].
NGS-related breakthroughs in the eukaryotic genome sequencing and annotation
The available at NCBI and other databases genome sequencing data are not free from errors [3]. The deposited contig-based gene sequences from low eukaryotes may carry a 2-3% sequence error, and the genome assemblies are not devoid of errors either, as they may contain structural defects. The correcting programs were developed to analyze reads that do not map properly to the assembly (e.g. when there is a difference between the read size and insert size or when the “soft-clipped” reads are present, i.e. when one end of the read is mapped to the reference while the second is not). One program, known as NucBreak analyzes the alignments of reads that are properly mapped to an assembly, however, it is designed to elucidate the alternative read alignments, which could actually be the correct ones [19]. Several tools exist aimed to assess the genome assembly accuracy (e.g. REAPR, FRCbam, Pilon, etc.) that detect structural errors including medium to long insertions and deletions, inversions, duplications, and inter-and intra-chromosomal rearrangements [19-20].
Analyzing the whole genome using NGS and Illumina systems provides high throughput and is also cost-effective. The sequencing data is characterized by a relatively short read length (100–300 bp) and high accuracy, however, it may include the sequencing errors towards the 3’-end of the reads and it does not necessarily provide a uniform distribution of reads across the genome [21-22]. Still, despite its short read length, Illumina data is often used for de novo genome assembly, sometimes complemented by data generated through other platforms, such as PacBiO, which generate longer “bridging” reads [21, 23].
All input reads are used to generate a de Bruijn graph which establishes the overlap between individual reads, ultimately creating a path through a contig.
Good example of such combinational NGS Illumina/PacBio approach to the whole-genome is sequencing of industrially-relevant strain W29/CLIB89 of Yarrowia lipolytica that utilizes an oleic fatty acid (OA, C18) as a sole carbon source. Genome characterization of other strains was reported previously, hence OAutilizing yeast cells had been extensively studied in the past. These cells and their responsible for the OA metabolism organelles became the unique experimental models in which to identify novel transcription factors, test chemical libraries, and sharpen the genome sequencing and annotation skills (Figure 1)[23-25].
Figure 1: (A) Originally used to characterize somatic cancer mutations, NGS was subsequently applied to the whole-genome sequencing (WGS) and more recently to identifying COVID-19 variants. (B) The genome of the new Yarrowia lipolytica strain was sequenced in recent years using the combination of NGS and PacBio long-read sequencing [23]. Previously the oleic-acid consumption and its involving mechanisms were used to identify new transcription factors and to chemically immortalize yeast cells.
In the mentioned above project, each Y. lipolytica chromosome was represented by a single contig. The Illumina system produced ≈14x109 of the very short (≈100bp) reads, and the third-generation sequencing technology PacBio reads (3000-5000 bp long) along with the optical Irys genome mapping system were used to evaluate the integrity of the assembly, to estimate the extent of unassembled sequence (in telomeric regions), and to localize rDNA repeats [23].
The Irys system from BioNano Genomics Inc. is designed to stretch long chromosome segments inside an array of nanochannels for genetic analysis. It should be noted that the acquisition of stretched and labeled DNA images was enabled years ago, before the appearance of NGS [26-27].
The non-optical genome mapping systems
The already mentioned NucBreak program enables the non-optical mapping of data, and it also allows the omission of long PacBio reads. Together with BreakDancer, Lumpy, and Whamit tools, it presents a computer program capable of detecting structural variants that can’t be seen through the conventional NGS data compiling [19].
The NucBreak tool was created to detect structural errors in the assembly by using paired-end Illumina reads. The reads are first mapped to the assembly, and then the mapping results are analyzed to detect the assembly error locations. The error detection process starts with mapping reads to the assembly by using Bowtie2, a gap-read alignment tool.
Bowtie2 is run separately for each read file to report all local alignments with an added nucleotide match bonus. The resulting files contain all possible alignments for each read, and they do not depend on the second read in a pair. Notably, a read alignment may contain either a full read sequence or a read sequence clipped on one or both ends [19].
Back to the new Y. lipolytica-strain sequencing
The genome annotation pipeline included the previous Y. lipolytica annotations from NCBI/RefSeq, yeast YGAP [28], and SnowyOwl fungal HMM (hidden Markov model) predictions [29]. The strongest overlap in terms of the CDS prediction and identification was reported between the sequencing data and RefSeq, and the weakest between the sequencing data and SnowyOwl [29].
The SnowyOwl uses RNA-Seq data to provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed based on three manually curated gene models in fungal genomes, including among others Aspergillus niger (ATCC 1015).
A. niger is the industrially relevant producer of the organic metabolites and enzymes used in numerous biotechnology applications. It secretes numerous glycoside hydrolases capable of hydrolyzing cellulose, hemicellulose, or pectin to sugars, and is famous for studying the signal-peptide- containing proteins traveling through ER (endoplasmic reticulum) to become secreted [30]. It is also used as a host in which to express the secreted enzymes from the other phylogenetically diverse and frequently pathogenic fungal species [31]. However, the A. niger genomic DNA with its 53.3% GC content is characterized by reduced usage of the subset of synonymous A- and T-ending codons, therefore expression of the A, T-rich sequences requires numerous codon substitutions [32]. Many GC-rich fungal cDNAs mimic the average codon usage in A. niger, however, the >60% GC content of those organisms could interfere with protein biosynthesis due to the limited availability of some tRNAs, plus strong GC enrichment interferes with the diversification of synonymous codon usage [33].
DNA sequencing and alternative genome browsers
The SnowyOwl label used to designate a fungal HMM prediction pipeline is the continuation of a 15-year-old tradition to name the biology-related data-engineering platforms, genome browsers, and programs after the birds, with the best-known examples being a MAGPIE data engineering platform, a Bluejay genome browser, and a Seahawk program designed to efficiently load and retrieve data-containing Web pages and text files [34-36]. All modern operating systems perform these operations today momentarily, but the early operating systems and browsers weren’t always satisfying to many biologists, thus prompting the creation of additional browsing tools. Bluejay for example is a genome viewer that integrates genome annotation with the gene expression information and comparative analysis, with potentially several other genomes in the same view. Blujay was remarkable for its versatile and detailed displays, and it also offered a circular genome viewing, the prototype of CIRCOS, while also offering visualization of read alignment and the WGS error profiling [36].
The NGS error profiling and its computational suppression
The substitution error rate by conventional NGS is > 0.1%, which is higher than the FDA-authorized detection limit of 0.02 mutant allele fraction (MAF) for hotspot mutations, and 0.05 for nonhotspot mutations at a read-depth of 500–1000x [3].
Errors could be introduced at various steps of the NGS procedure, including sample handling, library preparation, enrichment PCR, and sequencing itself.
The C>A/G>T errors were reported due to DNA damage during sample processing [37, 38]. Spontaneous deamination of methylated cytosine to uracil can cause C>T/G>A errors. Additional errors, as already said, can also be introduced by target-enrichment PCR and the sequencing step [2].
Close evaluation of read-specific error distribution suggests the possibility of computational suppression by which the substitution error rate can be suppressed 10- to 100-fold. The error rates differ by nucleotide substitution types, ranging from 10−5 (for A>C/T>G, C>A/G>T, and C>G/G>C) to 10−4 (for A>G/T>C). The C>T/G>A errors exhibit strong sequence context-dependency, while C>A/G>T errors are dominated by sample-specific effects. The target-enrichment PCR leads to a ~ 6-fold increase in overall error rate, and more than 70% of hotspot variants can be detected at 0.01 ~ 0. 1%frequency with the current NGS technology, by applying in silico error suppression [3].
To determine the lowest frequency at which a true somatic mutation can be distinguished from a sequencing error and to determine sitespecific sequencing error rates, the dilution experiment should be performed using cancer- and normal cells from the same patient. In such an experiment, a biopsy material from a tumor is used as well as healthy cells such as lymphocytes, which are easy to isolate from blood, using density centrifugation. The dilution is then performed by spiking 0.1% and 0.02% of cancer genomic DNA into normal DNA [3]. The already known somatic substitution mutations are targeted by amplicon sequencing, using short amplicon reads.
In the high-quality reads, there shouldn’t be false-positive detections from the indicated dilution datasets, i.e. the detected mutant allele fraction (MAF) shouldn’t exceed the already known number of single-nucleotide variants (SNV) [39]. The exceptions include a loss-of-heterozygosity (LOH) when a chromosome in the cancer line has 1, 2, or 4 mutant alleles. In such a case one SNV may produce several MAF numbers.The error rate in sequencing is estimated as a ratio of reads with a modified nucleotide to the total number of reads. The low-quality reads are those with an error rate of ≥ 1%. They usually contain low-quality bases (a base quality score ≤20), sometimes due to adapter contamination, or due to the low mapping quality. Identified by HiSeq data, these reads are then manipulated either by trimming the flanking 5 bp at both ends to remove adapter contamination or discarded altogether. Ideally, the reads with a base quality score ≥ 30 and the estimated error rate less than 0.1% should be used. The in silico error suppression methods are developed, to identify and filter the low-quality reads (LQReads). Among other criteria, these methods are designed to account for the concordance between forward and reverse readouts so that discordant readouts are not counted [3, 21-22]. The mutation-identifying algorithms are designed to screen the aligned reads and they also suppress error rates resulting from the nucleotide substitution, consequently MAF < 0.002 cannot be distinguished from sequencing errors. On the other hand, “forced calling” of hotspot mutations without considering error may result in false positives. For example, hotspot mutation BRAF K601E is a T>C change that was detected in > 100 tumors [39]. This site was also shown to have an allele fraction of ~ 0.0003 in the mentioned above melanoma cancer cell dilution experiment, and therefore appeared to be the mutation present also in that particular cancer. That however was not confirmed by an undiluted cancer experiment in which it had an exceptionally low MAF of 0.00002 [3].
As already mentioned, a C>T/G>A change has the highest error rate [2]. However, the C>T/G>A mutations are the most common mutation type in cancers [39]. In such cases, the signature analysis is performed that is similar to a signature analysis in disease. It was shown, for example, that the C>T errors exhibit a strong contextdependency, with elevated error rates for G(C>T)N or N(C>T)G and the highest error rate in G(C>T)G consensus [3]. The same pattern was observed for G>A errors in a reverse complement. Other substitution types do not exhibit sequence context dependency as strong as that of C>T/ G>A.
The semi-automated analysis of small-scale DNA sequencing
The routine DNA sequencing aimed to verify the molecular-cloning or lab-performed PCR result has comparable to NSG margin error (Figure 2). We used mRNA of the chemically challenged yeast to synthesize cDNA, and then to PCR amplify distinct ORFs and sequence them in search of mutations. At the same time, we analyzed the sequencing errors. Our best sequencing results contained anywhere from 0.1% to 1.1% error, however, some sequenced ORFs had up to 16% errors. In the good-quality sequencing results G>A/ C>T changes were more frequent than other changes (p<0.05) (Figure 2) [40-41]. Some poor-quality sequencing results, which nevertheless still enabled peptide identification, contained: (i) up to 34% of nucleotide substitutions, (ii) 0.02% of inserted triplet codons, and (iii) up to 13% of sequence loops, suggesting the amplification and not the sequencing error. The sequencing results with 40% nucleotide change error were linked to poor DNA quality and were discarded.
Conclusions and perspectives
It was frequently suggested that NGS short reads pose a limitation for the identification of structural variants, sequencing repetitive regions, phasing of alleles, and distinguishing highly homologous genomic regions. These limitations may contribute to the diagnostic gap in patients with genetic disorders. The emerging long-read sequencing (LRS) technologies may improve the characterization of genetic variants. LRS primarily used to investigate genetic disorders with previously known disease loci, and future studies will determine whether LRS technologies can be used for routine WGS to trigger further advancements in medical genetics.
Meanwhile, NGS remains the first choice in WGS, cancer- and COVID-19 sequencing. Hence the already dated shotgun WGS often introduces not only the assembly- but also the sequence errors, including ≤2% errors in deposited protein sequences. Because of this, the individual cDNA sequences or mRNAs must be reexamined and re-deposited, with an improved error margin of <0.3%.
The outlined in this study sequencing of the novel strain of Y. lipolytica and other mentioned projects suggest that genome annotation soon will become completely automated and computerized, and it will also be error-free.
Moreover, not only the NGS was optimized further to fully avoid errors, but the methodology also itself is now could be used in a small laboratory setting that does not require a giant hospital lab or an over-equipped university research center.