Sponsored content brought to you by
As sequencing costs decrease, the volume of whole genome sequencing (WGS) and whole exome sequencing (WES) continues to rise. Sequencing is just the first step. To provide the best results requires analyzing sequencing data with accelerated compute, data science and AI to read and understand the genome, from base calls to variant interpretation. The challenge is substantial.
Human genomes are complex. The current understanding according to the National Human Genome Research Institute is that compared to a reference human genome, on average, an individuals ~3B-nucleotide genome sequence will have ~4M SNVs, ~600K insertion/deletion variants, and ~25K structural variants that involve greater than 20M nucleotides.1 As of now, the clinical impact of most of these variants is unknown. Can genomic AI help us to identify the handful of clinically significant genetic variants from this vast ocean of data?
AI methods excel when large amounts of structured data can be paired with validated outcomes for training. Recent population-level sequencing efforts, as well as validation data sets like NIST Genome in a Bottle, have spurred a new category of AIGenomic AI. Genomic AI has the potential to dramatically reduce the time it takes to analyze, decipher, and interpret sequencing data, but only if the data is carefully assembled across the width of the challenge from alignment to interpretation.
DNA sequencing has substantial promise to guide healthcare and treatment if the needed tools become more accurate, easier to use, and cost effective. Illumina believes that genomic AI is an emerging tool complementary to traditional analysis methods and known biology, that can further accuracy advancements, providing a fully-featured genome including annotation and interpretation. To achieve this the company is using its access to large data and world-class AI talent to integrate genomic AI into Illuminas software products.
Three examples will be used to illustrate the utility of this advanced technologyvariant calling, annotation and prioritization, and interpretation.
The upstream DRAGEN secondary analysis pipeline improves variant calling accuracy over a larger portion of the human genome, while ensuring that these improvements are generalizable to a wide and diverse population of samples. Hardware-accelerated DRAGEN analysis won the 2020 Precision FDA germline accuracy competition in the Difficult-to-Map regions and All-Benchmark-Regions categories.2
Building on that success, Illumina added powerful and efficient machine learning (ML) algorithms that drive significant performance improvements.
DRAGEN-ML integrates closely with the existing Bayesian Variant Calling pipeline, driving germline accuracy to new heights and addressing challenges in the most difficult genomic regions. Sophisticated and efficient machine learning enables improvement in sensitivity and genotyping accuracy, recovering low-confidence false negative calls and filtering over 50% of false positive calls. Access to deep internal data and numerous collaborations have allowed us to model how Illumina sequencing reads map to a genomic reference, says Rami Mehio, Head of Software and Informatics, Illumina. Machine learning has been critical to how our engineers and their algorithms continually improve mapping sensitivity in DRAGEN.
The latest DRAGEN release, DRAGEN v4.2 with enhanced machine learning, trained on a vast amount of data, detects variants with an analytical accuracy of 99.84%, reducing both false positive and false negative rates.* This extends Illuminas lead in providing the most accurate secondary analysis in all benchmark regions compared to other solutions using PrecisionFDA v2 Truth Challenge3 benchmark data.
Delivering a comprehensive platform for genomic analysis, the team continues to invest more in machine learning algorithms for use in RNA analysis, somatic pipelines, methylation analysis and large variant calling for release in future versions of the DRAGEN platform.
Out of the tens of millions of protein-coding variants in the human genome, only 0.1% are presently annotated in clinical variant databases, while the vast majority remain variants of unknown significance (VUS).
To address this challenge, Illumina scientists have developed PrimateAI-3D, a three-dimensional convolutional neural network for variant effect prediction, trained using primate variants and 3-D protein structure. PrimateAI-3D leverages the premise that common variants from non-human primates are unlikely to cause human disease, and has been validated to identify disease-causing variants with superior accuracy across six clinical benchmarks based on real-world patient cohorts.
Published in Science, the PrimateAI-3D project helped drive a massive international collaborative effort to sequence 809 individuals from 233 primate species and create a catalog of common missense variants. Importantly, the species selected for sequencing represent close to half of Earths 521 extant primate species and cover all major primate families.4 These WGS data were used to train PrimateAI-3D with millions of primate variants.
In a related Science publication, PrimateAI-3D was used to estimate the pathogenicity of rare coding variants in over 450K UK Biobank individuals in order to improve rare-variant association tests and genetic risk prediction for common diseases and complex traits. Stratification of the missense variants using PrimateAI-3D enabled discovery of 73% more significant gene-phenotype associations in rare variant burden tests, outperforming other existing variant interpretation algorithms.5
PrimateAI-3D also enables rare-variant polygenic risk scores (PRS), which are substantially more portable to different cohorts and ancestry groups not used during model training.5 This outcome is extremely relevant as existing PRS algorithms most often train on data from individuals of European descent, which lacks generalization to individuals of other populations.
The PrimateAI-3D deep learning scores and the primate population variant database, which enables classification of 4.3M missense variants as likely benign, are publicly available to the genomics community for research use, in addition to being made available through Illumina software products.
Complementary to PrimateAI-3Ds role for protein-coding variants, Illumina scientists earlier released SpliceAI, a deep learning model for identifying pathogenic variants in the non-coding genome. Currently, clinical exome sequencing for rare disease patients is only able to detect a pathogenic variant in around one third of cases by examining the 1% of the genome that is protein-coding. Improving identification ofdisease-causing variants in the non-coding genome extends clinical sequencingbeyond the exome to the whole genome, marking an important step towards helping patients and their families.6
Explainable AI (XAI), created by and integrated in Emedgene tertiary analysis software, prioritizes variants that are most likely to solve a case. Emedgenes XAI allows users algorithms, while keeping the geneticist in full control. By definition, XAI must be accurate, secure, transparent, and efficient.
Emedgene, for hereditary disease data interpretation applications and assaysspanning genomes, exomes, targeted panels, and virtual panels, leverages its XAI and full suite of automation capabilities for users to streamline and minimize touchpoints across their end-to-end germline analysis workflows. This variant interpretation research platform for rare-genetic, hereditary cancer and other genetic diseases, and large-scale screening projects, significantly reduces time per case.
The use of genomic XAI in Emedgene mimics the work performed by a scientist and provides a full causal explanation of the most relevant variants with accompanying linked and curated evidence. Significant time savings of 50-75% are achieved per case. Emedgenes Explainable AI (XAI) simplifies the highly complex task of variant prioritization, allowing us to handle more tests every day, relates Ray Louie, PhD, Associate Director, Greenwood Genetic Center.
In addition, a study performed by Baylor Genetics showed that in a 180-sample cohort Emedgene accurately pinpointed the manually reported variants as candidates to resolve the case. The reported variants were ranked in the top 10 candidate variants in 98.4% of trio cases, in 93.0% of single proband cases, and 96.7% of all cases. Reduction of the accuracy of the model in some cases was due to incomplete variant calling or incomplete phenotypic description.7 The study clearly demonstrated that Emedgene can assist genetic laboratories in prioritizing candidate variants effectively, thereby helping to streamline lab operations.
Decades of internal development and multiple population level collaborations provide Illumina access to massive amounts of data to train new genomic AI algorithms. The data, in combination with Illuminas world-class products and talent, can help speed genomic AI on its path towards providing a better genome.
References
* Secondary analysis run times on HG002 Illumina sequencing data from PrecisionFDA Truth Challenge V2 with 34.46X coverage. DRAGEN was run on a DRAGEN v4 server with a U200 FPGA card and Machine Learning enabled. BWA GATK 4.1.4.0 was run on a local 2x Intel Xeon Gold 6126 (48 threads) with 394 GB RAM and 2TB NVME SSD using BCBIO for parallelization.
For Research Use Only.Not for use in diagnostic procedures.
Learn more illumina.com
More here:
Leveraging Genomic AI to Deliver a More Accurate and Comprehensive Genome - Genetic Engineering & Biotechnology News
Read More..