Opinion

Beyond the Pipeline: How AI Is Rewriting the Rules of NGS Data Analysis

AI is poised to revolutionize how we analyze and interpret this data.

Ron Zhu May 23, 2026 66 views 5.0/5 (1) 1 comments

Beyond the Pipeline: How AI Is Rewriting the Rules of NGS Data Analysis

Foundation models trained on trillions of base pairs, AI tools detecting cancer months before imaging, and single-cell atlases that map entire tumors at single-nucleotide resolution. The latest wave of AI in genomics is nothing like the incremental improvements of five years ago — here's what's actually changing in 2025–2026.

The original pitch for AI in genomics was practical: machines that could align reads faster, call variants with fewer errors, and annotate consequences more accurately. That pitch held for years, and it was valid. Tools like DeepVariant delivered genuine improvements over classical GATK heuristics. But what's happened since 2023 is categorically different. We are not refining existing pipelines — we are replacing entire conceptual frameworks.

Two shifts are driving this. First, the scale of pre-training has crossed a threshold: models trained on billions to trillions of base pairs across all domains of life can now generalize across species and tasks in ways that are genuinely surprising. Second, the integration of multimodal data — combining sequencing read counts with spatial coordinates, protein expression, epigenetic marks, and imaging — has turned isolated analyses into whole-tissue, whole-organism maps. The result is a field that increasingly feels less like bioinformatics and more like computational biology at the scale of the cell atlas.

This article covers where the field actually stands heading into mid-2026: which AI advances are solid, which remain aspirational, and what the honest technical constraints look like.

9T
Base pairs in Evo 2's training corpus (Arc Institute, 2026)
~650M
Parameters in the largest public Nucleotide Transformer v3 (InstaDeep, 2025)
1 Mb
Single-nucleotide context window in NTv3 U-Net architecture
>1,000
AI/ML algorithms FDA-authorized as of 2025 — predominantly radiology, with genomics a growing share (npj Digit. Med., 2025)

01 DNA Foundation Models: From BERT to Evo 2

The most consequential shift in AI-powered genomics is the emergence of large-scale DNA language models — the genomic equivalent of GPT trained on human language, but on nucleotide sequences across all living organisms. In 2023–2024, several architectures competed: DNABERT-2 used byte-pair encoding for multi-species modeling; HyenaDNA replaced attention with implicit convolutions to enable million-token single-nucleotide contexts; Caduceus added reverse-complement equivariance for long-range variant-effect prediction.

The 2025–2026 generation raised the bar sharply. Nucleotide Transformer v3 (NTv3), released by InstaDeep in 2025, uses a U-Net-like architecture with single-base tokenization enabling contexts up to 1 megabase — enough to span entire regulatory landscapes. Pre-trained on 9 trillion base pairs from the OpenGenome2 dataset and post-trained on over 16,000 functional tracks from 24 animal and plant species, NTv3 achieves state-of-the-art accuracy for functional-track prediction and genome annotation across species. It can also be fine-tuned as a controllable generative model for designing enhancer sequences with specified activity levels, validated experimentally via STARR-seq.

Arc Institute's Evo 2, published in Nature in early 2026, is arguably the most ambitious genomic foundation model to date. Trained on 9 trillion base pairs from all domains of life — bacteria, archaea, eukaryotes — it predicts functional properties from genomic sequences and provides a rich generative model for biology. At 7B and 40B parameter scales with sequences up to 1 million base pairs at nucleotide resolution, Evo 2 can generate novel protein-coding genes, regulatory elements, and even entire synthetic genomes with functional properties.

DNA foundation model landscape 2023–2026 Timeline showing the evolution of genomic foundation models from DNABERT in 2021 through Evo 2 in 2026, with parameter counts and key architectural innovations. 2021 2023 2024 2025 2026 DNABERT 110M params HyenaDNA 1M token context Single-nucleotide Nucleotide Transformer 2.5B params · 3,202 genomes Multi-species · 6-mer tokens Nucleotide Transformer v3 ~650M (public) · U-Net · 1 Mb 9T bp OpenGenome2 16,000 functional tracks Evo 2 7B – 40B params 9T bp · all life domains 1 Mb context Nature, 2026 Parameter scale 110M 7 – 40B Implicit convolution k-mer tokenization U-Net + single-base tokens StripedHyena + generative Context window and parameter count are proxies for genomic reasoning capability — not the full story.
Figure 1 — The genomic foundation model timeline. From encoder-only transformer models with ~100M parameters in 2021 to the 7–40B parameter Evo 2 in 2026, trained on 9 trillion base pairs from all domains of life. Context window growth (from ~512 tokens to 1 megabase) has been as important as parameter count for capturing long-range regulatory interactions.

An important caveat: a 2026 benchmark study raised questions about whether purely sequence-trained genomic foundation models actually generalize as well as claimed. Specialized smaller models, trained specifically for a given task, often match or exceed large GFM embeddings while being orders of magnitude faster and cheaper to run. This honest tension — general-purpose vs. task-specific — is one the field is still working through.

"Evo 2 is an AI-based biological foundation model trained on 9 trillion DNA base pairs spanning all domains of life that predicts functional properties from genomic sequences and provides a rich generative model for researchers in biology."

Nature, Arc Institute & collaborators, March 2026

02 Variant Calling in the Clinical Era

Classical variant calling combined probabilistic models (GATK HaplotypeCaller, FreeBayes) with manual filtering rules. DeepVariant's introduction of a CNN-based approach — treating variant calling as an image classification problem over read pileup representations — marked the first major AI transition. But the current generation goes further on two fronts: long-read error correction and clinical deployment at scale.

Modern AI variant callers now handle Oxford Nanopore and PacBio long reads natively, contending with systematic error profiles that broke short-read tools. RNN- and transformer-based base-calling models (Dorado, Bonito) now run directly on the sequencing instrument, improving read accuracy in near real-time and pushing the entire workflow closer to the clinic. A 2025 review in Current Issues in Molecular Biology notes that CNNs, RNNs, and hybrid architectures now consistently outperform traditional methods across variant calling, epigenomic profiling, and single-cell sequencing.

The integration pressure is also institutional. Over 1,000 AI algorithms and devices had been authorized by the FDA as of 2025 — with genomic AI representing a small but rapidly growing share (the majority are radiology devices). This regulatory momentum is driving demand for interpretable AI — models whose variant prioritizations can be audited by clinicians — and for federated learning approaches that can train on distributed patient data without centralizing sensitive records.

03 Liquid Biopsy + AI: Detecting Cancer from Blood

Perhaps nowhere is the clinical translation of AI + NGS more tangible than in liquid biopsy. Circulating tumor DNA (ctDNA) offers a non-invasive window into tumor genomics, but the challenge is signal: tumor fractions can be below 0.01% in early-stage cancers, drowning in a background of normal cfDNA. AI has changed what's detectable.

MRD-EDGE, developed by researchers at Weill Cornell Medicine, NewYork-Presbyterian, the NY Genome Center, and Memorial Sloan Kettering, trains a machine-learning model to analyze ctDNA sequencing data for minimal residual disease (MRD) detection. Published in Nature Medicine in June 2024, it detected all five colorectal cancer patients with recurrence in a validation cohort — including detection of pre-cancerous polyps shedding ctDNA at levels previously considered undetectable. Critically, it achieved this without requiring pre-training on sequencing data from the patient's own tumor, and could detect immunotherapy responses in melanoma and lung cancer weeks before changes appeared on CT imaging.

The broader liquid biopsy field is moving from mutation-centric assays toward multimodal frameworks that integrate cfDNA signals with circulating tumor cells (CTCs), extracellular vesicles (EVs), and fragmentomic features. At low tumor fractions, methylation profiles and fragment length patterns preserve tissue-of-origin information that mutation profiling misses. AI-powered multi-cancer early detection (MCED) tests that combine methylation, fragmentomics, and protein markers are entering clinical trials, with the goal of a single blood draw detecting cancer across 50+ tumor types.

For gastrointestinal cancers specifically, a 2025 analysis found that AI-based algorithms effectively profiled exosome and ctDNA biomarkers, and that federated learning improved model generalizability across diverse clinical settings — addressing a long-standing concern about cohort-specific overfitting in liquid biopsy AI.

AI-powered liquid biopsy multi-analyte pipeline Flowchart showing how blood draw analytes (ctDNA, EVs, CTCs, proteins) feed into an AI integration layer for multi-cancer early detection and tissue-of-origin inference. Blood draw cfDNA / plasma fraction isolated ctDNA mutations SNVs · indels · CNV Methylation profile Tissue-of-origin signal Fragmentomics Fragment length · WGBS EVs / exosomes miRNA · exoRNA AI integration Deep learning fusion Federated training Multi-omics embedding MRD-EDGE, GRAIL MCED detection 50+ cancer types Tissue-of-origin (TOO) MRD surveillance Post-treatment relapse Therapy response Weeks before imaging Immunotherapy tracking
Figure 2 — Multimodal AI liquid biopsy pipeline. Modern AI-powered liquid biopsy integrates ctDNA mutations, methylation patterns, fragmentomic features, and extracellular vesicle data through deep learning models trained in federated settings. Outputs range from multi-cancer early detection (MCED) to therapy response monitoring weeks ahead of conventional imaging.

04 Single-Cell Sequencing and Spatial Transcriptomics: AI Maps the Cellular Landscape

Single-cell RNA sequencing (scRNA-seq) was named Nature Methods' "Method of the Year" in 2013, and spatial transcriptomics received the same recognition in 2020. Together, they are generating datasets that are categorically different from bulk NGS: massive (millions of cells), sparse (most genes undetected per cell), and richly multimodal (gene expression, chromatin accessibility, protein abundance, and physical location simultaneously). This is exactly the regime where AI excels and traditional statistics struggles.

A 2026 review in Advanced Science provides the most comprehensive survey of the current AI toolkit for these data types. Key advances include:

Cell-type annotation and atlas integration

Transformer-based models now treat single-cell gene expression profiles as token sequences — analogous to words in a sentence — enabling pre-trained representations that generalize across tissues and species. Tools like scGPT and Geneformer pre-train on millions of cells from public atlases (the Human Cell Atlas, CELLxGENE), then fine-tune for specific annotation tasks. This dramatically reduces the labeled data requirement for new tissues.

Spatial domain identification

Identifying functionally distinct zones within tissue (tumor cores, immune infiltrates, fibrotic stroma) from spatial transcriptomics data requires balancing gene expression signal with spatial continuity. Methods like GraphST and BayesSpace excel at domain boundary detection, while newer "interpretable-by-design" approaches like STAMP (topic modeling of gene modules via graph convolutional networks) and SpaHDmap (multimodal fusion of expression with histological morphology) address a key gap: not just identifying where domains are, but understanding what they are biologically.

Tumor microenvironment dissection

A 2026 study in npj Digital Medicine integrated scRNA-seq (141,986 cells) with spatial transcriptomics across localized, hormone-sensitive, and castration-resistant prostate cancer. Using explainable ML (SHAP analysis), it identified a malignant epithelial subpopulation with high chromosomal instability and stemness potential, spatially mapped to immune-suppressive niches where fibroblasts and myeloid cells coexisted with exhausted lymphocytes. This kind of mechanistic, spatially-grounded tumor characterization was not possible before the convergence of scRNA-seq, spatial transcriptomics, and AI.

A Key Limitation to Watch

Current spatial transcriptomics platforms that infer single-cell resolution using external reference databases suffer 30–50% accuracy drops when the reference doesn't match the sample. Morphology-guided approaches — integrating histological image features directly into the model — may be necessary to achieve true single-cell resolution in diverse clinical contexts. This remains an active area of methods development as of mid-2026.

05 AI Across the Full NGS Workflow

A 2025 review in Current Issues in Molecular Biology explicitly frames AI as enhancing "every aspect of NGS workflows — from experimental design and wet-lab automation to bioinformatics analysis." Here is an honest accounting of where AI currently delivers clear value, and where promises still exceed results:

Workflow stageAI approachMaturityHonest caveat
Base callingRNN/CNN on signal data (Dorado, Bonito)ProductionWell-validated; runs on-instrument in real time
Read alignmentDeep learning alignment (e.g., DRAGEN neural model)ProductionIncremental improvement; heuristic tools still competitive
Variant callingCNN image-based (DeepVariant) · LLM embeddings for prioritizationProductionShort-read mature; long-read models improving rapidly
Epigenomic profilingCNN/transformer for ChIP-seq, ATAC-seq peak predictionClinical researchPerformant; interpretability still limited for clinical sign-off
RNA-seq expressionTransformer cell models (scGPT, Geneformer)Clinical researchAtlas transfer works; novel tissue types still need fine-tuning
Liquid biopsy / MRDDL multi-analyte fusion (MRD-EDGE, Galleri/GRAIL)Clinical trialsLandmark results; regulatory approval pathway still being established
Genome designGenerative models (Evo 2, NTv3 fine-tuned)Early researchImpressive proof-of-concept; biological validation lags model capability
Spatial transcriptomicsGNN-based domain ID (GraphST, STAMP, SpaHDmap)Clinical researchSingle-cell resolution at scale still an open methods problem

06 The Honest Challenges: Where AI + NGS Still Struggles

Any serious treatment of this field has to reckon with what isn't working, or at least isn't working yet.

Model interpretability

A deep learning model calling a variant pathogenic with 98% confidence is useful; a model doing so without being able to explain which features drove the prediction is a liability in a clinical setting. Explainable AI methods — SHAP values, attention visualization, probing studies — are improving, but the gap between model performance and clinical trust remains real. The field is increasingly using interpretable-by-design architectures alongside post-hoc explanation tools.

Data heterogeneity and batch effects

Models trained on one sequencing platform, library prep protocol, or patient population often degrade significantly when applied to another. This is the most practical barrier to clinical deployment. Federated learning — where models train across distributed sites without pooling raw data — is increasingly seen as both the privacy-preserving solution and the generalization-improving one, but it adds substantial infrastructure complexity.

The "general foundation model" question

A 2026 preprint ("Entropy, Disagreement, and the Limits of Foundation Models in Genomics") challenges the assumption that larger genomic language models trained purely on sequences generalize better. Empirically, smaller task-specific models often outperform GFM embeddings on concrete downstream tasks, while being faster and cheaper. This doesn't invalidate Evo 2 or NTv3 — generative capacity and sequence design are distinct from classification performance — but it does caution against assuming scale automatically translates to utility.

Ethical concerns and genomic privacy

Clinical mNGS often sequences all nucleic acids in a sample, including the human host genome. This creates genuine patient privacy and informed consent challenges that technical solutions alone cannot resolve. Governance frameworks for genomic AI are lagging the technology by years.

"Specialized solutions — smaller models trained specifically for the task — often outperform genomic foundation model embeddings while being faster and cheaper to train and much more convenient to use."

— Entropy, Disagreement, and the Limits of Foundation Models in Genomics, arXiv 2026

07 What to Watch in 2026–2027

Based on the current trajectory of the field, several developments appear likely to materialize in the near term:

Multimodal genome-to-phenotype models

The next generation of foundation models will not train on DNA sequences alone, but on coordinated DNA + RNA + chromatin + protein data from the same cells. The Human Cell Atlas and similar initiatives are generating exactly this kind of data at scale. Models that learn the cellular program holistically — how a genomic variant propagates through transcript expression to protein abundance to cellular phenotype — would represent a qualitative leap in what NGS data can predict.

Regulatory approval of AI-native genomic diagnostics

The FDA's approval of over 1,000 AI medical algorithms has largely covered imaging. Genomic AI — including liquid biopsy tests and AI-assisted variant interpretation — is next on the regulatory agenda. The clinical validation data is accumulating; the question is regulatory framework maturity.

Spatial multi-omics at true single-cell resolution

Platforms like Xenium, MERFISH, and Slide-seq are converging toward profiling thousands of genes at single-cell spatial resolution across centimeter-scale tissue sections. At that scale and dimensionality, AI is not an optional enhancement — it is the only practical analysis approach. Expect AI-driven tissue atlases of disease to become standard pre-clinical and clinical research tools.

Generative biology in practice

Evo 2 and NTv3's controllable enhancer design represent the first wave of AI-generated genomic elements validated experimentally (via STARR-seq and related assays). The near-term roadmap includes AI-designed promoters, terminators, and synthetic regulatory circuits for gene therapy and metabolic engineering applications. The gap between in silico generation and biological validation remains a major bottleneck, but narrowing rapidly.

 

The convergence of bioinformatics and AI is not a future promise — it is an ongoing technical reality with an accelerating publication rate, a growing roster of clinical deployments, and a foundation model ecosystem that is architecturally more capable every six months. What distinguishes the current moment from earlier waves of machine learning in genomics is scope: we are no longer applying AI to improve individual steps in a fixed pipeline. We are rethinking what a genomic analysis is, what data it can integrate, and what questions it can answer.

Comments (1)

A
Amy May 23, 2026 12:59 AM
I like this blog very much. Very informative.

You must be logged in to leave a comment.

Login to Comment