Diagnostic AI Across the Life Sciences (2015–2025): A PRISMA-Scoping Review and Bibliometric Synthesis of External Validity, Calibration, Fairness, and Reproducibility
Abstract
Artificial intelligence (AI) is transforming diagnostic decision-making across the life sciences, yet evidence remains fragmented across human, veterinary, plant, environmental, and microbial domains. We conducted a PRISMA-ScR scoping review (protocol preregistered on OSF; details in Supplement) and bibliometric analysis covering 2015–2025. Searches in PubMed/MEDLINE, Scopus, Web of Science, and IEEE Xplore (plus arXiv/bioRxiv tagging) identified 28,541 records and 68 preprints; after de-duplication and dual screening, 689 primary studies met inclusion criteria (with 42 preprints analyzed descriptively but excluded from citation-based bibliometrics). Human medicine dominated the corpus (81.3%), followed by veterinary (6.2%), plant (5.1%), environmental (4.2%), and microbial diagnostics (3.2%). Modalities were led by medical imaging (65.0%), then omics (18.0%), time-series (8.1%), spectra (4.1%), text (2.9%), and eDNA (1.9%). Reported performance was high (median AUROC 0.94), but external validity and transparency were limited: only 28.0% performed external validation, 9.0% used prospective designs, and 5.2% reported probability calibration. Reproducibility signals were weak (code availability 22.9%, data availability 18.0%, explicit preregistration rare), and fairness/bias assessments appeared in 7.0% of studies, concentrated in human health. Bibliometrics showed rapid year-on-year growth, with the United States (32.1%) and China (28.4%) leading output and collaborations. Trends indicate a shift from task-specific CNNs to multimodal/foundation-model approaches and early data-fusion gains, but consistent gaps persist in leakage controls, calibration, subgroup reporting, and regulatory alignment. We recommend domain-aware, leakage-resistant splits; at least one independent, real-world evaluation; prevalence-aware metrics with calibration and decision-utility; open datasheets/model cards; and federated/external benchmarking to probe generalization. These practices can convert impressive internal results into dependable, equitable diagnostics that work across clinics, farms, rivers, and labs.