AI as a second pair of eyes — never a replacement.
Sumora's products are built around a simple operating principle: AI augments the clinician's judgement, surfaces patterns a tired human might miss, and watches data streams that no one would otherwise watch. The clinician remains accountable. Always.
This isn't only an ethical stance. It's also what the evidence supports. Across published radiology, dermatology, and pathology studies, the strongest results consistently come from clinician + AI configurations rather than AI alone [1, 2]. Our models, our user interfaces, and our evaluation protocols are all built around that finding.
Three commitments that shape every model we ship
Calibration over confidence. A model that says “70% likely” should be right 70% of the time on cases like this one. We measure and report calibration alongside accuracy, because a confidently wrong model is worse than no model at all [3].
Reasoning is shown, not hidden. Every Barnard suggestion comes with the features it weighed and the literature it drew from. A clinician who disagrees can see exactly where to disagree.
Out-of-distribution awareness. Models flag when they're being asked something they weren't trained for. A pediatric-trained model declines an adult case rather than guessing.
Different problems,
different models.
“AI” in healthcare covers wildly different techniques. We don't use one giant model for everything — we use the right family of model for the right problem, and combine them carefully when products span more than one.
Why this layered approach matters: a single model trying to do all of these things would be worse at each one and harder to evaluate. Specialised models with clear inputs and outputs are easier to test, audit, and update independently when better techniques arrive.
How a diagnosis gets ranked.
The clearest way to explain Barnard is to walk through one anonymised, illustrative case end-to-end. The case below is composite — built from common presentations to demonstrate the workflow, not a real patient. Numbers are illustrative, not benchmark claims.
A 58-year-old presents with chest discomfort on exertion.
Inputs to Barnard
- Age, sex, BMI, smoking status
- Symptom narrative (free-text, transcribed)
- Vitals: BP 148/92, HR 88, SpO₂ 97%
- Recent labs (lipid panel, HbA1c)
- Family history flags
- Resting ECG image
Pipeline
- LLM extracts structured features from the narrative
- ECG image scored by a vision model
- Tabular model produces a CV-risk estimate
- Reasoning model combines them, retrieves relevant guidelines, ranks differentials with calibrated probabilities
The clinician sees the ranking, the features that drove it (age + exertional pattern + ECG findings + lipid profile), and the cited guidelines. They're free to disagree, override, and document why. Every override is logged and used to improve calibration over time.
Barnard does not — and should not — make autonomous decisions. Its output is one input among many that the clinician weighs.
From phone camera
to verified packet.
VRx turns a phone into a counterfeit-detection tool. The pipeline runs partly on-device for speed and partly in the cloud for verification against pharmaceutical-company reference data.
Camera frame
User points the phone at packaging. App stabilises, detects packet orientation, and selects the best frame.
Feature extraction
Mobile-optimised CNN identifies hologram regions, print boundaries, batch-code blocks, and packaging colour signature.
Reference match
Extracted features are compared against the manufacturer's reference fingerprint for that batch and lot number.
Verified or flagged
Result returned in seconds: verified, suspect, or unable-to-verify (with the specific failing checks shown).
What the model is actually looking at
Counterfeit packaging usually fails on the small details — the things a forger can't reproduce cheaply at scale. VRx's vision model is trained to find those specific signals:
The verification network only works to the extent that pharmaceutical manufacturers contribute reference data. Sumora's partnership programme is structured to make that easy: a manufacturer onboards once, and from that point every VRx scan in the field protects their brand and their patients.
Listening to the signals
a patient sends home.
SERA's core problem is signal triage at scale. A single patient on continuous monitoring produces hundreds of thousands of data points per day. Most of it is normal. The work is finding the small fraction that isn't — without flooding the clinical team with false alarms.
What SERA is watching for
The model layer combines on-device anomaly detection with cloud-side context-aware classification:
The escalation thresholds are not magic numbers — they're configured per-patient by their care team, based on baseline values, prior conditions, and recovery stage. A post-cardiac-surgery patient and a chronic-disease patient at home need different sensitivity profiles.
Why false alarm rate is the metric we obsess over
Continuous monitoring has a known failure mode: alarm fatigue. If a system raises alerts too often, clinicians stop responding to all of them, including the real ones [7]. SERA is evaluated as much on its specificity (true negative rate) as on its sensitivity. The honest target isn't “catch everything” — it's “be trustworthy enough that the alerts you do raise are taken seriously.”
The numbers we report
and the ones we don't.
“Accuracy” alone is almost meaningless in clinical AI. A model that says “no disease” to every patient in a population where 5% have the disease is 95% accurate — and useless. We evaluate every clinical-facing model across a fuller set of measures, on held-out data the model has never seen, with subgroup analysis to catch performance gaps.
The protocol on every Sumora model
What we don't do: cherry-pick the best evaluation slice for marketing. If a model performs well on adults and poorly on adolescents, both numbers go in the report. If a competitor's published number is higher than ours on a benchmark, we say so and explain why.
What our AI cannot do.
An honest technology page lists the boundaries as clearly as the capabilities. These aren't temporary engineering gaps — they're principled limits we hold ourselves to. Anyone claiming otherwise about clinical AI should be treated with caution.
Boundaries we name openly
- / 01Sumora's models do not make autonomous clinical decisions. Every clinically significant output is a suggestion to a licensed clinician, not a directive.
- / 02Performance degrades on patient populations that differ materially from training data. We name the populations our models are validated on, and decline cases that fall meaningfully outside them.
- / 03Bisma is a triage assistant, not a diagnostic device. It is designed to help patients seek care appropriately, never to replace it.
- / 04VRx verifies what manufacturers have provided reference data for. Medicines from manufacturers not in the network return “unable to verify” — never a false positive.
- / 05SERA monitors what its sensors can measure. It cannot detect conditions that don't manifest in observable physiological signals, and it is not a substitute for clinically indicated continuous monitoring.
- / 06No Sumora product diagnoses, treats, cures, or prevents disease as a regulatory claim. Specific clinical indications follow specific regulatory clearances; we report those clearances honestly as they are obtained.
Where the evidence sits.
Selected sources cited above. This is a working bibliography, not an exhaustive one — the field is large and moving. We prioritise peer-reviewed journals, regulatory body guidance, and open clinical guidelines.
// Cited above
- Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.
- McKinney, S. M. et al. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577, 89–94.
- Van Calster, B. et al. (2019). Calibration: the Achilles heel of predictive analytics. BMC Medicine, 17(1), 230.
- Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? NeurIPS.
- Hannun, A. Y. et al. (2019). Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1), 65–69.
- Royal College of Physicians. (2017). National Early Warning Score (NEWS) 2: Standardising the assessment of acute-illness severity in the NHS.
- Sendelbach, S., & Funk, M. (2013). Alarm fatigue: a patient safety concern. AACN Advanced Critical Care, 24(4), 378–386.
Standards we build against
Beyond the cited literature, our development and evaluation processes draw from: FDA Good Machine Learning Practice (GMLP) guiding principles for medical device software; WHO ethics and governance of artificial intelligence for health; ISO 14971 for risk management of medical devices; HL7 FHIR for interoperability; and HIPAA, GDPR, and the UAE Dubai Health Authority data frameworks for privacy and security.