Technology — Sumora Health

Our Approach

AI as a second pair of eyes — never a replacement.

Sumora's products are built around a simple operating principle: AI augments the clinician's judgement, surfaces patterns a tired human might miss, and watches data streams that no one would otherwise watch. The clinician remains accountable. Always.

This isn't only an ethical stance. It's also what the evidence supports. Across published radiology, dermatology, and pathology studies, the strongest results consistently come from clinician + AI configurations rather than AI alone [1, 2]. Our models, our user interfaces, and our evaluation protocols are all built around that finding.

Three commitments that shape every model we ship

Calibration over confidence. A model that says “70% likely” should be right 70% of the time on cases like this one. We measure and report calibration alongside accuracy, because a confidently wrong model is worse than no model at all [3].

Reasoning is shown, not hidden. Every Barnard suggestion comes with the features it weighed and the literature it drew from. A clinician who disagrees can see exactly where to disagree.

Out-of-distribution awareness. Models flag when they're being asked something they weren't trained for. A pediatric-trained model declines an adult case rather than guessing.

The Model Stack

Different problems,
different models.

“AI” in healthcare covers wildly different techniques. We don't use one giant model for everything — we use the right family of model for the right problem, and combine them carefully when products span more than one.

FOUNDATION

Large language models (reasoning + dialogue)

For clinical reasoning chains, patient-facing conversation, and synthesising free-text notes. Used in Barnard's differential generation and Bisma's triage flow.

LLM

VISION

Convolutional + vision-transformer models (images)

For VRx packaging verification (CNN ensembles for hologram detection, ViT for printed-mark micro-features) and supporting Barnard's image inputs (X-ray, dermoscopy).

SIGNALS

Time-series models (physiological streams)

For SERA's continuous vitals monitoring. 1D-CNN and Transformer architectures over heart rhythm, SpO₂, and movement data, with anomaly detection layered on top.

STRUCTURED

Gradient-boosted trees (labs + tabular)

For risk scoring on lab panels, vitals snapshots, and EHR-derived features. Boring, well-understood, and often the right tool — strong baselines that newer architectures rarely beat on tabular data [4].

GBT

RETRIEVAL

Medical-knowledge retrieval (citations)

A vector store over curated, peer-reviewed sources (clinical reference texts, PubMed abstracts, society guidelines) so LLM outputs are grounded in citable evidence rather than free-floating training memory.

RAG

Why this layered approach matters: a single model trying to do all of these things would be worse at each one and harder to evaluate. Specialised models with clear inputs and outputs are easier to test, audit, and update independently when better techniques arrive.

Barnard · Worked Example

How a diagnosis gets ranked.

The clearest way to explain Barnard is to walk through one anonymised, illustrative case end-to-end. The case below is composite — built from common presentations to demonstrate the workflow, not a real patient. Numbers are illustrative, not benchmark claims.

ILLUSTRATIVE CASESetting: General practiceTime: 90 seconds

A 58-year-old presents with chest discomfort on exertion.

Inputs to Barnard

Age, sex, BMI, smoking status
Symptom narrative (free-text, transcribed)
Vitals: BP 148/92, HR 88, SpO₂ 97%
Recent labs (lipid panel, HbA1c)
Family history flags
Resting ECG image

Pipeline

LLM extracts structured features from the narrative
ECG image scored by a vision model
Tabular model produces a CV-risk estimate
Reasoning model combines them, retrieves relevant guidelines, ranks differentials with calibrated probabilities

// Barnard output (ranked differential)

Stable angina

~62%

GERD / oesophageal

~18%

Musculoskeletal

~12%

Anxiety-related

~5%

Other / further workup

~3%

Suggested next steps: exercise stress test, troponin if pain recurs at rest, lipid management. Cited: ESC chronic coronary syndrome guideline; NICE CG95 (chest pain of recent onset).

The clinician sees the ranking, the features that drove it (age + exertional pattern + ECG findings + lipid profile), and the cited guidelines. They're free to disagree, override, and document why. Every override is logged and used to improve calibration over time.

Barnard does not — and should not — make autonomous decisions. Its output is one input among many that the clinician weighs.

VRx · The Verification Pipeline

From phone camera
to verified packet.

VRx turns a phone into a counterfeit-detection tool. The pipeline runs partly on-device for speed and partly in the cloud for verification against pharmaceutical-company reference data.

/ 01 CAPTURE

Camera frame

User points the phone at packaging. App stabilises, detects packet orientation, and selects the best frame.

/ 02 ON-DEVICE

Feature extraction

Mobile-optimised CNN identifies hologram regions, print boundaries, batch-code blocks, and packaging colour signature.

/ 03 CLOUD

Reference match

Extracted features are compared against the manufacturer's reference fingerprint for that batch and lot number.

/ 04 RESULT

Verified or flagged

Result returned in seconds: verified, suspect, or unable-to-verify (with the specific failing checks shown).

What the model is actually looking at

Counterfeit packaging usually fails on the small details — the things a forger can't reproduce cheaply at scale. VRx's vision model is trained to find those specific signals:

hologram_consistency

Diffraction pattern under different angles. Genuine holograms shift colour predictably; counterfeits are usually static stickers or low-fidelity reproductions.

print_microfeatures

Sub-millimetre print details: kerning, edge sharpness, security micro-text. Inkjet copies show a characteristic dot pattern absent in offset printing.

colour_signature

Spectral profile of the packaging compared to the manufacturer's reference. Counterfeit dyes drift outside tolerance bounds.

batch_lot_consistency

Cross-checks the printed lot/batch code against the manufacturer's database for that production run, including expiry validation.

The verification network only works to the extent that pharmaceutical manufacturers contribute reference data. Sumora's partnership programme is structured to make that easy: a manufacturer onboards once, and from that point every VRx scan in the field protects their brand and their patients.

SERA · Signal Processing

Listening to the signals
a patient sends home.

SERA's core problem is signal triage at scale. A single patient on continuous monitoring produces hundreds of thousands of data points per day. Most of it is normal. The work is finding the small fraction that isn't — without flooding the clinical team with false alarms.

What SERA is watching for

The model layer combines on-device anomaly detection with cloud-side context-aware classification:

arrhythmia detection

A 1D convolutional network trained on annotated ECG data identifies rhythm abnormalities (atrial fibrillation, supraventricular tachycardia, ventricular ectopy). Comparable architectures have reached cardiologist-level performance on specific tasks [5].

deterioration trend

Modified Early Warning Score (NEWS2) computed continuously, with smoothed trend lines. Used in clinical practice for years — SERA's contribution is making it run between hospital visits, not at the bedside [6].

contextual filtering

A separate model down-weights spurious alerts: a heart rate spike during recorded movement, a brief desaturation while the sensor was being adjusted. This is where false-alarm reduction lives.

escalation logic

Tiered: a single anomaly is logged silently. Persistent or worsening anomalies notify the patient. Critical patterns notify the care team directly, with the raw signal waveform attached.

The escalation thresholds are not magic numbers — they're configured per-patient by their care team, based on baseline values, prior conditions, and recovery stage. A post-cardiac-surgery patient and a chronic-disease patient at home need different sensitivity profiles.

Why false alarm rate is the metric we obsess over

Continuous monitoring has a known failure mode: alarm fatigue. If a system raises alerts too often, clinicians stop responding to all of them, including the real ones [7]. SERA is evaluated as much on its specificity (true negative rate) as on its sensitivity. The honest target isn't “catch everything” — it's “be trustworthy enough that the alerts you do raise are taken seriously.”

How We Evaluate

The numbers we report
and the ones we don't.

“Accuracy” alone is almost meaningless in clinical AI. A model that says “no disease” to every patient in a population where 5% have the disease is 95% accurate — and useless. We evaluate every clinical-facing model across a fuller set of measures, on held-out data the model has never seen, with subgroup analysis to catch performance gaps.

AUROC

Discrimination across thresholds

Sens./Spec

At clinically chosen thresholds

PPV/NPV

At expected disease prevalence

ECE

Expected calibration error

The protocol on every Sumora model

held-out test set

A geographically and temporally separate dataset the model has never seen during training or hyperparameter tuning. No exceptions.

subgroup analysis

Performance reported separately by age band, sex, ethnicity (where available and consented), and clinical setting. Gaps trigger investigation, not silence.

prospective validation

Before clinical deployment, every model is run shadow-mode against live data at partner sites for a defined period, with disagreement cases reviewed by clinicians.

drift monitoring

Once deployed, distribution drift on inputs and predictions is monitored continuously. Models retrain on a defined cadence with new data and re-evaluation.

adverse event capture

A formal channel for clinicians to flag suspected model errors. Every report is investigated; trends inform the next training cycle.

What we don't do: cherry-pick the best evaluation slice for marketing. If a model performs well on adults and poorly on adolescents, both numbers go in the report. If a competitor's published number is higher than ours on a benchmark, we say so and explain why.

Limitations

What our AI cannot do.

An honest technology page lists the boundaries as clearly as the capabilities. These aren't temporary engineering gaps — they're principled limits we hold ourselves to. Anyone claiming otherwise about clinical AI should be treated with caution.

Boundaries we name openly

/ 01Sumora's models do not make autonomous clinical decisions. Every clinically significant output is a suggestion to a licensed clinician, not a directive.
/ 02Performance degrades on patient populations that differ materially from training data. We name the populations our models are validated on, and decline cases that fall meaningfully outside them.
/ 03Bisma is a triage assistant, not a diagnostic device. It is designed to help patients seek care appropriately, never to replace it.
/ 04VRx verifies what manufacturers have provided reference data for. Medicines from manufacturers not in the network return “unable to verify” — never a false positive.
/ 05SERA monitors what its sensors can measure. It cannot detect conditions that don't manifest in observable physiological signals, and it is not a substitute for clinically indicated continuous monitoring.
/ 06No Sumora product diagnoses, treats, cures, or prevents disease as a regulatory claim. Specific clinical indications follow specific regulatory clearances; we report those clearances honestly as they are obtained.

References

Where the evidence sits.

Selected sources cited above. This is a working bibliography, not an exhaustive one — the field is large and moving. We prioritise peer-reviewed journals, regulatory body guidance, and open clinical guidelines.

// Cited above

Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.
McKinney, S. M. et al. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577, 89–94.
Van Calster, B. et al. (2019). Calibration: the Achilles heel of predictive analytics. BMC Medicine, 17(1), 230.
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? NeurIPS.
Hannun, A. Y. et al. (2019). Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1), 65–69.
Royal College of Physicians. (2017). National Early Warning Score (NEWS) 2: Standardising the assessment of acute-illness severity in the NHS.
Sendelbach, S., & Funk, M. (2013). Alarm fatigue: a patient safety concern. AACN Advanced Critical Care, 24(4), 378–386.

Standards we build against

Beyond the cited literature, our development and evaluation processes draw from: FDA Good Machine Learning Practice (GMLP) guiding principles for medical device software; WHO ethics and governance of artificial intelligence for health; ISO 14971 for risk management of medical devices; HL7 FHIR for interoperability; and HIPAA, GDPR, and the UAE Dubai Health Authority data frameworks for privacy and security.

Want to see this in practice, or have a question for our team?

Talk to Sumi →hello@sumora.health

How AI actuallymeets medicine.