zedbyl.tech/blog/Hand-written referrals to structured EHR records: a clinic NER case study (UAE)

CASE STUDYField note · 001

Hand-written referrals to structured EHR records: a clinic NER case study (UAE)

A UAE clinic group turned hand-written referral letters into structured EHR records in 4 seconds with 99.2% NER precision - on a single on-prem A6000, never touching the public internet.

PublishedMay 19, 2026

Reading time11 min

AuthorNikita Chetverikov

Categoryon-prem · case study

01 - THE BRIEF240 letters a day, four minutes each, on paper.

The clinic group's COO described the workload in numbers: 240 referral letters per day across three sites, four to seven minutes of administrative transcription per letter, and a steady-state backlog of about ninety minutes by end of shift. The transcription error rate on a quarterly audit sample was ~3.4% - usually small (an inverted date, a misread medication abbreviation), occasionally consequential.

The privacy frame was set before the technical brief. The clinic operates under UAE federal health-data rules and, because of its insurance mix, also under contractual commitments to GDPR-aligned data handling for European reinsurers. The COO's rule was the same one I hear from every regulated client: "No PHI leaves the building. Including for OCR. Including for benchmarking. Including for the model vendor."

The brief, in one line: turn the paper letter into a structured EHR record, faster and more accurate than a human, without sending a single byte to anyone outside the clinic network. The constraint mirrors the one I described in the private RAG contract review case study for a Dubai and London law firm - same deny-by-default posture, different entity schema, different regulator on the other side of the table.

02 - THE DATA PROBLEMHand-written, multilingual, abbreviation-heavy.

The referral corpus was harder than a standard OCR benchmark in three specific ways. Each shaped the pipeline.

Hand-writing dominates. ~78% of letters were partially or fully hand-written. The remainder were templated forms with hand-written fields. Off-the-shelf OCR engines tuned for printed text were unusable; the per-character error rate on hand-written sections was above 12%.
Multilingual entities. Patient names and addresses appeared in Arabic, English, and occasionally transliterated forms of both. Medication names appeared in English brand and generic forms, sometimes with Arabic phonetic spellings. Referring-physician names followed local naming conventions with multi-part family names that fixed-template parsers handled poorly.
Abbreviation-heavy clinical language. The presenting-complaint and history sections were dense with clinical abbreviations - some standard (HTN, T2DM, COPD), some local to the referring physician. A generic clinical NER model trained on western EHR data missed roughly a quarter of the abbreviations on the first pass.

None of these were solvable with a single off-the-shelf model. The pipeline that shipped was a three-stage cascade, each stage tuned against a clinician-graded reference set built from 1,800 letters across a six-month sample window.

03 - THE PIPELINEOCR, NER, de-identification, EHR mapping.

The pipeline runs end-to-end inside the clinic network. From the moment the scanner produces a TIFF to the moment a structured record lands in the EHR staging table, no byte traverses the firewall.

Stage 1 - OCR. A fine-tuned TrOCR -family model with a clinic-specific hand-writing adapter, trained on a labelled subset of 4,200 letters. Per-character error rate on the held-out hand-writing test set: 2.1% - in line with recent transformer-based OCR benchmarks for handwritten medical prescriptions, which report CER around 1.4% on cleaner, single-domain corpora. Confidence scores below a threshold trigger a human-review queue rather than downstream processing.
Stage 2 - Clinical NER. A fine-tuned NER model running on vLLM on the A6000. The entity schema covers patient identifiers, dates, medications, conditions, procedures, allergies, referring physicians, and service-line requests. Schema-validated outputs only; anything that fails schema validation is routed to manual review.
Stage 3 - De-identification gate. A separate model pass that re-identifies any free-text fields the NER might have missed, ensuring the audit log entry contains only the structured fields that the EHR is supposed to receive. This was added in week five after a near-miss in pilot.
Stage 4 - EHR mapping. A deterministic mapper that writes the structured record to a Postgres staging table inside the EHR. The clinic's EHR vendor consumes the staging table over its existing internal HL7 channel - no new integration surface, no new vendor in the chain.

Keeping the de-identification stage separate from the NER, rather than folding it into a single multi-task model, came out of the pilot. A combined model was 1.4 seconds faster but produced one PHI leak into a free-text notes field across the 1,800-letter pilot set. A separate gate that re-examines the output before write is slower and harder to break.

04 - HARDWARE SIZINGOne A6000, not a cluster.

240 letters per day across an eight-hour clinical shift - roughly half a letter per minute at peak. Not a throughput problem. The hardware question was whether a single GPU could absorb the OCR and NER stages with comfortable headroom for re-runs and the de-identification gate.

The answer was a single NVIDIA RTX A6000 48GB - same family I have argued for elsewhere when sizing on-prem inference appliances; the companion benchmarks in Apple Silicon as an inference node cover comparable sizing on a different model class. The A6000 sits in a 2U chassis in the clinic's existing server room, on the clinical-systems VLAN, with no external network route.

End-to-end latency 3.8s p50 Scanner output to EHR staging row. P95 is 5.2 seconds, driven by occasional OCR re-runs on low-confidence pages.

GPU utilisation ~22% Average across the clinical day. Headroom is deliberate - the clinic's next service-line expansion will roughly double daily letter volume.

Power draw ~280W under load Single chassis, single power feed, fits in the existing server room without HVAC changes. Operational footprint matters as much as throughput for a clinical site.

05 - EVALUATIONDouble-blind clinician review.

The eval set was built by two clinicians over three weeks: 1,800 letters from a six-month sample window, each transcribed and entity-tagged independently. Disagreements between the two clinicians were resolved by a third reviewer and added to the reference set as harder cases. The final reference covered approximately 14,000 entity instances across the schema.

The evaluation was double-blind on the production model: the clinicians grading the post-pilot outputs did not know which records were human-transcribed and which were model-generated. The results on the locked reference at handover:

NER precision 99.2% Across all entity types, double-blind clinician-graded. Errors concentrated in medication abbreviations local to two referring physicians; both were added to a per-physician overrides table.

NER recall 97.6% Misses concentrated in low-confidence OCR pages routed to manual review - by design, the pipeline prefers a routed-to-human miss over a confident-but-wrong write.

PHI leak rate 0 Zero leaks into free-text fields after the separate de-identification gate was added. One leak in the pre-gate pilot, none in the production reference set.

For external calibration, recent JAMIA work on prompt-engineered LLM clinical NER reports relaxed F1 around 0.861 for GPT-4 and 0.901 for specialised BioClinicalBERT on similar entity-tagging tasks; the fine-tuned, schema-validated cascade here trades generality for the precision number the clinical team can defend in audit.

The precision number is the one the clinical team uses externally. The leak rate is the one the COO uses with the insurance reinsurers. Both are necessary; neither is sufficient on its own.

06 - PRIVACY POSTUREAir-gap, audit log, named keys.

The privacy posture follows the same four-clause definition I apply to every on-prem deployment - locality, egress posture, key custody, operational independence - written up in detail in why on-premises is not cloud without internet. For this deployment, the specifics:

Locality. The A6000 chassis sits in the clinic's main-site server room, badge-controlled by the clinic's existing physical-security process. The two satellite sites send scanned letters over the clinic's private inter-site link; nothing crosses a public network.
Egress posture. Deny-by-default at the perimeter firewall. The inference VLAN has no egress route at all - model updates arrive on signed media, on a six-month cadence agreed with the clinical-systems team.
Key custody. Administrative access is held by two named clinical-systems engineers employed by the clinic. I do not retain standing access after handover; emergency access is via a sealed-envelope procedure that has not been opened.
Operational independence. The pipeline continues to serve the EHR with the upstream link severed indefinitely. A 30-day disconnection drill was run in week eight; no degradation observed.

For the regulatory framing - particularly for the European-reinsurer-facing side of the clinic's business - the relevant background is in GDPR liability for public AI assistants. The same reasoning that makes a public assistant unworkable for student data makes a public OCR unworkable for PHI. The contrast with cloud-mediated approaches is easy to see in Stanford Medicine's published Secure GPT infrastructure for clinical generative AI, which still routes PHI through a private Azure OpenAI tenant - the right answer for an academic research workflow, the wrong answer for a clinic operating under a deny-by-default egress posture.

07 - RESULT AND NEXT STEPSFaster than a human, more accurate, on a single GPU.

Ninety days after handover, the measured outcomes:

Time per letter: baseline 4-7 minutes of administrative transcription, post-deployment ~4 seconds of automated pipeline plus a sampled spot-check by the administrative team. End-of-day backlog eliminated.
Transcription error rate: baseline ~3.4% on quarterly audit sample, post-deployment 0.8% on continuous double-blind sampling. The reduction is concentrated in date and medication-abbreviation errors that were the most common manual failure modes.
PHI egress events: zero detected, audited weekly. The egress alerting on the perimeter firewall has fired twice in ninety days, both for time-sync - both named exceptions.
Administrative workload: the two administrators previously dedicated to referral transcription have moved to insurance pre-authorisation work, which the clinic had been outsourcing.

The clinic is now scoping a second pipeline over discharge summaries. Harder problem - longer free text, more clinical narrative, less structured. Architecture stays the same; the eval-set work gets larger. For the operational discipline that makes the posture sustainable after handover, the closest reference is the cadence documented in the 14-day private LLM rollout. The technical work is rarely the hard part. Naming what the clinic is actually buying - a defensible PHI-handling posture, not just faster transcription - is. For clinical operators weighing a similar medical NER pipeline under HIPAA and GDPR constraints, the four engagement tiers documented under on-premises AI deployment services - audit, deployment, monthly support, and custom clinical automation - map onto the shape of most healthcare on-prem programmes I take on.

Nikita Chetverikov

Fullstack · Private AI

Related field notes.

All posts →

FN / 001

2026 · 06

OPINIONJun 16, 202615 min

Why 40% of AI Projects Get Canceled - And the Five Decisions That Separate the Rest

Gartner says more than 40% of AI agent projects will be canceled by the end of 2027. The reason is almost never the technology. Here is what actually goes wrong - and a practical framework for not ending up in that pile.

FN / 002

2026 · 06

CASE STUDYJun 28, 202615 min

How I Built an Autonomous AI Content Engine for a Crypto Media Company

A crypto media company needed to cover a 24/7 market with a finite editorial team. Here is how I built an autonomous pipeline that went from detecting events to publishing finished articles - and why the hardest part had nothing to do with AI generation.