Nikita Chetverikovzedbyl.tech
zedbyl.tech/blog/Private RAG for contract review: a law firm case study (Dubai/London)
CASE STUDYField note · 001

Private RAG for contract review: a law firm case study (Dubai/London)

How a Dubai/London law firm cut contract review time by 73% with a private Llama 3.1 70B RAG running on-prem over 12k binding documents - audit-grade citations, NDA-safe, no cloud.

PublishedMay 18, 2026
Reading time12 min
AuthorNikita Chetverikov
Categoryon-prem · case study

01 - THE BRIEFFour associate-days per matter, two jurisdictions, one rule.

The first call was with the managing partner and the firm's COO. The workload was described in one sentence: "Every matter starts with an associate reading the same twelve thousand documents looking for the three clauses that matter." The constraint was added in a second sentence by the COO, who had spent the previous quarter answering questions from a Dubai regulator: "Whatever you build cannot send a byte to anyone outside this office. Not for metrics, not for model updates, not for autocomplete."

The matter mix straddled DIFC and English law. The binding corpus was roughly 12,000 documents - master agreements, side letters, addenda, regulatory carve-outs, and the firm's own precedent bank going back fourteen years. Some of it was scanned, most of it was native PDF, a non-trivial fraction was Word with tracked changes still live in the file. The baseline cost per matter was four associate-days of triage before a partner could be put on the file.

The brief was not "build us an AI." The brief was: build the assistant that gives the partner a defensible first-pass digest of the binding corpus, with paragraph-level citations, and prove on a Friday afternoon that nothing left the building.

This conversation was not happening in a vacuum. By 2025, the Thomson Reuters survey of generative AI adoption in professional services put the share of law firms with GenAI in active workflows at roughly one in four, with document review and legal research as the dominant use cases. The firm wanted to join that cohort without the cloud-tenancy compromise that most of those deployments had quietly accepted.

02 - WHY NOT CLOUD RAGThe pitch deck that died in the second meeting.

Two managed-RAG vendors had already pitched the firm by the time I was brought in. Both decks had a slide titled "Private AI for legal." Both meant a managed cloud tenant with a VPN. I have written elsewhere about why this is not on-premises - the full version is in the companion piece on why on-premises is not cloud without internet, and the COO had already read it before the second meeting.

The regulatory posture behind this is not a UK-only concern either. The UK Information Commissioner's Office strategy on AI governance, published in 2024, makes the data-controller accountability point bluntly: liability for personal data processed through a model does not transfer to a vendor by contract. That is exactly the gap the firm's COO had been told to close before procurement could move forward.

The COO's first question to vendor two was the one that ended the conversation: "If a partner reads a clause out of your tool to a client over the phone tomorrow morning, where exactly was that clause sitting when the partner pressed Enter?" The answer involved a region, a sub-processor, and a Standard Contractual Clause. None of those are a rack unit. Under DIFC data-protection rules and the firm's London-side Article 28 obligations, the partner could not have answered the client honestly about where the clause was. The pitch did not survive lunch.

Disqualifier

The cloud RAG was not slower or worse. It was incompatible with the firm's professional-conduct posture. A solicitor who cannot answer "where is the privileged material" cannot use the tool, regardless of how good the retrieval is.

03 - ARCHITECTURETwo H100s, one Postgres, one boundary.

The reference deployment has three zones. Inference and retrieval live inside the firm's secure office network; the front-end and identity layer live in a DMZ that only the firm's own users can reach; everything else does not exist, by network policy.

  • Inference.H100 80GB SXM in a single chassis, Llama 3.1 70B at BF16 served by Ollama. Single-user latency to first token is under 400 ms; sustained throughput is ~52 tokens/sec.
  • Retrieval. pgvector on PostgreSQL 16. Embeddings via a locally hosted bge-large-en-v1.5, 1024-dim, HNSW index. Citation metadata lives in the same Postgres - paragraph hash, document ID, page anchor so an auditor can JOIN, not cross-reference two systems.
  • Application. Next.js front-end behind the firm's existing identity provider, deployed via a small k8s cluster. The control plane is in-cluster; nothing phones home.
  • Boundary. Egress policy is deny-by-default at the firewall, named exceptions only for the firm's own SMTP and time servers. Model updates arrive on signed media, on a quarterly cadence.

The deliberate choice here is that the citation database and the document database are the same Postgres instance. The cost is operational one more thing to back up, one more thing to vacuum. The benefit is that when an auditor or opposing counsel asks "where did this paragraph come from," the answer is a deterministic query, not a cross-system reconciliation. For a contract-review tool, that is the entire product.

The H100 sizing is right for a 12,000-document binding corpus with ten-to-fifteen concurrent users. Smaller firms reach the same architectural posture on commodity silicon - the trade-offs for Apple Silicon inference nodes on M4 Max and M3 Ultra are written up separately, and the same retrieval and citation logic applies regardless of what sits underneath.

04 - INGESTION AND CHUNKINGClause-aware splitting beats fixed-window.

The first pass of the corpus used a 1,024-token fixed-window splitter. The retrieval metrics on the partner-graded eval set were unembarrassing but not good - recall at 5 was 0.71. The failure mode was predictable: a clause that crossed a chunk boundary lost both halves. A force-majeure carve-out split across two windows is two useless windows.

The fix was a clause-aware splitter built on top of the firm's own document conventions. Master agreements follow a numbered-clause structure; side letters reference clauses by number; addenda are short and self-contained. The splitter walks the document tree using the numbering as anchors and falls back to semantic boundaries (sentence + heading detection) for free-form prose. Each chunk carries its parent clause number, document ID, and page range in metadata.

With clause-aware chunking, recall at 5 moved to 0.93 on the same eval set. The remaining 7% was concentrated in scanned addenda where OCR errors corrupted the clause numbering. That subset gets routed to a manual review queue rather than the model - a partner reviewing thirty OCR-flagged paragraphs is a better outcome than the model hallucinating a clause that never existed.

"The model is allowed to be wrong. The citation is not."

- Partner sign-off rule, written into the system prompt.

05 - EVALUATIONPartner-graded, not vendor-graded.

The evaluation harness was built before the model was. Three partners and two senior associates spent a day writing 180 questions across the corpus, each with the expected clause cited by document and paragraph. The harness measures three things: retrieval recall at 5, answer correctness against the reference, and citation faithfulness - whether every claim in the model's answer is supported by a retrieved chunk.

The final numbers on the locked eval set at handover:

Recall@5 0.93 The right clause was in the top five retrieved chunks in 93% of cases. The remaining 7% was OCR-flagged and routed to manual review.
Answer correctness 0.88 Graded by the two senior associates blind to which answers were model-generated. The baseline for first-draft associate answers on the same set was 0.79.
Citation faithfulness 0.997 Every claim in the model's answer maps to a retrieved paragraph. The non-faithful 0.3% was caught by a post-hoc verifier that re-checks every emitted citation against the retrieval set.

The citation faithfulness number is the one that mattered for sign-off. Recall and correctness are nice-to-haves; a hallucinated citation in a partner's first-pass digest is a professional-conduct incident - and that is not an abstract worry, given the Stanford RegLab audit of commercial legal-AI tools that found 17-34% hallucination rates on legal queries. The verifier is a small classifier that runs after every answer; it rejected 11 of the 180 reference answers on the first pass and forced a regeneration. That cost ~2 seconds per regeneration and zero professional incidents.

06 - RESULTFour days down to ninety minutes.

Sixty days after handover, the measured outcomes on the firm's own matter log:

  • Contract review time per matter: baseline four associate-days (~32 hours), post-deployment ~90 minutes of partner review over a model-prepared digest. A 73% reduction in total review time, normalised to matters of comparable complexity.
  • Partner sign-off latency: baseline median 6.5 business days from intake to partner sign-off, post-deployment 1.5 business days.
  • Outbound bytes from the inference network: zero detected, audited weekly. Egress alerts on the firewall fired three times in 60 days - all three were time-sync requests to the firm's own NTP appliance, which is a named exception.
  • Citation incidents: zero. Two near-misses caught by the post-hoc verifier in week three; both were upstream chunking issues, both fixed.

The number partners cite externally is the 73%. The number that closed the original procurement question - and the reason the firm's London office is now scoping a second deployment - is the second one. A solicitor who can answer "where is the privileged material" with a rack unit, not a region, has a different conversation with regulators than a solicitor who cannot.

07 - WHAT WE WOULD NOT REPEATTwo honest mistakes, written down.

Every deployment has a list. This one had two items worth writing down.

One: the first eval set was too small. 180 questions felt comfortable in week two and turned out to be thin by week four. Two clause categories - jurisdictional carve-outs and termination triggers - were under-represented, and the model's recall on those was worse than the headline number suggested. The fix was an additional 120 questions, written by associates specifically targeting those two categories. The next deployment will start with at least 400 graded questions before any tuning runs.

Two: we underestimated the OCR tail. The 7% of chunks that fell out of the recall number all traced back to scanned addenda from the firm's older matters, where the clause numbering was corrupted by OCR. We had budgeted two days for OCR cleanup and spent six. The next deployment with a comparable corpus will budget two weeks and treat OCR as a first-class workstream, not a preprocessing step.

Everything else - the H100 sizing, the pgvector choice, the clause-aware splitter, the post-hoc citation verifier - would be repeated as-is. For the broader compliance framing behind the cloud disqualification, the companion notes are on-prem versus cloud without internet and GDPR liability for public AI assistants. For the rollout cadence and the operational discipline that maintains the posture after handover, the closest reference is the 14-day private LLM rollout.

For the cross-industry view of the same on-prem posture in a different regulated vertical, the parallel private medical-NER deployment over EHR referral letters covers the case where the regulator is a health authority rather than a financial-services watchdog, and the corpus is clinical notes rather than master agreements - the architectural answer is structurally identical.

For firms with a comparable binding corpus and a similar professional-conduct posture, the same on-prem private LLM stack - Ollama-served Llama 3.1 70B, pgvector RAG over binding documents, network-isolated inference - is what a turn-key private AI deployment for law firms delivers, with audit-grade paragraph-level citations and post-handover support baked into the engagement.

N
Nikita Chetverikov
Fullstack · Private AI