zedbyl.tech/blog/A private LLM for a research lab: notes from a 14-day rollout

ARCHITECTUREField note · 001

A private LLM for a research lab: notes from a 14-day rollout

Field notes from a 14-day on-premises private LLM deployment for a 28-person genomics lab: Mac Studio M3 Ultra running Llama 3.3 70B via Ollama, AnythingLLM RAG over 2,400 unpublished documents, pfSense deny-by-default egress, and zero outbound bytes after handover - with the hardware trade-offs, GDPR and grant-compliance framing, and the three things that broke.

PublishedApr 28, 2026

Reading time14 min

AuthorNikita Chetverikov

Categoryon-prem · architecture

01 - THE BRIEFTwenty-eight researchers, unpublished data, one rule.

The kick-off call lasted forty minutes. The lab director described the workload in two sentences: "We want to ask questions over our own corpus - preprints, lab notes, grant drafts. We need it to be useful enough that people stop using ChatGPT for it." The compliance officer, dialled in from another building, added the constraint that turned this from a productivity project into an architecture project: "Nothing leaves the network. Including telemetry. Including model updates. Including the question of whether the model exists."

The corpus was substantial. Roughly 2,400 unpublished documents - preprints under embargo, internal review drafts, grant proposals at various stages - plus seven years of meeting notes and supervision records. Three live grant submissions were due within sixty days, and the lab director estimated that a working AI assistant over this corpus would save each researcher one to two hours per day on literature triage and draft review.

The funding body's data-handling clause was explicit: research outputs and any data derived from research outputs could not be processed by third-party AI services without a written contract reviewed by the funder. No such contract existed for any cloud AI provider available in the institution's procurement catalogue. The cloud was not slow. The cloud was not expensive. The cloud was simply not an option.

Project frame

The brief was not "build us an AI." The brief was "build us the AI that the funding body, the DPO, and the senior researchers will all sign off on, in a single fortnight, with no detectable outbound traffic after handover."

02 - WHY CLOUD WAS DISQUALIFIEDThree questions, twenty minutes.

The lab's IT lead arrived at the second meeting with a pre-built spreadsheet comparing five cloud AI vendors. The compliance officer read the spreadsheet and asked three questions. None of the vendors survived the third.

Where, physically, does an embedded preprint live the moment after upload? All five vendors answered with a region (EU, EU+US, EU-only with replication). None could name a rack unit, a datacenter aisle, or a contractual guarantee that the data would not be replicated for redundancy purposes.
Who has root on the inference machine? Three vendors answered "we do, with audit logging." Two answered with a shared-responsibility model. None answered with "your sysadmin, with these named keys, after handover."
If the vendor's enterprise contract terminates - bankruptcy, acquisition, breach of grant conditions - what happens to the corpus on Monday morning? All five answered with a contractual data-deletion clause. None could demonstrate that the deletion was technically enforceable across all replication zones, sub-processors, and backup tiers.

The third question was the disqualifier. A grant-funded research output cannot live in a system whose continuity depends on a contractual clause with an external party. The funding body's audit framework requires the institution to prove, technically, that the data is recoverable and controlled regardless of vendor status. No cloud vendor in the catalogue could meet that standard for an unpublished corpus. The lab's risk register sat squarely inside the wider institutional shift toward technology sovereignty for AI workloads that Deloitte's 2026 TMT predictions describe - not an isolated stance, but the default posture for European public-sector research by the end of 2025.

The framing was not unique to research labs either: the same Article 28 processor problem and DPIA gap show up in the GDPR exposure of public AI assistants in higher education, where the institution carries the controller liability and the cloud vendor's standard terms simply do not absorb it. The decision to go on-premises was made before the end of the call. The remaining work was to confirm the hardware budget, define the architecture, and schedule the deployment.

03 - THE HARDWARE DECISIONMac Studio M3 Ultra, with the trade-offs named.

The shortlist was two machines: a Mac Studio M3 Ultra with 192 GB unified memory, or a Linux workstation with two NVIDIA RTX 6000 Ada cards (96 GB total VRAM). Both could run a 70-billion-parameter model at usable speeds. The decision came down to three operational factors that had nothing to do with raw throughput.

Mac Studio M3 Ultra 192GB unified 800 GB/s memory bandwidth. Single machine, single power cable, sits on a desk. ~95-115W under inference load. €7,200 inc. tax.

2× RTX 6000 Ada 96GB VRAM Per-card 960 GB/s, but PCIe limits inter-card throughput. Workstation chassis, dedicated cooling, ~700W under load. ~€14,000 inc. tax.

Throughput on 70B Q4_K_M ~28tok/s vs ~38 Single-user. Mac is slower in raw terms, faster in setup, lower in power, and quieter by an order of magnitude.

The lab chose the Mac Studio. The decision rationale, written into the procurement memo:

Operational simplicity. The lab has no dedicated systems engineer. A Mac Studio integrates into existing university IT processes - same patching, same MDM, same physical-security posture as any other Mac on the network. The Linux workstation would have required a bespoke maintenance plan.
Acoustic and thermal envelope. The machine sits in a shared researcher office. The Linux workstation under load is audible across the room and produces enough heat to require local extraction. The Mac Studio is silent and stays cool.
Throughput sufficient for the team. Twenty-eight researchers, of whom at most eight are active concurrent users. At 5-6 tokens per second per user under that load, queries return inside ten seconds for short prompts. Fast enough that nobody falls back to ChatGPT.

The faster hardware, in this case, was the wrong hardware. The right hardware was the one that fit into the existing operational regime without requiring a new one. The numbers behind the choice - tokens per second on quantised 70B weights, memory bandwidth, sustained thermal envelope - line up with the open community benchmarks of llama.cpp on Apple Silicon M-series and with our own write-up of honest Apple Silicon inference benchmarks for M4 Max and M3 Ultra, which is the more granular companion to this section.

04 - THE STACKOllama, AnythingLLM, RAG, audit log.

The stack was deliberately boring. Every component was open-source, deployable from the local network, and had at least two viable maintenance paths if the upstream project stagnated. No SaaS components. No license servers. No telemetry endpoints in the default configuration. The component choices mirror the ones documented in Simon Willison's ongoing field notes on running local LLMs on your own hardware - the same Ollama-plus-front-end-plus-local-vector-store shape that has become the default for privacy-constrained teams across the EU and the CIS.

REF / 003

003Reference deployment for the genomics lab. Corpus NAS pre-existed; everything inside Zone B was new. The audit log is on a separate machine to make tampering harder, not impossible - perfect tamper-resistance was out of scope.

Inference: Ollama. Llama 3.3 70B at Q4_K_M quantization, plus a smaller 8B model for fast tasks where the quality drop was acceptable. Both models loaded into unified memory on startup; no model swapping during normal use.
Workspace and RAG: AnythingLLM. Per-researcher workspaces with shared corpus access. Built-in document ingestion and retrieval, no separate vector database to manage. Embedding model: bge-large-en-v1.5 for English, with a Russian variant available for the Slavic-language preprints.
Vector store: LanceDB. Embedded in AnythingLLM, file-based, local. No separate database server, no network port, no operational surface area beyond the file system.
Audit log: a small Linux box. Separate machine, write-only network share, immutable filesystem. Every prompt and response logged with timestamp, user, model, and a hash of the documents retrieved. Not glamorous; absolutely required.

05 - NETWORK ISOLATIONDeny by default. Logged. Quarterly review.

The compliance officer's hard requirement was that the inference node had zero outbound connectivity after handover. The university's network team configured a dedicated VLAN with deny-by-default egress at the perimeter. Any connection attempt the inference node made to a non-internal address would be logged as an alert.

Two specific allowances were made, both internal-only: NTP for clock sync (against an internal time server, not a public NTP pool), and HTTPS inbound from the lab VLAN for researcher access. Everything else was blocked. The Ollama model server was configured with telemetry disabled at startup. AnythingLLM was configured with all external integrations disabled and the auto-update channel pointed at localhost to break it intentionally.

Field note

The first time the firewall blocked an outbound connection from the inference node, it was AnythingLLM trying to phone home for a "telemetry opt-in check" - despite telemetry being explicitly disabled in the config. The connection was logged, blocked, and the alert was the first proof the isolation worked.

06 - DAY-BY-DAY TIMELINEWhat actually happened in fourteen days.

The deployment was scheduled across three weeks of calendar time, with fourteen working days of engagement. The timeline below is what occurred in practice - not what was on the Gantt chart, which had to be revised on day five.

Day 1-2 - Scoping and procurement. Confirmed brief. Ordered Mac Studio M3 Ultra 192 GB. Specified the audit-log appliance. Drafted DPIA addendum referencing the institution's existing AI policy.
Day 3 - Hardware arrival, racking. Mac Studio physically installed in the lab's secured server cabinet. Audit-log appliance racked alongside. VLAN provisioned by the university IT team.
Day 4-5 - Operating system, Ollama, base configuration. macOS configured for unattended operation. Ollama installed and configured. First test of Llama 3.3 70B at single-user load. Tokens-per-second measured and logged.
Day 6 - AnythingLLM installation, workspace structure. AnythingLLM installed and configured. Per-researcher workspaces created. Shared corpus workspace created with read-only access from all user accounts.
Day 7-8 - Corpus ingestion, embedding model selection. Initial ingestion of 2,400 documents. First embedding model (smaller, faster) produced retrieval that was too imprecise for the lab director's test queries. Switched to bge-large-en-v1.5 on day 8. Re-embedding took six hours.
Day 9 - Network isolation, firewall rules, audit log. pfSense rules deployed. First blocked outbound connection logged. Audit log appliance receiving prompt and response records.
Day 10 - Backup configuration, disaster-recovery test. Encrypted backups configured to a second internal NAS. Restoration test passed. Note in handover: backup script needs more aggressive monitoring (this proved prescient - see "What broke").
Day 11 - User accounts, role-based access, MDM integration. Twenty-eight researcher accounts provisioned. Two admin accounts (lab director, IT lead). Mac Studio enrolled in university MDM for patching.
Day 12 - First training session. Ninety-minute session with the senior researchers. Query patterns, retrieval quality expectations, what not to ask. First questions surfaced: a permission edge case in AnythingLLM that required a configuration change (see "What broke").
Day 13 - Second training session, junior researchers. Ninety-minute session for the rest of the lab. Different question set: more "how do I" than "what is." Documentation written based on the actual questions asked.
Day 14 - Handover, documentation pack, sign-off. Final walkthrough with lab director, IT lead, and DPO. Documentation pack delivered: architecture, runbooks, escalation procedures, DPIA addendum, first-90-days operational checklist. Sign-off obtained.

Days five and eight contained the two unplanned days of work that compressed the rest of the schedule. The lesson, written up in the post-project review: budget two extra days inside any fourteen-day deployment specifically for embedding-model and permission issues, because these are the two things that consistently take longer than estimated.

07 - WHAT BROKEThree problems, three fixes, named.

No deployment is clean. The three problems below were the ones that mattered - caught and fixed inside the fourteen days, but worth documenting because they will recur on similar deployments.

Embedding model mismatch. The first embedding model selected (a smaller multilingual model) produced retrieval that ranked irrelevant documents above relevant ones for the lab director's test queries. Symptom: queries about a specific gene returned papers from an unrelated subdomain. Diagnosis: the smaller model was not trained on enough scientific text for the corpus. Fix: switched to bge-large-en-v1.5 on day 8, re-embedded 2,400 documents over six hours, retrieval quality became acceptable. Lesson: test embedding quality on real corpus questions before committing.
AnythingLLM permission edge case. A researcher discovered, during the day-12 training, that they could see workspace metadata for workspaces they were not assigned to. Not the documents themselves, but the workspace names - which in some cases were the names of unpublished projects. Diagnosis: a default in AnythingLLM that exposed workspace listings to all authenticated users. Fix: configuration change on day 12, plus a follow-up with the AnythingLLM project to confirm the default in the next release.
Backup script failed silently for two days. The encrypted backup script ran successfully on day 10 (during the disaster-recovery test). On days 11 and 12, the script reported success but produced empty backup files due to a path issue introduced when the corpus directory was moved during ingestion. Caught on day 13 by the handover checklist, which included a manual verification of backup file size. Fix: corrected the path, added a size check to the script, raised an alert if the backup file was below a sanity-check threshold. Lesson: a backup that reports success is not the same as a backup that contains data.

"The point of the handover checklist is not to verify that the system works. The point is to verify that the system fails loudly when it stops working."

- Field journal, day 13

08 - TRAINING THE TEAMTwo sessions, ninety minutes each.

Training was deliberately split into two ninety-minute sessions, by seniority rather than role. The senior researchers (PI, post-docs) attended the first; the junior researchers (PhDs, technicians) the second. The sessions had different content because the questions were different.

Session 1 - Senior researchers

Coverage: What the system can and cannot do. How retrieval works. Why some queries return nothing and what to change. How to interpret confidence in model output. What to escalate to the IT lead.
Questions asked: "Can it write the introduction for me?" "How do I know if it is making something up?" "What happens if I paste in a draft from a colleague?" The third question opened a useful conversation about workspace isolation that changed the workspace structure for two ongoing collaborations.

Session 2 - Junior researchers

Coverage: How to log in. How to upload a document to your workspace. How to ask a question that returns a useful answer. How to verify the citations the system produces. Practical examples from the lab's actual workflow, drafted with the lab director the previous week.
Questions asked: "Is this faster than ChatGPT?" "Can I use it on my phone?" "What if I forget my password?" Concrete, operational, no philosophical concerns. Documentation was written immediately afterward, mirroring the questions verbatim.

The senior session generated requirements for the system. The junior session generated documentation. Both were necessary; the second was where most of the lasting value of the deployment came from, because it converted a piece of infrastructure into a tool the team would actually use.

09 - NUMBERS AFTER 60 DAYSUsage, latency, and zero detected egress.

Sixty days after handover, the lab IT lead exported the audit log, ran the standard reporting queries, and shared the numbers. The summary below is reproduced with permission and with one identifying detail redacted.

Active researchers 26of 28 Two researchers had not used the system at all by day 60. Both were on extended leave. Adoption among present researchers: 100%.

Daily prompts ~340median Median across active days. Range 180-560 depending on the day of the week. Highest days were Tuesdays and Wednesdays.

Median latency 3.8seconds From prompt submission to first token. P95 at 9.1 seconds. Acceptable for the workload; nobody fell back to ChatGPT.

Detected outbound bytes 0after handover Across 60 days, with deny-default firewall and alert-on-block. Two blocked attempts in the first 48 hours, both during initial configuration; none since.

Storage growth 38GB / 60d Audit log plus new corpus documents plus vector store delta. Well within the planned capacity.

Operational tickets 7total Three password resets, two RAG quality questions, one MDM patch issue, one researcher requesting a new workspace. None required vendor intervention.

The most consequential number is the bottom one on the second row: zero detected outbound bytes after handover. The grant compliance audit, scheduled for the eighth week, took forty minutes - most of which was the auditor reading the architecture document and the audit log queries. The institution's DPO, who had previously written three memos about the lab's ChatGPT usage, closed the file and reassigned the open risk to "monitor."

That last sentence, more than the throughput numbers, is what the deployment was for. The rest is engineering. The hard part was naming what the lab was actually buying - not faster literature review, but the institutional posture that lets the lab use AI on unpublished data without a memo from the funding body, the DPO, or the news desk. A near-identical posture, with a different corpus and a different regulator, is documented in a parallel private RAG deployment in a Dubai/London law firm - same on-prem private LLM shape, different industry constraints.

Research groups, clinics, finance desks and accounting teams that want this same on-prem private LLM posture without rebuilding it from scratch - the Mac Studio M3 Ultra 192 GB chassis, Llama 3.3 70B served via Ollama, AnythingLLM workspaces with private RAG, deny-by-default egress and the DPO-ready audit pack - usually start from the pre-configured Sanctum Mac Studio appliance, sized for 8-12 concurrent researchers and delivered with the GDPR and EU AI Act handover documentation already written.

Nikita Chetverikov

Fullstack · Private AI

Related field notes.

All posts →

FN / 001

2026 · 05

CASE STUDYMay 19, 202611 min

Hand-written referrals to structured EHR records: a clinic NER case study (UAE)

A UAE clinic group turned hand-written referral letters into structured EHR records in 4 seconds with 99.2% NER precision - on a single on-prem A6000, never touching the public internet.

FN / 002

2026 · 06

OPINIONJun 16, 202615 min

Why 40% of AI Projects Get Canceled - And the Five Decisions That Separate the Rest

Gartner says more than 40% of AI agent projects will be canceled by the end of 2027. The reason is almost never the technology. Here is what actually goes wrong - and a practical framework for not ending up in that pile.

FN / 003

2026 · 06

CASE STUDYJun 28, 202615 min

How I Built an Autonomous AI Content Engine for a Crypto Media Company

A crypto media company needed to cover a 24/7 market with a finite editorial team. Here is how I built an autonomous pipeline that went from detecting events to publishing finished articles - and why the hardest part had nothing to do with AI generation.