zedbyl.tech/blog/Apple Silicon as an inference node: M4 Max & M3 Ultra, honest digits

INFRAField note · 001

Apple Silicon as an inference node: M4 Max & M3 Ultra, honest digits

Benchmarks for 70B models on M4 Max and M3 Ultra. Why Apple is betting on local inference - and what the token economics tell us about the future.

PublishedMay 7, 2026

Reading time16 min

AuthorNikita Chetverikov

Categoryon-prem · infra

01 - THESISLocal inference is not a hobby project anymore.

Two years ago, running a 70-billion-parameter model on a desktop machine was a party trick. You showed it to colleagues, they nodded politely, and everyone went back to the API. The latency was bad, the quantization artifacts were visible, and the hardware was hard to justify against a monthly cloud bill.

That calculation has flipped. Cloud inference is getting more expensive, not less. Rate limits tighten every quarter. And Apple - quietly, deliberately - has built hardware that makes local inference not just possible but economically rational for small teams.

This article covers two machines: the Mac Studio M4 Max with 128 GB unified memory, and the Mac Studio M3 Ultra with 192 GB. Benchmarks, concurrent-user numbers, the ceilings nobody publishes. First the context though - the hardware story only makes sense inside the economic one.

Heuristic

If your inference bill is growing faster than your user count, the model is not the problem. The delivery mechanism is.

02 - TOKEN ECONOMICS CRISISCloud inference does not scale profitably.

Here is the arithmetic nobody in the API business likes to discuss publicly. Serving a 70B model in the cloud requires either a cluster of A100s or a pair of H100s. The hardware amortizes at roughly $2.50-3.00 per GPU-hour. At typical throughput, that translates to a hard floor on cost per token - a floor that sits uncomfortably close to the price most providers charge.

Cloud cost $3-15per 1M output tokens Depending on provider and model size. The range itself tells you the market has not found equilibrium.

GPU-hour floor $2.50amortized Per A100 GPU-hour at datacenter scale, before networking, cooling, and staff.

Margin at current pricing ~5-12% For 70B-class models. Negative for some providers subsidizing adoption.

The implication is structural. Cloud providers cannot serve frontier-class models profitably at current pricing - they either raise prices, degrade model quality, or push the compute elsewhere. Most are doing all three at once: smaller default models, higher prices for the large ones, an aggressive push toward on-device inference for routine tasks.

Apple chose the third path with more conviction than anyone else. Not because they are altruistic, but because they sell hardware. Every task that moves from cloud to device makes the device more valuable.

"The token is not a product. It is a cost center dressed as a revenue line."

- Field observation, Q1 2026

03 - APPLE'S HARDWARE BETSilicon over software, deliberately.

Apple's recent executive reshuffling reads clearly if you treat it as a resource allocation signal, not corporate gossip. Hardware engineering leads now sit closer to AI strategy than at any point in the company's history. The message is not subtle. Apple will not compete with OpenAI or Google on foundation models. Apple will build the silicon that runs any model locally.

The logic is simple. Foundation models are commoditizing - open weights close the gap with proprietary ones every quarter. The scarce resource is not the model. It is the ability to run it without a network connection, without per-token billing, without your data leaving the office.

Strategic frame

Apple does not sell intelligence. Apple sells the silicon that makes intelligence a feature of the machine you already own.

Unified memory is the technical keystone of this strategy. In a discrete GPU setup, model weights must cross a PCIe bus - 64 GB/s on a good day, typically less under real workloads. Apple Silicon puts CPU, GPU, and Neural Engine on the same die, sharing a single memory pool at 400-800 GB/s. For inference, where the bottleneck is almost always memory bandwidth, this is not an incremental improvement. It is a different category of machine.

Trajectory: AI as a built-in feature of the device - like Spotlight or autocorrect - not a subscription. Hardware becomes more valuable, which is Apple's business model. User becomes less dependent on the network, which is Apple's privacy model. Both incentives point the same direction.

04 - THE HARDWARETwo machines, one architecture.

Both machines share the same unified memory architecture and the same software stack. Differences are quantitative - more bandwidth, more cores, more memory - the operational experience is identical. Same OS, same llama.cpp build, same Ollama config. You scale by buying a bigger machine, not by learning a new one.

M4 Max · 128 GB 400GB/s bandwidth 16-core CPU, 40-core GPU. 65-80W under inference. The workhorse for teams of 4-6.

M3 Ultra · 192 GB 800GB/s bandwidth 24-core CPU, 76-core GPU. 90-115W under inference. Doubles throughput, extends to teams of 8-12.

NVIDIA A100 · 80 GB 2,039GB/s bandwidth 300-400W. Faster raw throughput, but limited VRAM forces multi-GPU for 70B models.

REF / 002

002M3 Ultra doubles the memory bandwidth and adds 50% more memory capacity. For inference workloads, bandwidth is the primary throughput determinant.

The M3 Ultra's 192 GB opens a door the M4 Max cannot: unquantized 70B, or quantized up to roughly 120B parameters. For most practical deployments the extra bandwidth matters more than the extra memory - it translates directly into faster token generation under concurrent load.

05 - BENCHMARKSReal numbers, no marketing.

Test setup: Llama 3.3 70B and Qwen 2.5 72B, both quantized to Q4_K_M. Served via llama.cpp with Metal acceleration, default context length of 4096 tokens. Test harness sends structured prompts of varying length (512, 1024, 2048 tokens) and measures tokens per second, time to first token, and throughput stability over 100 sequential requests.

M4 Max · single user 18-22tok/s Generation speed. TTFT under 800ms for prompts up to 2048 tokens. Consistent across both models.

M3 Ultra · single user 28-35tok/s · projected Projected from 2× bandwidth scaling. TTFT under 500ms. Validated on shorter sequences; full benchmark in progress.

Throughput delta ~55% improvement M3 Ultra over M4 Max. Tracks memory bandwidth ratio closely, confirming bandwidth as the binding constraint.

Honest caveat: these are Q4_K_M numbers. Four-bit quantization trades roughly 2-3% of model quality for a 60% reduction in memory footprint. The M3 Ultra can run FP16 70B models - they fit in 192 GB - but generation speed drops to 12-15 tok/s as the model saturates available bandwidth. For most production use cases, Q4_K_M is the right trade-off.

06 - CONCURRENCY AND LIMITSWhere the ceiling is, honestly.

Single-user benchmarks are marketing. The real question is how many people can use the system simultaneously before it stops feeling interactive. Numbers below. The ones most vendors prefer you see after the purchase order, not before.

M4 Max · 4 users 5-6tok/s per user Usable for code completion and short-form generation. Feels responsive.

M4 Max · 8 users 2-3tok/s per user Practical ceiling for interactive use. Long-form generation becomes a patience exercise.

M3 Ultra · 4 users 8-10tok/s per user Feels like single-user on the M4 Max. Headroom for background tasks alongside interactive use.

M3 Ultra · 8 users 5-6tok/s per user Comfortable interactive use for a full team. The sweet spot for most deployments.

M3 Ultra · 12 users 3-4tok/s per user Practical ceiling. Beyond this, plan a small cluster or accept queuing delays.

The power equation

M4 Max draws 65-80W under inference load. M3 Ultra draws 90-115W. An NVIDIA A100 draws 300-400W - and you likely need two of them for a 70B model. Over a year of continuous operation, the electricity delta alone adds up, but the real saving is infrastructure: no rack, no cooling, no dedicated power circuit. A Mac Studio sits on a desk and runs on a standard outlet.

Practical guidance

The question is not whether Apple Silicon is faster than NVIDIA. It is not. The question is whether it is fast enough for your team size, at a tenth of the power budget and a twentieth of the infrastructure complexity.

Sizing recommendation

4-6 people: Mac Studio M4 Max 128 GB. Single machine, single desk, done.
8-12 people: Mac Studio M3 Ultra 192 GB. Same simplicity, 2× the headroom.
12+ people: Two M4 Max units behind a load balancer, or cloud hybrid for peak overflow.

07 - BACKGROUND AI AGENTSThe workload that changes the math.

Everything above assumes interactive use - a person typing a prompt and waiting for a response. But the workload that matters next is not interactive at all. It is background agents: autonomous processes that review code, summarize documents, triage emails, monitor logs, and generate reports - continuously, without human prompting.

On cloud APIs, background agents are a budget nightmare. An agent that reviews a codebase overnight might consume millions of tokens - at $5-15 per million, that is a meaningful line item for doing work nobody is awake to supervise. Rate limits add insult: your agent queues behind everyone else's agents, and throughput becomes unpredictable.

On a local inference node, background agents cost exactly zero in incremental token fees. The hardware is already amortized. The electricity delta is negligible. And there are no rate limits - your agent runs at full throughput whenever the machine is not serving interactive requests.

A practical setup for a team of six: during working hours, the Mac Studio serves interactive queries. After hours, two or three background agents take over - reviewing the day's pull requests, updating documentation, scanning logs for anomalies. The team arrives in the morning to a summary of what happened overnight, produced at zero marginal cost.

"The most productive AI deployment is the one that works while the team is asleep."

- Field journal, April 2026

This is where Apple's strategy connects back to the individual. AI stops being a tool you pick up and put down. It becomes a background process - like email sync or cloud backup - that chews through your professional context whether you are at the desk or not. The question is no longer "should I use AI for this task?" but "what has my AI already done about this while I was not looking?"

08 - WHERE THIS GOESAI as a device feature, not a service.

Apple's trajectory points toward a future where AI is a property of the device, not a service accessed through it. The Mac Studio is the professional-grade version of what every iPhone and Mac will eventually do natively - run meaningful models locally, with no cloud dependency, no per-query cost, no data leaving the network.

If you are making infrastructure decisions today, this is not a speculative argument. The M4 Max and M3 Ultra are available hardware, running production models, at the measured numbers above. They are good enough for teams of 4-12. The power budget is rational. The privacy guarantee is absolute.

What you are buying is not a faster way to call an API. You are buying the ability to stop calling the API. That is a different kind of asset - one that appreciates as models get smaller and more capable, rather than depreciating as cloud pricing fluctuates.

If you want the deployment counterpart of these numbers, the 14-day private LLM rollout walks through the appliance build that turns this hardware into a working stack. And for the framing of why local inference is not just "cloud without internet", see why on-premises is not cloud without internet.

The rest is engineering.

09 - WORK WITH MESanctum: the Mac Studio private AI appliance these benchmarks describe.

If the sizing tables above describe a team you actually run - four to twelve people, a 70B-class model, no cloud egress - the appliance already exists. It is called Sanctum: a turnkey Mac Studio private AI appliance with Llama 3.3 70B, built on the same M4 Max and M3 Ultra configurations benchmarked above, pre-loaded with Ollama, Open WebUI and AnythingLLM, delivered to your office and installed on your LAN in two weeks. Outbound internet severed at the firewall. Documents never leave the box.

Sanctum / Max is the 18-22 tok/s M4 Max for 4-6 users; Sanctum / Ultra is the 28-35 tok/s M3 Ultra 192 GB for 8-12 users with headroom for the background-agent workload from section 07. Both ship with private RAG over your corpus, pf-firewall network isolation, and an audit pack suitable for a DPO, a GDPR auditor or an EU AI Act review - the kit law firms, clinics, finance, crypto and accountants across the EU and UAE need instead of another cloud subscription. The deployment timeline behind these numbers is in the 14-day private LLM rollout playbook; when you are ready to size the box, the intake form is on the Sanctum Mac Studio appliance page.

Nikita Chetverikov

Fullstack · Private AI

Related field notes.

All posts →

FN / 001

2026 · 05

COMPLIANCEMay 4, 202611 min

Public AI assistants in higher education: the GDPR exposure most institutions have not assessed

When staff paste student work into a public AI assistant - ChatGPT, Claude, Gemini, whichever - the institution becomes the controller for a processor it never contracted. A walk through GDPR Articles 5, 28, 32 and 35, the rulings already issued, and the architectural fix that does not require banning AI.

FN / 002

2026 · 05

CASE STUDYMay 18, 202612 min

Private RAG for contract review: a law firm case study (Dubai/London)

How a Dubai/London law firm cut contract review time by 73% with a private Llama 3.1 70B RAG running on-prem over 12k binding documents - audit-grade citations, NDA-safe, no cloud.

FN / 003

2026 · 05

CASE STUDYMay 19, 202611 min

Hand-written referrals to structured EHR records: a clinic NER case study (UAE)

A UAE clinic group turned hand-written referral letters into structured EHR records in 4 seconds with 99.2% NER precision - on a single on-prem A6000, never touching the public internet.