Prince Canuma · 01:30 "In 2020 my father went blind. The same year, Apple released its most powerful chip for on-device inference (M1). I told my father, 'I'll find a way to get you back to reading' — from then on, on-device AI became my future."
A talk that stands apart from other sessions — a technical argument starting from personal motivation. Prince's father lives in Africa and lost his sight in 2020. He lives in a place with unstable internet, where "cloud AI" cannot reach him. Prince's conviction that "on-device AI is the future" begins here. The same year, Apple announced the M1. A coincidental crossing.
Three years later (2023), Prince found MLX on GitHub and became a contributor. "Three years on, 1.5M downloads, 4,000-plus ported models, day-zero support relationships with frontier labs" (03:00). As a central figure in the MLX ecosystem, he shows in 22 minutes the frontier of on-device AI implementation.
From the MEMEX editorial vantage, what matters is that this is the first time the on-device inference angle is given dimension on MEMEX — distinct from the vibe coding / agent line we've covered. Karpathy / Boris / Schluntz assume cloud inference (Claude API, GPT API); Mehedi and Eric (Granola, Trigger.dev) likewise. Prince instead treats "everything runs locally" itself as a philosophy. Accessibility, sovereignty, privacy — angles that don't surface when cloud is the default.
Key Observations
Personal history — his father's blindness and the coincidence of Apple Silicon (01:00 to 02:30)
Prince's starting point. In 2020 his father went blind. "My father is the most voracious reader I know." "I told my father, I'll find a way to get you back to reading" (01:30). The same year, Apple announced the M1.
The reasoning for the technical choice is both personal and structural: "My father lives in Africa. Internet isn't readily available there the way it is here, and subscription plans are genuinely terrible. So on-device is the future" (02:30). Set against the vibe coding Karpathy and Boris talk about under the cloud-AI assumption, Prince directly problematizes the geographic inequality of AI access. A direct response to the same structural issue that Hinton and Sejnowski raised at the DWC in the form of "the tobacco-asbestos precedent (the Global South as victims)."
The state of the MLX ecosystem — 1.5M downloads, 4,000 models (03:00)
MLX A machine learning framework for Apple Silicon, published by Apple's research team in 2023. Plays the role on Apple Silicon that PyTorch or TensorFlow plays elsewhere. Uses the unified memory architecture (CPU/GPU shared memory) to run LLM inference efficiently. itself is the Apple Silicon ML framework Apple released in 2023. Prince found it early and has led the community extensions.
MLX extensions from Neywa Labs:
- MLX VLM — vision-language models (image understanding plus text generation). Adopted as the foundation under LM Studio, Liquid AI models, and others
- MLX Audio — speech recognition (Whisper-family), speech synthesis (Marvis TTS, sub-100ms generation)
- MLX Video — video generation. Runs on a 16GB-RAM MacBook
- MLX Omni series — integrated image + audio + text models (Gemma 4 E, QAN 3 Omni 30B, etc.)
The meaning of "day-zero support": "Gemma 4 launched last week. We had support ready in MLX on day one" (03:00). This indicates an official pre-coordinated relationship with frontier labs. Open-source MLX, updated at the same pace as the latest models from Apple, Google, and Meta.
An era in which Gemma 4 26B runs on iPhone (04:42 to 05:30)
The technical highlight. "Models with tens of billions of parameters run on an M1 MacBook. And on iPhone, Gemma 4 26B runs — using storage" (05:00). With the qualifier "at a reasonable speed."
This guarantees, on the device side, the physical possibility behind Karpathy's "Software 3.0" at AI Ascent 2026. Karpathy's paradigm of "the LLM is the computer" holds because "an LLM runs on a local machine." Apple Silicon's Unified Memory architecture (GPU/CPU shared memory) is what makes it physically possible.
Marvis TTS and sub-100ms generation (06:00)
Neywa Labs' open custom voice generation model, Marvis. "Generates audio in under 100ms." Real-time dialog systems become practical as a result. "Anyone using Whisperflow or Super Whisper? You can vibe code that in 10 minutes by pointing Claude Code or Codex at MLX Audio" (06:20).
An ideal example of Schluntz's "leaf node strategy" — a fully local audio I/O library with enough polish, plus Claude Code for assembly, lets an individual build a TTS app in 10 minutes. It extends the "combine cloud AI with Claude Code" vibe coding covered elsewhere on MEMEX into a different dimension: "combine local AI with Claude Code."
TurboQuant — Prince's research contribution (21:00 to 22:30)
A key Q&A moment. "I open-sourced TurboQuant A quantization research released in March 2026. Compresses the LLM KV cache (Key-Value — the main consumer of memory at inference time) by 4x. Prince Canuma published an open implementation 30 minutes after the paper appeared, and it spread widely the same day. 30 minutes after the paper dropped — the first in the world. The 3 a.m. tweet hit 700,000 views" (21:30).
Practical impact: "A full model uses about 1GB in KV cache / RAM. TurboQuant cuts that by 4x. Same quality. At 300K context, throughput roughly doubles" (22:00). And decisively: "With this, we can deliver 1M context on device — depending on model size and hardware" (22:30).
Read alongside Arize's hierarchical memory and Trigger.dev's snapshot/restore as "three distinct solutions to the LLM scale problem." Arize at the software layer (context strategy), Trigger.dev at the infrastructure layer (state management), Prince at the model layer (quantization) — different angles on the same "too-big" problem.
The Reachy Mini robot — cloning the Iron Man Jarvis voice (15:45 to 16:30)
The video's closing. Prince acquired a Reachy Mini robot and gave it "sight plus hearing" via MLX Audio plus MLX Vision. "Iron Man's Jarvis voice, cloned in real time." "You can build an agent that runs on your iPhone, iPad, Mac, or robot — starting today" (16:30).
This view of the future — agents running on personal devices, regardless of shape (phone, robot) — sees the same landscape Karpathy's "LLM is the new computer" and Boris Cherny's "I write code on a phone" describe, but in the context of hardware diversity. Phone, tablet, Mac, robot — all become "LLM host devices."
Related
- Karpathy: Software 3.0 — the "LLM as new computer" concept, with local execution underwriting its feasibility
- Hinton: UN DWC — the structural problem of the Global South and accessibility
- Trigger.dev: durable agents — infrastructure-layer solution to the scale problem
- Arize: hierarchical memory — software-layer solution to the scale problem
- Samuel Humeau / Mistral: TTS — the cloud-side TTS counterpoint
Key Quotes
- "In 2020 my father went blind. The same year, Apple released its most powerful chip for on-device inference (M1)" (01:30)
- "My father lives in Africa, where internet isn't readily available. So on-device is the future" (02:30)
- "In three years MLX hit 1.5M downloads, 4,000-plus models, day-zero support with frontier labs" (03:00)
- "Gemma 4 26B runs on iPhone — using storage. At a reasonable speed" (05:00)
- "Marvis TTS generates audio in under 100ms. Point Claude Code at MLX Audio and you can vibe code Whisperflow in 10 minutes" (06:20)
- "TurboQuant — first implementation in the world 30 minutes after the paper dropped, the 3 a.m. tweet hit 700,000 views" (21:30)
- "We can deliver 1M context on device — depending on model size and hardware" (22:30)
- "You can build an agent that runs on your iPhone, iPad, Mac, or robot — starting today" (16:30)
Sources
Why MLX — AI Engineer (YouTube)
Related resources:
- MLX VLM on GitHub (Prince Canuma)
- MLX Audio on GitHub
- MLX (Apple)
- LM Studio (a local LLM client built on MLX)
Glossary
- MLX
- A machine learning framework for Apple Silicon, published by Apple's research team in 2023. Plays the role on Apple Silicon that PyTorch or TensorFlow plays elsewhere. Uses the unified memory architecture (CPU/GPU shared memory) to run LLM inference efficiently.
- Neywa Labs
- A startup co-founded by Prince Canuma. Leads development of the MLX ecosystem (MLX VLM, MLX Audio, MLX Video). A center of open-source AI tooling on Apple Silicon.
- Marvis TTS
- A custom speech synthesis model developed by Neywa Labs. Generates audio in under 100ms, fully on device. A foundation for self-built voice input tools, alongside Whisperflow / Super Whisper.
- TurboQuant
- A quantization research released in March 2026. Compresses the LLM KV cache (Key-Value — the main consumer of memory at inference time) by 4x. Prince Canuma published an open implementation 30 minutes after the paper appeared, and it spread widely the same day. Enables 1M context on device.
- RFDETR (Roboflow)
- A real-time object detection model published by Roboflow. Used in Prince's MLX demo to run "detect my face and the background, blur the background" on-device in real time. Explained in more depth in Isaac Robinson's AI Engineer Europe talk.
- Reachy Mini
- An open-source compact robot sold by Hugging Face and Pollen Robotics. Prince combined MLX Audio and MLX Vision to give it sight plus hearing, modifying it into a robot that converses in an Iron Man Jarvis-style voice.