How Oaken Notes Does On-Device Meeting Transcription on Apple Silicon

Let’s be honest: Apple Intelligence isn’t very good. As an LLM, it falls behind every major cloud model. GPT-4, Claude, Gemini. They all run circles around it for general reasoning.

But Apple Silicon? That’s a different story entirely. The Neural Engine and unified memory architecture make on-device inference shockingly fast. I’ve been building on CoreML since the M1 days, and I’m more convinced than ever that on-device AI on Apple’s hardware is the future. Not because Apple’s models are the smartest, but because the silicon is fast enough that you don’t need to send your data anywhere.

That’s the bet I made with Oaken Notes. I wanted a meeting assistant that didn’t need an account, a cloud subscription, or your voice leaving your Mac. Here’s how the internals actually work.

Architecture at a Glance

┌──────────────────────────────────────────────────────────────┐
│                      Audio Capture                           │
│  ┌─────────────────────┐    ┌──────────────────────────┐    │
│  │ Mic (AVAudioEngine)  │    │ System Audio              │    │
│  │ installTap -> memcpy │    │ (ScreenCaptureKit/SCStream)│    │
│  └────────┬────────────┘    └────────────┬─────────────┘    │
│           └──────────┬───────────────────┘                   │
│                      v                                       │
│            AudioBufferSink protocol                          │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       v
┌──────────────────────────────────────────────────────────────┐
│                    Transcription                             │
│  Primary: Apple Intelligence                                 │
│  SpeechAnalyzer -> SpeechTranscriber (AsyncStream)           │
│  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Fallback: WhisperKit (CoreML)                               │
│  WhisperKitHolder actor | model selected by RAM              │
│  2-min pre-buffer | memory pressure -> auto-downgrade        │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       v
┌──────────────────────────────────────────────────────────────┐
│                  AI Enhancement                              │
│  SystemLanguageModel + LanguageModelSession                  │
│  @Generable -> MeetingSummary / TemplateMeetingSummary       │
│  Short meetings: singlePass | Long: chunked -> reduceSummaries│
└──────────────────────┬───────────────────────────────────────┘
                       │
                       v
┌──────────────────────────────────────────────────────────────┐
│                     Storage                                  │
│  GRDB/SQLite (DatabaseQueue) + FTS5 full-text search         │
│  Notes: ~/Library/.../Notes/<uuid>.md                        │
│  Audio: .m4a via AVAssetWriter                               │
└──────────────────────────────────────────────────────────────┘

The Audio Capture Pipeline

Everything starts with getting clean audio without crashing the real-time thread. Oaken uses AVAudioEngine with an installTap on the microphone bus.

The trick is that we don’t do any processing on the tap’s callback. That’s a fast lane. We memcpy the buffers immediately and ship them off to a serial processingQueue. If you block that tap for even a few milliseconds, you get dropped frames and robotic-sounding audio.

System audio (the stuff coming from Zoom, Meet, or Teams) is a different beast. We use ScreenCaptureKit and its SCStream API. It hands us a CMSampleBuffer, which we have to convert into an AVAudioPCMBuffer before it hits our internal sink.

Both the microphone and system audio streams feed into a unified AudioBufferSink protocol. This lets our transcription engines consume the data without caring where it came from.

Dual Transcription Backends

I didn’t want to rely on just one transcription method. If Apple’s built-in stuff is available, we use it. If not, we have a heavy-duty fallback.

Primary: Apple Intelligence On macOS 26 (Tahoe), we use the new SpeechAnalyzer and SpeechTranscriber frameworks. It’s fed via an AsyncStream. It’s incredibly efficient because it’s baked into the OS and highly optimized for the hardware.

Fallback: WhisperKit When the OS-native tools aren’t enough, we spin up WhisperKit, a Core ML implementation of OpenAI’s Whisper. We wrap this in a WhisperKitHolder actor to keep things thread-safe and isolated.

Memory management is the biggest challenge here. We don’t just load the “large” model and hope for the best. We have a WhisperModelSize enum that maps to estimatedRAMMB. We check the machine’s physical RAM and choose the best fit.

We also use a MemoryPressureMonitor. If the system starts screaming about memory, we downgrade the model on the fly, moving from medium to small, or even dropping down to the Apple Intelligence backend if things get really tight.

To prevent the user from missing the start of a meeting while the model loads, we collect up to 2 minutes of 16kHz audio in a pre-buffer. Once the model is warm, we drain that buffer into the inference engine.

To keep the app’s footprint small, we have an idle unload. If you haven’t recorded anything for 10 minutes, we reclaim that ~1.5GB of RAM.

AI Summarization and FoundationModels

Once we have a transcript, we need to make sense of it. We use the system’s LanguageModelSession and SystemLanguageModel.

I’m a big fan of structured data. We use @Generable typed output schemas, like MeetingSummary and TemplateMeetingSummary. We’re not parsing strings and hoping for the best. We’re getting back structured objects that we can actually use in the UI.

Token budgeting is another fun part. We call tokenBudget(for: model) to calculate how much room we have left after subtracting our prompt overhead from the model.contextSize.

For short meetings, we do a singlePassEnhancement. For long ones, we have to do a chunkedEnhancement. We summarize chunks, then run a reduceSummaries pass recursively until we can synthesize a final structured summary.

On macOS 26.4+, we use model.tokenCount for precise measurement. On older versions, we fall back to a “chars divided by 4” heuristic, which is surprisingly close most of the time. If the model fails or hits a guardrail, we fall back to a plain-string generation with a FallbackOutputParser.

Storage without the Bloat

I’ve been a software engineer for 30 years, and I’ve seen enough Core Data migrations to last a lifetime. For Oaken, I chose GRDB with a DatabaseQueue. It’s lightweight, predictable, and doesn’t have the overhead or “magic” of Core Data.

We use FTS5 virtual tables for full-text search. Triggers automatically flatten our transcript segments (stored as a JSON column) into the FTS index. This means you can search across every meeting, note, and transcript almost instantly.

The actual note content lives as plain .md files in ~/Library/Application Support/. We index these into the FTS table so they’re searchable, but the source of truth is just a simple file on your disk. Audio is saved as .m4a files using AVAssetWriter.

Privacy by Design

Privacy isn’t just a marketing line for us. It’s the architecture.

If you look at our project manifest, you won’t find a single analytics SDK. No Sentry, no Firebase, no Mixpanel. Nothing. We don’t even have an “account” system. There’s no server for your app to authenticate against.

We do have a network entitlement, but it’s strictly for Sparkle (our auto-updater), Paddle (for license verification), and the one-time WhisperKit model download. All the AI inference (the transcription and the summarization) happens on your machine using Apple’s local frameworks.

I’m happy to dive deeper into any of this. If you have questions about the macOS 26 APIs, how we handle WhisperKit’s memory footprint, or the privacy model, ask away. I’ll be hanging out in the comments.

Oaken Notes is a private AI meeting assistant for macOS. Your meetings never leave your Mac.

Download Oaken Notes