Writing a Swift TTS Pipeline for Chatterbox Turbo ONNX

Chatterbox is Resemble AI's open-weight TTS model — emotionally expressive, zero-shot voice cloning, and competitive with ElevenLabs. The Python reference implementation runs fine on a GPU server. I wanted it running entirely on-device in a Swift app, with no server, no Python, and no CoreML conversion headaches.

The path I chose: ONNX export → ONNX Runtime Swift bindings → a hand-rolled inference loop that mirrors the Python generate() method as closely as possible. What looked like a weekend project turned into a deep education in autorelease pools, KV cache semantics, and why memory management and ML inference are a genuinely tricky combination.

The Architecture

Chatterbox Turbo has four distinct model components. Each gets its own ONNX session, loaded once and reused for every synthesis call:

Speech Encoder

Takes raw PCM audio from the reference speaker. Outputs audio_features, audio_tokens, speaker_embeddings, and speaker_features — collectively the speaker conditioning.

Embed Tokens

Converts token IDs (text on step 0, speech tokens on subsequent steps) into dense float embeddings via a learned lookup table.

Language Model

The autoregressive core. Takes embeddings + KV cache + position IDs, outputs logits for the next speech token. Runs up to 1024 times per generation.

Conditional Decoder

Converts the full sequence of predicted speech tokens back into a PCM waveform at 24kHz, conditioned on the speaker embeddings and features.

The entire synthesis flow is: encode speaker → tokenize text → autoregressive token loop → decode to audio. The expensive part is the middle step — every token requires a full LM forward pass.

The Generation Loop

The Python reference uses a fairly standard autoregressive loop with a KV cache for efficiency. My Swift port in generateSpeech(tokenIds:conditioning:) mirrors this structure exactly. Here's the skeleton:

Chatterbox+LM.swift

var generateTokens = [Int64(startSpeechToken)]
var currentInputIds = tokenIds  // text tokens on step 0

for i in 0..maxNewTokens {
    let nextToken: Int64 = try autoreleasepool {
        // embed → lm forward → sample token
    }
    generateTokens.append(nextToken)
    if nextToken == Int64(stopSpeechToken) { break }
    currentInputIds = [nextToken]  // next step uses predicted token
}

On step 0, we pass the full tokenized text. On every subsequent step, currentInputIds becomes just the single predicted token — exactly mirroring Python's input_ids[:, -1:] slicing behavior. When we hit stopSpeechToken (6562), we break early.

Step 0 is special

The first step is different from all subsequent ones. On step 0, we need to prepend the speaker conditioning embedding to the text embeddings before running the LM. This is the prefix — it tells the model whose voice to synthesize in. We track its length as prefixLen because it determines the starting cachedSeqLen for all subsequent attention mask calculations.

if i == 0 {
    inputsEmbeds = try concatEmbeds(conditioning.condEmb, inputsEmbeds)
    let shape = try inputsEmbeds.tensorTypeAndShapeInfo().shape
    prefixLen = shape[1].intValue
    cachedSeqLen = prefixLen
}

The attention mask and position IDs diverge accordingly: on step 0 they span the full prefix sequence; on later steps they're a single position pointing at the tail of the cache.

The Four Bugs I Fixed

The initial port "worked" — in that it produced audio — but had four correctness and performance issues I had to track down. I annotated all of them as // Fix #N in the source.

Fix	Problem	Solution
#1	Attention mask length was miscalculated after step 0, causing incorrect attention across cached positions	Track `cachedSeqLen` explicitly; increment after each successful LM run
#2	Memory spiking from ~2GB to ~16GB over long generations, with 50s latency spikes	Scope `autoreleasepool` tightly per step; call `kvCache.update()` inside the pool
#3	KV cache not cleared on early break or thrown error	Add `defer { kvCache.clear() }` immediately after allocation
#4	Repetition penalty re-allocated a new `Set<Int64>` on every step	Maintain `penaltyTokenSet` as a persistent var; pass by reference to `sampleToken`

Fix #2 deserves its own explanation

This was the most subtle one. The naive approach wraps the entire generation step in an autoreleasepool. That sounds right — you want transient tensors released each step. But ONNX Runtime's ORTValue objects have internal retain semantics, and draining the pool mid-step while the runtime still holds references causes stalls. In practice: latency spikes to 50 seconds.

The fix is to scope the pool to wrap the step, but ensure the new KV cache tensors get a strong reference inside the pool before it drains:

let nextToken: Int64 = try autoreleasepool {
    // ... build inputs, run lm session ...

    // update() moves new present.*.key/value into kvCache.store
    // (a strong ref that SURVIVES the pool drain)
    try kvCache.update(from: lmOutputs, numLayers: numLayers)

    // OLD KV tensors: no remaining refs → freed when pool drains ✓
    // Transient tensors (inputIds, mask, embeds): also freed here ✓
    return try sampleToken(from: lmOutputs["logits"]!, ...)
}
// Memory stays flat across all 1024 steps

⚠ What happens without this Without autoreleasepool: all ORTValue objects accumulate until generateSpeech returns. For a 1024-step generation, that's thousands of tensor allocations — memory climbs from ~2GB to ~16GB. With a pool that drains too eagerly (wrapping only parts of the step): ONNX Runtime stalls waiting on internal ref counts. The specific ordering — update cache inside the pool, sample outside — is the key.

Speaker Conditioning and Audio Loading

Voice cloning starts with a reference audio file. The speakerConditioning(for:) method in Chatterbox+Voice.swift loads that audio, runs it through the speech encoder, and returns a SpeakerConditioning struct holding the four encoder outputs. Results are cached in VoiceCache so switching between the same voices in a session is essentially free.

The trickier part is the audio loading itself. The model expects 24kHz mono float PCM. Reference audio from users comes in any format — 44.1kHz stereo AAC, 48kHz mono FLAC, whatever. I handle this with AVAudioConverter and added a fast path for audio that's already in the target format:

// Fast path — already 24kHz mono float
if isTargetSampleRate && isMono {
    let buffer = AVAudioPCMBuffer(...)
    try file.read(into: buffer)
    return Array(UnsafeBufferPointer(start: buffer.floatChannelData![0], ...))
}

// Resample / downmix path via AVAudioConverter
let frameCapacity = AVAudioFrameCount(ceil(Double(file.length) * ratio)) + 512

That + 512 on frameCapacity is a buffer guard — floating-point rounding in the ratio calculation can leave the output buffer one block short. Adding a small margin is a well-known idiom in audio code and cheaper than a buffer overflow.

One subtlety: the AVAudioConverter callback captures inputBuffer, but Swift actors make this tricky because closures passed to C-style callbacks aren't allowed to capture actor-isolated state. The fix is a nonisolated(unsafe) local copy:

nonisolated(unsafe) let capturedInput = inputBuffer

converter.convert(to: outputBuffer, error: &conversionError) { _, outStatus in
    outStatus.pointee = capturedInput.frameLength > 0 ? .haveData : .endOfStream
    return capturedInput
}

Text Chunking and Caching

Long text can't be fed to the LM as one sequence — the model has a practical token limit and very long inputs degrade quality. The TextChunker splits normalized text into chunks of at most 300 characters, trying to break on sentence boundaries. Each chunk is synthesized independently and the resulting audio arrays are concatenated.

Chunking enables a useful caching strategy: ChunkCache keys on (chunkText, voiceURL). If you call synthesize with slightly different text where only one sentence changed, the unchanged chunks hit the cache and only the modified chunk is re-synthesized. In practice this makes iterative voice preview in an editor feel instant.

let cacheKey = ChunkCacheKey(text: chunk, voiceURL: voice.refAudioURL)
if let cached = await chunkCache.get(cacheKey) {
    audio = cached
} else {
    let tokenIds = try tokenize(text: chunk)
    audio = try await generateSpeech(tokenIds: tokenIds, conditioning: conditioning)
    await chunkCache.set(cacheKey, value: audio)
}

The Actor Model

Chatterbox is a Swift actor. This is the right call for an ML engine: ONNX sessions aren't thread-safe, and the generation loop maintains mutable state (KV cache, cached sequence length). The actor serializes access automatically.

Status management uses an AsyncStream-based broadcast so multiple UI consumers can observe loading and synthesis state without polling:

public func statusStream() -> AsyncStream<Status> {
    AsyncStream(bufferingPolicy: .bufferingNewest(1)) { continuation in
        let id = UUID()
        self.continuations[id] = continuation
        continuation.onTermination = { [weak self] _ in
            Task { await self?.removeContinuation(id: id) }
        }
    }
}

The bufferingNewest(1) policy means late subscribers always get the current state, and a slow consumer never blocks synthesis by backpressuring the stream.

On the ensureReady() dance The ensureReady() method handles four cases: already ready (fast path), currently preparing (await existing task), currently synthesizing (wait for status stream to return to ready), and idle/failed (start a new preparation task). Using a stored preparationTask means concurrent callers all await the same Task rather than each starting their own model load.

Repetition Penalty and Greedy Sampling

The sampling strategy mirrors the Python reference: apply a repetition penalty to logits of previously generated tokens, then take the argmax (greedy). No temperature, no top-k, no nucleus sampling — the model was trained expecting greedy decoding.

The penalty is applied asymmetrically: logits that are already negative are multiplied by the penalty (made more negative); positive logits are divided (reduced). This prevents the model from repeating tokens it's already committed to:

let penalty = Float(repetitionPenalty)  // 1.5
for token in penaltyTokenSet {
    let idx = Int(token)
    if idx < values.count {
        values[idx] = values[idx] < 0 ? values[idx] * penalty : values[idx] / penalty
    }
}
// Greedy: argmax via vDSP for performance
vDSP_maxvi(values, 1, &maxValue, &maxIndex, vDSP_Length(values.count))

Using vDSP_maxvi from the Accelerate framework instead of a Swift loop gives the argmax in a single vectorized call — meaningfully faster for a 6562-element vocab size called 1024 times per generation.

What I'd Do Differently

KV cache as a ring buffer. The current implementation grows the cache linearly up to the token limit. A fixed-size ring buffer would cap memory usage regardless of generation length and enable streaming synthesis for very long text without the memory cliff.

Beam search or temperature sampling. Greedy decoding mirrors the reference but can produce repetitive or flat prosody on certain inputs. Even a small temperature (0.7) with top-k sampling would add more natural variation, though it would require re-validating output quality against the reference.

CoreML for the decoder. The conditional decoder is run once per generation and accounts for a meaningful chunk of total latency. It's a good candidate for CoreML conversion to take advantage of the Neural Engine on Apple Silicon, which would leave the ONNX sessions to handle the embedding and LM passes where the overhead of ANE dispatch would hurt more than help.

The full implementation is part of Spokio, a native macOS/iOS TTS app built on this pipeline. The ONNX model weights are the official Chatterbox Turbo exports from Resemble AI. If you're interested in on-device TTS or have questions about the ONNX Runtime Swift bindings, feel free to reach out.