Fixing Voice Drift in TTS with a Verification Pipeline

The Hidden Problem with Text-to-Speech: Why Voices Drift

Text-to-speech has improved dramatically, but one frustrating issue still shows up in real-world use: inconsistency.

You can generate something that sounds almost perfect (great pacing, convincing emotion, clean delivery) and then suddenly, halfway through, the voice shifts. The accent drifts. The tone changes. A confident narrator turns into something slightly off.

This isn't rare. It's structural.

Modern TTS systems, including high-quality ones like ElevenLabs, rely on probabilistic generation. Even when you provide detailed tags like [excited] or [British accent], those instructions are interpreted, not enforced. That means two generations with the same input can produce slightly different results. Sometimes those differences are subtle. Sometimes they break the illusion entirely.

And here's the real pain point: even if 90% of the output is correct, you often have to regenerate everything just to fix that last 10%. When you're on a deadline, that uncertainty kills productivity.

In this article, you'll see why this happens and how a simple pipeline approach (based on chunking, tagging, and automated verification) can turn TTS from a fragile, trial-and-error process into a reliable, production-ready system.

Why Tags Alone Don't Solve It

ElevenLabs-style tags are powerful. They let you guide emotion, pacing, and even accent. In many cases, they're the only practical way to get expressive audio without manual editing.

But tags are still suggestions, not guarantees.

You might write:

[Australian accent]
[calm]
[whispers]

…and get something close. But over longer passages, models tend to "forget" or soften those constraints. Accent is especially fragile because it depends on consistent phonology across the entire clip, not just a label at the start.

So you end up in a loop:

Generate audio
Notice drift
Regenerate
Hope it sticks this time

That loop doesn't scale. When something goes wrong, you want to know exactly where to look, not start over from scratch.

A Pipeline Approach Instead of a Single Prompt

The way out of this isn't better prompting. It's better structure.

Instead of asking one model to do everything perfectly in one pass, you break the problem into a pipeline where each stage has a narrow responsibility and where outputs can be checked.

In Kemu, this kind of system is built as a Recipe, a visual workflow made of connected widgets that pass data step by step.

The pipeline follows three stages:

Text segmentation (to stabilize delivery)
Tagging (to control expression)
Generation + verification (to enforce consistency)

Each stage reduces one source of drift. Keep it simple until you have a reason not to.

Stage 1: Stabilizing the Voice with Semantic Chunking

One of the biggest causes of inconsistency is simply giving the model too much text at once.

Long passages force the model to maintain tone, pacing, and accent over extended sequences, which increases the chance of drift.

The solution is to split text into semantically coherent chunks.

Not arbitrary chunks, not fixed-length cuts. Meaningful units.

The segmentation logic works hierarchically:

First split by paragraphs
If needed, split long paragraphs into sentences
If still too long, split by clauses

But there's a key constraint: never break natural speech flow. A subject and predicate stay together, and dramatic pauses remain intact.

This matters because TTS models rely heavily on context windows. Smaller, well-structured inputs lead to more stable prosody and more consistent accents.

By the end of this stage, you no longer have one fragile generation. You have multiple controlled segments.

Stage 2: Turning Text into a Directed Performance

Once the text is chunked, the next step is to make it expressive, but in a controlled way.

This is where tagging comes in.

Instead of manually inserting tags, the workflow uses an AI agent acting as a script director. Its job is to translate plain text into a performance script using ElevenLabs-compatible tags.

It applies rules like:

Insert tags at natural boundaries
Replace punctuation with expressive cues
Introduce emotional variation where appropriate

For example, a neutral sentence might become:

"We made it to the launch site."
→ [excited] We made it to the launch site

Or:

"I'm not sure this will work."
→ [hesitant] I'm not sure this will work

This step improves expressiveness, but it still doesn't guarantee consistency. It just makes the model more likely to behave correctly. The catch is that you're still trusting the model to interpret the tags the same way every time.

The real shift happens in the final stage.

Stage 3: Enforcing Accent Consistency with Verification

The pipeline moves beyond typical TTS workflows here.

After generating audio, the system doesn't assume success. It checks the result.

An additional AI agent acts as a linguistic validator. Instead of looking at text, it analyzes the generated audio and evaluates whether the accent is actually correct.

For an Australian accent, it looks for:

Rising intonation patterns (like High Rising Terminal)
Vowel shifts (FACE and PRICE vowels)
Non-rhotic pronunciation (dropping the "r" sound)

This is important because accent isn't just a label. It's a collection of phonetic behaviors. If those aren't present, the output is rejected.

So the pipeline becomes:

Generate audio
Analyze accent
Pass or reject

If rejected, the system can regenerate automatically.

Open this recipe on Kemu

This transforms TTS from a best-effort process into a controlled system with quality gates.

Why This Actually Fixes the Drift Problem

Each stage addresses a different failure mode:

Chunking reduces cognitive load on the model
Tagging improves expressive intent
Verification enforces objective correctness

Instead of hoping the model behaves, you detect when it doesn't.

That's the key shift.

Extending the System Beyond Accent

Accent is just the starting point.

The same validation approach can be extended to other dimensions of voice consistency:

Tone stability (e.g., staying calm or energetic)
Emotional continuity across chunks
Speaking speed and rhythm
Character voice matching

You can chain multiple validation agents after generation, each responsible for one constraint.

For example:

Agent 1: Accent verification
Agent 2: Emotional consistency
Agent 3: Tone stability

If any of them fail, the audio is regenerated.

This creates something closer to a production pipeline than a simple API call.

Building This in Kemu

In Kemu, this entire system exists as a single Recipe, a visual workflow where each step is a widget connected on a canvas.

An Input Widget receives the original text
AI agent widgets handle chunking and tagging
A TTS service widget generates audio
Additional agent widgets validate the result
An Output Widget returns only approved audio

Because Kemu workflows are event-driven, you can also introduce control logic (like retries, branching, or conditional loops) without writing traditional backend code.

While implementation details depend on your setup, the key idea is to separate generation from validation and ensure only verified outputs move forward. If you're using Kemu, this can be done visually by connecting agent and service widgets with conditional routing between them.

The Bigger Takeaway

The core issue with TTS today isn't quality. It's reliability.

You can get great results. You just can't guarantee them on the first try.

What this pipeline shows is a different approach. Don't rely on a single generation to be perfect. Design a system that can detect imperfection and correct it automatically.

That's what turns TTS from a creative tool into a dependable production system.