The Hidden Problem with Text-to-Speech: Why Voices Drift
Text-to-speech has improved dramatically, but one frustrating issue still shows up in real-world use: inconsistency.
You can generate something that sounds almost perfect (great pacing, convincing emotion, clean delivery) and then suddenly, halfway through, the voice shifts. The accent drifts. The tone changes. A confident narrator turns into something slightly off.
This isn't rare. It's structural.
Modern TTS systems, including high-quality ones like ElevenLabs, rely on probabilistic generation. Even when you provide detailed tags like [excited] or [British accent], those instructions are interpreted, not enforced. That means two generations with the same input can produce slightly different results. Sometimes those differences are subtle. Sometimes they break the illusion entirely.
And here's the real pain point: even if 90% of the output is correct, you often have to regenerate everything just to fix that last 10%. When you're on a deadline, that uncertainty kills productivity.
In this article, you'll see why this happens and how a simple pipeline approach (based on chunking, tagging, and automated verification) can turn TTS from a fragile, trial-and-error process into a reliable, production-ready system.
Why Tags Alone Don't Solve It
ElevenLabs-style tags are powerful. They let you guide emotion, pacing, and even accent. In many cases, they're the only practical way to get expressive audio without manual editing.
But tags are still suggestions, not guarantees.
You might write:
[Australian accent][calm][whispers]
…and get something close. But over longer passages, models tend to "forget" or soften those constraints. Accent is especially fragile because it depends on consistent phonology across the entire clip, not just a label at the start.
So you end up in a loop:
- Generate audio
- Notice drift
- Regenerate
- Hope it sticks this time
That loop doesn't scale. When something goes wrong, you want to know exactly where to look, not start over from scratch.
A Pipeline Approach Instead of a Single Prompt
The way out of this isn't better prompting. It's better structure.
Instead of asking one model to do everything perfectly in one pass, you break the problem into a pipeline where each stage has a narrow responsibility and where outputs can be checked.
In Kemu, this kind of system is built as a Recipe, a visual workflow made of connected widgets that pass data step by step.
The pipeline follows three stages:
- Text segmentation (to stabilize delivery)
- Tagging (to control expression)
- Generation + verification (to enforce consistency)
Each stage reduces one source of drift. Keep it simple until you have a reason not to.
Stage 1: Stabilizing the Voice with Semantic Chunking
One of the biggest causes of inconsistency is simply giving the model too much text at once.
Long passages force the model to maintain tone, pacing, and accent over extended sequences, which increases the chance of drift.
The solution is to split text into semantically coherent chunks.
Not arbitrary chunks, not fixed-length cuts. Meaningful units.
The segmentation logic works hierarchically:
- First split by paragraphs
- If needed, split long paragraphs into sentences
- If still too long, split by clauses
But there's a key constraint: never break natural speech flow. A subject and predicate stay together, and dramatic pauses remain intact.
This matters because TTS models rely heavily on context windows. Smaller, well-structured inputs lead to more stable prosody and more consistent accents.

By the end of this stage, you no longer have one fragile generation. You have multiple controlled segments.
Stage 2: Turning Text into a Directed Performance
Once the text is chunked, the next step is to make it expressive, but in a controlled way.
This is where tagging comes in.
Instead of manually inserting tags, the workflow uses an AI agent acting as a script director. Its job is to translate plain text into a performance script using ElevenLabs-compatible tags.
It applies rules like:
- Insert tags at natural boundaries
- Replace punctuation with expressive cues
- Introduce emotional variation where appropriate
For example, a neutral sentence might become:
"We made it to the launch site."
→ [excited] We made it to the launch site
Or:
"I'm not sure this will work."
→ [hesitant] I'm not sure this will work
This step improves expressiveness, but it still doesn't guarantee consistency. It just makes the model more likely to behave correctly. The catch is that you're still trusting the model to interpret the tags the same way every time.
The real shift happens in the final stage.
Stage 3: Enforcing Accent Consistency with Verification
The pipeline moves beyond typical TTS workflows here.
After generating audio, the system doesn't assume success. It checks the result.
An additional AI agent acts as a linguistic validator. Instead of looking at text, it analyzes the generated audio and evaluates whether the accent is actually correct.
For an Australian accent, it looks for:
- Rising intonation patterns (like High Rising Terminal)
- Vowel shifts (FACE and PRICE vowels)
- Non-rhotic pronunciation (dropping the "r" sound)
This is important because accent isn't just a label. It's a collection of phonetic behaviors. If those aren't present, the output is rejected.
So the pipeline becomes:
- Generate audio
- Analyze accent
- Pass or reject
If rejected, the system can regenerate automatically.
This transforms TTS from a best-effort process into a controlled system with quality gates.
Why This Actually Fixes the Drift Problem
Each stage addresses a different failure mode:
- Chunking reduces cognitive load on the model
- Tagging improves expressive intent
- Verification enforces objective correctness
Instead of hoping the model behaves, you detect when it doesn't.
That's the key shift.
Extending the System Beyond Accent
Accent is just the starting point.
The same validation approach can be extended to other dimensions of voice consistency:
- Tone stability (e.g., staying calm or energetic)
- Emotional continuity across chunks
- Speaking speed and rhythm
- Character voice matching
You can chain multiple validation agents after generation, each responsible for one constraint.
For example:
- Agent 1: Accent verification
- Agent 2: Emotional consistency
- Agent 3: Tone stability
If any of them fail, the audio is regenerated.
This creates something closer to a production pipeline than a simple API call.
Building This in Kemu
In Kemu, this entire system exists as a single Recipe, a visual workflow where each step is a widget connected on a canvas.
- An Input Widget receives the original text
- AI agent widgets handle chunking and tagging
- A TTS service widget generates audio
- Additional agent widgets validate the result
- An Output Widget returns only approved audio
Because Kemu workflows are event-driven, you can also introduce control logic (like retries, branching, or conditional loops) without writing traditional backend code.
While implementation details depend on your setup, the key idea is to separate generation from validation and ensure only verified outputs move forward. If you're using Kemu, this can be done visually by connecting agent and service widgets with conditional routing between them.
The Bigger Takeaway
The core issue with TTS today isn't quality. It's reliability.
You can get great results. You just can't guarantee them on the first try.
What this pipeline shows is a different approach. Don't rely on a single generation to be perfect. Design a system that can detect imperfection and correct it automatically.
That's what turns TTS from a creative tool into a dependable production system.
Ready to get started with Kemu?
Build your own computer vision solutions without writing code. Start creating powerful ML and machine vision pipelines today.
