AudioShake Debuts Dialogue RT: The First AI Dialogue Isolation Model Built for Live Broadcast

AudioShake

April 14, 2026

Dialogue isolation is the process of extracting speech as a distinct signal from a mixed audio feed — removing crowd noise, PA bleed, and ambient sound entirely, rather than suppressing them. Until now, AI-based dialogue isolation has been a post-production tool. The models that do it well introduce too much latency to be usable in live broadcast chains.

‍

AudioShake’s Dialogue RT changes that. It delivers end-to-end dialogue isolation at 11ms — from input signal to isolated dialogue output, running on NVIDIA Spark — making it the first AI dialogue isolation model to operate within the latency envelope required for live production.

‍

What Is Dialogue RT?

‍

Dialogue RT is AudioShake’s ultra-low latency dialogue isolation model. It takes a live mixed audio feed — commentary, crowd noise, PA bleed, ambient sound — and returns a clean, isolated dialogue signal in real time, with 11ms of end-to-end latency.

‍

It runs on NVIDIA DGX Spark and NVIDIA’s Blackwell architecture family of GPUs, and is available via the AudioShake SDK for direct integration into broadcast infrastructure and media workflows.

Unlike noise suppression tools, which modify a mixed signal to reduce unwanted sound, Dialogue RT performs true isolation — extracting the speech stem and discarding everything else. The output is two independent streams — a clean dialogue stem and a separate background stem — giving engineers direct control over both rather than forcing a tradeoff between them.

‍

The Problem It Solves

‍

A1s and broadcast audio engineers face several challenges when grappling with audio feeds. Live broadcast puts microphones in front of crowds, chaotic fields, and noisy press conferences. Crowd noise bleeds into commentary feeds. Stadium PA bleeds into sideline mics. And field reporters broadcast from streets that have no respect for production schedules.

‍

The consequences show up in two places. First, ASR and captioning accuracy drops significantly when noisy audio is fed into transcription systems — especially in live workflows already operating under tight latency constraints. Second, broadcast engineers spend significant effort managing the mix manually: adjusting thresholds on denoising hardware, compensating for bleed, and working around tools that suppress noise but cannot fully separate overlapping sources.

‍

The underlying issue is that existing tools treat these as two separate problems requiring two separate workflows. Dialogue RT treats them as one.

‍

How Dialogue RT Compares to Existing Tools

‍

9-15ms is the latency window inside which live broadcast chains operate without creating perceptible lip-sync issues. Until Dialogue RT, no AI dialogue isolation model operated within that window.

‍

For example, Waves Clarity Vx Pro, an AI dialogue isolation plugin for post-production, introduces approximately 42ms of latency at 48kHz — nearly four times the live broadcast threshold.

‍

In contrast, CEDAR’s hardware DNS products achieve near-zero latency, but perform noise suppression rather than dialogue isolation. Suppression and isolation are different operations: suppression attenuates unwanted sound within a mixed signal; isolation extracts speech as a separate stem. Isolation gives engineers independent control over dialogue and the rest of the mix. Suppression does not.

‍

Dialogue RT is the first AI model to perform true dialogue isolation within the live broadcast latency envelope.

‍

What This Enables in Broadcast Workflows

‍

Dialogue RT’s practical implications extend across the broadcast chain:

‍

A single source of truth for audio.‍

Instead of managing parallel feeds for production and transcription, teams can isolate dialogue directly from the main feed and route it downstream—eliminating duplicate workflows.

‍

Fewer microphones, less complexity.

‍In many live productions, engineers deploy dozens of microphones to compensate for bleed between crowd, PA, and commentary. Dialogue RT reduces that dependency—enabling cleaner results with fewer inputs and simpler signal chains.

‍

Better captioning and ASR accuracy.

‍Using Dialogue RT to feed isolated dialogue rather than a noisy mix directly improves transcription performance by removing background interference at the source. This also enables easier downstream dubbing, localization, and international distribution–all from a single live source.

‍

A more hands-off mix, with control where it matters.
Broadcast engineers can set dialogue isolation on the primary feed and avoid constantly managing crowd noise, stadium PA, or unpredictable field conditions. Unlike denoising tools, which require ongoing threshold tuning, Dialogue RT adapts in real time to changing environments—while still giving engineers the ability to dial in or override as needed.

‍

AI-Media, which provides live captioning and translation to broadcasters worldwide, is already putting Dialogue RT to work in production. Their use case is a clean illustration of what real-time isolation unlocks: instead of feeding a noisy mixed signal into a transcription engine and accepting the accuracy tradeoff, they can isolate dialogue first and route a clean stem downstream — better captions, better translations, from a single live source.

‍

Frequently Asked Questions

‍

What is Dialogue RT?

Dialogue RT is AudioShake’s real-time dialogue isolation model. It extracts clean speech from a live mixed audio feed at 11ms end-to-end latency — the first AI model to meet the latency requirements for live broadcast production.

‍

What is the difference between dialogue isolation and noise suppression?

Noise suppression modifies a mixed signal to reduce unwanted sound while keeping the mix intact. Dialogue isolation extracts speech as a separate signal, removing everything else entirely. Isolation gives engineers independent control over dialogue and the surrounding audio — crowd, PA, ambient sound — as distinct elements. Suppression is extremely fast, but it outputs a single modified signal — you can't independently control dialogue and background. Dialogue RT outputs two independent streams, a clean dialogue stem and a separate background stem, so engineers can treat them as distinct elements.

‍

Why does 11ms matter for live broadcast?

Live broadcast audio chains require processing latency below approximately 10–15ms to avoid perceptible lip-sync issues. Other AI dialogue isolation tools, including Waves Clarity Vx Pro, operate at 42ms or more — too slow for live production. Dialogue RT operates at 11ms, making it the first AI isolation model usable inside a live broadcast chain.

‍

What hardware does Dialogue RT run on?

Dialogue RT runs on NVIDIA DGX Spark and NVIDIA’s Blackwell architecture family of GPUs. It is available via the AudioShake SDK for integration into existing broadcast infrastructure and media workflows.

‍

How does Dialogue RT improve ASR and captioning accuracy?

ASR and captioning accuracy degrades when transcription systems receive noisy mixed audio. By isolating the dialogue signal before it reaches the ASR engine, Dialogue RT removes background interference at the source — improving transcription performance without adding perceptible delay.

‍

Get Started

Dialogue RT is available now via the AudioShake SDK, designed for direct integration into existing broadcast infrastructure, voice AI pipelines, and media workflows.

Get started at dashboard.audioshake.ai or read the documentation at developer.audioshake.ai.

‍

CAPABILITIES

POPULAR SEARCHES

CAPABILITIES

POPULAR SEARCHES

CAPABILITIES

POPULAR SEARCHES

VOICE

INFRASTRUCTURE

FILM & TV

MUSIC

BY USE CASE

VOICE

FILM & TV

MUSIC

MUSIC

LEARN

DEVELOPERS

COMPANY

AudioShake Debuts Dialogue RT: The First AI Dialogue Isolation Model Built for Live Broadcast

What Is Dialogue RT?

The Problem It Solves

How Dialogue RT Compares to Existing Tools

What This Enables in Broadcast Workflows

Frequently Asked Questions

What is Dialogue RT?

What is the difference between dialogue isolation and noise suppression?

Why does 11ms matter for live broadcast?

What hardware does Dialogue RT run on?

How does Dialogue RT improve ASR and captioning accuracy?

Get Started