Targeted vs. Generative Audio Separation: Reflections on Meta’s SAM Audio Benchmark Results

Meta recently released its latest SAM Audio model, which shows impressive results in its ability to isolate (or “segment”) individual sounds from complex audio mixes, using text prompts, visual prompts, and time span prompts. This means you can tell the model “guitar,” “noise,” etc.
Conceptually, there are many benefits to query-based separation–namely, it can separate so much more than a model that is only hunting after one or more sounds in a fixed configuration. For example, you could theoretically tell a model like SAM Audio that you want to separate all kazoos or triangles, and it may be able to create nice-sounding output. In addition, it can use multi-modal input like visuals to help with the separation task–for example, “seeing” that there is a female speaker and a male speaker in a video can help the model create the female speaker’s isolated voice.
In contrast to SAM Audio, companies like AudioShake focus on “targeted” models–meaning, source-specific and mostly discriminative (not generative) models. These kinds of models specifically seek to separate one sound or another–for example, a model specifically seeking to separate multiple voices, or to isolate a wind instrument from a jazz combo. The downsides are exactly what you might expect. For one, it’s inefficient–you may have to use multiple models to separate multiple sounds. Second, you can’t go about building models for every sound in the universe–are you really going to go about building a targeted model for bird sound and dog sounds and car horns, etc? Probably not.
So why do we and others use this approach?
1. Targeted Models Can Often Produce Better Perceptual Quality
First, for the most part, you can achieve better quality when you build targeted models. Across the listening tests reported by Meta, AudioShake was the highest-performing discriminative model on perceptual quality (higher than Moises / Music.AI, FADR, Lalal.ai, and others), and in some categories—such as instrument separation—listeners preferred its outputs over SAM’s.
For our customers–who include the world’s largest film studios, music labels, sports leagues, and tech companies–robustness is of the highest importance. You can’t dub a film or make a music track immersive, and suddenly find yourself grappling with imperfectly separated sound. Nor would you want to train your models using imperfectly separated audio data.
2. Non-Generative Targeted Models = Ground truth Separation with no Hallucinations
Next, for many workflows, it’s incredibly important that no new audio information is added. Imagine a news broadcaster or a forensics expert that wants separated audio, but needs to ensure that all of the separations are nothing more than the full mix, separated.
A generative audio model cannot guarantee that. Indeed, SAM Audio is a diffusion model and shows many hallucinations in its output. This is akin to the dangers of writing a legal or research paper using an LLM today—you might get made-up references, which could have ethical, societal, or legal consequences.
If one looks at the outputs from the SAM Audio model, they can often sound quite good—but they don’t actually match the original audio. The amplitude doesn’t match the original mix, and the model can hallucinate, so the outputted audio can sound like a different instrument.
This is also why SAM Audio (quite reasonably) needed to develop its own criteria for evaluating these models–the usual criteria that targeted models like ours use (what’s known as the SDR score) wouldn’t make much sense here. The converse is also true–most of the SAM Audio evaluation criteria, such as the ability to create all the sounds in a mix, are irrelevant to the evaluation of a targeted model–but absolutely relevant to the evaluation of a generative model.
3. Targeted Models Can Separate Without Prompting
SAM’s prompting approach opens up a lot of creative use cases and the ability to harness multi-modal information to aid with separation. But prompting also introduces friction and ambiguity. Unlike SAM Audio, which cannot reliably extract multiple sources of the same class without additional visual or temporal cues, targeted models like AudioShake’s multi-speaker system separate overlapping voices automatically and without prompting.
For example, while SAM’s model needs you to tell it what to separate (“separate the female speaker”), AudioShake’s model separates all the speakers, regardless of gender, age, or number of voices.
4. Targeted Models = Ultra Performant
Finally, a core focus for AudioShake, in addition to quality, is performance. While generative models will become more efficient, for now, models like SAM Audio are simply too large to deploy on-device–let alone on a CPU. And there are many, many use cases–from speech isolation in headphones through to consumer apps like DJing–that require on-device capabilities.
SAM Audio meaningfully expands what’s possible with promptable, open-vocabulary separation. At the same time, the benchmarks highlight that for production-critical separation—especially when faithfulness, determinism, or multi-speaker handling is required—AudioShake’s targeted models continue to set the standard.
Make no mistake: we are super excited about the SAM Audio model and the many opportunities it opens up. The world likely needs both approaches–general and targeted–to solve varying challenges. It’s motivating and validating to see so much momentum in the audio space after years of people telling us we have weird hobbies :) We’re grateful to the Meta team for including us in their benchmarking test, and proud of the AudioShake research team for its best-in-class performance.