AI Music Detection

Detect AI-generated music - vocals, instrumentals, and everything in between.

Most detectors return a single pass/fail for an entire track. Modulate's AI Music Detection API tells you where in a track AI was used, not just whether it was.
How It Works

Clip-level verdicts built from per-window evidence.

Send audio, get back a primary verdict, per-window scores, and confidence values. Batch or real-time streaming — same structured output either way.

Batch API

Send a complete audio file, receive a clip-level primary_verdict plus a per-window breakdown. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav — up to 100 MB.

Streaming API (WebSocket)

Connect over WebSocket and receive per-window vocal AI verdicts as audio arrives. Instrumental AI detection and the final clip-level verdict are delivered in the done message at end of stream.

Structured output

Every response includes primary_verdict, vocal_ai_percentage, vocal_ai_confidence, instrumental_ai_percentage, instrumental_ai_confidence, and a full per-window breakdown.

Tunable thresholds

Adjust the precision/recall tradeoff without retraining. Set vocal and instrumental thresholds independently to match your use case.
AI Music Detection Capabilities

Two detection paths. One API.

Most AI music detectors break down on hybrid tracks, multi-part productions, and anything where AI was only used for part of the composition. Modulate runs two independent models, one for vocals, one for instrumentals, scoring each 4-second window separately, so you get a confident result grounded in evidence.

Vocal detection

Modulate identifies AI-generated singing, rap, or any vocal performance within a 4-second window — including AI vocals over organic human music. Each window is scored independently, returning vocal_ai_percentage and vocal_ai_confidence.

Instrumental detection

Modulate identifies AI-generated instrumental content when no sufficient vocal content is present — including fully AI-generated tracks with no vocals at all. Results are delivered per-window and aggregated into the clip-level verdict.
Modulate vs. industry standard

A side-by-side comparison for teams evaluating AI music detection solutions.

Modulate
Industry Standard
Detection paths
Vocal + instrumental, independent
Single combined score per track
Output granularity
Per-4-second window + clip-level verdict
Clip-level only
Speech signals
Per-4s window for vocal AI; instrumental AI at end-of-stream
Single score at end of file
Hybrid content handling
Window-level visibility into mixed tracks
Not supported
Threshold tuning
Yes — no retraining needed
Fixed or not offered
Streaming support
Real-time WebSocket streaming
Batch only
Additional Velma models
Transcription, Deepfake Detection, PII Redaction, Emotion, Accent
Music detection only
Self-serve API
Yes — sign up and start in minutes
Varies; some require enterprise partnership negotiation
Cost
$0.07/hr
Not publicly disclosed or per-track pricing
What We Solve

It's time to get ahead of the AI music challenges your team is already dealing with.

AI-generated music is evolving faster than manual review can handle. AI Music Detect by Modulate gives you a scalable, self-serve detection layer, so you can enforce policies, protect rights, and stay compliant.

Reduce false positives

Window-level scoring and tunable thresholds mean you’re not stuck with a fixed cutoff. Adjust vocal and instrumental sensitivity independently to match your tolerance, so legitimate tracks don’t get flagged and real violations don’t slip through.

Protect royalty payouts

AI-generated tracks are being uploaded at scale to farm streaming royalties. Modulate flags them at ingestion before they dilute payouts for legitimate human artists.

Enforce platform policies at scale

Spotify, Apple Music, YouTube Music, and most major DSPs require AI disclosure or restrict AI monetization. Manual review at ingestion volume isn't viable. Modulate gives you automated detection that enforces policy without adding headcount.

Know what you're licensing

Purely AI-generated works without meaningful human authorship aren't copyrightable under US Copyright Office guidance. Velma lets licensors, sync agencies, and clearance teams verify content before contracts are issued.

Screen content before it reaches platforms

DSPs are pushing distributors to screen uploads upstream. Velma gives DistroKid, TuneCore, CD Baby equivalents a drop-in detection layer to get ahead of platform requirements — before they become mandates.

Stay ahead of AI disclosure regulations

The EU AI Act requires disclosure of AI-generated content in covered contexts. Velma gives platforms operating in affected jurisdictions a reliable way to identify non-disclosing uploads before they create compliance exposure.
How It Works

Clip-level verdicts built from per-window evidence.

1
Send audio
Point Velma AI Music Detect at a file or a live stream. One endpoint — no pipeline to assemble. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav.
2
Get per-window verdicts
Every 4-second window is scored independently for vocal AI and instrumental AI content. Streaming returns vocal results in real time; instrumental results and the clip-level verdict arrive at end of stream.
3
Receive structured output
Get back a primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music — plus confidence scores and a full per-window breakdown ready to plug into your pipeline.
Streaming, start to finish:

# 1 · open a connection 2 · stream audio 3 · read results
ws = connect("wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming?api_key=…")
ws.send(audio_chunk) # stream your audio
ws.send("") # signal end of stream
for event in ws: # window verdicts + final primary_verdict
print(event["primary_verdict"])

Start building with Modulate.

Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Velma API

Voice-native AI that detects emotion, intent, compliance, and coaching — from audio, not transcripts.
See how it works

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
See how it works

Transcription API

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
See how it works