AI Music Detection
Detect AI-generated music — vocals, instrumentals, and everything in between.
Most detectors return a single pass/fail for an entire track. Modulate AI Music Detect scores every 4-second segment, separately evaluating vocal and instrumental content — so you know where in a track AI was used, not just whether it was.
How It Works
Clip-level verdicts built from per-window evidence.
Send audio, get back a primary verdict, per-window scores, and confidence values. Batch or real-time streaming — same structured output either way.
Batch API
Send a complete audio file, receive a clip-level primary_verdict plus a per-window breakdown. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav — up to 100 MB.
Streaming API (WebSocket)
Connect over WebSocket and receive per-window vocal AI verdicts as audio arrives. Instrumental AI detection and the final clip-level verdict are delivered in the done message at end of stream.
Structured output
Every response includes primary_verdict, vocal_ai_percentage, vocal_ai_confidence, instrumental_ai_percentage, instrumental_ai_confidence, and a full per-window breakdown.
Tunable thresholds
Adjust the precision/recall tradeoff without retraining. Set vocal and instrumental thresholds independently to match your use case.
AI Detection Capabilities
Two detection paths. One API.
Most AI music detectors treat a track as a monolith — score it once, return a number, move on. That approach breaks down on hybrid tracks, multi-part productions, and anything where AI was used for only part of the composition.
Vocal detection
Identifies AI-generated singing, rap, or any vocal performance within a 4-second window — including AI vocals over organic human music, a common pattern in fraudulent or policy-violating uploads. Each window is scored independently, returning
vocal_ai_percentage and vocal_ai_confidence.Instrumental detection
Identifies AI-generated instrumental content when no sufficient vocal content is present — including fully AI-generated tracks with no vocals at all. Results are delivered per-window and aggregated into the clip-level verdict.
Modulate vs. industry standard
A side-by-side comparison for teams evaluating AI music detection.
Modulate
Industry Standard
Detection paths
Vocal + instrumental, independent
Single combined score per track
Output granularity
Per-4-second window + clip-level verdict
Clip-level only
Speech signals
Per-4s window for vocal AI; instrumental AI at end-of-stream
Single score at end of file
Hybrid content handling
Window-level visibility into mixed tracks
Not supported
Threshold tuning
98.9% accuracyYes — no retraining needed
Fixed or not offered
Streaming support
Real-time WebSocket streaming
Batch only
Pricing
$0.07/hr
$0.15–0.45/hr (estimated)
Additional Velma models
Transcription, Deepfake Detection, PII Redaction, Emotion, Accent
Music detection only
Self-serve API
Drop-inYes — sign up and start in minutes
Varies; some require enterprise partnership negotiation
Cost
$0.07/hr
Not publicly disclosed or per-track pricing
How It Works
Clip-level verdicts built from per-window evidence.
1
Send audio
Point Velma AI Music Detect at a file or a live stream. One endpoint — no pipeline to assemble. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav.
2
Get per-window verdicts
Every 4-second window is scored independently for vocal AI and instrumental AI content. Streaming returns vocal results in real time; instrumental results and the clip-level verdict arrive at end of stream.
3
Receive structured output
Get back a primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music — plus confidence scores and a full per-window breakdown ready to plug into your pipeline.
Streaming, start to finish:
# 1 · open a connection 2 · stream audio 3 · read results
ws = connect("wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming?api_key=…")
ws.send(audio_chunk) # stream your audio
ws.send("") # signal end of stream
for event in ws: # window verdicts + final primary_verdict
print(event["primary_verdict"])
Start building with Velma.
Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate
Explore Modulate's other leading voice models
Audio-native APIs built for real-time performance — designed to drop right into your stack.