AI Music Detection API – Vocal & Instrumental

How It Works

Clip-level verdicts built from per-window evidence.

Send audio, get back a primary verdict, per-window scores, and confidence values. Batch or real-time streaming — same structured output either way.

Batch API

Send a complete audio file, receive a clip-level primary_verdict plus a per-window breakdown. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav — up to 100 MB.

Streaming API (WebSocket)

Connect over WebSocket and receive per-window vocal AI verdicts as audio arrives. Instrumental AI detection and the final clip-level verdict are delivered in the done message at end of stream.

Structured output

Every response includes primary_verdict, vocal_ai_percentage, vocal_ai_confidence, instrumental_ai_percentage, instrumental_ai_confidence, and a full per-window breakdown.

Tunable thresholds

Adjust the precision/recall tradeoff without retraining. Set vocal and instrumental thresholds independently to match your use case.

AI Music Detection Capabilities

Two detection paths. One API.

Most AI music detectors break down on hybrid tracks, multi-part productions, and anything where AI was only used for part of the composition. Modulate runs two independent models, one for vocals, one for instrumentals, scoring each 4-second window separately, so you get a confident result grounded in evidence.

Vocal detection

Modulate identifies AI-generated singing, rap, or any vocal performance within a 4-second window — including AI vocals over organic human music. Each window is scored independently, returning vocal_ai_percentage and vocal_ai_confidence.

Instrumental detection

Modulate identifies AI-generated instrumental content when no sufficient vocal content is present — including fully AI-generated tracks with no vocals at all. Results are delivered per-window and aggregated into the clip-level verdict.

Modulate vs. industry standard

A side-by-side comparison for teams evaluating AI music detection solutions.

Modulate

Industry Standard

Detection paths

Vocal + instrumental, independent

Single combined score per track

Accuracy

Reliably detects only AI-generated music or instrumentals

Frequent false positives on digital alterations including autotune, compression, and other common techniques

Output granularity

Per-4-second window + clip-level verdict

Clip-level only

Speech signals

Per-4s window for vocal AI; instrumental AI at end-of-stream

Single score at end of file

Hybrid content handling

Window-level visibility into mixed tracks

Not supported

Threshold tuning

Yes — no retraining needed

Fixed or not offered

Streaming support

Real-time WebSocket streaming

Batch only

Additional Velma models

Transcription, Deepfake Detection, PII Redaction, Emotion, Accent

Music detection only

Self-serve API

Yes — sign up and start in minutes

Varies; some require enterprise partnership negotiation

Cost

$0.07/hr

Not publicly disclosed or per-track pricing

What We Solve

It's time to get ahead of the AI music challenges your team is already dealing with.

AI-generated music is evolving faster than manual review can handle. AI Music Detect by Modulate gives you a scalable, self-serve detection layer, so you can enforce policies, protect rights, and stay compliant.

Reduce false positives

Window-level scoring and tunable thresholds mean you’re not stuck with a fixed cutoff, so legitimate tracks don’t get flagged and real violations don’t slip through.

Protect royalty payouts

AI-generated tracks are being uploaded at scale to farm streaming royalties. Modulate flags them at ingestion before they dilute payouts for legitimate human artists.

Enforce platform policies at scale

Spotify, Apple Music, YouTube Music, and most major DSPs require AI disclosure or restrict AI monetization. Manual review at ingestion volume isn't viable. Modulate gives you automated detection that enforces policy without adding headcount.

Know what you're licensing

Purely AI-generated works without meaningful human authorship aren't copyrightable under US Copyright Office guidance. Velma lets licensors, sync agencies, and clearance teams verify content before contracts are issued.

Screen content before it reaches platforms

DSPs are pushing distributors to screen uploads upstream. Velma gives DistroKid, TuneCore, CD Baby equivalents a drop-in detection layer to get ahead of platform requirements — before they become mandates.

Stay ahead of AI disclosure regulations

The EU AI Act requires disclosure of AI-generated content in covered contexts. Velma gives platforms operating in affected jurisdictions a reliable way to identify non-disclosing uploads before they create compliance exposure.

How It Works

Clip-level verdicts built from per-window evidence.

Send audio

Point Velma AI Music Detect at a file or a live stream. One endpoint — no pipeline to assemble. Supported formats: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav.

Get per-window verdicts

Every 4-second window is scored independently for vocal AI and instrumental AI content. Streaming returns vocal results in real time; instrumental results and the clip-level verdict arrive at end of stream.

Receive structured output

Get back a primary_verdict of ai-vocal-music, ai-instrumental, or not-ai-music — plus confidence scores and a full per-window breakdown ready to plug into your pipeline.

Streaming, start to finish:

# 1 · open a connection 2 · stream audio 3 · read results

ws = connect("wss://platform.modulate.ai/api/velma-2-ai-music-detection-streaming?api_key=…")

ws.send(audio_chunk) # stream your audio

ws.send("") # signal end of stream

for event in ws: # window verdicts + final primary_verdict

print(event["primary_verdict"])

Start building with Modulate.

Grab an API key or try the playground to see Velma understand a real conversation.

Get free API access

API DOCS

More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Velma API

Voice-native AI that detects emotion, intent, compliance, and coaching — from audio, not transcripts.

See how it works

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.

See how it works

Transcription API

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.

See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.

See how it works

Frequently Asked Questions

What is AI Music Detect by Modulate?

Modulate's API for identifying AI-generated music in audio. It analyzes both vocal and instrumental content across 4-second windows, returning per-segment scores and a clip-level verdict of ai-vocal-music, ai-instrumental, or not-ai-music.

What does the API actually return?

For each clip, the API returns a primary_verdict, clip-level vocal_ai_percentage and instrumental_ai_percentage with confidence scores, and a per-window breakdown showing where in the track AI content was detected.

How is this different from a single-score detector?

Most detectors return one probability score for an entire track. Modulate's API scores every 4-second segment independently and separates vocal AI detection from instrumental AI detection. This matters for hybrid tracks, where only part of the content is AI-generated.

What can Modulate's AI Music Detection API reliably identify?

The API reliably detects fully AI-generated songs (AI vocals + AI instrumentals), AI vocals over human/organic music, and AI-only instrumentals. Known current limitations include AI choral or background vocals, and AI backing tracks underneath a live human vocal performance.

Does the API support streaming?

Yes. In addition to the batch API, Velma AI Music Detect supports real-time WebSocket streaming, returning per-window vocal AI results as audio arrives and a final clip-level verdict at end of stream.

How much does it cost?

Velma AI Music Detect is priced at $0.07/hr of audio. For current pricing details, see the API Pricing page.

What audio formats are supported?

Supported formats for batch: .aac, .flac, .m4a, .mp3, .mp4, .ogg, .opus, .wav. Maximum file size is 100 MB. For streaming, container formats need only an audio_format query parameter; raw PCM requires sample_rate and num_channels as well.

Is Modulate ISO 27001 certified?

Yes. Modulate maintains ISO 27001 certification as part of its organization-wide security program.

Cookie consent notice

Preferences Dashboard

Cookie consent notice

Preferences Dashboard

AI Music Detection

Detect AI generated vocal and instrumental music

AI Music Detection

Clip-level verdicts built from per-window evidence.

Batch API

Streaming API (WebSocket)

Structured output

Tunable thresholds

Two detection paths. One API.

Vocal detection

Instrumental detection

A side-by-side comparison for teams evaluating AI music detection solutions.

It's time to get ahead of the AI music challenges your team is already dealing with.

Reduce false positives

Protect royalty payouts

Enforce platform policies at scale

Know what you're licensing

Screen content before it reaches platforms

Stay ahead of AI disclosure regulations

Clip-level verdicts built from per-window evidence.

Start building with Modulate.

Explore Modulate's other leading voice models

Velma API

Deepfake Detection

Transcription API

PII/PHI Redaction

Frequently Asked Questions