Microsoft Foundry MAI Speech MCP connector for Power Platform
April 06, 2026
MAI-Transcribe-1 launched April 2, 2026, as part of Microsoft’s three-model MAI announcement alongside MAI-Voice-1 and MAI-Image-2. It delivers state-of-the-art speech-to-text accuracy at 3.9% average word error rate, 2.5x faster batch transcription than Azure Fast, and ranks first on the FLEURS benchmark in 11 core languages across the top 25 most-used languages.
Built for messy, real-world audio environments—not clean studio recordings. Meeting rooms with background chatter, phone calls with connection noise, field recordings with wind and traffic.
This connector brings MAI-Transcribe-1 and Whisper into Power Platform with two MCP tools for Copilot Studio and three REST operations for Power Automate and Power Apps. A borrowed chat completion operation lets agents summarize, analyze, or discuss transcription results within a conversation.
Full source: GitHub repository
The models
| Model | Accuracy | Languages | Pricing |
|---|---|---|---|
| MAI-Transcribe-1 | 3.9% avg WER, #1 on FLEURS in 11 core languages | 25+ | $0.36 per hour of audio |
| Whisper | Varies by deployment | 50+ | Varies by deployment |
Both models support the same operations—transcription and translation—through separate REST endpoints that take a deployment ID parameter. Select the model at deployment time.
Tools
MCP tools for Copilot Studio
| Tool | Description |
|---|---|
transcribe_audio |
Transcribe audio from a URL to text, with optional deployment ID, language, and response format |
chat_completion |
Send a chat message to summarize or discuss transcriptions (borrowed from parent Foundry connector) |
How it works
User: "Transcribe the recording from today's standup meeting"
[provides audio URL]
1. Orchestrator calls transcribe_audio({
audio_url: "https://storage.blob.core.windows.net/.../standup.mp3",
language: "en",
response_format: "verbose_json"
})
→ Returns:
text: "Good morning everyone. Let's start with updates..."
language: "en"
duration: 847.5
segments: [{ start: 0.0, end: 3.2, text: "Good morning everyone." }, ...]
User: "Summarize that meeting and pull out the action items"
2. Orchestrator calls chat_completion({
prompt: "Summarize this transcript and list action items:
Good morning everyone. Let's start with updates...",
system_prompt: "You are a meeting notes assistant"
})
→ Returns structured meeting summary with action items
REST operations for Power Automate and Power Apps
| Operation | Operation ID | Method | Path |
|---|---|---|---|
| Transcribe Audio | TranscribeAudio |
POST | /openai/deployments/{deployment_id}/audio/transcriptions |
| Translate Audio to English | TranslateAudio |
POST | /openai/deployments/{deployment_id}/audio/translations |
| Chat Completion | ChatCompletion |
POST | /models/chat/completions |
Parameter reference
| Operation | Parameter | Type | Default | Required |
|---|---|---|---|---|
| Transcribe Audio | deployment_id |
string | — | Yes |
| Transcribe Audio | file |
file | — | Yes |
| Transcribe Audio | language |
string | — | No |
| Transcribe Audio | prompt |
string | — | No |
| Transcribe Audio | response_format |
enum | — | No |
| Transcribe Audio | temperature |
float | 0 | No |
| Translate Audio | deployment_id |
string | — | Yes |
| Translate Audio | file |
file | — | Yes |
| Translate Audio | prompt |
string | — | No |
| Translate Audio | response_format |
enum | — | No |
| Translate Audio | temperature |
float | 0 | No |
| Chat Completion | messages |
array | — | Yes |
| Chat Completion | temperature |
float | 0.7 | No |
| Chat Completion | max_tokens |
int | 4096 | No |
Response format options
| Format | Output | Use case |
|---|---|---|
json |
Simple JSON with text | Quick text extraction |
text |
Plain text only | Downstream text processing |
verbose_json |
JSON with timestamps and segments | Meeting minutes, indexed search |
srt |
SubRip subtitle format | Video subtitles |
vtt |
WebVTT subtitle format | Web video captions |
Transcription response
| Field | Description |
|---|---|
text |
The transcribed or translated text |
language |
Detected or specified language |
duration |
Audio duration in seconds |
segments |
Timestamped segments (verbose_json only) |
Use cases
Meeting transcription and minutes: Transcribe a meeting recording, then use chat_completion to extract action items, decisions, and key discussion points. The verbose_json format provides timestamps for linking notes back to specific moments.
Call recording analysis and compliance: Process customer service or sales calls through TranscribeAudio, then analyze for compliance keywords, sentiment, or escalation triggers using the chat completion tool.
Multilingual content transcription: MAI-Transcribe-1 supports 25+ languages with auto-detection. Transcribe content in the original language or use TranslateAudio to translate any supported language into English text.
Subtitle generation: Use the srt or vtt response format to generate ready-to-use subtitle files for video content. Drop the output directly into your video publishing workflow.
Voice memo processing: Build a Power Automate flow that watches a SharePoint folder for audio uploads, transcribes each file, and stores the text alongside the original recording.
Podcast and media transcription: Process long-form audio content for searchable archives, show notes, or accessibility compliance.
Prerequisites
- An Azure subscription with access to Microsoft Foundry
- Deploy a speech-to-text model from the Foundry Model Catalog:
- MAI-Transcribe-1 — for state-of-the-art transcription accuracy
- Whisper — for broad language support
- Note the Resource Name (for example,
my-foundry-resourcefromhttps://my-foundry-resource.openai.azure.com) and API Key from the deployment
Setting up the connector
1. Deploy a speech model in Microsoft Foundry
- Go to the Foundry Model Catalog
- Search for MAI-Transcribe-1 or Whisper and deploy the model
- Copy the Resource Name and API Key from the deployment page
2. Create the custom connector
- Go to Power Platform Maker Portal
- Navigate to Custom connectors > + New custom connector > Import an OpenAPI file
- Upload
apiDefinition.swagger.json - On the Security tab:
- Authentication type: API Key
- Parameter label: API Key
- Parameter name:
api-key - Parameter location: Header
- On the Code tab:
- Enable Code
- Upload
script.csx
- Select Create connector
3. Create a connection
- Select Test > + New connection
- Enter your Resource Name and API Key
- Select Create connection
4. Test the connector
Test TranscribeAudio by uploading a short audio file (mp3, wav, or m4a) and setting the deployment ID to your model name (for example, MAI-Transcribe-1).
5. Add to Copilot Studio
- In Copilot Studio, open your agent
- Add this connector as an action—Copilot Studio detects the MCP endpoint via
x-ms-agentic-protocol - Test with prompts like “Transcribe this audio file” or “What was discussed in this recording?”
The MCP transcribe_audio tool downloads audio from a URL, so provide publicly accessible links or SAS URLs for files in Azure Storage.
Known limitations
- Audio files must be 25 MB or smaller
- Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
- For files larger than 25 MB, use the Azure Speech batch transcription API instead
- The MCP
transcribe_audiotool downloads audio from the provided URL—files must be publicly accessible or use a SAS URL - The
chat_completiontool is borrowed from the parent Microsoft Foundry connector and requires a chat-capable model deployed to the same resource - The connector uses two different host patterns:
.openai.azure.comfor audio operations and.services.ai.azure.comfor chat and MCP
Files
| File | Purpose |
|---|---|
apiDefinition.swagger.json |
OpenAPI 2.0 definition with MCP endpoint and 3 REST operations |
apiProperties.json |
API Key auth config, dual dynamic host URL policies, and script operation bindings |
script.csx |
C# script handling MCP protocol, audio file download and multipart forwarding, chat completion, and Application Insights telemetry |
readme.md |
Setup and usage documentation |