Copilot Studio Evaluations MCP connector for Power Platform
April 17, 2026
You built a Copilot Studio agent. It answers questions, calls connectors, routes to topics. But how do you know if the answers are actually good? Copilot Studio has a built-in evaluation framework that scores agents on abstention (did it answer?), relevance (was the answer on topic?), and completeness (was the answer thorough?). The problem is it’s manual—you open the UI, pick a test set, click run, and wait.
This connector makes agent evaluation programmatic. Five MCP tools and five REST operations that let a Copilot Studio agent or Power Automate flow manage test sets, trigger evaluation runs, and retrieve quality metrics through the Power Platform REST API. Schedule nightly quality checks, build dashboards from evaluation results, or let an agent evaluate itself and recommend improvements.
Full source: GitHub repository
Tools
MCP tools for Copilot Studio
| Tool | Description |
|---|---|
get_test_sets |
List all test sets for an agent |
get_test_set_details |
Get a specific test set with all test cases |
start_evaluation |
Launch an evaluation run with optional authentication |
get_run_details |
Retrieve quality metrics and test results |
list_test_runs |
Get all historical evaluation runs |
REST operations for Power Automate
| Operation | Method | Description |
|---|---|---|
| Get Agent Test Sets | GET | Retrieve test set inventory with metadata |
| Get Agent Test Set Details | GET | View individual test cases within a set |
| Start Agent Evaluation | GET | Trigger an async evaluation run and get a run ID |
| Get Agent Test Run Details | GET | Retrieve completed evaluation results |
| Get Agent Test Runs | GET | List all evaluation runs for an agent |
Quality metrics
Each completed evaluation returns three scores:
| Metric | What it measures |
|---|---|
| Abstention | Percentage of questions the agent declined to answer. Low is usually better, but some abstention on edge cases is healthy. |
| Relevance | How well the agent’s responses matched the expected topic. Higher is better. |
| Completeness | How thorough the agent’s responses were. Higher is better. |
Results also include aiResultReason—an AI-generated explanation of overall performance that summarizes what went well and where the agent struggled.
How it works
User: "Run the compliance test set against our customer service agent"
1. Orchestrator calls get_test_sets({
environmentId: "env-uuid",
botId: "bot-uuid"
})
→ Returns test sets: ["Compliance Testing", "FAQ Coverage",
"Edge Cases"]
2. Orchestrator calls start_evaluation({
environmentId: "env-uuid",
botId: "bot-uuid",
testSetId: "compliance-test-set-uuid"
})
→ { runId: "run-uuid-12345",
state: "Running",
totalTestCases: 25,
testCasesProcessed: 0 }
3. After 2-3 minutes, orchestrator calls get_run_details({
environmentId: "env-uuid",
botId: "bot-uuid",
testSetId: "compliance-test-set-uuid",
runId: "run-uuid-12345"
})
→ { state: "Completed",
metricsResults: [
{ name: "Abstention", score: 0.12 },
{ name: "Relevance", score: 0.94 },
{ name: "Completeness", score: 0.87 }
],
aiResultReason: "Agent handled 22 of 25 cases well.
Abstained on 3 policy exception questions.
Relevance was strong across all categories.
Completeness dropped on multi-step procedures." }
4. Agent responds: "Compliance evaluation complete.
Abstention: 12%, Relevance: 94%, Completeness: 87%.
The agent deferred on 3 policy exception questions—
consider adding knowledge articles on policy exceptions
to improve completeness on multi-step procedures."
Scheduling nightly evaluations
Build a Power Automate flow that evaluates agent quality on a schedule:
1. Trigger: Scheduled (daily at 2 AM)
2. Get Agent Test Sets
- Environment ID: [your-env-id]
- Bot ID: [your-agent-id]
3. For Each test set:
4. Start Agent Evaluation
- Test Set ID: current test set ID
5. Do Until state = "Completed" or "Failed"
- Get Agent Test Run Details
- Delay 30 seconds
6. Parse metrics from response
7. Send results summary to admin email or post to Teams
This catches quality regressions before users do—if a knowledge source update breaks answers in a specific category, you’ll know by morning.
Authenticated evaluations
Some agents use authenticated connections (for example, a Copilot Studio connection that accesses user-specific data). To evaluate these agents accurately, pass the mcsConnectionId parameter when starting an evaluation:
- Go to Power Automate
- Open the Connections page
- Select the Microsoft Copilot Studio connection
- Copy the
mcsConnectionIdfrom the URL
The evaluation then runs with that connection’s credentials, testing the agent as a real user would experience it.
Authentication
OAuth 2.0 with Microsoft Entra ID. Register an app with Power Platform API permissions and the .default scope.
Prerequisites
- App registration in Microsoft Entra ID with Power Platform API access
- Environment ID and Bot ID for the target agent
- Test sets created in Copilot Studio with Active state
- (Optional) MCS Connection ID for authenticated evaluations
Application Insights logging
The connector includes hardcoded Application Insights telemetry. Enable it in script.csx:
private const bool APP_INSIGHTS_ENABLED = true;
private const string APP_INSIGHTS_KEY = "your-instrumentation-key";
Logged events:
- Operation events — Request/response for each API call
- MCP tool invocations — Details of each Copilot Studio tool usage
- Evaluation lifecycle — Start, progress updates, completion
- Errors and exceptions — Diagnostics with stack traces
Useful queries
// All MCP tool calls
customEvents
| where name == "MCP_ToolCall"
| summarize Count = count() by tostring(customDimensions.ToolName)
// Evaluation errors
customExceptions
| where tostring(customDimensions.Connector) == "Copilot Studio Agent Evaluation"
| project timestamp, outerType, outerMessage, customDimensions.Operation
Limitations
- Evaluations are asynchronous—allow 2-5 minutes for completion depending on test set size
- Test sets must have Active state to run
- Maximum 200 test cases per set (Power Platform limit)
- Requires tenant admin or delegated Power Platform API access
Files
| File | Purpose |
|---|---|
apiDefinition.swagger.json |
OpenAPI 2.0 definition with MCP endpoint and 5 REST operations |
apiProperties.json |
OAuth 2.0 auth config and script operation bindings |
script.csx |
C# script handling MCP protocol routing and Application Insights telemetry |
readme.md |
Setup and usage documentation |