Copilot Studio Evaluations MCP connector for Power Platform

April 17, 2026

You built a Copilot Studio agent. It answers questions, calls connectors, routes to topics. But how do you know if the answers are actually good? Copilot Studio has a built-in evaluation framework that scores agents on abstention (did it answer?), relevance (was the answer on topic?), and completeness (was the answer thorough?). The problem is it’s manual—you open the UI, pick a test set, click run, and wait.

This connector makes agent evaluation programmatic. Five MCP tools and five REST operations that let a Copilot Studio agent or Power Automate flow manage test sets, trigger evaluation runs, and retrieve quality metrics through the Power Platform REST API. Schedule nightly quality checks, build dashboards from evaluation results, or let an agent evaluate itself and recommend improvements.

Full source: GitHub repository

Tools

MCP tools for Copilot Studio

Tool	Description
`get_test_sets`	List all test sets for an agent
`get_test_set_details`	Get a specific test set with all test cases
`start_evaluation`	Launch an evaluation run with optional authentication
`get_run_details`	Retrieve quality metrics and test results
`list_test_runs`	Get all historical evaluation runs

REST operations for Power Automate

Operation	Method	Description
Get Agent Test Sets	GET	Retrieve test set inventory with metadata
Get Agent Test Set Details	GET	View individual test cases within a set
Start Agent Evaluation	GET	Trigger an async evaluation run and get a run ID
Get Agent Test Run Details	GET	Retrieve completed evaluation results
Get Agent Test Runs	GET	List all evaluation runs for an agent

Quality metrics

Each completed evaluation returns three scores:

Metric	What it measures
Abstention	Percentage of questions the agent declined to answer. Low is usually better, but some abstention on edge cases is healthy.
Relevance	How well the agent’s responses matched the expected topic. Higher is better.
Completeness	How thorough the agent’s responses were. Higher is better.

Results also include aiResultReason—an AI-generated explanation of overall performance that summarizes what went well and where the agent struggled.

How it works

User: "Run the compliance test set against our customer service agent"

1. Orchestrator calls get_test_sets({
     environmentId: "env-uuid",
     botId: "bot-uuid"
   })
   → Returns test sets: ["Compliance Testing", "FAQ Coverage",
                          "Edge Cases"]

2. Orchestrator calls start_evaluation({
     environmentId: "env-uuid",
     botId: "bot-uuid",
     testSetId: "compliance-test-set-uuid"
   })
   → { runId: "run-uuid-12345",
       state: "Running",
       totalTestCases: 25,
       testCasesProcessed: 0 }

3. After 2-3 minutes, orchestrator calls get_run_details({
     environmentId: "env-uuid",
     botId: "bot-uuid",
     testSetId: "compliance-test-set-uuid",
     runId: "run-uuid-12345"
   })
   → { state: "Completed",
       metricsResults: [
         { name: "Abstention", score: 0.12 },
         { name: "Relevance", score: 0.94 },
         { name: "Completeness", score: 0.87 }
       ],
       aiResultReason: "Agent handled 22 of 25 cases well.
         Abstained on 3 policy exception questions.
         Relevance was strong across all categories.
         Completeness dropped on multi-step procedures." }

4. Agent responds: "Compliance evaluation complete.
   Abstention: 12%, Relevance: 94%, Completeness: 87%.
   The agent deferred on 3 policy exception questions—
   consider adding knowledge articles on policy exceptions
   to improve completeness on multi-step procedures."

Scheduling nightly evaluations

Build a Power Automate flow that evaluates agent quality on a schedule:

1. Trigger: Scheduled (daily at 2 AM)
2. Get Agent Test Sets
   - Environment ID: [your-env-id]
   - Bot ID: [your-agent-id]
3. For Each test set:
   4. Start Agent Evaluation
      - Test Set ID: current test set ID
   5. Do Until state = "Completed" or "Failed"
      - Get Agent Test Run Details
      - Delay 30 seconds
   6. Parse metrics from response
   7. Send results summary to admin email or post to Teams

This catches quality regressions before users do—if a knowledge source update breaks answers in a specific category, you’ll know by morning.

Authenticated evaluations

Some agents use authenticated connections (for example, a Copilot Studio connection that accesses user-specific data). To evaluate these agents accurately, pass the mcsConnectionId parameter when starting an evaluation:

Go to Power Automate
Open the Connections page
Select the Microsoft Copilot Studio connection
Copy the mcsConnectionId from the URL

The evaluation then runs with that connection’s credentials, testing the agent as a real user would experience it.

Authentication

OAuth 2.0 with Microsoft Entra ID. Register an app with Power Platform API permissions and the .default scope.

Prerequisites

App registration in Microsoft Entra ID with Power Platform API access
Environment ID and Bot ID for the target agent
Test sets created in Copilot Studio with Active state
(Optional) MCS Connection ID for authenticated evaluations

Application Insights logging

The connector includes hardcoded Application Insights telemetry. Enable it in script.csx:

private const bool APP_INSIGHTS_ENABLED = true;
private const string APP_INSIGHTS_KEY = "your-instrumentation-key";

Logged events:

Operation events — Request/response for each API call
MCP tool invocations — Details of each Copilot Studio tool usage
Evaluation lifecycle — Start, progress updates, completion
Errors and exceptions — Diagnostics with stack traces

Useful queries

// All MCP tool calls
customEvents
| where name == "MCP_ToolCall"
| summarize Count = count() by tostring(customDimensions.ToolName)

// Evaluation errors
customExceptions
| where tostring(customDimensions.Connector) == "Copilot Studio Agent Evaluation"
| project timestamp, outerType, outerMessage, customDimensions.Operation

Limitations

Evaluations are asynchronous—allow 2-5 minutes for completion depending on test set size
Test sets must have Active state to run
Maximum 200 test cases per set (Power Platform limit)
Requires tenant admin or delegated Power Platform API access

Files

File	Purpose
`apiDefinition.swagger.json`	OpenAPI 2.0 definition with MCP endpoint and 5 REST operations
`apiProperties.json`	OAuth 2.0 auth config and script operation bindings
`script.csx`	C# script handling MCP protocol routing and Application Insights telemetry
`readme.md`	Setup and usage documentation

Copilot Studio Evaluations MCP connector for Power Platform

Copilot Studio Evaluations MCP connector for Power Platform

Tools

MCP tools for Copilot Studio

REST operations for Power Automate

Quality metrics

How it works

Scheduling nightly evaluations

Authenticated evaluations

Authentication

Prerequisites

Application Insights logging

Useful queries

Limitations

Files

Resources

results matching ""

No results matching ""

Copilot Studio Evaluations MCP connector for Power Platform

Tools

MCP tools for Copilot Studio

REST operations for Power Automate

Quality metrics

How it works

Scheduling nightly evaluations

Authenticated evaluations

Authentication

Prerequisites

Application Insights logging

Useful queries

Limitations

Files

Resources

Subscribe to the weekly newsletter

results matching ""

No results matching ""