OpenDataLoader PDF MCP connector for Copilot Studio

April 27, 2026

PDFs are everywhere in enterprise workflows, and getting structured data out of them usually means either a paid API or a fragile parsing script. OpenDataLoader PDF is the top-ranked open-source PDF parser (0.907 overall accuracy), and this connector brings it directly into Copilot Studio and Power Automate.

The connector wraps OpenDataLoader PDF in a Flask API running on Azure Container Apps. Documents are processed entirely within the container—no external API calls, no data leaving the tenant.

Full source: GitHub repository

What this connector does

The connector exposes five capabilities as both MCP tools and REST operations:

Tool	Purpose
`convert_pdf`	Convert PDF to Markdown, JSON (with bounding boxes), HTML, or text
`extract_tables`	Extract tables with row/column structure and cell content
`get_page_elements`	Get elements with bounding boxes and semantic types for RAG citations
`check_accessibility`	Check PDF accessibility tags for EAA/ADA/Section 508 compliance
`get_server_info`	Get service version, capabilities, and configuration

MCP tools work in Copilot Studio agents. The matching REST operations work in Power Automate flows. Same connector, both modes.

Why local processing matters

Every PDF stays within your Azure Container Apps instance. The processing pipeline runs locally inside the container using the OpenDataLoader PDF library and a Java 21 JRE. No document content is sent to external services unless you opt into hybrid mode for advanced features like OCR or formula extraction.

This matters for:

Sensitive financial documents and contracts
Documents under data residency requirements
Scenarios where you need deterministic, repeatable output

OpenDataLoader PDF capabilities

Feature	Mode	Cost
Text extraction with reading order	Local	Free (Apache 2.0)
Bounding boxes for every element	Local	Free
Table extraction (simple)	Local	Free
Table extraction (complex/borderless)	Hybrid	Free
Heading hierarchy detection	Local	Free
Image extraction with coordinates	Local	Free
OCR for scanned PDFs (80+ languages)	Hybrid	Free
Formula extraction (LaTeX)	Hybrid	Free
AI chart/image descriptions	Hybrid	Free
Prompt injection filtering	Local	Free

Architecture

┌────────────────────┐     ┌──────────────────────────────┐
│  Copilot Studio    │     │  Azure Container Apps        │
│  Agent             │     │                              │
│                    │ MCP │  ┌──────────────────────┐    │
│  ┌──────────────┐  │────>│  │  Flask API (Python)  │    │
│  │ OpenDataLoader│  │     │  │  + OpenDataLoader PDF│    │
│  │ PDF MCP      │  │<────│  │  + Java 21 JRE       │    │
│  │ (connector)  │  │     │  └──────────────────────┘    │
│  └──────────────┘  │     │                              │
└────────────────────┘     └──────────────────────────────┘

The Flask API accepts PDFs via URL or base64-encoded content and returns structured output. All POST endpoints use the same input format:

{
  "source": "https://example.com/document.pdf",
  "sourceType": "url"
}

Or for base64:

{
  "source": "<base64-encoded-pdf>",
  "sourceType": "base64"
}

Quick deploy with pre-built image

The fastest path uses the pre-built image from GitHub Container Registry:

cd "OpenDataLoader PDF MCP/infra"
.\deploy.ps1 -ResourceGroup rg-opendataloader -UseGhcrImage

This provisions Azure Container Apps infrastructure and deploys ghcr.io/troystaylor/opendataloader-pdf-api:latest. The script outputs your service URL and API key.

Deploy from source

Build the container image in your own Azure Container Registry:

cd "OpenDataLoader PDF MCP/infra"
.\deploy.ps1 -ResourceGroup rg-opendataloader

Deploy script parameters

Parameter	Required	Default	Purpose
`ResourceGroup`	Yes	—	Azure resource group (created if needed)
`Location`	No	westus2	Azure region
`ApiKey`	No	auto-generated	API key for the service
`SkipInfra`	No	false	Skip Bicep deployment
`SkipBuild`	No	false	Skip container image build
`ImageTag`	No	latest	Container image tag
`UseGhcrImage`	No	false	Use pre-built GHCR image

Azure resources deployed

The Bicep template provisions:

Resource	SKU	Purpose
Container Registry	Basic	Stores container image
Log Analytics Workspace	PerGB2018	Container logs
Application Insights	Web	Telemetry and monitoring
Container Apps Environment	—	Hosting environment
Container App	1 CPU / 2 GiB	OpenDataLoader PDF API

The container app scales from 0 to 3 replicas and includes liveness and readiness probes on the /health endpoint.

Connector setup

After deployment, the script outputs the service URL and API key.

Update apiDefinition.swagger.json host field to your service FQDN
Deploy the connector:

pac connector create `
  --settings-file apiProperties.json `
  --api-definition apiDefinition.swagger.json `
  --script script.csx

Create a connection using your API key

Use cases

PDF to Markdown for RAG

Convert documents to clean Markdown for grounding AI responses:

“Convert this PDF to markdown so I can analyze its contents”

Table extraction

Extract structured tables from financial reports, invoices, or data sheets:

“Extract all tables from this quarterly report PDF”

Document analysis with citations

Get element-level data with bounding boxes for source citations:

“Analyze this research paper and show me where each finding is located”

Accessibility compliance

Check if organizational PDFs meet accessibility standards:

“Check if this PDF has proper accessibility tags for EAA compliance”

Observability with Application Insights

The deploy script outputs the App Insights instrumentation key. Enable telemetry in the connector by editing script.csx:

private const string APP_INSIGHTS_KEY = "your-instrumentation-key-here";

CI/CD with GitHub Actions

A workflow at .github/workflows/opendataloader-pdf-build.yml builds and publishes the container image to ghcr.io/troystaylor/opendataloader-pdf-api.

Triggers:

Manual dispatch (workflow_dispatch) with optional version tag
Push to main when files in OpenDataLoader PDF MCP/container-app/ change

Tags applied: latest and git SHA.

Files in this project

File	Purpose
`apiDefinition.swagger.json`	Swagger with MCP + REST operations
`apiProperties.json`	Connector properties (API key auth)
`script.csx`	MCP protocol handler
`container-app/app.py`	Flask REST API wrapping opendataloader-pdf
`container-app/Dockerfile`	Python 3.12 + Java 21 JRE
`container-app/requirements.txt`	Python dependencies
`infra/main.bicep`	Azure infrastructure template
`infra/deploy.ps1`	Deployment script

When to use this connector

Use this connector when you need PDF processing inside Copilot Studio or Power Automate and want the data to stay within your Azure tenant. It fits well for:

RAG pipelines that need clean Markdown from source PDFs
Document intake workflows that extract tables for downstream processing
Compliance checks for accessibility standards across document libraries
Citation-aware analysis where bounding box data ties findings back to source locations

OpenDataLoader PDF MCP connector for Copilot Studio

OpenDataLoader PDF MCP connector for Copilot Studio

What this connector does

Why local processing matters

OpenDataLoader PDF capabilities

Architecture

Quick deploy with pre-built image

Deploy from source

Deploy script parameters

Azure resources deployed

Connector setup

Use cases

PDF to Markdown for RAG

Table extraction

Document analysis with citations

Accessibility compliance

Observability with Application Insights

CI/CD with GitHub Actions

Files in this project

When to use this connector

Resources

results matching ""

No results matching ""

OpenDataLoader PDF MCP connector for Copilot Studio

What this connector does

Why local processing matters

OpenDataLoader PDF capabilities

Architecture

Quick deploy with pre-built image

Deploy from source

Deploy script parameters

Azure resources deployed

Connector setup

Use cases

PDF to Markdown for RAG

Table extraction

Document analysis with citations

Accessibility compliance

Observability with Application Insights

CI/CD with GitHub Actions

Files in this project

When to use this connector

Resources

Subscribe to the weekly newsletter

results matching ""

No results matching ""