Skip to main content

Baselines

Baselines capture a snapshot of your MCP server's expected behavior, enabling drift detection and regression testing.

What Is a Baseline?

A baseline is a JSON file containing:

  • Server capabilities - Tools, prompts, and resources
  • Tool schemas - Parameter types and requirements
  • Behavioral observations - How tools actually behave
  • Security findings - Any identified vulnerabilities (when check.security.enabled is true)
  • Performance metrics - P50/P95 latency, success rates, and confidence levels
  • Response fingerprints - Content types, sizes, and structure hashes
  • Error patterns - Categorized errors with root cause analysis
  • Schema evolution - Response schema stability tracking
  • Documentation quality - Score and grade for tool documentation

Creating a Baseline

# Initialize config (first time only)
bellwether init npx your-server

# Run check
bellwether check

# Then save baseline
bellwether baseline save

With the default config, this generates .bellwether/bellwether-baseline.json (relative baseline paths resolve under output.dir):

{
"version": "2.1.1",
"metadata": {
"mode": "check",
"generatedAt": "2026-01-25T10:30:00Z",
"cliVersion": "2.1.1",
"serverCommand": "npx @modelcontextprotocol/server-filesystem /tmp",
"serverName": "@modelcontextprotocol/server-filesystem",
"durationMs": 2341,
"personas": [],
"model": "none"
},
"server": {
"name": "@modelcontextprotocol/server-filesystem",
"version": "0.10.1",
"protocolVersion": "2025-11-25",
"capabilities": ["tools"]
},
"capabilities": {
"tools": [
{
"name": "read_file",
"description": "Read contents of a file",
"inputSchema": {
"type": "object",
"properties": {
"path": { "type": "string", "description": "Path to file" }
},
"required": ["path"]
},
"schemaHash": "def456...",
"baselineP50Ms": 45,
"baselineP95Ms": 120,
"baselineSuccessRate": 0.98,
"lastTestedAt": "2026-01-25T10:30:00Z",
"inputSchemaHashAtTest": "def456...",
"performanceConfidence": {
"sampleCount": 15,
"successfulSamples": 15,
"validationSamples": 0,
"totalTests": 15,
"standardDeviation": 12.5,
"coefficientOfVariation": 0.28,
"confidenceLevel": "high"
},
"responseFingerprint": {
"contentType": "text",
"sizeCategory": "small",
"structureHash": "ghi789..."
}
}
]
},
"interviews": [],
"toolProfiles": [
{
"name": "read_file",
"description": "Read contents of a file",
"schemaHash": "def456...",
"assertions": [
{
"type": "expects",
"condition": "Returns UTF-8 text for text files",
"tool": "read_file"
}
],
"securityNotes": ["Path traversal normalized within root"],
"limitations": ["Maximum file size: 10MB"],
"behavioralNotes": []
}
],
"assertions": [],
"summary": "Filesystem server with 1 tool",
"hash": "abc123...",
"documentationScore": {
"overallScore": 85,
"grade": "B",
"toolCount": 3,
"issueCount": 2
}
}

Custom Baseline Path

# Save to specific path
bellwether baseline save ./baselines/v1.json

# Compare against specific baseline
bellwether baseline compare ./baselines/v1.json

Or configure paths in bellwether.yaml:

baseline:
comparePath: "./baselines/v1.json"
savePath: "./baselines/current.json"

Baseline in CI/CD

Commit to Version Control

# Create baseline
bellwether check
bellwether baseline save

# Commit both config and baseline
git add bellwether.yaml .bellwether/bellwether-baseline.json
git commit -m "Update behavioral baseline"

Check in CI

Configure baseline path in bellwether.yaml:

baseline:
comparePath: "./bellwether-baseline.json"
failOnDrift: true
# GitHub Actions
- name: Check Behavioral Drift
run: npx @dotsetlabs/bellwether check --fail-on-drift

Accepting Intentional Changes

When you intentionally change your MCP server (adding features, modifying behavior), you need to update the baseline. Bellwether provides two ways to do this:

The baseline accept command marks drift as intentional and records metadata for audit trails:

# Run check to detect drift
bellwether check

# Review the drift, then accept it with a reason
bellwether baseline accept --reason "Added new delete_file tool"

# For breaking changes, use --force
bellwether baseline accept --reason "Major API update" --force

# Commit
git add .bellwether/bellwether-baseline.json
git commit -m "Update baseline: added delete_file tool"

Accept Command Options

OptionDescription
--reason <text>Why the drift was accepted
--accepted-by <name>Who accepted (for audit trail)
--dry-runPreview what would be accepted
--forceRequired for breaking changes

Option 2: Accept During Check

You can also accept drift as part of the check command:

# Check and accept in one command
bellwether check --accept-drift --accept-reason "Improved error handling"

# Commit
git add .bellwether/bellwether-baseline.json
git commit -m "Update baseline: improved error handling"

Option 3: Force Save

For simple cases, you can overwrite the baseline directly:

# Run check and review changes
bellwether check

# Overwrite baseline (no acceptance metadata)
bellwether baseline save --force

# Commit
git add .bellwether/bellwether-baseline.json
git commit -m "Update baseline: added delete_file tool"

Acceptance Metadata

When using baseline accept, the baseline records acceptance metadata (including acceptedBy).
When using --accept-drift, Bellwether records acceptedAt and reason only.

{
"acceptance": {
"acceptedAt": "2026-01-21T10:30:00Z",
"acceptedBy": "dev-team",
"reason": "Added new delete_file tool",
"acceptedDiff": {
"toolsAdded": ["delete_file"],
"toolsRemoved": [],
"toolsModified": [],
"severity": "info",
"breakingCount": 0,
"warningCount": 0,
"infoCount": 1
}
}
}

Use baseline accept --accepted-by <name> if you need a full audit trail of who approved the change.

Baseline Format Versioning

Baselines use the CLI package version as the format version (e.g., 2.1.1):

ComponentDescription
MajorBreaking baseline format changes (recreate baseline)
MinorBackwards-compatible format additions
PatchBug fixes in baseline generation

Compatibility Rules

  • Same major version = Compatible (can compare baselines)
  • Different major version = Incompatible (recreate baseline)

When comparing baselines with incompatible versions, recreate the older baseline with the latest CLI.

What's Captured

CategoryContent
Server InfoName, version, protocol version, capabilities, instructions
ToolsName, description, schema hash, title, annotations, output schema, execution/task support, security notes, limitations
PromptsPrompt names, descriptions, titles, and argument metadata
ResourcesResource URIs, names, descriptions, titles, and mime types
PerformanceP50/P95 latency, success rate, confidence level per tool
Response FingerprintContent type, size category, structure hash
Error PatternsCategorized errors with root cause and remediation
Schema EvolutionResponse schema stability and field changes
SecurityVulnerability findings and risk scores (when check.security.enabled is true)
DocumentationQuality score, grade, and improvement suggestions
AssertionsBehavioral assertions
WorkflowsWorkflow signatures and results
HashSHA-256 hash for detecting file tampering
MetadataTimestamp, mode, server command
AcceptanceOptional: when/why drift was accepted
IncrementalSchema hash and test timestamp for incremental checking

Baseline Comparison

Comparisons are protocol-version-aware — version-specific fields (annotations, titles, output schemas, execution/task support, server instructions) are only compared when both baselines support the relevant MCP protocol version. This prevents false positives when upgrading servers across protocol versions.

When comparing baselines, Bellwether detects:

Change TypeExample
AddedNew tool delete_file
RemovedTool legacy_read no longer exists
Schema changeParameter path now required
Behavior changeError message format changed
Prompt changePrompt argument added/removed or required flag changed
Resource changeResource added/removed or mime type changed
Server changeServer metadata or protocol version changed
Capability changeServer capabilities added/removed
Security changeNew vulnerability detected
Performance regressionP50 latency increased by >10%
Confidence changeMetrics reliability improved/degraded
Response structure changeJSON schema fields added/removed
Error pattern changeNew error types or resolved errors
Schema evolutionResponse schema stability changes
Documentation degradationQuality score decreased

Performance Comparison

When baselines include performance metrics, Bellwether compares:

  • P50 latency - Median response time
  • P95 latency - 95th percentile response time
  • Success rate - Percentage of successful calls

Configure the regression threshold in bellwether.yaml:

check:
performanceThreshold: 10 # Flag if P50 latency increases by >10%

Incremental Checking

Bellwether supports incremental checking to speed up CI runs. Only tools with changed schemas are re-tested. Configure in bellwether.yaml:

check:
incremental: true
incrementalCacheHours: 168 # 1 week cache validity

Each tool fingerprint includes:

  • lastTestedAt - When the tool was last tested
  • inputSchemaHashAtTest - Schema hash at test time

When a tool's schema changes, it's automatically re-tested. Unchanged tools reuse cached fingerprints.

See Also