Tutorial: Test and Debug
Write deterministic tests for your agents using MockLLM and TestHarness. No API keys, no flaky tests.
Real LLMs are non-deterministic and cost money. You don't want tests that break because the model phrased something differently, or that charge you every time CI runs. AgentForge's testing tools solve both problems.
This tutorial shows you how to write tests that verify your agent calls the right tools, in the right order, with the right inputs — without hitting any LLM API.
What you'll learn
- Script exact LLM responses with
createMockLLM() - Run agents in a test harness with
createTestHarness() - Assert on tool calls, step counts, and output
- Use the step debugger to inspect execution
Prerequisites
Same project from the previous tutorials. Add vitest:
npm install -D vitestAdd a test script to package.json:
{
"scripts": {
"test": "vitest run",
"test:watch": "vitest"
}
}Step 1: Understand MockLLM
createMockLLM() takes a list of scripted responses. Each time the agent calls the LLM, it gets the next response in the list. You control exactly what the LLM "says".
Two response types:
{ text: "..." }— the LLM returns a text response (task complete){ toolCalls: [{ name: "...", input: {...} }] }— the LLM asks to call a tool
Here's how a typical agent run maps to mock responses:
Agent calls LLM (step 1) → mock returns toolCalls → agent runs the tool
Agent calls LLM (step 2) → mock returns text → agent completesThat's two responses in the mock. The agent loop drives the sequence.
Step 2: Write your first test
Create tests/agent.test.ts:
import { describe, it, expect } from 'vitest';
import { z } from 'zod';
import { defineTool, createMockLLM, createTestHarness } from '@ahzan-agentforge/core';
// --- Define tools (same as your real agent) ---
const searchProducts = defineTool({
name: 'search_products',
description: 'Search products by keyword',
input: z.object({ query: z.string() }),
output: z.object({
results: z.array(z.object({
id: z.string(),
name: z.string(),
price: z.number(),
stock: z.number(),
})),
total: z.number(),
}),
execute: async ({ query }) => {
// Real tool logic — this actually runs during tests
if (query.toLowerCase().includes('keyboard')) {
return {
results: [{ id: 'KB-001', name: 'Wireless Keyboard', price: 49.99, stock: 23 }],
total: 1,
};
}
return { results: [], total: 0 };
},
});
const checkOrder = defineTool({
name: 'check_order',
description: 'Check order status',
input: z.object({ orderId: z.string() }),
output: z.object({
orderId: z.string(),
status: z.string(),
items: z.array(z.string()),
total: z.number(),
}),
execute: async ({ orderId }) => ({
orderId,
status: 'delivered',
items: ['Wireless Keyboard'],
total: 49.99,
}),
});
// --- Tests ---
describe('Product Assistant', () => {
it('searches products when asked about inventory', async () => {
const mockLLM = createMockLLM({
responses: [
// Step 1: LLM decides to search
{ toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
// Step 2: LLM reads the result and responds
{ text: 'We have the Wireless Keyboard (KB-001) for $49.99. 23 in stock.' },
],
});
const harness = createTestHarness({
agent: {
name: 'product-assistant',
description: 'Test',
tools: [searchProducts, checkOrder],
llm: mockLLM,
systemPrompt: 'You are a product assistant.',
maxSteps: 10,
},
});
const result = await harness.run({ task: 'Do you have keyboards?' });
// Basic assertions
expect(result.status).toBe('completed');
expect(result.output).toContain('Wireless Keyboard');
// Tool call assertions
const toolCalls = result.toolCalls();
expect(toolCalls).toHaveLength(1);
expect(toolCalls[0].name).toBe('search_products');
expect(toolCalls[0].input).toEqual({ query: 'keyboard' });
// The tool actually ran — check its output
expect(toolCalls[0].output).toEqual({
results: [{ id: 'KB-001', name: 'Wireless Keyboard', price: 49.99, stock: 23 }],
total: 1,
});
});
it('checks order status when given an order ID', async () => {
const mockLLM = createMockLLM({
responses: [
{ toolCalls: [{ name: 'check_order', input: { orderId: 'ORD-100' } }] },
{ text: 'Order ORD-100 has been delivered. It contained a Wireless Keyboard.' },
],
});
const harness = createTestHarness({
agent: {
name: 'product-assistant',
description: 'Test',
tools: [searchProducts, checkOrder],
llm: mockLLM,
systemPrompt: 'You are a product assistant.',
maxSteps: 10,
},
});
const result = await harness.run({ task: "What's the status of ORD-100?" });
expect(result.status).toBe('completed');
const orderCalls = result.toolCalls('check_order');
expect(orderCalls).toHaveLength(1);
expect(orderCalls[0].input).toEqual({ orderId: 'ORD-100' });
expect(orderCalls[0].output).toMatchObject({ status: 'delivered' });
});
it('handles multi-tool sequences', async () => {
const mockLLM = createMockLLM({
responses: [
// Step 1: search first
{ toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
// Step 2: then check an order
{ toolCalls: [{ name: 'check_order', input: { orderId: 'ORD-100' } }] },
// Step 3: summarize
{ text: 'Found the keyboard at $49.99, and your order ORD-100 was delivered.' },
],
});
const harness = createTestHarness({
agent: {
name: 'product-assistant',
description: 'Test',
tools: [searchProducts, checkOrder],
llm: mockLLM,
systemPrompt: 'You are a product assistant.',
maxSteps: 10,
},
});
const result = await harness.run({
task: 'Find keyboards and check order ORD-100',
});
expect(result.status).toBe('completed');
// Verify tool call order
const allCalls = result.toolCalls();
expect(allCalls).toHaveLength(2);
expect(allCalls[0].name).toBe('search_products');
expect(allCalls[1].name).toBe('check_order');
});
it('verifies the agent does not exceed step limit', async () => {
const mockLLM = createMockLLM({
responses: [
{ toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
{ text: 'Done.' },
],
});
const harness = createTestHarness({
agent: {
name: 'product-assistant',
description: 'Test',
tools: [searchProducts, checkOrder],
llm: mockLLM,
systemPrompt: 'You are a product assistant.',
maxSteps: 3,
},
});
const result = await harness.run({ task: 'Find keyboards' });
// Should complete in 2 steps (1 tool call + 1 text response)
expect(result.trace.steps.length).toBeLessThanOrEqual(3);
expect(result.status).toBe('completed');
});
});Step 3: Run the tests
npm testExpected output:
✓ tests/agent.test.ts (4)
✓ Product Assistant
✓ searches products when asked about inventory
✓ checks order status when given an order ID
✓ handles multi-tool sequences
✓ verifies the agent does not exceed step limit
Test Files 1 passed (1)
Tests 4 passed (4)No API keys needed. No network calls. Runs in milliseconds.
Step 4: Use the step debugger
For complex agents, TestHarness provides a step debugger that lets you inspect execution one step at a time:
it('debug step by step', async () => {
const mockLLM = createMockLLM({
responses: [
{ toolCalls: [{ name: 'search_products', input: { query: 'monitor' } }] },
{ text: 'Found monitors.' },
],
});
const harness = createTestHarness({
agent: {
name: 'debug-test',
description: 'Test',
tools: [searchProducts],
llm: mockLLM,
systemPrompt: 'Test',
maxSteps: 10,
},
});
// Step-by-step execution
const debugger_ = await harness.debug({ task: 'Find monitors' });
// Inspect state at each step
const step1 = debugger_.stepAt(0);
console.log('Step 1 type:', step1?.type); // 'tool_call'
console.log('Step 1 tool:', step1?.toolName); // 'search_products'
const step2 = debugger_.stepAt(1);
console.log('Step 2 type:', step2?.type); // 'llm_call'
// Check final state
console.log('Total steps:', debugger_.state.steps.length);
console.log('Status:', debugger_.state.status);
});Step 5: Test error handling
MockLLM can also simulate failures:
it('handles tool not found gracefully', async () => {
const mockLLM = createMockLLM({
responses: [
// LLM asks to call a tool that doesn't exist
{ toolCalls: [{ name: 'nonexistent_tool', input: {} }] },
// After the error, LLM recovers
{ text: 'Sorry, I was unable to process that request.' },
],
});
const harness = createTestHarness({
agent: {
name: 'error-test',
description: 'Test',
tools: [searchProducts],
llm: mockLLM,
systemPrompt: 'Test',
maxSteps: 10,
},
});
const result = await harness.run({ task: 'Do something impossible' });
// Agent should still complete (the LLM recovers on step 2)
expect(result.hasError()).toBe(true);
});Key patterns
One mock response per LLM call. If the agent makes 3 LLM calls (2 tool calls + 1 text), you need 3 mock responses.
Tool execute functions still run. Only the LLM is mocked. Your tool logic executes for real, so you're testing the actual integration between tools and the agent loop.
Use result.toolCalls('name') to filter. Instead of digging through steps manually, use the filter to grab calls to a specific tool.
Match your mock to reality. The tool inputs in your mock responses should match what a real LLM would produce. Look at real traces to calibrate.
Complete test file structure
your-project/
├── src/
│ ├── tools.ts ← Real tool definitions
│ └── main.ts ← Agent entry point
├── tests/
│ └── agent.test.ts ← Tests with MockLLM
├── package.json
└── tsconfig.jsonNext steps
- Testing reference — full TestHarness API
- Trace formatting — print and inspect run traces
- Agents reference — hooks for onError, onStep callbacks