Tutorial: Test and Debug

Write deterministic tests for your agents using MockLLM and TestHarness. No API keys, no flaky tests.

Real LLMs are non-deterministic and cost money. You don't want tests that break because the model phrased something differently, or that charge you every time CI runs. AgentForge's testing tools solve both problems.

This tutorial shows you how to write tests that verify your agent calls the right tools, in the right order, with the right inputs — without hitting any LLM API.

What you'll learn

Script exact LLM responses with createMockLLM()
Run agents in a test harness with createTestHarness()
Assert on tool calls, step counts, and output
Use the step debugger to inspect execution

Prerequisites

Same project from the previous tutorials. Add vitest:

npm install -D vitest

Add a test script to package.json:

{
  "scripts": {
    "test": "vitest run",
    "test:watch": "vitest"
  }
}

Step 1: Understand MockLLM

createMockLLM() takes a list of scripted responses. Each time the agent calls the LLM, it gets the next response in the list. You control exactly what the LLM "says".

Two response types:

{ text: "..." } — the LLM returns a text response (task complete)
{ toolCalls: [{ name: "...", input: {...} }] } — the LLM asks to call a tool

Here's how a typical agent run maps to mock responses:

Agent calls LLM (step 1) → mock returns toolCalls → agent runs the tool
Agent calls LLM (step 2) → mock returns text     → agent completes

That's two responses in the mock. The agent loop drives the sequence.

Step 2: Write your first test

Create tests/agent.test.ts:

import { describe, it, expect } from 'vitest';
import { z } from 'zod';
import { defineTool, createMockLLM, createTestHarness } from '@ahzan-agentforge/core';

// --- Define tools (same as your real agent) ---

const searchProducts = defineTool({
  name: 'search_products',
  description: 'Search products by keyword',
  input: z.object({ query: z.string() }),
  output: z.object({
    results: z.array(z.object({
      id: z.string(),
      name: z.string(),
      price: z.number(),
      stock: z.number(),
    })),
    total: z.number(),
  }),
  execute: async ({ query }) => {
    // Real tool logic — this actually runs during tests
    if (query.toLowerCase().includes('keyboard')) {
      return {
        results: [{ id: 'KB-001', name: 'Wireless Keyboard', price: 49.99, stock: 23 }],
        total: 1,
      };
    }
    return { results: [], total: 0 };
  },
});

const checkOrder = defineTool({
  name: 'check_order',
  description: 'Check order status',
  input: z.object({ orderId: z.string() }),
  output: z.object({
    orderId: z.string(),
    status: z.string(),
    items: z.array(z.string()),
    total: z.number(),
  }),
  execute: async ({ orderId }) => ({
    orderId,
    status: 'delivered',
    items: ['Wireless Keyboard'],
    total: 49.99,
  }),
});

// --- Tests ---

describe('Product Assistant', () => {
  it('searches products when asked about inventory', async () => {
    const mockLLM = createMockLLM({
      responses: [
        // Step 1: LLM decides to search
        { toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
        // Step 2: LLM reads the result and responds
        { text: 'We have the Wireless Keyboard (KB-001) for $49.99. 23 in stock.' },
      ],
    });

    const harness = createTestHarness({
      agent: {
        name: 'product-assistant',
        description: 'Test',
        tools: [searchProducts, checkOrder],
        llm: mockLLM,
        systemPrompt: 'You are a product assistant.',
        maxSteps: 10,
      },
    });

    const result = await harness.run({ task: 'Do you have keyboards?' });

    // Basic assertions
    expect(result.status).toBe('completed');
    expect(result.output).toContain('Wireless Keyboard');

    // Tool call assertions
    const toolCalls = result.toolCalls();
    expect(toolCalls).toHaveLength(1);
    expect(toolCalls[0].name).toBe('search_products');
    expect(toolCalls[0].input).toEqual({ query: 'keyboard' });

    // The tool actually ran — check its output
    expect(toolCalls[0].output).toEqual({
      results: [{ id: 'KB-001', name: 'Wireless Keyboard', price: 49.99, stock: 23 }],
      total: 1,
    });
  });

  it('checks order status when given an order ID', async () => {
    const mockLLM = createMockLLM({
      responses: [
        { toolCalls: [{ name: 'check_order', input: { orderId: 'ORD-100' } }] },
        { text: 'Order ORD-100 has been delivered. It contained a Wireless Keyboard.' },
      ],
    });

    const harness = createTestHarness({
      agent: {
        name: 'product-assistant',
        description: 'Test',
        tools: [searchProducts, checkOrder],
        llm: mockLLM,
        systemPrompt: 'You are a product assistant.',
        maxSteps: 10,
      },
    });

    const result = await harness.run({ task: "What's the status of ORD-100?" });

    expect(result.status).toBe('completed');

    const orderCalls = result.toolCalls('check_order');
    expect(orderCalls).toHaveLength(1);
    expect(orderCalls[0].input).toEqual({ orderId: 'ORD-100' });
    expect(orderCalls[0].output).toMatchObject({ status: 'delivered' });
  });

  it('handles multi-tool sequences', async () => {
    const mockLLM = createMockLLM({
      responses: [
        // Step 1: search first
        { toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
        // Step 2: then check an order
        { toolCalls: [{ name: 'check_order', input: { orderId: 'ORD-100' } }] },
        // Step 3: summarize
        { text: 'Found the keyboard at $49.99, and your order ORD-100 was delivered.' },
      ],
    });

    const harness = createTestHarness({
      agent: {
        name: 'product-assistant',
        description: 'Test',
        tools: [searchProducts, checkOrder],
        llm: mockLLM,
        systemPrompt: 'You are a product assistant.',
        maxSteps: 10,
      },
    });

    const result = await harness.run({
      task: 'Find keyboards and check order ORD-100',
    });

    expect(result.status).toBe('completed');

    // Verify tool call order
    const allCalls = result.toolCalls();
    expect(allCalls).toHaveLength(2);
    expect(allCalls[0].name).toBe('search_products');
    expect(allCalls[1].name).toBe('check_order');
  });

  it('verifies the agent does not exceed step limit', async () => {
    const mockLLM = createMockLLM({
      responses: [
        { toolCalls: [{ name: 'search_products', input: { query: 'keyboard' } }] },
        { text: 'Done.' },
      ],
    });

    const harness = createTestHarness({
      agent: {
        name: 'product-assistant',
        description: 'Test',
        tools: [searchProducts, checkOrder],
        llm: mockLLM,
        systemPrompt: 'You are a product assistant.',
        maxSteps: 3,
      },
    });

    const result = await harness.run({ task: 'Find keyboards' });

    // Should complete in 2 steps (1 tool call + 1 text response)
    expect(result.trace.steps.length).toBeLessThanOrEqual(3);
    expect(result.status).toBe('completed');
  });
});

Step 3: Run the tests

npm test

Expected output:

 ✓ tests/agent.test.ts (4)
   ✓ Product Assistant
     ✓ searches products when asked about inventory
     ✓ checks order status when given an order ID
     ✓ handles multi-tool sequences
     ✓ verifies the agent does not exceed step limit

 Test Files  1 passed (1)
      Tests  4 passed (4)

No API keys needed. No network calls. Runs in milliseconds.

Step 4: Use the step debugger

For complex agents, TestHarness provides a step debugger that lets you inspect execution one step at a time:

it('debug step by step', async () => {
  const mockLLM = createMockLLM({
    responses: [
      { toolCalls: [{ name: 'search_products', input: { query: 'monitor' } }] },
      { text: 'Found monitors.' },
    ],
  });

  const harness = createTestHarness({
    agent: {
      name: 'debug-test',
      description: 'Test',
      tools: [searchProducts],
      llm: mockLLM,
      systemPrompt: 'Test',
      maxSteps: 10,
    },
  });

  // Step-by-step execution
  const debugger_ = await harness.debug({ task: 'Find monitors' });

  // Inspect state at each step
  const step1 = debugger_.stepAt(0);
  console.log('Step 1 type:', step1?.type);        // 'tool_call'
  console.log('Step 1 tool:', step1?.toolName);     // 'search_products'

  const step2 = debugger_.stepAt(1);
  console.log('Step 2 type:', step2?.type);         // 'llm_call'

  // Check final state
  console.log('Total steps:', debugger_.state.steps.length);
  console.log('Status:', debugger_.state.status);
});

Step 5: Test error handling

MockLLM can also simulate failures:

it('handles tool not found gracefully', async () => {
  const mockLLM = createMockLLM({
    responses: [
      // LLM asks to call a tool that doesn't exist
      { toolCalls: [{ name: 'nonexistent_tool', input: {} }] },
      // After the error, LLM recovers
      { text: 'Sorry, I was unable to process that request.' },
    ],
  });

  const harness = createTestHarness({
    agent: {
      name: 'error-test',
      description: 'Test',
      tools: [searchProducts],
      llm: mockLLM,
      systemPrompt: 'Test',
      maxSteps: 10,
    },
  });

  const result = await harness.run({ task: 'Do something impossible' });

  // Agent should still complete (the LLM recovers on step 2)
  expect(result.hasError()).toBe(true);
});

Key patterns

One mock response per LLM call. If the agent makes 3 LLM calls (2 tool calls + 1 text), you need 3 mock responses.

Tool execute functions still run. Only the LLM is mocked. Your tool logic executes for real, so you're testing the actual integration between tools and the agent loop.

Use result.toolCalls('name') to filter. Instead of digging through steps manually, use the filter to grab calls to a specific tool.

Match your mock to reality. The tool inputs in your mock responses should match what a real LLM would produce. Look at real traces to calibrate.

Complete test file structure

your-project/
├── src/
│   ├── tools.ts          ← Real tool definitions
│   └── main.ts           ← Agent entry point
├── tests/
│   └── agent.test.ts     ← Tests with MockLLM
├── package.json
└── tsconfig.json

Next steps

Testing reference — full TestHarness API
Trace formatting — print and inspect run traces
Agents reference — hooks for onError, onStep callbacks

Tutorial: Test and Debug

On this page