Open Source · TypeScript + Python · v0.1.3

The Next.js
for AI Agents

You write the business logic. AgentForge runs the agent loop, checkpoints state, enforces budgets, and traces every decision. If it crashes, it picks up where it left off.

npm install @ahzan-agentforge/corepip install ahzan-agentforge
npx skills add ahzan-dev/ahzan-agentforge@agentforgeAI coding skill for Claude Code, Copilot, Gemini CLI
terminal
Scroll to explore

What is AgentForge?

Most agent frameworks hand you a bag of parts. You wire up an LLM, bolt on some tools, maybe add a vector store, and hope it holds together when real traffic hits. When it crashes at 2am, you get logs that say nothing useful.

AgentForge works differently. It runs the agent loop itself: plan, act, observe, decide. Your code says what the agent can do. The framework handles how it executes, recovers from failures, and reports what happened.

The analogy is Next.js. Next.js doesn't just hand you React. It gives you routing, rendering, caching, and a mental model for web apps. AgentForge does the same thing for agents: execution, memory, observability, governance, and deployment all wired together.

So you get agents that checkpoint every step, resume after crashes, stay within cost budgets, trace every decision, and swap LLM providers without touching your business logic.

The problem with agent frameworks today

Crash and you lose everything

AgentForge checkpoints every step. Resume from the last save point with a runId.

No idea what the agent decided or why

Full decision trace via OpenTelemetry — every LLM call, tool result, token count, and cost.

Token costs spiral out of control

Built-in budget governor. Set max tokens and dollar limits. The framework enforces them.

Agent calls a dangerous tool unchecked

Autonomy policy engine — allow, deny, or escalate any tool call. Runtime enforcement, not guidelines.

Locked into one LLM provider

Swap between Anthropic, OpenAI, Gemini, and Ollama. Same agent code, different brain.

"It works on my machine" but fails in prod

MockLLM for deterministic tests. TestHarness for structured assertions. StepDebugger for inspecting every decision.

Architecture

Seven layers, one system

Each layer builds on the one below. The execution engine sits at the bottom. Memory, tools, policy, multi-agent coordination, observability, and deployment stack on top of it.

Code

Write agents, not infrastructure

You define tools with validated schemas, wire them into an agent with a budget and a policy, and the framework takes care of the rest. Here's what that looks like.

TypeScript|src/agent.ts
import { defineAgent, createLLM } from '@ahzan-agentforge/core';

const agent = defineAgent({
  name: 'support-triage',
  description: 'Customer support triage agent',
  tools: [getOrder, createTicket, sendEmail],
  llm: createLLM({
    provider: 'anthropic',
    model: 'claude-sonnet-4-20250514',
    maxTokens: 4096,
  }),
  maxSteps: 15,
  systemPrompt: `You are a support triage agent.
    1. Look up their order using get_order
    2. Create a support ticket using create_ticket
    3. Send confirmation via send_email`,
  budget: { maxTokens: 50000, maxCostUsd: 0.50 },
  policy: {
    tools: [
      { pattern: 'delete*', permission: 'escalate' },
      { pattern: '*', permission: 'allow' },
    ],
  },
});

const result = await agent.run({
  task: 'Order #4521 arrived damaged',
});
Capabilities

The stuff that matters after the demo works

Checkpoint & Crash Recovery

Every step in the agent loop gets checkpointed before the agent moves on. State transitions are atomic. If the process dies (power failure, network drop, OOM kill), pass the same runId and the agent picks up exactly where it left off.

In development, state lives in memory. In production, it goes to Redis via ioredis. Same checkpoint format either way, same resume behavior.

TypeScript|recovery.ts
// First run — crashes at step 7
const result = await agent.run({
  task: 'Process batch of 100 orders',
  runId: 'run_1710000000_abc123',
});

// Resume — picks up from step 7
const result = await agent.run({
  task: 'Process batch of 100 orders',
  runId: 'run_1710000000_abc123', // same ID
});

Cost Governor & Autonomy Policy

Set token limits and dollar budgets per run. The framework tracks usage against per-model cost tables and stops the agent before it exceeds the budget. Warning thresholds let you react before hitting the hard limit.

Autonomy policy is separate from budget. It controls what the agent can do. Glob patterns let you allow, deny, or escalate tool calls. The framework enforces these at runtime as hard constraints. Application code can't skip them, even accidentally.

TypeScript|governance.ts
const agent = defineAgent({
  // ...tools, llm, systemPrompt

  budget: {
    maxTokens: 50_000,
    maxCostUsd: 0.50,
    warnThreshold: 0.8, // warn at 80%
  },

  policy: {
    tools: [
      { pattern: 'delete*', permission: 'escalate' },
      { pattern: 'send_email', permission: 'escalate' },
      { pattern: 'read*', permission: 'allow' },
      { pattern: '*', permission: 'allow' },
    ],
    maxStepsPerRun: 20,
  },
});

Rollback & Compensating Actions

When a run fails partway through, some tools have already fired. A ticket got created, an order got updated. AgentForge runs compensating actions in reverse order (LIFO) to undo those effects, and keeps an audit trail of what it rolled back.

Tools that can't be undone (sending an email, hitting an external API) get marked 'irreversible' and skipped during rollback. You always know what was undone and what wasn't.

TypeScript|rollback.ts
const createTicket = defineTool({
  name: 'create_ticket',
  input: z.object({ title: z.string() }),
  output: z.object({ ticketId: z.string() }),
  execute: async ({ title }) => {
    return await ticketSystem.create(title);
  },
  // Undo on run failure
  compensate: async (input, output) => {
    await ticketSystem.delete(output.ticketId);
  },
});

const sendEmail = defineTool({
  name: 'send_email',
  // ...
  compensate: 'irreversible', // skip during rollback
});

Real-Time Streaming

Stream tokens as the agent thinks. Events are typed: LLM tokens, tool starts, tool completions, step transitions, errors. You can build UIs that show what the agent is doing right now instead of waiting for the final result.

TypeScript|stream.ts
for await (const event of agent.stream({
  task: 'Analyze customer feedback'
})) {
  switch (event.type) {
    case 'llm_token':
      process.stdout.write(event.content);
      break;
    case 'tool_start':
      console.log(`Calling ${event.toolName}...`);
      break;
    case 'tool_end':
      console.log(`Result: ${event.output}`);
      break;
    case 'step_complete':
      console.log(`Step ${event.step} done`);
      break;
  }
}

Deterministic Testing

MockLLM lets you script exact responses: text, tool calls, or sequences of both. No API keys, no network calls, fully deterministic. Replay mode re-runs a previous trace against your current code so you catch regressions.

TestHarness wraps an agent run and gives you structured assertions: check which tools were called, verify step counts, inspect the full message history, assert on error presence. StepDebugger pauses after each step so you can inspect state in real-time.

TypeScript|agent.test.ts
import { MockLLM, TestHarness } from
  '@ahzan-agentforge/core';

const mockLLM = new MockLLM([
  { text: 'Let me look up that order.' },
  { toolCalls: [{
    name: 'get_order',
    input: { orderId: '4521' }
  }]},
  { text: 'Order #4521 is being processed.' },
]);

const harness = new TestHarness(agent, mockLLM);
const result = await harness.run({
  task: 'Check order #4521'
});

expect(result.toolCalls()).toContain('get_order');
expect(result.hasError()).toBe(false);
Philosophy

Non-negotiable principles

Own the loop, not the LLM

The LLM is a swappable part. AgentForge controls execution, retries, checkpointing, and tracing around every decision the model makes. Switch from Claude to GPT to Gemini to a local Ollama model. Your agent logic doesn't change.

State is sacred

Every step is checkpointed before the agent moves on. Crash the process, lose power, hit a network failure. Restart and resume from the last checkpoint. Zero data loss, atomic state transitions.

Reliability over features

Ten things that always work beat fifty that sometimes don't. Every capability ships with retry logic, timeout enforcement, error classification, and rollback support built into the framework itself.

Observability is not optional

Every LLM call, tool execution, token count, and timing is recorded as a core part of the architecture. When something breaks at 3am, you trace the exact decision chain that caused it.

Comparison

What you get that others don't

Other frameworks give you pieces. AgentForge gives you a runtime that handles the hard parts: crash recovery, cost limits, rollback, tracing. The stuff that only matters after your demo works and you need it to keep working.

CapabilityAgentForgeOther Frameworks
Crash Recovery
Checkpoint + resume via runId
Cost Governance
Framework-enforced budgets
Autonomy Policy
Runtime trust enforcement
Rollback System
Compensating actions + audit trail
Error Classification
Framework vs LLM vs Tool errors
Full Decision Trace
OpenTelemetry native
Multi-Agent Patterns
4 built-in orchestration patterns
partial
Python + TypeScript
Full API parity, not a wrapper
partial
Tool I/O Validation
Zod + Pydantic schemas
partial
Mock & Test Utilities
MockLLM + TestHarness + StepDebugger
Stack

What it's built on

TypeScript First

Core SDK with full type inference. Strict mode. Published to npm as @ahzan-agentforge/core.

Python SDK

Full API parity with the TypeScript core. camelCase becomes snake_case, Zod becomes Pydantic. It's a native SDK, not a wrapper.

Zod + Pydantic

Schema validation at every boundary. Tool inputs validated before execution, outputs validated after. Type safety end to end.

Redis + BullMQ

State checkpoints and session data in Redis via ioredis. Queue-backed execution through BullMQ for high-throughput workloads.

Postgres + PgVector

Long-term memory stored as embeddings in Postgres with pgvector. Retrieval with relevance scoring and recency weighting.

OpenTelemetry

Every span is instrumented: agent runs, LLM calls, tool executions. Export to Jaeger, Grafana, Datadog, or any OTLP-compatible backend.

Supported LLM Providers

Anthropic Claude
OpenAI GPT
Google Gemini
Ollama (Local)

Open source core. Cloud when you need it.

The core framework is MIT-licensed. Run it on your own servers, forever, for free. If you'd rather not manage infrastructure, AgentForge Cloud (coming in Phase 5) handles deployment and scaling.

Open Source

Everything in the framework, MIT license. Self-host wherever you want.

Cloud

One-command deploy, managed scaling, visual inspector for live agent runs. Coming in Phase 5.

Enterprise

RBAC, audit logs, SSO, on-premise deployment. For teams where compliance isn't optional.

Start building agents
that actually work in production

Get a working agent running in five minutes. Type-safe from tool definitions to LLM calls. Crash recovery and decision tracing from the start. MIT licensed.

npm install @ahzan-agentforge/corepip install ahzan-agentforge