AgentForge

An autonomous AI agent that plans and executes multi-step tasks using tool-calling — built from scratch, no frameworks.

No LangChain, no LlamaIndex, no CrewAI. Just an async FastAPI backend that drives OpenAI function-calling in a plan-act-observe loop, a pluggable tool registry, MongoDB persistence, and a Next.js streaming chat UI that shows every step the agent takes in real time.

Architecture

flowchart LR
    U[User] -->|goal| FE[Next.js UI]
    FE -->|POST /api/runs| BE[FastAPI backend]
    FE <-->|SSE /stream| BE
    BE --> LOOP[Agent loop<br/>plan · act · observe]
    LOOP <-->|function calling| LLM[(OpenAI<br/>gpt-4o / gpt-4o-mini)]
    LOOP --> REG[Tool registry]
    REG --> T1[web_search]
    REG --> T2[code_execution]
    REG --> T3[db_query]
    REG --> T4[task_complete]
    LOOP -->|persist each step| DB[(MongoDB<br/>agent_runs)]
    T3 --> DB

The stack runs as three containers — frontend, backend, and mongo — orchestrated by docker-compose.yml.

Quick start

git clone <your-fork-url> agentforge
cd agentforge

cp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...

docker compose up --build

Then open:

Frontend: http://localhost:3000
Backend API docs: http://localhost:8000/docs

The sample MongoDB product catalog is seeded automatically on first boot, so the db_query tool works immediately. web_search works with no extra keys (keyless DuckDuckGo) and upgrades to Tavily if you set TAVILY_API_KEY.

Running the backend without Docker

cd backend
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Point at a local Mongo and export your key:
export OPENAI_API_KEY=sk-... MONGO_URI=mongodb://localhost:27017
uvicorn app.main:app --reload

How the agent loop works

The core lives in backend/app/agent/loop.py. It implements a textbook plan → act → observe cycle:

sequenceDiagram
    participant U as User
    participant L as Agent loop
    participant M as OpenAI
    participant T as Tool
    participant DB as MongoDB

    U->>L: goal
    loop until task_complete or max_iter
        L->>M: messages + tool schemas (tool_choice=auto)
        M-->>L: function_call (tool, args)
        L->>T: execute(args) with timeout
        T-->>L: result (or error string)
        L->>DB: persist step
        L-->>U: stream step (SSE)
        L->>L: append tool result to history
    end
    L-->>U: stream final answer (SSE)

User sends a goal via POST /api/runs.
A system prompt tells the LLM it has tools and must work step by step.
The LLM returns a function_call (tool name + JSON arguments).
The backend executes the tool and captures the result.
The result is appended to the message history as a tool message.
Loop back to step 3 until either:
- the LLM calls the terminal task_complete tool, or
- the max-iteration cap (default 10) is hit.
The final answer is streamed to the frontend.

Built-in robustness

Unknown tool → an error string is returned into context so the LLM self-corrects.
Tool failure → wrapped in try/except; the exception text becomes the tool result.
Context-window management → when the running history exceeds TOKEN_THRESHOLD tokens, older steps are summarized (cheap model) while the original goal and recent steps are kept verbatim (context.py).
Per-tool timeouts → each tool call is bounded by TOOL_TIMEOUT (default 30s); the sandboxed code executor has its own tighter timeout.
Crash-safe persistence → every step is $push-ed to MongoDB as it happens, not just at the end.

Adding a custom tool

Adding a tool is one decorated async function. Drop a new file in backend/app/tools/ — it is auto-discovered at startup.

# backend/app/tools/weather.py
from typing import Any
import httpx
from .registry import ToolContext, register_tool


@register_tool(
    name="get_weather",
    description="Get the current temperature for a city.",
    parameters={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"},
        },
        "required": ["city"],
    },
)
async def get_weather(args: dict[str, Any], ctx: ToolContext) -> str:
    city = args["city"]
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            "https://wttr.in/" + city, params={"format": "3"}
        )
    return resp.text

That's it. No registration list to edit — the @register_tool decorator adds it to the registry, its JSON schema is sent to the LLM automatically, and the loop can call it. Mark a tool terminal=True to make calling it end the run (that's how task_complete works).

Each tool receives a ToolContext carrying shared dependencies (settings, the OpenAI client, the Mongo database) so tools stay easy to test.

Example run

Goal: "What's the cheapest in-stock book in the catalog, and is it cheaper than a USB-C Hub?"

Step	Tool	Args	Observation (truncated)
1	`db_query`	`{"question": "cheapest in-stock book"}`	`Clean Code — $33.99, in_stock: true`
2	`db_query`	`{"question": "price of USB-C Hub"}`	`USB-C Hub — $39.99, in_stock: false`
3	`code_execution`	`{"code": "print(33.99 < 39.99)"}`	`stdout: True`
4	`task_complete`	`{"answer": "..."}`	(ends run)

Final answer: "The cheapest in-stock book is Clean Code at $33.99. Yes — it's cheaper than the USB-C Hub ($39.99) by $6.00 (and the hub is out of stock anyway)."

You can watch each of these steps stream in as collapsible cards in the UI, and replay any past run from the history sidebar.

API reference

Method	Path	Description
`POST`	`/api/runs`	Start a new run. Body: `{ "goal": "...", "model"?, "max_iterations"? }`. Returns `{ "run_id" }`.
`GET`	`/api/runs/{run_id}/stream`	Server-Sent Events: each `step`, then `final`, then `done`. Replays persisted steps for finished runs.
`GET`	`/api/runs/{run_id}`	Full run document with every step.
`GET`	`/api/runs?limit=50`	Recent runs (compact summaries).
`GET`	`/api/health`	Liveness probe.

Run document shape (`agent_runs` collection)

{
  "run_id": "…",
  "goal": "…",
  "status": "running | completed | failed | max_iter",
  "created_at": "…", "completed_at": "…",
  "model": "gpt-4o-mini",
  "steps": [
    {
      "step_number": 1,
      "tool_name": "db_query",
      "tool_args": { "question": "…" },
      "tool_result": "…",
      "llm_response_raw": { "…": "…" },
      "tokens_used": 812,
      "latency_ms": 640,
      "timestamp": "…"
    }
  ],
  "final_answer": "…",
  "total_tokens": 3120,
  "total_latency_ms": 4210
}

Tech stack

Backend

Python 3.11+, FastAPI (async)
openai ≥ 1.0 (function calling)
motor (async MongoDB driver)
httpx (async HTTP for tools)
pydantic v2 + pydantic-settings (typed models & config)
tiktoken (token accounting)

Frontend

Next.js 14 (App Router) + React 18
TypeScript (everything typed)
Tailwind CSS
EventSource for SSE streaming

Infrastructure

MongoDB 7
Docker + Docker Compose

Project structure

agentforge/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI app + CORS + lifespan + SSE broker
│   │   ├── config.py            # Settings via pydantic-settings
│   │   ├── agent/
│   │   │   ├── loop.py          # The core plan-act-observe loop
│   │   │   ├── context.py       # Message history + summarization
│   │   │   └── schemas.py       # Pydantic models
│   │   ├── tools/
│   │   │   ├── registry.py      # @register_tool decorator + discovery
│   │   │   ├── web_search.py
│   │   │   ├── code_exec.py
│   │   │   ├── db_query.py
│   │   │   └── task_complete.py
│   │   └── storage/
│   │       └── mongo.py         # Run persistence + sample data seeding
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/app/
│   │   ├── page.tsx             # Main chat UI
│   │   ├── lib/api.ts           # Typed API client + SSE helpers
│   │   └── components/
│   │       ├── ChatInput.tsx
│   │       ├── StepCard.tsx     # Collapsible tool-call display
│   │       └── RunHistory.tsx
│   ├── package.json
│   └── Dockerfile
├── docker-compose.yml
├── .env.example
├── README.md
└── LICENSE

A note on the code sandbox

The code_execution tool runs snippets in a fresh, isolated Python subprocess (-I isolated mode, a temporary working directory, a scrubbed environment, and a hard timeout). This is appropriate for trusted/educational use. It is not a hardened security boundary — for untrusted input, run it inside a container with seccomp/gVisor and resource limits.

Built as a side project exploring agentic AI workflows.

License

MIT — see LICENSE.

Agentforge