Technical Deep Dive

AI Agent Architecture:
Theory & Production Practice

Every AI agent, regardless of its domain or framework, is built from the same six architectural components. This guide first establishes that universal blueprint through theory and research, then applies it line by line to a production fleet of twenty-eight specialized agents — equipped with 45 Sentinel tools, a Knowledge Library of 43 technical books, Jira integration, and autonomous self-healing — running on self-hosted bare-metal infrastructure.

Part I deconstructs the general architecture: the reasoning engine at the center, the perception layer that feeds it, the planning system that decomposes goals, the memory that retains context, the tools that connect to the outside world, and the evaluation loop that drives continuous improvement. Part II maps each of these components onto a 28-agent autonomous fleet — complete with 45 Sentinel tools, Jira Cloud integration, a Knowledge Library of 43 indexed books, MLOps pipelines, observability stacks, and CI/CD automation — to show exactly how theory translates into running code and measurable operational outcomes.

Core Layers

Agency Levels

Agents

Sentinel Tools

D3.js Visuals

Part I

General theory

Reasoning, perception, memory, tools, evaluation, and context engineering.

Part II

Production mapping

A 28-agent fleet with orchestration, Jira integration, Knowledge Library, RAG, observability, and CI/CD.

Output

Operational evidence

Measured latency, uptime, resource allocation, and autonomous incident handling.

Architecture briefing

Professional field guide for modern agent systems

production-backed

Architecture scope

Theory to production

A complete blueprint spanning cognition, orchestration, and infrastructure.

Operational baseline

28 agents · 45 tools · 43 books

Concrete fleet telemetry, Jira integration, and Knowledge Library powered by GPT-4o-mini.

Visual system

Interactive D3 analysis

Topology, state, resource, and loop behavior rendered as live explanatory graphics.

Reading mode

Technical narrative

Visual mode

Interactive D3.js

Infra basis

Self-hosted bare metal

Focus

Autonomy with observability

Foundation Model

Reasoning Engine

The LLM or neural network that provides natural language understanding, chain-of-thought reasoning, and structured output generation. This is the agent's 'brain' — it processes inputs and decides what to do next.

Perception

Input Processing

How the agent observes its environment. Transforms raw inputs — text, images, API responses, sensor data — into structured representations that the reasoning engine can process effectively.

Planning

Task Decomposition

Breaks complex goals into executable sub-tasks. Uses patterns like ReAct, Tree-of-Thought, or DAG construction to create multi-step action plans with dependency resolution.

Memory

Context Management

Maintains state across interactions. Working memory (current context), episodic memory (past experiences), and semantic memory (generalized knowledge) enable learning from history.

Tool Integration

Execution Layer

The agent's hands — how it acts on the world. Connects to APIs, databases, code interpreters, and external services. The reasoning engine selects tools, constructs parameters, and interprets results.

Evaluation

Feedback Loop

Assesses outcomes and drives improvement. Compares actual results against expectations, scores decisions, and feeds learnings back into memory and planning for continuous adaptation.

Workflows vs Agents

Seven levels of increasing autonomy

Not every LLM integration is an agent. The critical distinction lies in who controls the execution flow. In a workflow, the developer pre-defines every step — the LLM generates text but makes no decisions about what happens next. In an agent, the LLM itself determines which actions to take, when to use tools, and when to stop. Between these two poles lies a spectrum of seven distinct levels, each granting the model progressively more control over the current step, the next step, and the set of available steps.

Understanding where your system sits on this spectrum is essential for choosing the right architecture. Levels 1–2 are appropriate when reliability and predictability are paramount. Levels 3–5 suit most production agent deployments where the LLM needs to interact with external systems. Levels 6–7 represent fully autonomous systems that learn from experience and coordinate across multiple specialized agents — the territory occupied by the 28-agent fleet described in Part II.

Loading visualization…

Workflows (L1–L2)

→ Developer pre-defines all execution paths
→ LLM generates content but controls nothing
→ Deterministic, auditable, low risk
→ Cannot handle novel situations outside pre-built branches

Agents (L4–L7)

→ LLM dynamically decides actions based on context
→ Iterative loop with unpredictable step count
→ Can handle novel, complex, multi-step tasks
→ Requires guardrails, evaluation, and memory to be reliable

The Agent Loop

LLM + Tool + Loop: the core triad

Every LLM agent, regardless of framework or domain, reduces to three elements. The LLM serves as the decision-making core — it interprets context and chooses what to do next. Tools give the agent hands to interact with the external world: API calls, database queries, code execution. The Loop repeats the observe–think–act cycle until the agent determines the goal is achieved. This iterative loop is what transforms a stateless text generator into an autonomous problem-solver.

The loop continues because the agent cannot predict in advance how many steps will be required. A simple query might resolve in one iteration. A complex research task might require dozens of tool calls, each observation enriching the context for the next decision. The key insight is that the LLM decides both the next action and when to terminate — this is the fundamental property that distinguishes an agent from traditional software where execution paths are fixed at compile time.

Decision mechanics

Observe, reason, act, evaluate, and remember as one continuous control loop

This orbital D3 view turns the abstract concept of agency into a readable systems primitive with explicit state transitions and memory closure.

Loop nodes

5 operational states

Center

LLM decision core

Loading visualization…

LLM — The Brain

Understands context, classifies intent, generates reasoning chains, selects tools, and decides when to stop. The foundation model provides generalization that no rule-based system can match.

Qwen 2.5 7B / GPT-4 / Claude

Tools — The Hands

APIs, databases, code interpreters, infrastructure control planes. Each tool is presented as a function signature with name, description, and JSON parameter schema. The LLM selects and parameterizes tools autonomously.

Function calling interface

Loop — The Heartbeat

The iterative cycle that gives agents their power. Each iteration adds new observations to the context, enabling compound reasoning. Without the loop, you have a chatbot. With it, you have an agent.

Observe → Think → Act → Evaluate

Reasoning Engine

The LLM at the core of every decision

At the center of every agent sits a foundation model that serves as its reasoning engine. When an input arrives, the model does not simply generate a response. It first decomposes the problem into explicit reasoning steps, then decides whether the answer requires external data or action, and finally evaluates the quality of its own output before releasing it. This three-phase pipeline — reasoning, acting, reflecting — is the fundamental decision loop that every other component supports.

The dominant pattern for this loop is ReAct (Yao et al., 2023), where the LLM alternates between a Thought step (reasoning about what to do), an Action step (calling a tool or generating output), and an Observation step (interpreting the result). A self-reflection gate at the end checks whether the output meets confidence thresholds before it is finalized. The diagram below traces this full decision pipeline from raw input to structured output.

Loading visualization…

Chain-of-Thought

The LLM breaks reasoning into explicit steps: understand intent → identify constraints → evaluate options → select action. This makes decisions interpretable and auditable.

Pattern

Thought → Action → Observation → Thought → ...

Tool Selection

When the task requires external information or action, the LLM selects an appropriate tool from its available toolkit, constructs structured parameters, and interprets the result.

Format

{ tool: "api_call", params: { ... } }

Self-Reflection

Before outputting a final answer, the LLM evaluates its own response: Is it complete? Accurate? Does it need another reasoning step? This creates an internal quality gate.

Check

confidence ≥ threshold → output

Perception & Input

How agents observe their environment

A reasoning engine is only as good as the data it receives. Perception is the sensory layer that sits between the raw environment and the reasoning core, responsible for transforming unstructured signals into clean, structured representations. In text-based agents, this means tokenizing input, extracting entities, classifying intent, and embedding content into vector space for retrieval. In multimodal systems, it extends to images via vision-language models and audio via speech recognition. The quality of perception directly determines the quality of every downstream decision.

Text Understanding

Natural language inputs are tokenized, entities extracted, and intent classified. Modern agents use the LLM itself for this, but specialized NER and intent classifiers can run as pre-processing steps.

Multimodal Input

Vision-language models (GPT-4V, Claude Vision) can process images alongside text. Audio is transcribed via Whisper or similar ASR models before entering the text pipeline.

Structured Data Ingestion

API responses, database query results, and log streams are parsed into structured formats (JSON, dataframes) that the agent can reason over directly.

Embedding & Retrieval

Dense vector embeddings (sentence-transformers, text-embedding-3) enable semantic similarity search. This is the foundation of RAG — grounding decisions in retrieved context.

Planning & Decomposition

Breaking complex goals into executable sub-tasks

Planning is the capability that separates agents from simple chatbots. A chatbot produces a single response to a single prompt. An agent, by contrast, receives a complex goal, breaks it into a sequence of sub-tasks with explicit dependencies, determines which can run in parallel, and executes them iteratively while monitoring intermediate results. Research has converged on several planning paradigms, each suited to different types of problems. The choice of paradigm fundamentally shapes how an agent handles multi-step tasks, error recovery, and resource allocation.

ReAct (Reason + Act)

Yao et al., 2023

The agent alternates between reasoning steps (Thought) and action steps (Tool Call). Each observation from the environment triggers a new reasoning step. Simple, effective, and widely adopted.

Thought → Action → Observation → Thought → ...

Plan-and-Execute

Wang et al., 2023

The agent creates a complete plan upfront, then executes each step sequentially. Better for long-horizon tasks where the full plan can be validated before execution begins.

Plan [step1, step2, ...] → Execute(step1) → Execute(step2) → ...

Tree-of-Thought

Yao et al., 2023

Multiple reasoning paths are explored in parallel as a tree. The agent evaluates each branch and selects the most promising path. Powerful for tasks requiring search and exploration.

Root → Branch A, Branch B → Evaluate → Best path → ...

DAG-based Orchestration

LangGraph, CrewAI

Tasks are modeled as a directed acyclic graph with explicit dependencies. Enables parallel execution of independent sub-tasks and checkpointing for recovery. Used in production multi-agent systems.

Goal → DAG(nodes, edges) → Topo-sort → Parallel dispatch → Merge

When to Use Each Pattern

SIMPLE TASKS

Use ReAct for tasks with predictable complexity — single-domain queries, straightforward tool calls. The iterative loop handles uncertainty without upfront planning overhead.

COMPLEX MULTI-STEP

Use Plan-and-Execute or DAG for cross-domain tasks requiring coordination between multiple agents. The upfront planning phase catches dependency conflicts before execution begins.

Memory & Context

Maintaining state across interactions and time

An agent without memory is stateless: it has no record of what it has done, what worked, or what failed. It will repeat the same mistakes indefinitely. Cognitive science identifies three distinct memory systems in the human brain, and modern AI agent architectures replicate all three. Working memory holds the current task context, analogous to holding a phone number in your head while you dial it. Episodic memory records complete traces of past interactions, analogous to remembering a specific conversation. Semantic memory captures generalized patterns and facts extracted from many episodes, analogous to knowing that a particular configuration always causes a timeout. Together, these three layers give agents the ability to learn from experience without retraining the underlying model.

Working Memory

Analogy

Human: holding a phone number

Current conversation context, active variables, intermediate reasoning steps. Typically the LLM's context window (4K–128K tokens).

Context window, Redis hash, in-memory state

Episodic Memory

Analogy

Human: remembering a specific event

Complete records of past interactions: what was asked, what the agent did, what happened. Enables 'learning from experience' and few-shot retrieval.

Vector database, PostgreSQL, conversation logs

Semantic Memory

Analogy

Human: knowing that Paris is in France

Generalized facts and patterns extracted from many episodes. Domain knowledge, runbook templates, common resolution patterns stored as embeddings.

Knowledge graphs, vector embeddings, fine-tuning

Why Memory Matters: APIs Are Stateless

LLM APIs are fundamentally stateless — each call is independent with no built-in memory of prior interactions. Web interfaces like ChatGPT create the illusion of continuity, but under the hood, the entire conversation history is resent with every request. This means the developer must explicitly manage conversation state, accumulating user messages, assistant responses, tool results, and system context into a growing message list. Memory systems (working, episodic, semantic) solve this at scale — they determine what to keep, what to compress, and what to retrieve, turning a stateless API into a stateful agent that learns from experience.

Tool Integration

Connecting agents to the external world

Reasoning alone cannot change the state of the world. Tools are the execution layer that connects the agent to external systems: APIs, databases, code interpreters, file systems, and infrastructure control planes. When the reasoning engine determines that it needs information it does not have or needs to perform an action it cannot perform internally, it generates a structured tool call specifying the function name and parameters. The runtime validates the call, executes it, captures the result (or the error), and feeds it back as an observation for the next reasoning step.

The critical design principle is a unified tool interface. Every tool, whether it queries a PostgreSQL database, restarts a Docker container, or calls a third-party API, is presented to the LLM as a function signature with a name, a natural-language description, and a JSON parameter schema. This is the same pattern used by OpenAI Function Calling, LangChain Tools, and Anthropic Tool Use. Standardizing the interface means new tools can be added without modifying the reasoning engine.

Loading visualization…

Universal Tool Call Format

Input Schema

{ name: string, description: string, parameters: JSONSchema }

LLM Output

{ tool_call: { name: "search", args: { query: "..." } } }

Execution

Runtime validates args → executes tool → captures result + errors

Observation

Tool result injected as new message → LLM reasons over it

Evaluation Loop

Closing the loop with continuous feedback

The evaluation and feedback loop is the sixth and final component, and it is the one that transforms a static system into an adaptive one. Without it, an agent can only apply the knowledge embedded in its training weights and its fixed prompt templates. With it, the agent continuously measures its own performance against real-world outcomes and adjusts its behavior accordingly. This is what makes the difference between an agent that degrades over time and one that improves.

The feedback loop operates at three distinct time horizons. The inner loop runs per-action, on the order of seconds, enabling self-correction within a single task. The middle loop runs per-episode, on the order of minutes, storing and learning from complete task traces. The outer loop runs per-epoch, on the order of days or weeks, performing systematic analysis of accumulated metrics and tuning parameters across the entire system.

Loading visualization…

Key Research References

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al., 2023 — ICLR 2023

Foundational pattern for interleaving reasoning and tool use in LLM agents.

Retrieval-Augmented Generation for Knowledge-Intensive Tasks

Lewis et al., 2020 — NeurIPS 2020

RAG pattern for grounding LLM outputs in retrieved documents.

Generative Agents: Interactive Simulacra of Human Behavior

Park et al., 2023 — UIST 2023

Memory architecture for agents with reflection, planning, and observation.

Tree of Thoughts: Deliberate Problem Solving with LLMs

Yao et al., 2023 — NeurIPS 2023

Multi-path reasoning where the LLM explores and evaluates multiple strategies.

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al., 2023 — NeurIPS 2023

Self-reflection mechanism for agents to learn from mistakes without weight updates.

Context Engineering

Delivering the right context at the right time

Agent development is fundamentally about context engineering — the discipline of assembling precisely the right information into the LLM's prompt window so that its intelligence can be fully utilized. An agent framework is, at its core, a context management system. Every component described in this article — perception, memory, tools, planning — exists to solve one problem: ensuring the LLM has exactly the information it needs to make the right decision at each step of the loop.

The context window is a finite resource. A 128K-token window sounds vast, but production agents must partition it across six competing demands: system instructions, conversation history, retrieved documents, tool results, few-shot examples, and output space. Each token carries an opportunity cost — spending tokens on irrelevant history means fewer tokens available for critical RAG context. The visualization below maps the complete context assembly pipeline, from system prompt to structured output.

Loading visualization…

Prompt Architecture

System Message Design

The system message defines the agent's identity, constraints, and output schema. A well-structured system prompt is the single highest-leverage intervention in agent performance — it shapes every downstream decision.

Role + constraints + format + safety

Sliding Window

Conversation Management

As conversations exceed the token budget, older messages must be evicted. A sliding window retains the most recent N turns, but smart implementations weight messages by relevance rather than recency alone.

TTL-based eviction · priority weighting

Turn Compaction

Context Compression

When token pressure increases, the agent compresses verbose messages into concise summaries while preserving key facts. An 800-character response might compact to 300 characters without losing critical information.

800 → 300 chars · fact-preserving

Dynamic Few-Shot

Example Selection

Instead of static examples hardcoded into prompts, the agent retrieves semantically similar input-output pairs from episodic memory. This grounds the LLM's behavior in proven successful patterns for each specific task type.

Embedding similarity → top-k examples

Token Budgeting

Resource Allocation

Each context component receives a token budget proportional to its importance for the current task. Infrastructure alerts allocate more tokens to tool results. Research tasks allocate more to RAG context.

Task-adaptive allocation strategy

Context Grounding

RAG Integration

No decision should rely solely on parametric knowledge. The agent retrieves relevant documents, past episodes, and domain patterns via hybrid search before reasoning. This is the bridge between memory and reasoning.

Dense + sparse + reranking

The Context Engineering Principle

“The essence of an agent lies in delivering the right context to an LLM so that its intelligence can be fully utilized. Agent frameworks are essentially designed to manage this challenge effectively.”

This insight reframes agent development from a model problem to an information architecture problem. The quality of the model matters, but the quality of the context it receives matters more. A smaller model with excellent context engineering will consistently outperform a larger model with poor context assembly — as demonstrated by the fleet's Qwen 2.5 7B achieving 94% classification accuracy through carefully engineered RAG pipelines and dynamic few-shot selection.

Stage 1 of 4

Input Ingestion

Webhook Receiver

FastAPI endpoints accepting JSON payloads from GitHub, Docker Hub, Cloudflare, and Prometheus AlertManager

Log Tail Stream

Structured log ingestion from Docker containers via Loki push API (labels: container, severity, service)

Metric Scraper

Prometheus pull-based collection of /metrics endpoints at 15-second intervals across all agent containers

CLI / REST API

Manual task submission via authenticated FastAPI endpoints with OpenAPI schema validation

Stage 2 of 4

Embedding & Indexing

Text Embedder

Sentence-transformers (all-MiniLM-L6-v2, 384 dimensions) converts raw text into dense vectors for semantic search

Chunk Splitter

Recursive character splitter with 512-token chunks and 64-token overlap for maintaining context continuity

Vector Indexer

Milvus collections partitioned by domain (infra, security, ops, analytics, data) with HNSW index for sub-10ms retrieval

Metadata Tagger

Each chunk annotated with source, timestamp, agent_id, severity, and TTL for freshness-aware retrieval

Stage 3 of 4

RAG Retrieval

Query Encoder

Incoming event text → embedding → cosine similarity search against relevant Milvus collection (top-k=5)

Context Ranker

Cross-encoder reranker (ms-marco-MiniLM-L-6-v2) scores retrieval candidates for relevance and recency

Hybrid Search

Combines dense vector search (semantic) with BM25 sparse search (keyword) via reciprocal rank fusion (RRF)

Context Window

Retrieved chunks assembled into a structured prompt template with system instructions, context, and task description

Stage 4 of 4

LLM Reasoning

Qwen 2.5 7B (Q4)

Primary reasoning model running locally on Ollama. 4-bit quantized for 6GB VRAM footprint, ~40 tok/s on RTX-class GPU

Chain-of-Thought

Structured prompting with reasoning steps: (1) classify intent, (2) assess severity, (3) select agent, (4) generate action plan

Tool Calling

Function-calling format where LLM outputs structured JSON specifying tool name, parameters, and expected response type

Fallback Router

If confidence < 0.7, the event is escalated to a larger model (Mistral 7B) or queued for human review

RAG Pipeline Technical Specification

Embedding Model

all-MiniLM-L6-v2 · 384 dimensions · 22M params

Chunking Strategy

RecursiveCharacterTextSplitter · 512 tokens · 64 overlap

Vector Store

Milvus · HNSW index · domain-partitioned collections with etcd coordination

Retrieval

Hybrid: cosine similarity (dense) + BM25 (sparse) · RRF fusion

Reranker

ms-marco-MiniLM-L-6-v2 · cross-encoder · top-k=5 → top-3

LLM (Primary)

Qwen 2.5 7B Q4_K_M · Ollama · ~40 tok/s

LLM (Fallback)

Mistral 7B Q4_K_M · triggered when confidence < 0.7

Latency Budget

Embed: 5ms · Retrieve: 8ms · Rerank: 15ms · LLM: ~2s total

Knowledge Library Index

Corpus Size

43 technical books · 15,655 pages · 13 categories

Extraction

PyMuPDF (fitz) · TOC, keywords, first-page text

Index Format

JSON · 368 KB · in-memory at startup · TF-IDF scoring

LLM Backend

OpenAI GPT-4o-mini · intent classification + answer synthesis

Short-term vs long-term context retention

The theoretical memory model described in Part I identified three tiers: working, episodic, and semantic. In production, each tier maps to a specific storage backend chosen for its access pattern. Working memory demands sub-millisecond reads and tolerates volatility. Episodic memory requires durability and structured querying. Semantic memory requires high-dimensional vector search across generalized knowledge. The fleet uses Redis for the first, PostgreSQL for the second, and Milvus for the third.

Working memory answers the question: what is happening right now across the fleet. Episodic memory answers: what happened the last time this type of event occurred, and what was the outcome. Semantic memory answers: given everything we have learned, what is the best general approach to this class of problem. Together, these three layers give each agent access to the collective experience of the entire fleet.

Working Memory

Active task context held in Redis. Each agent publishes its current state to a Redis hash (key: agent:{id}:state) with a 30-second TTL heartbeat. Sentinel reads all agent states every 10 seconds to build fleet-wide situational awareness.

Redis Hash + TTL + Pub/Sub

Example

agent:security:state → { task: 'cve-scan', target: 'web-proxy', progress: 0.7, started: 1738800000 }

Episodic Memory

Every completed task is persisted as an episode in PostgreSQL: the input event, classification decision, execution plan, each step result, and final outcome. This creates a complete decision trace for auditing and learning.

PostgreSQL JSONB + Indexed Timestamps

Example

episodes(id, agent_id, event_type, plan_dag, steps[], outcome, duration_ms, created_at)

Semantic Memory

Generalized knowledge from two sources: (1) episodic patterns stored as vector embeddings in Milvus for operational RAG retrieval, and (2) a Knowledge Library of 43 technical books (15,655 pages, 13 categories) indexed via PyMuPDF for deep engineering Q&A via GPT-4o-mini.

Milvus + all-MiniLM-L6-v2 (384d) + Knowledge Library (TF-IDF + GPT-4o-mini)

Example

Operational: 'nginx 502' → top-3 past resolutions | Knowledge: 'explain RAFT consensus' → book-grounded answer

Short-Term vs Long-Term Memory — Technical Comparison

Dimension	Short-Term (Working)	Long-Term (Episodic + Semantic)
Storage Backend	Redis (in-memory hash maps with TTL)	PostgreSQL (persistent tables) + Milvus (vector embeddings)
Retention Period	30 seconds to 15 minutes (configurable per agent)	Indefinite — partitioned by month, archived after 90 days
Data Format	JSON blobs: current task state, active context, pending actions	Structured rows: agent_id, action, params, result, duration, embedding vector
Access Pattern	O(1) key-value lookup; pub/sub for real-time state changes	Indexed SQL queries + ANN (approximate nearest neighbor) vector search
Use Case	Active task tracking, heartbeat monitoring, coordination state	Pattern recognition, decision replay, few-shot prompt context, audit trail
Failure Mode	Volatile — data lost on Redis restart (acceptable for ephemeral state)	Durable — WAL-replicated PostgreSQL with daily pg_dump backups

Planning & Reflection

Task decomposition and self-assessment in production

Planning separates agents from chatbots. A chatbot produces a single response; an agent decomposes a complex goal into ordered sub-tasks, executes them iteratively, and uses reflection to assess progress between steps. The fleet implements both capabilities as LangGraph tools that the ReAct loop can invoke on demand.

TASK

Task Decomposition

The create_tasks tool receives a list of step descriptions and returns a structured checklist with status tracking. Each task follows a state machine: pending → in_progress → completed.

@tool
def create_tasks(tasks: List[str]) -> str:
    """Decompose complex goals into
    ordered sub-task checklists."""
    task_objects = [
        Task(content=t, status="pending")
        for t in tasks
    ]
    task_objects[0].status = "in_progress"
    return format_checklist(task_objects)

Multi-step operationsOrder-dependent tasksAudit workflows

REF

Self-Reflection

The reflection tool creates structured self-assessments recorded in conversation context. Four patterns: progress review, error analysis, result synthesis, and self-check. Setting need_replan=True triggers plan revision.

@tool
def reflection(
    analysis: str,
    need_replan: bool = False
) -> str:
    """Self-assess progress and decide
    whether to continue or replan."""
    # Recorded in LangGraph context
    # as ToolResult for next reasoning
    # step — closing the feedback loop

Progress reviewError analysisSelf-correction

Planning ↔ Reflection Cycle

These two tools form a complementary cycle within the ReAct loop. The planner creates structure; the reflector evaluates results and triggers replanning when conditions change. This enables the agent to handle multi-step operations that would previously require human intervention — a container health audit that discovers cascading failures, a deployment that needs rollback, or a data pipeline repair spanning multiple services.

Create Plan

create_tasks

Execute Step

domain tools

Reflect

reflection

Continue / Replan

loop or done

Context Management

Token-aware memory with sliding window & compaction

Long multi-step operations accumulate tool results that can overflow the LLM's context window. The context management layer automatically measures token usage, compacts verbose tool outputs, and applies a sliding window to keep conversations within budget — all without losing critical information.

TOK

Token Counting

Every message is measured via tiktoken (cl100k_base) before LLM invocation. Budget tracking prevents context overflow during long ReAct chains with many tool calls. Falls back to word-based estimation when tiktoken is unavailable.

Budget: 6,000 tokens (GitHub Models safe limit)

WIN

Sliding Window

Retains only the last 20 messages while always preserving system prompts and the initial user message. Prevents unbounded context growth during multi-step operations where the ReAct loop accumulates many tool call/result pairs.

Window: 20 messages (system msgs always preserved)

CMP

Compaction

Tool results exceeding 800 characters are automatically summarized, preserving key data (numbers, statuses, errors) while removing verbose formatting. Priority-based extraction ensures critical information survives compaction.

Threshold: 800 chars → 300 char target summary

Pre-processing Pipeline: process_llm_request()

Before every LLM invocation, the context manager chains three strategies in order — compact, window, and budget — ensuring the agent stays within token limits even during complex multi-tool ReAct chains. This is injected automatically into chat(), achat(), and astream() methods.

def process_llm_request(messages, max_tokens=6000):
    # Step 1: Compact long tool results
    messages = compact_messages(messages)
    # Step 2: Sliding window (last 20 msgs)
    messages = apply_sliding_window(messages)
    # Step 3: Token budget enforcement
    tokens = count_messages_tokens(messages)
    if tokens > max_tokens:
        trim_oldest_non_system(messages)
    return messages, stats

Adaptive Learning

Agents that improve with experience

A static rule-based system degrades as its environment evolves because the rules were written for conditions that no longer hold. The agent fleet avoids this failure mode by implementing four distinct learning loops, each operating at a different time scale and optimizing a different aspect of system performance. These loops continuously refine decision quality, retrieval accuracy, human dependency, and resource efficiency without ever retraining the underlying language model.

The design principle behind all four loops is the same: learning happens at the retrieval and configuration layer, not at the model weight layer. The base LLM remains frozen and deterministic. What changes over time is the content of the vector store, the set of few-shot examples retrieved for each classification, the confidence thresholds that trigger escalation, and the resource limits assigned to each container. This separation of concerns makes the learning process auditable, reversible, and safe to run continuously in production.

Outcome Feedback Loop

After every task execution, the result (success/failure/partial) is compared against the predicted outcome. This feedback is stored in episodic memory and used to update the confidence scores of the classification model's few-shot examples.

Mechanism

Post-execution diff comparison → outcome label → update few-shot retrieval weights

Measured Impact

Classification accuracy improved from 78% to 94% over 90 days of continuous operation

Retrieval Quality Loop

Each RAG retrieval is scored by the downstream agent: did the retrieved context lead to a successful resolution? Low-scoring retrievals trigger re-embedding with updated metadata. High-scoring contexts are promoted in the reranker's training set.

Mechanism

Agent feedback score (1–5) → Milvus metadata update → reranker fine-tuning (weekly batch)

Measured Impact

Mean retrieval relevance score increased from 3.1 to 4.4 (out of 5) over 60 days

Escalation Learning Loop

When Sentinel escalates a task to a human operator, it records the human's resolution. These expert traces become new training examples — stored as high-confidence episodes that future RAG queries preferentially retrieve.

Mechanism

Human resolution capture → episode with expert_label=true → boosted retrieval weight (2x)

Measured Impact

Escalation rate decreased from 23% to 8% as the system learned from human interventions

Resource Optimization Loop

The Cost Agent continuously monitors resource utilization vs. allocation across all containers. Historical utilization patterns (7-day rolling window) feed a linear regression model that recommends resource limit adjustments.

Mechanism

cAdvisor metrics → 7-day rolling stats → regression model → Sentinel approval → apply limits

Measured Impact

Infrastructure cost reduced by 31% through automated rightsizing without SLA degradation

Continuous Improvement Pipeline

1. Observe

Collect structured telemetry from every agent action: input, decision, output, latency, resource usage. All observations are schema-validated before storage.

2. Evaluate

Compare actual outcomes against expected outcomes. Success criteria are defined per agent type: deployment agents check health endpoints, security agents verify CVE counts.

3. Store

Persist evaluated episodes to PostgreSQL (structured) and Milvus (embedded). Tag each episode with outcome quality score, agent version, and environmental context.

4. Retrieve

When a new, similar event occurs, the RAG pipeline retrieves the most relevant past episodes. Recency and outcome quality are weighted in the retrieval ranking.

5. Adapt

Update decision parameters: few-shot examples, confidence thresholds, retry limits, resource allocations. Changes are logged as configuration episodes for traceability.

Topology

Control relationships

Flow

Volume across stages

State

Workflow transitions

Resources

Per-container cost

Event Detection

< 1s

Webhooks, cron triggers, and anomaly detectors feed raw events into the LangGraph ingestion node. Each event is tagged with source, severity, and timestamp before entering the classification pipeline.

Task Classification

< 2s

The Router node uses Qwen 2.5 7B to classify each event into a domain category. Ambiguous events trigger a secondary LLM-based analysis with chain-of-thought reasoning before routing. The classifier uses few-shot examples stored in vector memory for accuracy.

DAG Construction

< 3s

The Planner node decomposes the classified task into a directed acyclic graph of sub-tasks, resolving dependencies and parallelization opportunities. Each DAG node maps to a specific agent capability. Dependencies are topologically sorted for execution order.

Parallel Execution

Variable

Worker agents execute their assigned sub-tasks concurrently, communicating progress through Redis pub/sub channels. Failed tasks trigger automatic retry with exponential backoff (base 2s, max 3 retries) before escalation to Sentinel.

Validation and Consolidation

< 5s

Quality gates verify each result against predefined schemas and business rules. Validated outputs are consolidated into a single report, persisted to PostgreSQL, and metrics pushed to Prometheus. Failures trigger rollback via LangGraph checkpoints.

Single control plane

Every event enters through Sentinel first. This centralizes classification, priority assignment, trace IDs, and rollback authority before any specialist agent runs.

Ingress → classify → authorize

Parallel work where safe

The DAG planner isolates independent checks such as log analysis, resource inspection, and cost evaluation so they execute concurrently instead of serially.

Topo-sort → fan-out → merge

Validation before memory

The system does not store raw outcomes blindly. It first verifies health probes, schema compliance, and side effects so future RAG retrieval is grounded in trusted episodes.

Verify → persist → reuse

Request Lifecycle Pipeline

Event Ingestion

< 100 ms

Raw events enter via FastAPI endpoints on Sentinel (port 8002). Each event is tagged with source, severity, timestamp, and a correlation ID. Entry point: Cloudflare Tunnel → Nginx reverse proxy → Sentinel container.

Webhook ReceiverCron TriggerPrometheus AlertManager

Task Classification

< 2 s

Sentinel's LLM (Qwen 2.5 via Ollama at localhost:11434) classifies the event into a domain—infrastructure, security, data, analytics, or management—and assigns severity P0–P3. Few-shot examples are retrieved from Milvus (HNSW index) to improve classification accuracy.

Qwen 2.5 via OllamaMilvus Vector SearchLangGraph Router

DAG Construction

< 500 ms

The Orchestrator (port 8030) builds a directed acyclic graph of sub-tasks using LangGraph's state-machine engine. It resolves agent dependencies, identifies parallelization opportunities, and attaches checkpoint metadata for rollback capability.

LangGraph PlannerAgent Capability MatrixConditional Edges

Agent Dispatch

< 200 ms

Sub-tasks are published to domain-specific Redis Pub/Sub channels (agent.{domain}.{action}). Target agents pick up tasks based on subscription and current capacity. Working memory is pre-loaded from Redis HSET.

Redis Pub/SubAgent Working MemoryLoad Balancing

Tool Execution

1–30 s

Each agent executes its assigned tools via FastAPI endpoints on ports 8002–8050. Tools include container restarts (Docker API), vulnerability scans (Trivy), query optimization (PostgreSQL), deployment rollouts, and log analysis.

Docker APIPostgreSQLTrivyProxmox APINginx

Result Validation

< 500 ms

Quality gates verify each agent's output. Health probes confirm system state, JSON schema validators check response structure, and diff checkers verify no unintended side-effects. Failed actions trigger retry (max 3) with exponential backoff before human escalation.

Health ProbesSchema ValidatorsLangGraph Checkpoints

State Persistence

< 100 ms

Results are written to four persistence layers: episodic memory (PostgreSQL JSONB), semantic embeddings (Milvus HNSW index), time-series metrics (Prometheus push gateway), and ephemeral cache (Redis with TTL). Future similar incidents benefit from RAG retrieval of this episode.

PostgreSQL JSONBMilvus HNSWPrometheus PushRedis TTL

Concrete Scenario: Container Memory Alert

Prometheus AlertManager fires: container 'data-pipeline' memory > 85%

50 msAlertManagerPOST /webhook → Sentinel (port 8002)

1.8 sSentinelClassify event → domain: infrastructure, severity: P1

400 msOrchestrator (8030)Build DAG: [analyze_logs → check_resources → restart_if_needed → verify]

3.0 sLog Analysis (8006)Analyze recent container logs → detect OOM kill patterns

2.0 sInfra Health (8004)Check container resource limits vs. 7-day baseline

1.5 sCost Optimization (8008)Recommend memory limit: 512 MB → 768 MB based on usage trend

8.0 sDeployment (8007)Apply new resource limits, graceful container restart

2.0 sSentinel (8002)Verify health probe passes → persist episode to PostgreSQL

Auto-healed in < 19 seconds

Container auto-healed without human intervention. Episode stored in PostgreSQL for future RAG retrieval by Milvus.

Detection

50 ms

Alert enters Sentinel and is normalized.

Planning

< 500 ms

DAG is built with dependency-aware branches.

Execution

~ 14.5 s

Specialists analyze, recommend, and remediate.

Persistence

< 100 ms

Validated episode becomes future retrieval context.

System Topology — 28 Agent Fleet

Orchestration

2 agents

Sentinel (LangGraph) -- Orchestrator v2 (Redis DAG)

Infrastructure

7 agents

Infra Health -- Proxmox -- Deployment -- Log Analysis -- Security -- Network -- Realtime Alert

Data Engineering

5 agents

Data Pipeline -- Data Engineer (Airflow) -- Data Quality -- Database -- RAG Knowledge (Milvus + Knowledge Library)

Analytics & Intelligence

6 agents

BI -- Customer Analytics -- Cost Optimization -- Data Analyst -- Data Science (ML) -- MLOps (MLflow)

Business & Management

8 agents

Project Mgr -- Delivery Mgr -- Product Owner -- Reporting -- Job Search -- Outreach -- Portfolio Analytics -- Workspace Tidy (deprecated)

Layer 1

Compute Layer

Proxmox VE

Hypervisor managing LXC containers and VMs across bare-metal nodes with VLAN isolation

Docker + Compose

39 services orchestrated via docker-compose profiles, host networking, unless-stopped restart policy

Portainer

Visual management plane for container fleet operations, stack deployment, and log access

Layer 2

Data Layer

PostgreSQL

Persistent state store — 185k+ health check rows, materialized views refreshed every 6 hours, agent episodes and SLA data

Redis

Workflow state backend for Orchestrator v2 DAG execution and checkpoint recovery

Milvus + etcd

Vector database for RAG Knowledge Agent — semantic search, document embedding, episodic memory retrieval. Complemented by a Knowledge Library of 43 indexed books (15,655 pages, 13 categories)

Layer 3

Network Layer

Nginx

Reverse proxy on CT-150 with path-based routing to Sentinel, BI Agent, and Dashboard services

Cloudflare Tunnel

Zero-trust ingress — encrypted tunnels from CDN edge to origin via cloudflared, no ports exposed to internet

Cloudflare CDN

DDoS mitigation, DNS management, WAF rules, and global content delivery with edge caching

Layer 4

Observability Layer

Prometheus

31 scrape targets across all agents, 15-second intervals, 6 alert groups with severity routing

Grafana

12 operational dashboards — fleet overview, agent drill-down, LLM performance, SLA heatmaps, resource utilization, and anomaly panels

Jaeger

Distributed tracing across 14 instrumented services — span waterfall, dependency graph, and latency analysis via OpenTelemetry

Langfuse

LLM observability — tracks every prompt/completion, token usage, cost, and quality scores across all agent LLM calls

Alertmanager

Alert routing, grouping, and notification — receives from Prometheus, deduplicates, and dispatches to Slack/email/webhook channels

Loki + Promtail

Centralized log aggregation — structured labels per container, severity-based filtering, and correlation with metrics

Layer 5

MLOps Layer

MLflow

Experiment tracking, model registry, and run comparison — wired to the MLOps Agent for lifecycle automation

MinIO

S3-compatible object storage for ML artifacts, model binaries, and training datasets

Apache Airflow

DAG-based pipeline orchestration — managed by the Data Engineer Agent for scheduled ETL and data workflows

Layer 6

Intelligence Layer

Ollama + OpenAI

Local LLM inference via Ollama (Qwen 2.5 Coder 7B, CPU, 8-15s) for fleet orchestration, plus OpenAI GPT-4o-mini for Knowledge Library Q&A and advanced reasoning tasks

LangGraph

Stateful ReAct orchestration in Sentinel Agent — conditional routing, checkpointing, and plan-execute loops

LangChain

Tool-augmented chains for structured agent interactions, document loaders, and output parsers across the fleet

End-to-End Production Lifecycle — 8 Stages

Hardware

Bare-Metal Server

Dedicated Hetzner root server running Proxmox VE hypervisor. Hosts VMs and LXC containers across isolated VLANs.

Proxmox VEVLAN

Containerize

Docker Build

Each agent is a Python 3.11-slim image with FastAPI, LangChain tools, and a shared library. Dockerfile per agent, ~150 MB each.

DockerFastAPI

Deploy

Fleet Launch

Containers start with host networking on unique ports (8002-8064). Environment variables injected, restart policy set to unless-stopped.

--network host28 ports

Health Proxy

Agent Health Proxy discovers all 28 agents. Polls /health and /healthz endpoints, aggregates status into a single /all-health JSON response.

:8085/all-health

Monitor

Sentinel Dashboard

Sentinel (port 8003) serves the Command Center UI. JavaScript polls the proxy every 15 seconds, renders agent cards with status, tools, and tier labels.

:800315s poll

Expose

Cloudflare Tunnel

Zero-trust encrypted tunnel from Cloudflare edge to Sentinel. Public domain sentinel.simondatalab.de routes to the dashboard. No ports exposed to internet.

cloudflaredTLS

Orchestrate

Task Routing

Events arrive via webhooks. Sentinel classifies, Orchestrator builds DAG, agents execute tools in parallel via Redis Pub/Sub. Results validated and persisted.

LangGraphRedis

Self-Heal

Auto Recovery

Failed health checks trigger Docker restart policies. Sentinel detects degraded agents, redistributes tasks, and stores recovery episodes for future RAG retrieval.

auto-restartRAG

Network Topology — Request Path (28 Agents)

Active Agents

Sentinel Tools

Agent Tiers

15s

Health Poll Interval

How Health Monitoring Works

1BrowserHTTPSCloudflare EdgeUser visits sentinel.simondatalab.de/dashboard/

2CloudflareTunnelSentinel :8003Encrypted tunnel routes to VM, no exposed ports

3Dashboard JSfetch()/proxy/all-healthJavaScript polls same-origin proxy endpoint every 15s

4SentinelHTTPHealth Proxy :8085Reverse proxy forwards to aggregated health endpoint

5Health ProxyGET /health28 Agent PortsParallel health checks to all agent containers on host network

6Each Agent200 OKHealth ProxyReturns { status: "healthy", tools: [...], version: "..." }

7Health ProxyJSONDashboardAggregated response: 28 agents with status, tools, metadata

8DashboardrenderAgent CardsUI updates cards: green=healthy, red=down, tier labels applied

Infrastructure

Sentinel, Orchestrator v2, Infrastructure Health, Proxmox, Deployment, Log Analysis, Security, Network, Realtime Alert

9 agents

Data + Analytics + Intelligence

Data Pipeline, Data Engineer, Data Quality, Database, RAG Knowledge, BI, Customer Analytics, Cost Optimization, Data Analyst, Data Science, MLOps

11 agents

Business + Management

Project Manager, Delivery Manager, Product Owner, Reporting, Job Search, Outreach Networking, Portfolio Analytics

8 agents

Explore more projects

This case study demonstrates autonomous AI agent architecture with perception, memory, and adaptive learning. See more data engineering and infrastructure projects.

Data Engineer Agent Deep Dive Healthcare Coverage All Projects

AI Agent Architecture: Theory & Production Practice

General theory

Production mapping

Operational evidence

Professional field guide for modern agent systems

Seven levels of increasing autonomy

Workflows (L1–L2)

Agents (L4–L7)

LLM + Tool + Loop: the core triad

Observe, reason, act, evaluate, and remember as one continuous control loop

LLM — The Brain

Tools — The Hands

Loop — The Heartbeat

The LLM at the core of every decision

Chain-of-Thought

Tool Selection

Self-Reflection

How agents observe their environment

Text Understanding

Multimodal Input

Structured Data Ingestion

Embedding & Retrieval

Breaking complex goals into executable sub-tasks

ReAct (Reason + Act)

Plan-and-Execute

Tree-of-Thought

DAG-based Orchestration

When to Use Each Pattern

Maintaining state across interactions and time

Working Memory

Episodic Memory

Semantic Memory

Why Memory Matters: APIs Are Stateless

Connecting agents to the external world

Universal Tool Call Format

Closing the loop with continuous feedback

Key Research References

Delivering the right context at the right time

Prompt Architecture

Sliding Window

Turn Compaction

Dynamic Few-Shot

Token Budgeting

Context Grounding

The Context Engineering Principle

Input Ingestion

Embedding & Indexing

RAG Retrieval

LLM Reasoning

RAG Pipeline Technical Specification

Knowledge Library Index

Short-term vs long-term context retention

Working Memory

Episodic Memory

Semantic Memory

Short-Term vs Long-Term Memory — Technical Comparison

Task decomposition and self-assessment in production

Task Decomposition

Self-Reflection

Planning ↔ Reflection Cycle

Token-aware memory with sliding window & compaction

Token Counting

Sliding Window

Compaction

Pre-processing Pipeline: process_llm_request()

Agents that improve with experience

Outcome Feedback Loop

Retrieval Quality Loop

Escalation Learning Loop

Resource Optimization Loop

Continuous Improvement Pipeline

1. Observe

2. Evaluate

3. Store

4. Retrieve

5. Adapt

Event Detection

Task Classification

DAG Construction

Parallel Execution

AI Agent Architecture:
Theory & Production Practice