AI Agent Architecture:
Theory & Production Practice
Every AI agent, regardless of its domain or framework, is built from the same six architectural components. This guide first establishes that universal blueprint through theory and research, then applies it line by line to a production fleet of twenty-eight specialized agents — equipped with 45 Sentinel tools, a Knowledge Library of 43 technical books, Jira integration, and autonomous self-healing — running on self-hosted bare-metal infrastructure.
Part I deconstructs the general architecture: the reasoning engine at the center, the perception layer that feeds it, the planning system that decomposes goals, the memory that retains context, the tools that connect to the outside world, and the evaluation loop that drives continuous improvement. Part II maps each of these components onto a 28-agent autonomous fleet — complete with 45 Sentinel tools, Jira Cloud integration, a Knowledge Library of 43 indexed books, MLOps pipelines, observability stacks, and CI/CD automation — to show exactly how theory translates into running code and measurable operational outcomes.
Part I
General theory
Reasoning, perception, memory, tools, evaluation, and context engineering.
Part II
Production mapping
A 28-agent fleet with orchestration, Jira integration, Knowledge Library, RAG, observability, and CI/CD.
Output
Operational evidence
Measured latency, uptime, resource allocation, and autonomous incident handling.
Architecture briefing
Professional field guide for modern agent systems
Architecture scope
Theory to production
A complete blueprint spanning cognition, orchestration, and infrastructure.
Operational baseline
28 agents · 45 tools · 43 books
Concrete fleet telemetry, Jira integration, and Knowledge Library powered by GPT-4o-mini.
Visual system
Interactive D3 analysis
Topology, state, resource, and loop behavior rendered as live explanatory graphics.
Reading mode
Technical narrative
Visual mode
Interactive D3.js
Infra basis
Self-hosted bare metal
Focus
Autonomy with observability
The LLM or neural network that provides natural language understanding, chain-of-thought reasoning, and structured output generation. This is the agent's 'brain' — it processes inputs and decides what to do next.
How the agent observes its environment. Transforms raw inputs — text, images, API responses, sensor data — into structured representations that the reasoning engine can process effectively.
Breaks complex goals into executable sub-tasks. Uses patterns like ReAct, Tree-of-Thought, or DAG construction to create multi-step action plans with dependency resolution.
Maintains state across interactions. Working memory (current context), episodic memory (past experiences), and semantic memory (generalized knowledge) enable learning from history.
The agent's hands — how it acts on the world. Connects to APIs, databases, code interpreters, and external services. The reasoning engine selects tools, constructs parameters, and interprets results.
Assesses outcomes and drives improvement. Compares actual results against expectations, scores decisions, and feeds learnings back into memory and planning for continuous adaptation.
Seven levels of increasing autonomy
Not every LLM integration is an agent. The critical distinction lies in who controls the execution flow. In a workflow, the developer pre-defines every step — the LLM generates text but makes no decisions about what happens next. In an agent, the LLM itself determines which actions to take, when to use tools, and when to stop. Between these two poles lies a spectrum of seven distinct levels, each granting the model progressively more control over the current step, the next step, and the set of available steps.
Understanding where your system sits on this spectrum is essential for choosing the right architecture. Levels 1–2 are appropriate when reliability and predictability are paramount. Levels 3–5 suit most production agent deployments where the LLM needs to interact with external systems. Levels 6–7 represent fully autonomous systems that learn from experience and coordinate across multiple specialized agents — the territory occupied by the 28-agent fleet described in Part II.
Workflows (L1–L2)
- → Developer pre-defines all execution paths
- → LLM generates content but controls nothing
- → Deterministic, auditable, low risk
- → Cannot handle novel situations outside pre-built branches
Agents (L4–L7)
- → LLM dynamically decides actions based on context
- → Iterative loop with unpredictable step count
- → Can handle novel, complex, multi-step tasks
- → Requires guardrails, evaluation, and memory to be reliable
LLM + Tool + Loop: the core triad
Every LLM agent, regardless of framework or domain, reduces to three elements. The LLM serves as the decision-making core — it interprets context and chooses what to do next. Tools give the agent hands to interact with the external world: API calls, database queries, code execution. The Loop repeats the observe–think–act cycle until the agent determines the goal is achieved. This iterative loop is what transforms a stateless text generator into an autonomous problem-solver.
The loop continues because the agent cannot predict in advance how many steps will be required. A simple query might resolve in one iteration. A complex research task might require dozens of tool calls, each observation enriching the context for the next decision. The key insight is that the LLM decides both the next action and when to terminate — this is the fundamental property that distinguishes an agent from traditional software where execution paths are fixed at compile time.
Observe, reason, act, evaluate, and remember as one continuous control loop
This orbital D3 view turns the abstract concept of agency into a readable systems primitive with explicit state transitions and memory closure.
Loop nodes
5 operational states
Center
LLM decision core
LLM — The Brain
Understands context, classifies intent, generates reasoning chains, selects tools, and decides when to stop. The foundation model provides generalization that no rule-based system can match.
Tools — The Hands
APIs, databases, code interpreters, infrastructure control planes. Each tool is presented as a function signature with name, description, and JSON parameter schema. The LLM selects and parameterizes tools autonomously.
Loop — The Heartbeat
The iterative cycle that gives agents their power. Each iteration adds new observations to the context, enabling compound reasoning. Without the loop, you have a chatbot. With it, you have an agent.
The LLM at the core of every decision
At the center of every agent sits a foundation model that serves as its reasoning engine. When an input arrives, the model does not simply generate a response. It first decomposes the problem into explicit reasoning steps, then decides whether the answer requires external data or action, and finally evaluates the quality of its own output before releasing it. This three-phase pipeline — reasoning, acting, reflecting — is the fundamental decision loop that every other component supports.
The dominant pattern for this loop is ReAct (Yao et al., 2023), where the LLM alternates between a Thought step (reasoning about what to do), an Action step (calling a tool or generating output), and an Observation step (interpreting the result). A self-reflection gate at the end checks whether the output meets confidence thresholds before it is finalized. The diagram below traces this full decision pipeline from raw input to structured output.
Chain-of-Thought
The LLM breaks reasoning into explicit steps: understand intent → identify constraints → evaluate options → select action. This makes decisions interpretable and auditable.
Tool Selection
When the task requires external information or action, the LLM selects an appropriate tool from its available toolkit, constructs structured parameters, and interprets the result.
Self-Reflection
Before outputting a final answer, the LLM evaluates its own response: Is it complete? Accurate? Does it need another reasoning step? This creates an internal quality gate.
How agents observe their environment
A reasoning engine is only as good as the data it receives. Perception is the sensory layer that sits between the raw environment and the reasoning core, responsible for transforming unstructured signals into clean, structured representations. In text-based agents, this means tokenizing input, extracting entities, classifying intent, and embedding content into vector space for retrieval. In multimodal systems, it extends to images via vision-language models and audio via speech recognition. The quality of perception directly determines the quality of every downstream decision.
Text Understanding
Natural language inputs are tokenized, entities extracted, and intent classified. Modern agents use the LLM itself for this, but specialized NER and intent classifiers can run as pre-processing steps.
Multimodal Input
Vision-language models (GPT-4V, Claude Vision) can process images alongside text. Audio is transcribed via Whisper or similar ASR models before entering the text pipeline.
Structured Data Ingestion
API responses, database query results, and log streams are parsed into structured formats (JSON, dataframes) that the agent can reason over directly.
Embedding & Retrieval
Dense vector embeddings (sentence-transformers, text-embedding-3) enable semantic similarity search. This is the foundation of RAG — grounding decisions in retrieved context.
Breaking complex goals into executable sub-tasks
Planning is the capability that separates agents from simple chatbots. A chatbot produces a single response to a single prompt. An agent, by contrast, receives a complex goal, breaks it into a sequence of sub-tasks with explicit dependencies, determines which can run in parallel, and executes them iteratively while monitoring intermediate results. Research has converged on several planning paradigms, each suited to different types of problems. The choice of paradigm fundamentally shapes how an agent handles multi-step tasks, error recovery, and resource allocation.
ReAct (Reason + Act)
Yao et al., 2023The agent alternates between reasoning steps (Thought) and action steps (Tool Call). Each observation from the environment triggers a new reasoning step. Simple, effective, and widely adopted.
Plan-and-Execute
Wang et al., 2023The agent creates a complete plan upfront, then executes each step sequentially. Better for long-horizon tasks where the full plan can be validated before execution begins.
Tree-of-Thought
Yao et al., 2023Multiple reasoning paths are explored in parallel as a tree. The agent evaluates each branch and selects the most promising path. Powerful for tasks requiring search and exploration.
DAG-based Orchestration
LangGraph, CrewAITasks are modeled as a directed acyclic graph with explicit dependencies. Enables parallel execution of independent sub-tasks and checkpointing for recovery. Used in production multi-agent systems.
When to Use Each Pattern
Use ReAct for tasks with predictable complexity — single-domain queries, straightforward tool calls. The iterative loop handles uncertainty without upfront planning overhead.
Use Plan-and-Execute or DAG for cross-domain tasks requiring coordination between multiple agents. The upfront planning phase catches dependency conflicts before execution begins.
Maintaining state across interactions and time
An agent without memory is stateless: it has no record of what it has done, what worked, or what failed. It will repeat the same mistakes indefinitely. Cognitive science identifies three distinct memory systems in the human brain, and modern AI agent architectures replicate all three. Working memory holds the current task context, analogous to holding a phone number in your head while you dial it. Episodic memory records complete traces of past interactions, analogous to remembering a specific conversation. Semantic memory captures generalized patterns and facts extracted from many episodes, analogous to knowing that a particular configuration always causes a timeout. Together, these three layers give agents the ability to learn from experience without retraining the underlying model.
Working Memory
Current conversation context, active variables, intermediate reasoning steps. Typically the LLM's context window (4K–128K tokens).
Episodic Memory
Complete records of past interactions: what was asked, what the agent did, what happened. Enables 'learning from experience' and few-shot retrieval.
Semantic Memory
Generalized facts and patterns extracted from many episodes. Domain knowledge, runbook templates, common resolution patterns stored as embeddings.
Why Memory Matters: APIs Are Stateless
LLM APIs are fundamentally stateless — each call is independent with no built-in memory of prior interactions. Web interfaces like ChatGPT create the illusion of continuity, but under the hood, the entire conversation history is resent with every request. This means the developer must explicitly manage conversation state, accumulating user messages, assistant responses, tool results, and system context into a growing message list. Memory systems (working, episodic, semantic) solve this at scale — they determine what to keep, what to compress, and what to retrieve, turning a stateless API into a stateful agent that learns from experience.
Connecting agents to the external world
Reasoning alone cannot change the state of the world. Tools are the execution layer that connects the agent to external systems: APIs, databases, code interpreters, file systems, and infrastructure control planes. When the reasoning engine determines that it needs information it does not have or needs to perform an action it cannot perform internally, it generates a structured tool call specifying the function name and parameters. The runtime validates the call, executes it, captures the result (or the error), and feeds it back as an observation for the next reasoning step.
The critical design principle is a unified tool interface. Every tool, whether it queries a PostgreSQL database, restarts a Docker container, or calls a third-party API, is presented to the LLM as a function signature with a name, a natural-language description, and a JSON parameter schema. This is the same pattern used by OpenAI Function Calling, LangChain Tools, and Anthropic Tool Use. Standardizing the interface means new tools can be added without modifying the reasoning engine.
Universal Tool Call Format
Closing the loop with continuous feedback
The evaluation and feedback loop is the sixth and final component, and it is the one that transforms a static system into an adaptive one. Without it, an agent can only apply the knowledge embedded in its training weights and its fixed prompt templates. With it, the agent continuously measures its own performance against real-world outcomes and adjusts its behavior accordingly. This is what makes the difference between an agent that degrades over time and one that improves.
The feedback loop operates at three distinct time horizons. The inner loop runs per-action, on the order of seconds, enabling self-correction within a single task. The middle loop runs per-episode, on the order of minutes, storing and learning from complete task traces. The outer loop runs per-epoch, on the order of days or weeks, performing systematic analysis of accumulated metrics and tuning parameters across the entire system.
Key Research References
Delivering the right context at the right time
Agent development is fundamentally about context engineering — the discipline of assembling precisely the right information into the LLM's prompt window so that its intelligence can be fully utilized. An agent framework is, at its core, a context management system. Every component described in this article — perception, memory, tools, planning — exists to solve one problem: ensuring the LLM has exactly the information it needs to make the right decision at each step of the loop.
The context window is a finite resource. A 128K-token window sounds vast, but production agents must partition it across six competing demands: system instructions, conversation history, retrieved documents, tool results, few-shot examples, and output space. Each token carries an opportunity cost — spending tokens on irrelevant history means fewer tokens available for critical RAG context. The visualization below maps the complete context assembly pipeline, from system prompt to structured output.
Prompt Architecture
The system message defines the agent's identity, constraints, and output schema. A well-structured system prompt is the single highest-leverage intervention in agent performance — it shapes every downstream decision.
Sliding Window
As conversations exceed the token budget, older messages must be evicted. A sliding window retains the most recent N turns, but smart implementations weight messages by relevance rather than recency alone.
Turn Compaction
When token pressure increases, the agent compresses verbose messages into concise summaries while preserving key facts. An 800-character response might compact to 300 characters without losing critical information.
Dynamic Few-Shot
Instead of static examples hardcoded into prompts, the agent retrieves semantically similar input-output pairs from episodic memory. This grounds the LLM's behavior in proven successful patterns for each specific task type.
Token Budgeting
Each context component receives a token budget proportional to its importance for the current task. Infrastructure alerts allocate more tokens to tool results. Research tasks allocate more to RAG context.
Context Grounding
No decision should rely solely on parametric knowledge. The agent retrieves relevant documents, past episodes, and domain patterns via hybrid search before reasoning. This is the bridge between memory and reasoning.
The Context Engineering Principle
“The essence of an agent lies in delivering the right context to an LLM so that its intelligence can be fully utilized. Agent frameworks are essentially designed to manage this challenge effectively.”
This insight reframes agent development from a model problem to an information architecture problem. The quality of the model matters, but the quality of the context it receives matters more. A smaller model with excellent context engineering will consistently outperform a larger model with poor context assembly — as demonstrated by the fleet's Qwen 2.5 7B achieving 94% classification accuracy through carefully engineered RAG pipelines and dynamic few-shot selection.
Input Ingestion
Embedding & Indexing
RAG Retrieval
LLM Reasoning
RAG Pipeline Technical Specification
Knowledge Library Index
Short-term vs long-term context retention
The theoretical memory model described in Part I identified three tiers: working, episodic, and semantic. In production, each tier maps to a specific storage backend chosen for its access pattern. Working memory demands sub-millisecond reads and tolerates volatility. Episodic memory requires durability and structured querying. Semantic memory requires high-dimensional vector search across generalized knowledge. The fleet uses Redis for the first, PostgreSQL for the second, and Milvus for the third.
Working memory answers the question: what is happening right now across the fleet. Episodic memory answers: what happened the last time this type of event occurred, and what was the outcome. Semantic memory answers: given everything we have learned, what is the best general approach to this class of problem. Together, these three layers give each agent access to the collective experience of the entire fleet.
Working Memory
Active task context held in Redis. Each agent publishes its current state to a Redis hash (key: agent:{id}:state) with a 30-second TTL heartbeat. Sentinel reads all agent states every 10 seconds to build fleet-wide situational awareness.
Episodic Memory
Every completed task is persisted as an episode in PostgreSQL: the input event, classification decision, execution plan, each step result, and final outcome. This creates a complete decision trace for auditing and learning.
Semantic Memory
Generalized knowledge from two sources: (1) episodic patterns stored as vector embeddings in Milvus for operational RAG retrieval, and (2) a Knowledge Library of 43 technical books (15,655 pages, 13 categories) indexed via PyMuPDF for deep engineering Q&A via GPT-4o-mini.
Short-Term vs Long-Term Memory — Technical Comparison
| Dimension | Short-Term (Working) | Long-Term (Episodic + Semantic) |
|---|---|---|
| Storage Backend | Redis (in-memory hash maps with TTL) | PostgreSQL (persistent tables) + Milvus (vector embeddings) |
| Retention Period | 30 seconds to 15 minutes (configurable per agent) | Indefinite — partitioned by month, archived after 90 days |
| Data Format | JSON blobs: current task state, active context, pending actions | Structured rows: agent_id, action, params, result, duration, embedding vector |
| Access Pattern | O(1) key-value lookup; pub/sub for real-time state changes | Indexed SQL queries + ANN (approximate nearest neighbor) vector search |
| Use Case | Active task tracking, heartbeat monitoring, coordination state | Pattern recognition, decision replay, few-shot prompt context, audit trail |
| Failure Mode | Volatile — data lost on Redis restart (acceptable for ephemeral state) | Durable — WAL-replicated PostgreSQL with daily pg_dump backups |
Task decomposition and self-assessment in production
Planning separates agents from chatbots. A chatbot produces a single response; an agent decomposes a complex goal into ordered sub-tasks, executes them iteratively, and uses reflection to assess progress between steps. The fleet implements both capabilities as LangGraph tools that the ReAct loop can invoke on demand.
Task Decomposition
The create_tasks tool receives a list of step descriptions and returns a structured checklist with status tracking. Each task follows a state machine: pending → in_progress → completed.
@tool
def create_tasks(tasks: List[str]) -> str:
"""Decompose complex goals into
ordered sub-task checklists."""
task_objects = [
Task(content=t, status="pending")
for t in tasks
]
task_objects[0].status = "in_progress"
return format_checklist(task_objects)Self-Reflection
The reflection tool creates structured self-assessments recorded in conversation context. Four patterns: progress review, error analysis, result synthesis, and self-check. Setting need_replan=True triggers plan revision.
@tool
def reflection(
analysis: str,
need_replan: bool = False
) -> str:
"""Self-assess progress and decide
whether to continue or replan."""
# Recorded in LangGraph context
# as ToolResult for next reasoning
# step — closing the feedback loopPlanning ↔ Reflection Cycle
These two tools form a complementary cycle within the ReAct loop. The planner creates structure; the reflector evaluates results and triggers replanning when conditions change. This enables the agent to handle multi-step operations that would previously require human intervention — a container health audit that discovers cascading failures, a deployment that needs rollback, or a data pipeline repair spanning multiple services.
Create Plan
create_tasks
Execute Step
domain tools
Reflect
reflection
Continue / Replan
loop or done
Token-aware memory with sliding window & compaction
Long multi-step operations accumulate tool results that can overflow the LLM's context window. The context management layer automatically measures token usage, compacts verbose tool outputs, and applies a sliding window to keep conversations within budget — all without losing critical information.
Token Counting
Every message is measured via tiktoken (cl100k_base) before LLM invocation. Budget tracking prevents context overflow during long ReAct chains with many tool calls. Falls back to word-based estimation when tiktoken is unavailable.
Sliding Window
Retains only the last 20 messages while always preserving system prompts and the initial user message. Prevents unbounded context growth during multi-step operations where the ReAct loop accumulates many tool call/result pairs.
Compaction
Tool results exceeding 800 characters are automatically summarized, preserving key data (numbers, statuses, errors) while removing verbose formatting. Priority-based extraction ensures critical information survives compaction.
Pre-processing Pipeline: process_llm_request()
Before every LLM invocation, the context manager chains three strategies in order — compact, window, and budget — ensuring the agent stays within token limits even during complex multi-tool ReAct chains. This is injected automatically into chat(), achat(), and astream() methods.
def process_llm_request(messages, max_tokens=6000):
# Step 1: Compact long tool results
messages = compact_messages(messages)
# Step 2: Sliding window (last 20 msgs)
messages = apply_sliding_window(messages)
# Step 3: Token budget enforcement
tokens = count_messages_tokens(messages)
if tokens > max_tokens:
trim_oldest_non_system(messages)
return messages, statsAgents that improve with experience
A static rule-based system degrades as its environment evolves because the rules were written for conditions that no longer hold. The agent fleet avoids this failure mode by implementing four distinct learning loops, each operating at a different time scale and optimizing a different aspect of system performance. These loops continuously refine decision quality, retrieval accuracy, human dependency, and resource efficiency without ever retraining the underlying language model.
The design principle behind all four loops is the same: learning happens at the retrieval and configuration layer, not at the model weight layer. The base LLM remains frozen and deterministic. What changes over time is the content of the vector store, the set of few-shot examples retrieved for each classification, the confidence thresholds that trigger escalation, and the resource limits assigned to each container. This separation of concerns makes the learning process auditable, reversible, and safe to run continuously in production.
Outcome Feedback Loop
After every task execution, the result (success/failure/partial) is compared against the predicted outcome. This feedback is stored in episodic memory and used to update the confidence scores of the classification model's few-shot examples.
Retrieval Quality Loop
Each RAG retrieval is scored by the downstream agent: did the retrieved context lead to a successful resolution? Low-scoring retrievals trigger re-embedding with updated metadata. High-scoring contexts are promoted in the reranker's training set.
Escalation Learning Loop
When Sentinel escalates a task to a human operator, it records the human's resolution. These expert traces become new training examples — stored as high-confidence episodes that future RAG queries preferentially retrieve.
Resource Optimization Loop
The Cost Agent continuously monitors resource utilization vs. allocation across all containers. Historical utilization patterns (7-day rolling window) feed a linear regression model that recommends resource limit adjustments.
Continuous Improvement Pipeline
1. Observe
Collect structured telemetry from every agent action: input, decision, output, latency, resource usage. All observations are schema-validated before storage.
2. Evaluate
Compare actual outcomes against expected outcomes. Success criteria are defined per agent type: deployment agents check health endpoints, security agents verify CVE counts.
3. Store
Persist evaluated episodes to PostgreSQL (structured) and Milvus (embedded). Tag each episode with outcome quality score, agent version, and environmental context.
4. Retrieve
When a new, similar event occurs, the RAG pipeline retrieves the most relevant past episodes. Recency and outcome quality are weighted in the retrieval ranking.
5. Adapt
Update decision parameters: few-shot examples, confidence thresholds, retry limits, resource allocations. Changes are logged as configuration episodes for traceability.
Topology
Control relationships
Flow
Volume across stages
State
Workflow transitions
Resources
Per-container cost
Event Detection
< 1sWebhooks, cron triggers, and anomaly detectors feed raw events into the LangGraph ingestion node. Each event is tagged with source, severity, and timestamp before entering the classification pipeline.
Task Classification
< 2sThe Router node uses Qwen 2.5 7B to classify each event into a domain category. Ambiguous events trigger a secondary LLM-based analysis with chain-of-thought reasoning before routing. The classifier uses few-shot examples stored in vector memory for accuracy.
DAG Construction
< 3sThe Planner node decomposes the classified task into a directed acyclic graph of sub-tasks, resolving dependencies and parallelization opportunities. Each DAG node maps to a specific agent capability. Dependencies are topologically sorted for execution order.
Parallel Execution
VariableWorker agents execute their assigned sub-tasks concurrently, communicating progress through Redis pub/sub channels. Failed tasks trigger automatic retry with exponential backoff (base 2s, max 3 retries) before escalation to Sentinel.
Validation and Consolidation
< 5sQuality gates verify each result against predefined schemas and business rules. Validated outputs are consolidated into a single report, persisted to PostgreSQL, and metrics pushed to Prometheus. Failures trigger rollback via LangGraph checkpoints.
Single control plane
Every event enters through Sentinel first. This centralizes classification, priority assignment, trace IDs, and rollback authority before any specialist agent runs.
Parallel work where safe
The DAG planner isolates independent checks such as log analysis, resource inspection, and cost evaluation so they execute concurrently instead of serially.
Validation before memory
The system does not store raw outcomes blindly. It first verifies health probes, schema compliance, and side effects so future RAG retrieval is grounded in trusted episodes.
Request Lifecycle Pipeline
Event Ingestion
< 100 msRaw events enter via FastAPI endpoints on Sentinel (port 8002). Each event is tagged with source, severity, timestamp, and a correlation ID. Entry point: Cloudflare Tunnel → Nginx reverse proxy → Sentinel container.
Task Classification
< 2 sSentinel's LLM (Qwen 2.5 via Ollama at localhost:11434) classifies the event into a domain—infrastructure, security, data, analytics, or management—and assigns severity P0–P3. Few-shot examples are retrieved from Milvus (HNSW index) to improve classification accuracy.
DAG Construction
< 500 msThe Orchestrator (port 8030) builds a directed acyclic graph of sub-tasks using LangGraph's state-machine engine. It resolves agent dependencies, identifies parallelization opportunities, and attaches checkpoint metadata for rollback capability.
Agent Dispatch
< 200 msSub-tasks are published to domain-specific Redis Pub/Sub channels (agent.{domain}.{action}). Target agents pick up tasks based on subscription and current capacity. Working memory is pre-loaded from Redis HSET.
Tool Execution
1–30 sEach agent executes its assigned tools via FastAPI endpoints on ports 8002–8050. Tools include container restarts (Docker API), vulnerability scans (Trivy), query optimization (PostgreSQL), deployment rollouts, and log analysis.
Result Validation
< 500 msQuality gates verify each agent's output. Health probes confirm system state, JSON schema validators check response structure, and diff checkers verify no unintended side-effects. Failed actions trigger retry (max 3) with exponential backoff before human escalation.
State Persistence
< 100 msResults are written to four persistence layers: episodic memory (PostgreSQL JSONB), semantic embeddings (Milvus HNSW index), time-series metrics (Prometheus push gateway), and ephemeral cache (Redis with TTL). Future similar incidents benefit from RAG retrieval of this episode.
Concrete Scenario: Container Memory Alert
Prometheus AlertManager fires: container 'data-pipeline' memory > 85%
System Topology — 28 Agent Fleet
Orchestration
2 agentsSentinel (LangGraph) -- Orchestrator v2 (Redis DAG)
Infrastructure
7 agentsInfra Health -- Proxmox -- Deployment -- Log Analysis -- Security -- Network -- Realtime Alert
Data Engineering
5 agentsData Pipeline -- Data Engineer (Airflow) -- Data Quality -- Database -- RAG Knowledge (Milvus + Knowledge Library)
Analytics & Intelligence
6 agentsBI -- Customer Analytics -- Cost Optimization -- Data Analyst -- Data Science (ML) -- MLOps (MLflow)
Business & Management
8 agentsProject Mgr -- Delivery Mgr -- Product Owner -- Reporting -- Job Search -- Outreach -- Portfolio Analytics -- Workspace Tidy (deprecated)
Compute Layer
Data Layer
Network Layer
Observability Layer
MLOps Layer
Intelligence Layer
Dedicated Hetzner root server running Proxmox VE hypervisor. Hosts VMs and LXC containers across isolated VLANs.
Each agent is a Python 3.11-slim image with FastAPI, LangChain tools, and a shared library. Dockerfile per agent, ~150 MB each.
Containers start with host networking on unique ports (8002-8064). Environment variables injected, restart policy set to unless-stopped.
Agent Health Proxy discovers all 28 agents. Polls /health and /healthz endpoints, aggregates status into a single /all-health JSON response.
Sentinel (port 8003) serves the Command Center UI. JavaScript polls the proxy every 15 seconds, renders agent cards with status, tools, and tier labels.
Zero-trust encrypted tunnel from Cloudflare edge to Sentinel. Public domain sentinel.simondatalab.de routes to the dashboard. No ports exposed to internet.
Events arrive via webhooks. Sentinel classifies, Orchestrator builds DAG, agents execute tools in parallel via Redis Pub/Sub. Results validated and persisted.
Failed health checks trigger Docker restart policies. Sentinel detects degraded agents, redistributes tasks, and stores recovery episodes for future RAG retrieval.
Sentinel, Orchestrator v2, Infrastructure Health, Proxmox, Deployment, Log Analysis, Security, Network, Realtime Alert
Data Pipeline, Data Engineer, Data Quality, Database, RAG Knowledge, BI, Customer Analytics, Cost Optimization, Data Analyst, Data Science, MLOps
Project Manager, Delivery Manager, Product Owner, Reporting, Job Search, Outreach Networking, Portfolio Analytics
Explore more projects
This case study demonstrates autonomous AI agent architecture with perception, memory, and adaptive learning. See more data engineering and infrastructure projects.