Files

xcaliber 62651b4d2e Add Multi agent orchestration design.md

2026-01-13 18:44:41 +00:00

64 KiB

Raw Permalink Blame History

Multi-Agent Orchestration System Design v2.0

Version: 2.0.0
Author: Jeff Smith + Claude
Date: January 11, 2026
Status: RFC (Request for Comments)
Data Source: Venice.ai billing history Dec 12, 2025 - Jan 11, 2026 (13,582 transactions)

Executive Summary

This document describes the architecture for a multi-model AI development system built on Open WebUI, Venice.ai, and Gitea. Based on 30 days of actual billing data, we demonstrate that full NPE automation costs 0.21 DIEM/day (2.6% of the 8.1 DIEM daily budget), leaving 7.89 DIEM for interactive work.

Core Thesis: Your actual usage patterns prove orchestrated automation is not just feasible—it's virtually free. The real challenge is context management and cache optimization, not model costs.

Billing Data Analysis
Model Selection Strategy
Context Management
Tool Architecture
NPE Personas & Roles
Cron & Scheduling
Workflow Patterns
Cost Management
Implementation Roadmap
Open Questions

1. Billing Data Analysis

1.1 30-Day Summary

Period:           Dec 12, 2025 - Jan 11, 2026
Total Records:    13,582 billing events
Total Spend:      61.24 DIEM + 2.86 USD
Days Active:      18 days
Average Daily:    3.56 DIEM/day
Max Daily:        9.96 DIEM (Jan 2, 2026)
Median Daily:     3.72 DIEM/day

Budget Analysis:

Daily Budget: 8.1 DIEM (staked)
Average Spend: 3.56 DIEM/day
Average Surplus: 4.54 DIEM/day (56% unutilized)
This surplus is lost at 19:00 EST reset

1.2 Spend by Model Family

┌────────────────────────────────────────────────────────────────────┐
│                    ACTUAL SPEND BY MODEL (30 days)                 │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GLM-4.6         ████████████████████████████  16.66 DIEM (26.0%)  │
│  Qwen            ████████████████████████      15.60 DIEM (24.3%)  │
│  Claude-Opus-4.5 ██████████████████            11.86 DIEM (18.5%)  │
│  Grok-41-Fast    ██████████                     6.71 DIEM (10.5%)  │
│  Image-Gen       █████████                      6.01 DIEM  (9.4%)  │
│  MiniMax-M21     █████                          3.31 DIEM  (5.2%)  │
│  Kimi-K2         ██                             1.53 DIEM  (2.4%)  │
│  GLM-4.7         ██                             1.25 DIEM  (2.0%)  │
│  Other           █                              1.17 DIEM  (1.7%)  │
│                                                                     │
│  TOTAL                                         64.10 DIEM          │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

1.3 Actual Effective Rates (DIEM per 1M Tokens)

From your billing data, sorted by cost efficiency:

Model	Input Rate	Output Rate	Cache Rate	Effective Rate	Calls	Total Cost
Qwen-Instruct	$0.071	$0.27	-	0.0626	408	0.92
Grok-Code	$0.140	$1.87	$0.030	0.0721	611	1.67
DeepSeek	$0.329	$1.00	$0.200	0.1825	36	0.10
Grok-41-Fast	$0.314	$1.25	$0.125	0.1857	1,152	5.03
MiniMax-M21	$0.316	$1.60	$0.040	0.2232	360	3.31
Qwen-Thinking	$0.450	$3.50	-	0.3339	105	1.61
Qwen-Coder	$0.750	$3.00	-	0.3830	530	12.99
Kimi-K2-Thinking	$0.595	$3.20	$0.375	0.4152	111	1.53
GLM-4.6	$0.850	$2.75	-	0.4455	1,724	16.66
Claude-Opus-4.5	$6.000	$30.00	-	5.2751	78	11.86

Key Insights:

Grok is 28× cheaper than Claude per token
Qwen-Instruct is 84× cheaper than Claude for bulk work
Cache hits reduce Grok input costs by 75%
Claude is only 78 calls but 18.5% of total spend

1.4 Context Size Distribution

Your actual context sizes reveal optimization opportunities:

GROK-41-FAST (1,152 calls):
  Median: 5,291 tokens | P75: 7,217 | Max: 582,506
  Distribution: 0-5k: 1671 | 5-10k: 1245 | 10-20k: 485 | 20-50k: 23 | 50k+: 17
  ✓ WELL MANAGED - 85% of calls under 10k tokens

KIMI-K2-THINKING (111 calls):
  Median: 5,164 tokens | P75: 17,797 | Max: 48,893
  Distribution: 0-5k: 144 | 5-10k: 39 | 10-20k: 83 | 20-50k: 34
  ⚠ CONTEXT BLEEDING - P75 jumps to 18k, needs pruning

MINIMAX-M21 (360 calls):
  Median: 10,090 tokens | P75: 23,071 | Max: 169,708
  Distribution: 0-5k: 211 | 5-10k: 202 | 10-20k: 181 | 20-50k: 198 | 50k+: 38
  ⚠ HEAVY CONTEXTS - 11% of calls over 50k tokens

QWEN-CODER (530 calls):
  Median: 17,016 tokens | P75: 47,601 | Max: 253,462
  Distribution: 0-5k: 340 | 5-10k: 124 | 10-20k: 82 | 20-50k: 284 | 50k+: 230
  ⚠ CODE CONTEXTS ARE LARGE - expected but optimize where possible

CLAUDE-OPUS-4.5 (78 calls):
  Median: 7,198 tokens | P75: 11,956 | Max: 51,368
  ✓ REASONABLE - given high cost, context is well-controlled

1.5 Cache Efficiency Analysis

┌────────────────────────────────────────────────────────────────────┐
│                    CACHE HIT RATES BY MODEL                        │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  GROK-CODE:     ████████████████████████████████████  67.8%        │
│                 Saved: $1.08 | Cache rate: $0.030 vs $0.250 full   │
│                                                                     │
│  KIMI-K2:       █████                                   9.8%        │
│                 Saved: $0.05 | Cache rate: $0.375 vs $0.750 full   │
│                                                                     │
│  GROK-41-FAST:  █████                                   8.8%        │
│                 Saved: $0.27 | Cache rate: $0.125 vs $0.500 full   │
│                                                                     │
│  MINIMAX-M21:   ████                                    7.2%        │
│                 Saved: $0.16 | Cache rate: $0.040 vs $0.400 full   │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Grok-Code's 67.8% cache hit rate demonstrates what's possible with sequential operations using the same system prompt and context.

1.6 Hourly Usage Pattern (EST)

Hour (EST) | Spend    | Activity Level
───────────┼──────────┼────────────────────────────────
  00-03    |   0.00   | 💤 Dead
  04-05    |   0.84   | 🌅 Early morning
  06-08    |   9.86   | ☕ Morning peak
  09-11    |   3.57   | 📊 Late morning
  12-13    |   7.34   | 🍽️ Lunch peak
  14-15    |   8.13   | 💻 Afternoon work
  16-17    |  14.25   | 🔥 PEAK (4-6pm)
  18-19    |   8.13   | 🌆 Evening
  20-21    |   9.93   | 🌙 Night session
  22-23    |   0.00   | 💤 Dead

AUTOMATION WINDOW: 22:00 - 07:00 EST (9 hours of minimal usage)

2. Model Selection Strategy

2.1 Tier Architecture (Data-Driven)

Based on actual billing patterns:

┌─────────────────────────────────────────────────────────────────────┐
│                    MODEL SELECTION PYRAMID                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                        ┌─────────────┐                              │
│                        │   TIER 4    │  Claude Opus 4.5             │
│                        │   ORACLE    │  5.28 DIEM/1M effective      │
│                        │    <2%      │  Architecture, Security,     │
│                        └──────┬──────┘  Deadlock Resolution         │
│                               │         Budget: 0.05 DIEM/day       │
│                               │                                      │
│                     ┌─────────┴─────────┐                           │
│                     │      TIER 3       │  Kimi-K2, Qwen-Thinking   │
│                     │    REASONING      │  0.33-0.42 DIEM/1M        │
│                     │       5%          │  Complex analysis,        │
│                     └─────────┬─────────┘  Multi-step reasoning     │
│                               │            Budget: 0.10 DIEM/day    │
│                               │                                      │
│               ┌───────────────┴───────────────┐                     │
│               │           TIER 2              │  MiniMax, DeepSeek  │
│               │         BALANCED              │  0.18-0.22 DIEM/1M  │
│               │           15%                 │  Standard tasks,    │
│               └───────────────┬───────────────┘  Code generation    │
│                               │                  Budget: 0.50 DIEM  │
│                               │                                      │
│  ┌────────────────────────────┴────────────────────────────┐        │
│  │                       TIER 1                             │        │
│  │                    WORKHORSES                            │        │
│  │                       78%                                │        │
│  │  Grok-41-Fast: 0.19 DIEM/1M | Grok-Code: 0.07 DIEM/1M  │        │
│  │  Qwen-Instruct: 0.06 DIEM/1M                            │        │
│  │  Routing, Quick checks, Bulk processing, PM tasks       │        │
│  │  Budget: Remaining (~7.45 DIEM/day)                     │        │
│  └─────────────────────────────────────────────────────────┘        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

2.2 Model Selection Matrix

Task Type	Primary Model	Fallback	Cost/Call	Rationale
Routing/Dispatch	Grok-41-Fast	Qwen-Instruct	0.002	Cheapest with cache
PM Coordination	Grok-41-Fast	MiniMax	0.003	Simple decisions
Code Generation	Grok-Code	Grok-41-Fast	0.005	67% cache hits
Code Review	Grok-41-Fast	DeepSeek	0.004	Good reasoning
Bulk Processing	Qwen-Instruct	Grok-41-Fast	0.001	0.06 DIEM/1M
Complex Reasoning	Kimi-K2	Qwen-Thinking	0.012	Thinking models
Architecture	Claude-Opus	Kimi-K2	0.054	Only when needed
Security Review	Claude-Opus	-	0.054	Non-negotiable

2.3 Model Selection Logic

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class TaskComplexity(Enum):
    TRIVIAL = 1   # Routing, yes/no decisions
    SIMPLE = 2    # Single-step tasks
    MODERATE = 3  # Multi-step, needs context
    COMPLEX = 4   # Reasoning required
    CRITICAL = 5  # Architecture, security

@dataclass
class ModelConfig:
    id: str
    input_rate: float   # DIEM per 1M tokens
    output_rate: float  # DIEM per 1M tokens
    cache_rate: float   # DIEM per 1M tokens (0 if no cache)
    max_context: int    # Recommended max context
    tier: int

MODELS = {
    "qwen-instruct": ModelConfig("qwen3-235b-a22b-instruct-2507", 0.15, 0.27, 0, 32000, 1),
    "grok-code": ModelConfig("grok-code-fast-1", 0.25, 1.87, 0.03, 16000, 1),
    "grok-fast": ModelConfig("grok-41-fast", 0.50, 1.25, 0.125, 16000, 1),
    "deepseek": ModelConfig("deepseek-chat", 0.50, 1.00, 0.20, 32000, 2),
    "minimax": ModelConfig("minimax-m21", 0.40, 1.60, 0.04, 32000, 2),
    "kimi": ModelConfig("kimi-k2", 0.75, 3.20, 0.375, 32000, 3),
    "qwen-thinking": ModelConfig("qwen3-235b-a22b-thinking-2507", 0.45, 3.50, 0, 32000, 3),
    "claude": ModelConfig("claude-opus-4-5", 6.00, 30.00, 0, 200000, 4),
}

def select_model(
    task_type: str,
    complexity: TaskComplexity,
    budget_remaining: float,
    context_size: int = 0
) -> str:
    """
    Select optimal model based on task, complexity, and budget.
    
    Returns Venice model ID.
    """
    # Budget gates
    if budget_remaining < 0.5:
        return MODELS["qwen-instruct"].id  # Emergency mode
    
    if budget_remaining < 2.0:
        # Low budget - force Tier 1
        if complexity == TaskComplexity.CRITICAL:
            return MODELS["kimi"].id  # Downgrade from Claude
        return MODELS["grok-fast"].id
    
    # Task-based selection
    selection_map = {
        # Task type -> {complexity: model_key}
        "routing": {
            TaskComplexity.TRIVIAL: "qwen-instruct",
            TaskComplexity.SIMPLE: "grok-fast",
            TaskComplexity.MODERATE: "grok-fast",
        },
        "pm_coordination": {
            TaskComplexity.TRIVIAL: "grok-fast",
            TaskComplexity.SIMPLE: "grok-fast",
            TaskComplexity.MODERATE: "minimax",
            TaskComplexity.COMPLEX: "kimi",
        },
        "code_generation": {
            TaskComplexity.SIMPLE: "grok-code",
            TaskComplexity.MODERATE: "grok-code",
            TaskComplexity.COMPLEX: "minimax",
            TaskComplexity.CRITICAL: "kimi",
        },
        "code_review": {
            TaskComplexity.SIMPLE: "grok-fast",
            TaskComplexity.MODERATE: "grok-fast",
            TaskComplexity.COMPLEX: "deepseek",
            TaskComplexity.CRITICAL: "claude",
        },
        "architecture": {
            TaskComplexity.MODERATE: "kimi",
            TaskComplexity.COMPLEX: "kimi",
            TaskComplexity.CRITICAL: "claude",
        },
        "security": {
            TaskComplexity.SIMPLE: "grok-fast",
            TaskComplexity.MODERATE: "kimi",
            TaskComplexity.COMPLEX: "claude",
            TaskComplexity.CRITICAL: "claude",
        },
    }
    
    task_map = selection_map.get(task_type, {})
    model_key = task_map.get(complexity, "grok-fast")
    
    return MODELS[model_key].id


def estimate_cost(model_key: str, input_tokens: int, output_tokens: int, cache_tokens: int = 0) -> float:
    """Estimate DIEM cost for a completion."""
    model = MODELS[model_key]
    
    input_cost = ((input_tokens - cache_tokens) / 1_000_000) * model.input_rate
    cache_cost = (cache_tokens / 1_000_000) * model.cache_rate
    output_cost = (output_tokens / 1_000_000) * model.output_rate
    
    return input_cost + cache_cost + output_cost

3. Context Management

3.1 The Context Problem

Your billing data reveals context size is the primary cost driver:

Kimi calls at 48K tokens: 0.040 DIEM each
Kimi calls at 5K tokens: 0.006 DIEM each (6.7× cheaper)
Grok calls at 5K tokens: 0.003 DIEM each

Every 10K tokens of unnecessary context costs ~0.005-0.015 DIEM.

3.2 Context Budget by Role

# Maximum context tokens per NPE role
CONTEXT_LIMITS = {
    "orchestrator": 4_000,   # Minimal - just state and decisions
    "pm": 6_000,             # Moderate - task list and status
    "coder": 12_000,         # Larger - needs code context
    "reviewer": 8_000,       # Moderate - diff + surrounding code
    "editorial": 10_000,     # Article + guidelines
}

# Warning thresholds (emit warning in logs)
CONTEXT_WARNINGS = {
    "orchestrator": 3_000,
    "pm": 5_000,
    "coder": 10_000,
    "reviewer": 6_000,
    "editorial": 8_000,
}

3.3 Context Management Strategy

from typing import Optional
import tiktoken

class ContextManager:
    """
    Aggressive context pruning for cost control.
    
    Strategy:
    1. Always keep: system prompt, last 2 user messages, last assistant response
    2. Summarize: everything older than 3 exchanges
    3. Prune: tool outputs older than 2 exchanges
    4. Compress: code blocks to signatures only (in summaries)
    """
    
    def __init__(self, model: str = "grok-fast"):
        self.encoder = tiktoken.get_encoding("cl100k_base")
        self.summary_model = model  # Use cheap model for summaries
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.encoder.encode(text))
    
    def count_messages(self, messages: list[dict]) -> int:
        """Count total tokens in message list."""
        total = 0
        for msg in messages:
            total += self.count_tokens(msg.get("content", ""))
            # Add overhead for role, etc.
            total += 4
        return total
    
    async def prepare_context(
        self,
        role: str,
        system_prompt: str,
        messages: list[dict],
        force_limit: Optional[int] = None
    ) -> tuple[str, list[dict]]:
        """
        Prune context to fit within role's budget.
        
        Returns: (possibly_modified_system_prompt, pruned_messages)
        """
        limit = force_limit or CONTEXT_LIMITS.get(role, 8_000)
        warning = CONTEXT_WARNINGS.get(role, limit - 1000)
        
        system_tokens = self.count_tokens(system_prompt)
        message_tokens = self.count_messages(messages)
        total = system_tokens + message_tokens
        
        if total <= limit:
            if total > warning:
                print(f"⚠️ Context at {total} tokens (warning: {warning})")
            return system_prompt, messages
        
        print(f"🔄 Context pruning: {total} -> {limit} tokens")
        
        # Strategy 1: Keep essential messages
        essential = []
        essential_tokens = 0
        
        # Always keep last 3 messages (2 user + 1 assistant typically)
        for msg in messages[-3:]:
            essential.append(msg)
            essential_tokens += self.count_tokens(msg.get("content", "")) + 4
        
        remaining_budget = limit - system_tokens - essential_tokens - 500  # Buffer for summary
        
        if remaining_budget < 200:
            # Can't fit summary, just use essential
            return system_prompt, essential
        
        # Strategy 2: Summarize older messages
        old_messages = messages[:-3]
        if old_messages:
            summary = await self._summarize_messages(old_messages, remaining_budget)
            
            # Prepend summary as system context
            augmented_system = f"{system_prompt}\n\n## Previous Context Summary\n{summary}"
            return augmented_system, essential
        
        return system_prompt, essential
    
    async def _summarize_messages(self, messages: list[dict], max_tokens: int) -> str:
        """
        Summarize old messages into compact form.
        Uses Grok for cheap summarization (~0.001 DIEM).
        """
        # Build summary request
        content_parts = []
        for msg in messages[-10:]:  # Last 10 messages max
            role = msg.get("role", "unknown")
            text = msg.get("content", "")[:1000]  # Truncate each
            content_parts.append(f"{role}: {text}")
        
        content = "\n---\n".join(content_parts)
        
        summary_prompt = f"""Summarize this conversation in {max_tokens // 4} words or less.
Focus on: decisions made, current task, blockers, key context.
Do NOT include pleasantries or meta-commentary.

Conversation:
{content[:4000]}

Summary:"""
        
        # Call cheap model for summary
        response = await venice_completion(
            model=self.summary_model,
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=min(max_tokens, 500)
        )
        
        return response.strip()
    
    def compress_code_context(self, code: str, max_lines: int = 50) -> str:
        """
        Compress code to essential structure for context.
        Keeps signatures, docstrings, removes implementation.
        """
        lines = code.split("\n")
        
        if len(lines) <= max_lines:
            return code
        
        compressed = []
        in_function = False
        brace_depth = 0
        
        for line in lines:
            stripped = line.strip()
            
            # Always keep: imports, class/function definitions, docstrings
            if any(stripped.startswith(kw) for kw in ["import ", "from ", "class ", "def ", "async def ", '"""', "'''"]):
                compressed.append(line)
                if stripped.startswith(("def ", "async def ", "class ")):
                    in_function = True
            elif in_function and stripped.startswith(('"""', "'''")):
                compressed.append(line)
                if stripped.count('"""') == 2 or stripped.count("'''") == 2:
                    compressed.append("        # ... implementation ...")
                    in_function = False
            elif stripped == "":
                compressed.append("")
        
        return "\n".join(compressed)

3.4 Cache Optimization

Your Grok-Code's 67.8% cache hit rate shows what's achievable:

class CacheOptimizer:
    """
    Maximize Venice's prompt caching for cost savings.
    
    Venice caches the PREFIX of prompts. To maximize hits:
    1. Put static content (system prompt) FIRST
    2. Put stable context (project info) SECOND
    3. Put variable content (current task) LAST
    """
    
    @staticmethod
    def build_cacheable_prompt(
        system_prompt: str,
        project_context: str,
        task_context: str,
        user_message: str
    ) -> list[dict]:
        """
        Build message list optimized for cache hits.
        
        Structure:
        1. System prompt (static) - CACHED after first call
        2. Project context as system addendum - CACHED if unchanged
        3. Task context as assistant message - Varies
        4. User message - Always new
        """
        messages = [
            {
                "role": "system",
                "content": f"{system_prompt}\n\n## Project Context\n{project_context}"
            }
        ]
        
        if task_context:
            messages.append({
                "role": "assistant",
                "content": f"Current task context:\n{task_context}"
            })
        
        messages.append({
            "role": "user",
            "content": user_message
        })
        
        return messages
    
    @staticmethod
    def batch_similar_tasks(tasks: list[dict]) -> list[list[dict]]:
        """
        Group tasks by system prompt and project to maximize cache hits.
        
        Running 5 code reviews for the same project sequentially
        means prompts 2-5 get ~75% cache hits on the system prompt.
        """
        batches = {}
        
        for task in tasks:
            key = (task.get("system_prompt_hash"), task.get("project_id"))
            if key not in batches:
                batches[key] = []
            batches[key].append(task)
        
        return list(batches.values())

4. Tool Architecture

4.1 Current Tools

Tool	Version	Purpose	Used By
`gitea_dev`	1.1.0	File ops, branches, PRs, issues	Coder NPEs
`gitea_admin`	1.1.0	Teams, permissions, org management	PM NPEs
`venice_info`	1.0.0	Model discovery, cost tracking	All NPEs
`editorial_pipeline`	1.0.0	Content creation workflow	Editorial NPEs

4.2 Required New Tools

4.2.1 Cost Tracker Tool (`cost_tracker.py`)

Purpose: Real-time cost monitoring and budget enforcement.

class Valves(BaseModel):
    VENICE_API_KEY: str = Field(default="", description="Venice API key")
    DAILY_BUDGET: float = Field(default=8.1, description="Daily DIEM budget")
    AUTOMATION_RESERVE: float = Field(default=0.5, description="Reserve for automation")
    ESCALATION_RESERVE: float = Field(default=0.1, description="Reserve for Claude escalations")
    WARNING_THRESHOLD: float = Field(default=0.8, description="Warn at this % of budget")

class Tools:
    async def get_balance(self) -> str:
        """Get current Venice DIEM balance."""
    
    async def get_remaining_today(self) -> str:
        """Get remaining budget for today (resets 19:00 EST)."""
    
    async def estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> str:
        """Estimate DIEM cost for a completion."""
    
    async def can_afford(self, estimated_cost: float, include_reserve: bool = True) -> str:
        """Check if operation fits within budget."""
    
    async def record_cost(self, amount: float, model: str, npe_id: str, project_id: str = None) -> str:
        """Record actual cost after operation."""
    
    async def get_daily_report(self) -> str:
        """Get today's spend breakdown by model and NPE."""
    
    async def check_budget_alerts(self) -> str:
        """Check for budget warnings and return any alerts."""

4.2.2 Project Manager Tool (`project_manager.py`)

Purpose: Manage development projects and work items.

class Tools:
    # Project CRUD
    async def create_project(self, name: str, description: str, daily_budget: float = 1.0) -> str
    async def get_project(self, project_id: str) -> str
    async def list_projects(self, status: str = "active") -> str
    async def update_project(self, project_id: str, **updates) -> str
    
    # Work Item Management
    async def create_work_item(self, project_id: str, title: str, item_type: str, assigned_model: str = None) -> str
    async def get_work_item(self, item_id: str) -> str
    async def list_work_items(self, project_id: str, status: str = "open") -> str
    async def update_work_item(self, item_id: str, **updates) -> str
    async def add_comment(self, item_id: str, comment: str, author: str) -> str
    
    # Budget Tracking
    async def get_project_budget(self, project_id: str) -> str
    async def record_project_expense(self, project_id: str, amount: float, description: str) -> str

4.2.3 NPE Manager Tool (`npe_manager.py`)

Purpose: Create and manage NPE identities.

class Tools:
    # NPE Lifecycle
    async def create_npe(self, name: str, role: str, model: str, persona: str, tools: list[str]) -> str
    async def get_npe(self, npe_id: str) -> str
    async def list_npes(self, role: str = None, status: str = "active") -> str
    async def update_npe(self, npe_id: str, **updates) -> str
    async def deactivate_npe(self, npe_id: str) -> str
    
    # Activity Tracking
    async def get_npe_activity(self, npe_id: str, days: int = 7) -> str
    async def get_npe_cost_report(self, npe_id: str, period: str = "today") -> str

4.2.4 Workflow Engine Tool (`workflow_engine.py`)

Purpose: Execute and monitor multi-step workflows.

class Tools:
    # Workflow Execution
    async def start_workflow(self, workflow_type: str, params: dict, project_id: str = None) -> str
    async def get_workflow_status(self, workflow_id: str) -> str
    async def complete_step(self, workflow_id: str, step_id: str, result: dict) -> str
    async def fail_step(self, workflow_id: str, step_id: str, error: str) -> str
    
    # Circuit Breaker
    async def check_circuit(self, workflow_id: str) -> str
    async def trip_circuit(self, workflow_id: str, reason: str) -> str
    async def reset_circuit(self, workflow_id: str) -> str
    
    # Escalation
    async def escalate(self, workflow_id: str, reason: str, to_model: str = "claude-opus-4-5") -> str

4.3 Tool Permission Matrix

Tool	Orchestrator	PM	Coder	Reviewer
cost_tracker	✓	R	-	-
project_manager	✓	✓	R	R
npe_manager	✓	-	-	-
workflow_engine	✓	✓	-	-
gitea_dev (read)	✓	✓	✓	✓
gitea_dev (write)	-	-	✓	-
gitea_admin	✓	✓	-	-
venice_info	✓	✓	✓	✓

5. NPE Personas & Roles

5.1 NPE Identity Structure

@dataclass
class NPEIdentity:
    # Core Identity
    id: str                      # e.g., "npe-pm-main"
    name: str                    # e.g., "Project Manager - Main"
    role: str                    # orchestrator, pm, coder, reviewer
    status: str                  # active, suspended, archived
    
    # Model Configuration
    base_model: str              # Venice model ID
    tier: int                    # 1-4 based on cost
    
    # Context Limits
    max_context: int             # Max input tokens
    target_output: int           # Target output tokens
    
    # Budget
    daily_budget: float          # DIEM limit per day
    spent_today: float           # Running total
    
    # Tools
    enabled_tools: list[str]     # Tool IDs this NPE can use

5.2 Orchestrator Persona

ID: npe-orchestrator
Model: grok-41-fast (Tier 1)
Cost: ~0.003 DIEM/call
Context Limit: 4,000 tokens

# System Prompt: Orchestrator NPE

You are the Orchestrator, responsible for coordinating all automated development work.

## Core Responsibilities
1. Receive triggers from cron jobs and webhooks
2. Route work to appropriate NPEs
3. Monitor workflow progress
4. Handle escalations
5. Manage budget allocation

## Constraints
- You do NOT perform work yourself
- You MUST check budget before spawning work
- You MUST use structured JSON for all outputs
- You MUST keep context under 4,000 tokens

## Output Format
All outputs must be valid JSON:

### Spawn Work
{
  "action": "spawn_workflow",
  "workflow_type": "code_review|feature|bugfix",
  "project_id": "string",
  "assigned_npe": "npe-id",
  "budget_limit": 0.5,
  "priority": "high|medium|low"
}

### Route Escalation
{
  "action": "escalate",
  "workflow_id": "string",
  "reason": "string",
  "to_model": "claude-opus-4-5",
  "context_summary": "string (max 500 words)"
}

### Budget Check
{
  "action": "budget_check",
  "remaining": 5.5,
  "can_proceed": true,
  "warnings": []
}

## Decision Rules
1. If remaining budget < 0.5 DIEM: STOP all non-critical work
2. If task is security-related: Route to Claude
3. If task is simple routing: Do it yourself (no spawn needed)
4. If stuck for > 30 minutes: Escalate

5.3 PM Persona

ID: npe-pm-{project}
Model: grok-41-fast (Tier 1)
Cost: ~0.003 DIEM/call
Context Limit: 6,000 tokens

# System Prompt: Project Manager NPE

You are a Project Manager responsible for coordinating development work.

## Core Responsibilities
1. Break down requirements into work items
2. Assign work to Coder NPEs
3. Review completed work
4. Track progress and budget

## Constraints
- You do NOT write code
- You do NOT modify files
- You MUST check project budget before assigning work
- You MUST use structured JSON for work assignments

## Work Assignment Format
{
  "action": "assign_work",
  "work_item": {
    "id": "WI-{timestamp}",
    "title": "Brief title",
    "description": "Requirements in 200 words or less",
    "type": "feature|bugfix|refactor",
    "assigned_to": "npe-coder-{specialty}",
    "estimated_tokens": 5000,
    "files_to_modify": ["path/to/file.py"],
    "acceptance_criteria": ["criterion 1"]
  }
}

## Review Format
{
  "action": "review_complete",
  "work_item_id": "WI-xxx",
  "verdict": "approve|request_changes|escalate",
  "feedback": "string (50 words max)"
}

5.4 Coder Persona

ID: npe-coder-{specialty}
Model: grok-code-fast-1 (Tier 1)
Cost: ~0.005 DIEM/call
Context Limit: 12,000 tokens

# System Prompt: Coder NPE

You are a Coder responsible for implementing assigned work items.

## Core Responsibilities
1. Read work item requirements
2. Examine existing code
3. Implement changes
4. Commit via Gitea tool

## Constraints
- You ONLY work on assigned items
- You do NOT make architectural decisions
- You MUST follow existing code style
- You MUST output structured JSON for commits

## Code Output Format
{
  "action": "commit_changes",
  "work_item_id": "WI-xxx",
  "changes": [
    {
      "file_path": "src/module/file.py",
      "action": "create|update|delete",
      "content": "full file content",
      "description": "what this change does (20 words max)"
    }
  ],
  "commit_message": "feat: description",
  "ready_for_review": true
}

## When Stuck
{
  "action": "request_help",
  "work_item_id": "WI-xxx",
  "blocker": "description (50 words max)",
  "attempted": ["approach 1", "approach 2"]
}

5.5 Reviewer Persona

ID: npe-reviewer-{specialty}
Model: grok-41-fast (Tier 1)
Cost: ~0.004 DIEM/call
Context Limit: 8,000 tokens

# System Prompt: Code Reviewer NPE

You are a Code Reviewer responsible for ensuring code quality.

## Core Responsibilities
1. Review code changes
2. Check for bugs, security issues, style violations
3. Provide actionable feedback
4. Approve or request changes

## Constraints
- You do NOT modify code
- You do NOT approve your own changes
- You MUST be specific and actionable
- You MUST output structured JSON

## Review Output Format
{
  "action": "review_complete",
  "work_item_id": "WI-xxx",
  "verdict": "approve|request_changes|escalate",
  "summary": "one line summary",
  "issues": [
    {
      "severity": "critical|major|minor",
      "file": "path",
      "line": 42,
      "issue": "what's wrong",
      "fix": "how to fix"
    }
  ],
  "security_concerns": [],
  "approved": true|false
}

## Escalation Triggers
- Security vulnerability
- Architectural concern
- >3 major issues

6. Cron & Scheduling

6.1 Architecture: Hybrid Scheduler

Based on your usage patterns, I recommend a hybrid scheduler:

┌─────────────────────────────────────────────────────────────────────┐
│                      SCHEDULER ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  KUBERNETES CRONJOBS                                                │
│  ═══════════════════                                                │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              MASTER SCHEDULER (Every 15 min)                 │   │
│  │                                                              │   │
│  │  • Health check all systems                                  │   │
│  │  • Process pending triggers                                  │   │
│  │  • Check for stuck workflows                                 │   │
│  │  • Route escalations                                         │   │
│  │                                                              │   │
│  │  Cost: ~0.003 DIEM × 4/hour = 0.012 DIEM/hour               │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐          │
│  │ BURN WINDOW   │  │    DAILY      │  │    WEEKLY     │          │
│  │ (18:45 EST)   │  │ (00:00 UTC)   │  │ (Sun 06:00)   │          │
│  │               │  │               │  │               │          │
│  │ Use surplus   │  │ Cleanup       │  │ Full report   │          │
│  │ before reset  │  │ Archive       │  │ Cost analysis │          │
│  │               │  │ Reset budgets │  │               │          │
│  └───────────────┘  └───────────────┘  └───────────────┘          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

6.2 Master Scheduler Script

#!/usr/bin/env python3
"""
Master Scheduler for NPE Orchestration.
Runs every 15 minutes via Kubernetes CronJob.
"""

import asyncio
import os
from datetime import datetime, timezone
from typing import Optional

import httpx

# Configuration
OWUI_URL = os.environ["OWUI_URL"]
OWUI_TOKEN = os.environ["OWUI_TOKEN"]
ORCHESTRATOR_CHAT_ID = os.environ.get("ORCHESTRATOR_CHAT_ID")
EST_OFFSET = -5  # EST timezone offset


async def main():
    """Main scheduler loop."""
    async with httpx.AsyncClient(
        base_url=OWUI_URL,
        headers={"Authorization": f"Bearer {OWUI_TOKEN}"},
        timeout=60.0
    ) as client:
        
        now = datetime.now(timezone.utc)
        hour_est = (now.hour + EST_OFFSET) % 24
        
        # 1. Always: Health check
        health = await check_system_health(client)
        if not health["ok"]:
            await alert_admin(client, f"System unhealthy: {health['issues']}")
            return
        
        # 2. Always: Budget check
        budget = await get_budget_status(client)
        log_budget(budget)
        
        if budget["remaining"] < 0.5:
            await alert_admin(client, f"Budget critical: {budget['remaining']:.2f} DIEM")
            # Don't stop - still need to process escalations
        
        # 3. Conditional: Process work based on hour
        if 22 <= hour_est or hour_est < 7:
            # Night automation window (22:00 - 07:00 EST)
            await process_automation_queue(client, budget)
        elif hour_est == 18 and now.minute >= 45:
            # Burn window (18:45 - 19:00 EST)
            await run_burn_window(client, budget)
        else:
            # Daytime - only process high-priority triggers
            await process_high_priority_only(client, budget)
        
        # 4. Always: Check for stuck workflows
        stuck = await find_stuck_workflows(client, max_age_minutes=30)
        for workflow in stuck:
            await handle_stuck_workflow(client, workflow)
        
        # 5. Always: Route pending escalations
        escalations = await get_pending_escalations(client)
        for escalation in escalations:
            await route_escalation(client, escalation, budget)


async def check_system_health(client: httpx.AsyncClient) -> dict:
    """Verify all system components are operational."""
    issues = []
    
    # Check Open WebUI
    try:
        resp = await client.get("/health")
        if resp.status_code != 200:
            issues.append("Open WebUI unhealthy")
    except Exception as e:
        issues.append(f"Open WebUI unreachable: {e}")
    
    # Check Venice balance
    try:
        resp = await client.get("/api/v1/venice/balance")
        balance = resp.json().get("balance", 0)
        if balance < 0.1:
            issues.append(f"Venice balance critical: {balance}")
    except Exception as e:
        issues.append(f"Venice check failed: {e}")
    
    return {"ok": len(issues) == 0, "issues": issues}


async def get_budget_status(client: httpx.AsyncClient) -> dict:
    """Get current budget status."""
    # Calculate time until reset (19:00 EST = 00:00 UTC)
    now = datetime.now(timezone.utc)
    hours_until_reset = (24 - now.hour) % 24
    
    try:
        resp = await client.get("/api/v1/venice/balance")
        data = resp.json()
        remaining = data.get("balance", 0)
        
        # Get today's spend from cost tracker
        spend_resp = await client.get("/api/v1/cost-tracker/today")
        spent_today = spend_resp.json().get("total", 0)
    except Exception:
        remaining = 8.1  # Assume full budget on error
        spent_today = 0
    
    return {
        "remaining": remaining,
        "spent_today": spent_today,
        "hours_until_reset": hours_until_reset,
        "automation_reserve": 0.5,
        "escalation_reserve": 0.1,
        "available_for_work": remaining - 0.6  # reserves
    }


async def run_burn_window(client: httpx.AsyncClient, budget: dict):
    """
    Use surplus DIEM before 19:00 EST reset.
    
    ROI: 0.10 DIEM spend can utilize 1.0+ DIEM that would be lost.
    """
    surplus = budget["remaining"] - 2.0  # Keep 2.0 DIEM for tomorrow morning
    
    if surplus < 0.10:
        print(f"No surplus to burn: {budget['remaining']:.2f} DIEM")
        return
    
    print(f"Burn window: {surplus:.2f} DIEM surplus available")
    
    tasks = []
    
    # Priority 1: Summarize active workflows (saves context tomorrow)
    if surplus >= 0.05:
        tasks.append(("summarize_workflows", 0.05))
        surplus -= 0.05
    
    # Priority 2: Pre-plan tomorrow's tasks
    if surplus >= 0.08:
        tasks.append(("pre_plan", 0.08))
        surplus -= 0.08
    
    # Priority 3: Run pending reviews
    if surplus >= 0.05:
        tasks.append(("pending_reviews", surplus))
    
    for task, budget_limit in tasks:
        await dispatch_burn_task(client, task, budget_limit)


async def process_automation_queue(client: httpx.AsyncClient, budget: dict):
    """Process automation tasks during night window."""
    if budget["available_for_work"] < 0.1:
        print("Budget too low for automation")
        return
    
    # Get pending automation tasks
    triggers = await get_pending_triggers(client)
    
    for trigger in triggers:
        # Estimate cost
        estimated = estimate_trigger_cost(trigger)
        
        if estimated > budget["available_for_work"]:
            print(f"Skipping {trigger['type']}: cost {estimated} > available {budget['available_for_work']}")
            continue
        
        await dispatch_trigger(client, trigger)
        budget["available_for_work"] -= estimated


def estimate_trigger_cost(trigger: dict) -> float:
    """Estimate DIEM cost for a trigger."""
    costs = {
        "code_review": 0.015,      # Grok review + routing
        "feature_request": 0.050,  # PM + Coder + Review
        "bug_fix": 0.030,          # Triage + Fix + Review
        "cleanup": 0.005,          # Simple Grok task
        "health_check": 0.003,     # Minimal
    }
    return costs.get(trigger.get("type"), 0.010)


if __name__ == "__main__":
    asyncio.run(main())

6.3 Cron Schedule (Kubernetes)

# Master Scheduler - Every 15 minutes
apiVersion: batch/v1
kind: CronJob
metadata:
  name: npe-master-scheduler
  namespace: open-webui
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scheduler
            image: python:3.11-slim
            command: ["python", "/scripts/master_scheduler.py"]
            envFrom:
            - secretRef:
                name: npe-secrets
          restartPolicy: OnFailure
          
---
# Burn Window - 18:45 EST (23:45 UTC)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: npe-burn-window
spec:
  schedule: "45 23 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: burner
            image: python:3.11-slim
            command: ["python", "/scripts/burn_window.py"]

---
# Daily Maintenance - 00:00 UTC (19:00 EST - after reset)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: npe-daily-maintenance
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: maintenance
            image: python:3.11-slim
            command: ["python", "/scripts/daily_maintenance.py"]

7. Workflow Patterns

7.1 Code Review Workflow

Trigger: Gitea webhook on PR create/update
Cost: ~0.015 DIEM
Duration: 2-5 minutes

┌─────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────┐
│  START  │───▶│ Load Context │───▶│    Review    │───▶│ Verdict │
│         │    │ (Grok, 0.003)│    │ (Grok, 0.008)│    │         │
└─────────┘    └──────────────┘    └──────────────┘    └────┬────┘
                                                             │
                    ┌────────────────────────────────────────┤
                    │                    │                   │
                    ▼                    ▼                   ▼
              ┌──────────┐        ┌───────────┐       ┌───────────┐
              │ APPROVE  │        │  REQUEST  │       │ ESCALATE  │
              │          │        │  CHANGES  │       │ to Claude │
              │ Post     │        │           │       │ (0.054)   │
              │ Comment  │        │ Post      │       └───────────┘
              │ (0.002)  │        │ Feedback  │
              └──────────┘        │ (0.002)   │
                                  └───────────┘

7.2 Feature Development Workflow

Trigger: Issue with label "feature"
Cost: ~0.050 DIEM
Duration: 15-30 minutes

┌─────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  START  │───▶│ PM: Analyze  │───▶│ PM: Breakdown│───▶│   Coder:     │
│         │    │ (Grok, 0.003)│    │ (Grok, 0.003)│    │  Implement   │
└─────────┘    └──────────────┘    └──────────────┘    │ (Grok, 0.015)│
                                                        └──────┬───────┘
                                                               │
                    ┌──────────────────────────────────────────┤
                    │                                          │
                    ▼                                          ▼
              ┌───────────┐                             ┌───────────┐
              │  Review   │◀────── Revision Loop ──────│  FAILED   │
              │  (0.008)  │        (max 3x)            │           │
              └─────┬─────┘                            └───────────┘
                    │
                    ▼
              ┌───────────┐
              │ Create PR │
              │  (0.003)  │
              └───────────┘

7.3 Workflow State Machine

from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

class WorkflowState(Enum):
    PENDING = "pending"
    RUNNING = "running"
    WAITING_INPUT = "waiting_input"
    STEP_FAILED = "step_failed"
    ESCALATED = "escalated"
    COMPLETED = "completed"
    FAILED = "failed"

@dataclass
class CircuitBreaker:
    max_failures: int = 3
    failure_count: int = 0
    last_failure: Optional[datetime] = None
    cooldown_seconds: int = 300
    
    def record_failure(self):
        self.failure_count += 1
        self.last_failure = datetime.now()
    
    def is_open(self) -> bool:
        if self.failure_count >= self.max_failures:
            if self.last_failure:
                elapsed = (datetime.now() - self.last_failure).seconds
                if elapsed < self.cooldown_seconds:
                    return True
                # Reset after cooldown
                self.failure_count = 0
        return False
    
    def should_escalate(self) -> bool:
        return self.failure_count >= (self.max_failures - 1)

@dataclass
class Workflow:
    id: str
    type: str
    state: WorkflowState = WorkflowState.PENDING
    current_step: str = ""
    assigned_npe: str = ""
    project_id: Optional[str] = None
    budget_limit: float = 1.0
    budget_spent: float = 0.0
    circuit: CircuitBreaker = field(default_factory=CircuitBreaker)
    steps_completed: list = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    updated_at: datetime = field(default_factory=datetime.now)
    
    def can_proceed(self) -> tuple[bool, str]:
        if self.circuit.is_open():
            return False, "Circuit breaker open"
        if self.budget_spent >= self.budget_limit:
            return False, "Budget exhausted"
        return True, "OK"
    
    def record_cost(self, amount: float):
        self.budget_spent += amount
        self.updated_at = datetime.now()
    
    def complete_step(self, step_id: str, result: dict):
        self.steps_completed.append({
            "step_id": step_id,
            "completed_at": datetime.now().isoformat(),
            "result": result
        })
        self.updated_at = datetime.now()
    
    def fail_step(self, step_id: str, error: str):
        self.circuit.record_failure()
        self.state = WorkflowState.STEP_FAILED
        self.updated_at = datetime.now()

8. Cost Management

8.1 Budget Allocation (Based on Actual Data)

┌─────────────────────────────────────────────────────────────────────┐
│                    DAILY BUDGET ALLOCATION                          │
│                    (8.1 DIEM total)                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ INTERACTIVE WORK                                   5.50 DIEM  │ │
│  │ (68% of budget)                                               │ │
│  │                                                               │ │
│  │   Claude sessions (1-2/day)         ~1.00 DIEM               │ │
│  │   Grok chat (continuous)            ~2.50 DIEM               │ │
│  │   Qwen bulk processing              ~1.00 DIEM               │ │
│  │   Image generation                  ~1.00 DIEM               │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ NPE AUTOMATION                                     0.25 DIEM  │ │
│  │ (3% of budget)                                                │ │
│  │                                                               │ │
│  │   PM checks (12/day × 0.003)        ~0.04 DIEM               │ │
│  │   Coder tasks (10/day × 0.005)      ~0.05 DIEM               │ │
│  │   Reviews (8/day × 0.004)           ~0.03 DIEM               │ │
│  │   Orchestrator (96/day × 0.001)     ~0.10 DIEM               │ │
│  │   Buffer                            ~0.03 DIEM               │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ RESERVES                                           0.60 DIEM  │ │
│  │ (7% of budget)                                                │ │
│  │                                                               │ │
│  │   Escalation reserve (Claude)       ~0.10 DIEM               │ │
│  │   Automation reserve                ~0.50 DIEM               │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ BUFFER / BURN WINDOW                              1.75 DIEM  │ │
│  │ (22% of budget)                                               │ │
│  │                                                               │ │
│  │   Available for burn window automation if unused              │ │
│  │   Target: Use 80%+ of daily budget                           │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

8.2 Cost Enforcement

class BudgetEnforcer:
    """Enforce budget limits at multiple levels."""
    
    def __init__(self, daily_budget: float = 8.1):
        self.daily_budget = daily_budget
        self.reserves = {
            "escalation": 0.10,
            "automation": 0.50,
        }
    
    async def can_proceed(
        self,
        estimated_cost: float,
        npe_id: str,
        project_id: Optional[str] = None,
        use_reserves: bool = False
    ) -> tuple[bool, str]:
        """Check if operation can proceed within budget."""
        
        # Get current balance
        balance = await self.get_venice_balance()
        
        # Calculate available
        reserved = sum(self.reserves.values()) if not use_reserves else 0
        available = balance - reserved
        
        # Check global
        if available < estimated_cost:
            return False, f"Insufficient budget: {available:.3f} < {estimated_cost:.3f}"
        
        # Check per-NPE daily limit
        npe_spent = await self.get_npe_spent_today(npe_id)
        npe_limit = await self.get_npe_daily_limit(npe_id)
        
        if npe_spent + estimated_cost > npe_limit:
            return False, f"NPE budget exceeded: {npe_spent:.3f} + {estimated_cost:.3f} > {npe_limit:.3f}"
        
        # Check per-project if applicable
        if project_id:
            project_spent = await self.get_project_spent_today(project_id)
            project_limit = await self.get_project_daily_limit(project_id)
            
            if project_spent + estimated_cost > project_limit:
                return False, f"Project budget exceeded"
        
        return True, "OK"
    
    async def record_and_verify(
        self,
        actual_cost: float,
        estimated_cost: float,
        npe_id: str,
        operation: str
    ):
        """Record cost and check for anomalies."""
        # Record
        await self.record_cost(actual_cost, npe_id, operation)
        
        # Check for cost overrun
        if actual_cost > estimated_cost * 1.5:
            await self.alert(
                f"Cost overrun: {operation} estimated {estimated_cost:.4f}, actual {actual_cost:.4f}"
            )
        
        # Check for budget warnings
        remaining = await self.get_remaining_today()
        if remaining < self.daily_budget * 0.2:
            await self.alert(f"Budget warning: only {remaining:.2f} DIEM remaining today")

8.3 Automation Cost Projections

Based on your actual rates:

Scenario	Model	Per Call	Calls/Day	Daily Cost	Monthly
PM Check	Grok	0.0021	12	0.0252	0.76
Coder Task	Grok-Code	0.0050	10	0.0500	1.50
Code Review	Grok	0.0035	8	0.0280	0.84
Orchestrator	Grok	0.0010	96	0.0960	2.88
Deep Analysis	Kimi	0.0120	3	0.0360	1.08
Escalation	Claude	0.0540	1	0.0540	1.62
TOTAL				0.2892	8.68

Conclusion: Full NPE automation costs 0.29 DIEM/day (3.6% of budget), leaving 7.81 DIEM for interactive work.

9. Implementation Roadmap

Phase 1: Foundation (Week 1)

Goal: Basic infrastructure working
Budget Impact: None (setup only)

Deploy corrected Gitea tools (v1.1.0)
Create cost_tracker tool
Set up 2 NPEs manually:
- Orchestrator (Grok)
- PM (Grok)
Create master scheduler (health check only)
Test: Manual trigger → Orchestrator routes to PM

Success Criteria:

Orchestrator receives triggers
Cost tracked per operation
No automation yet - just infrastructure

Phase 2: Automation (Week 2)

Goal: Night automation working
Budget Impact: +0.05 DIEM/day

Add Coder NPE (Grok-Code)
Add Reviewer NPE (Grok)
Implement Code Review workflow
Enable night automation (22:00-07:00 EST)
Test: PR created → auto-review → comment posted

Success Criteria:

PRs reviewed automatically during night window
Reviews posted as Gitea comments
Circuit breaker prevents loops

Phase 3: Full Workflows (Week 3-4)

Goal: Complete workflow coverage
Budget Impact: +0.20 DIEM/day

Create project_manager tool
Create workflow_engine tool
Implement Feature Development workflow
Implement Bug Fix workflow
Enable burn window automation
Test: Full feature cycle from issue to PR

Success Criteria:

Features developed from issue to merged PR
Budget tracked per project
Burn window uses surplus effectively

Phase 4: Escalation (Week 5)

Goal: Claude integration for complex cases
Budget Impact: +0.05 DIEM/day

Implement escalation paths
Create context compression for Claude calls
Add security review workflow
Test: Complex review → escalate → Claude response

Success Criteria:

Escalation triggers when needed
Claude calls stay under 0.06 DIEM each
Responses routed back to workflow

Phase 5: Optimization (Week 6+)

Goal: Cost optimization and scaling
Budget Impact: -0.05 DIEM/day (savings)

Implement cache optimization strategy
Add context compression to all NPEs
Tune model selection based on success rates
Add metrics dashboard
Document runbooks

Success Criteria:

Cache hit rates > 50% for Grok
System runs 7 days unattended
Total automation < 0.25 DIEM/day

10. Open Questions

10.1 Unresolved Decisions

Question	Options	Recommendation	Notes
State storage	OpenWebUI folders vs. SQLite	OpenWebUI folders	Simpler, no new deps
Token rotation	30 vs 90 days	90 days	Manual for now
Max concurrent workflows	3 vs 5 vs 10	5	Test and adjust
Chat retention	7 vs 30 vs 90 days	30 days	Balance audit vs. storage

10.2 Your Input Needed

Gitea Webhooks: How are webhooks exposed? Need ingress path for triggers.
Claude API Key: Using Venice's Claude or direct Anthropic? Venice is simpler.
Multi-file Commits: Do you need atomic batch commits in gitea_dev?
Test Execution: Skip CI/CD for Phase 1-5? Add later?
Human Approval UI: Chat-only for now? Dashboard later?

10.3 Known Limitations

No real-time collaboration - NPEs work asynchronously
No visual review - Can't review UI changes
Venice dependency - All LLM calls through Venice
Single Gitea instance - No multi-repo federation yet

Appendix A: Quick Reference

Model Rates (DIEM per 1M tokens)

Model	Input	Output	Cache	Effective
Qwen-Instruct	0.15	0.27	-	0.06
Grok-Code	0.25	1.87	0.03	0.07
Grok-41-Fast	0.50	1.25	0.125	0.19
MiniMax-M21	0.40	1.60	0.04	0.22
Kimi-K2	0.75	3.20	0.375	0.42
Claude-Opus	6.00	30.00	-	5.28

Context Limits

Role	Max Tokens	Warning At
Orchestrator	4,000	3,000
PM	6,000	5,000
Coder	12,000	10,000
Reviewer	8,000	6,000

Budget Summary

Category	Daily DIEM	% of Budget
Interactive	5.50	68%
Automation	0.25	3%
Reserves	0.60	7%
Buffer/Burn	1.75	22%
Total	8.10	100%

Document Status: RFC v2.0 - Based on 30-day billing analysis
Last Updated: January 11, 2026
Data Source: 13,582 Venice.ai transactions

64 KiB Raw Permalink Blame History Unescape Escape

Multi-Agent Orchestration System Design v2.0

Executive Summary

Table of Contents

1. Billing Data Analysis

1.1 30-Day Summary

1.2 Spend by Model Family

1.3 Actual Effective Rates (DIEM per 1M Tokens)

1.4 Context Size Distribution

1.5 Cache Efficiency Analysis

1.6 Hourly Usage Pattern (EST)

2. Model Selection Strategy

2.1 Tier Architecture (Data-Driven)

2.2 Model Selection Matrix

2.3 Model Selection Logic

3. Context Management

3.1 The Context Problem

3.2 Context Budget by Role

3.3 Context Management Strategy

3.4 Cache Optimization

4. Tool Architecture

4.1 Current Tools

4.2 Required New Tools

4.2.1 Cost Tracker Tool (cost_tracker.py)

4.2.2 Project Manager Tool (project_manager.py)

4.2.3 NPE Manager Tool (npe_manager.py)

4.2.4 Workflow Engine Tool (workflow_engine.py)

4.3 Tool Permission Matrix

5. NPE Personas & Roles

5.1 NPE Identity Structure

5.2 Orchestrator Persona

5.3 PM Persona

5.4 Coder Persona

5.5 Reviewer Persona

6. Cron & Scheduling

6.1 Architecture: Hybrid Scheduler

6.2 Master Scheduler Script

6.3 Cron Schedule (Kubernetes)

7. Workflow Patterns

7.1 Code Review Workflow

7.2 Feature Development Workflow

7.3 Workflow State Machine

8. Cost Management

8.1 Budget Allocation (Based on Actual Data)

8.2 Cost Enforcement

8.3 Automation Cost Projections

9. Implementation Roadmap

Phase 1: Foundation (Week 1)

Phase 2: Automation (Week 2)

Phase 3: Full Workflows (Week 3-4)

Phase 4: Escalation (Week 5)

Phase 5: Optimization (Week 6+)

10. Open Questions

10.1 Unresolved Decisions

10.2 Your Input Needed

10.3 Known Limitations

Appendix A: Quick Reference

Model Rates (DIEM per 1M tokens)

Context Limits

Budget Summary

64 KiB

Raw Permalink Blame History

4.2.1 Cost Tracker Tool (`cost_tracker.py`)

4.2.2 Project Manager Tool (`project_manager.py`)

4.2.3 NPE Manager Tool (`npe_manager.py`)

4.2.4 Workflow Engine Tool (`workflow_engine.py`)