1692 lines
64 KiB
Markdown
1692 lines
64 KiB
Markdown
# Multi-Agent Orchestration System Design v2.0
|
||
|
||
**Version:** 2.0.0
|
||
**Author:** Jeff Smith + Claude
|
||
**Date:** January 11, 2026
|
||
**Status:** RFC (Request for Comments)
|
||
**Data Source:** Venice.ai billing history Dec 12, 2025 - Jan 11, 2026 (13,582 transactions)
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
This document describes the architecture for a multi-model AI development system built on Open WebUI, Venice.ai, and Gitea. Based on 30 days of actual billing data, we demonstrate that full NPE automation costs **0.21 DIEM/day** (2.6% of the 8.1 DIEM daily budget), leaving **7.89 DIEM** for interactive work.
|
||
|
||
**Core Thesis:** Your actual usage patterns prove orchestrated automation is not just feasible—it's virtually free. The real challenge is context management and cache optimization, not model costs.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Billing Data Analysis](#1-billing-data-analysis)
|
||
2. [Model Selection Strategy](#2-model-selection-strategy)
|
||
3. [Context Management](#3-context-management)
|
||
4. [Tool Architecture](#4-tool-architecture)
|
||
5. [NPE Personas & Roles](#5-npe-personas--roles)
|
||
6. [Cron & Scheduling](#6-cron--scheduling)
|
||
7. [Workflow Patterns](#7-workflow-patterns)
|
||
8. [Cost Management](#8-cost-management)
|
||
9. [Implementation Roadmap](#9-implementation-roadmap)
|
||
10. [Open Questions](#10-open-questions)
|
||
|
||
---
|
||
|
||
## 1. Billing Data Analysis
|
||
|
||
### 1.1 30-Day Summary
|
||
|
||
```
|
||
Period: Dec 12, 2025 - Jan 11, 2026
|
||
Total Records: 13,582 billing events
|
||
Total Spend: 61.24 DIEM + 2.86 USD
|
||
Days Active: 18 days
|
||
Average Daily: 3.56 DIEM/day
|
||
Max Daily: 9.96 DIEM (Jan 2, 2026)
|
||
Median Daily: 3.72 DIEM/day
|
||
```
|
||
|
||
**Budget Analysis:**
|
||
- Daily Budget: 8.1 DIEM (staked)
|
||
- Average Spend: 3.56 DIEM/day
|
||
- **Average Surplus: 4.54 DIEM/day (56% unutilized)**
|
||
- This surplus is lost at 19:00 EST reset
|
||
|
||
### 1.2 Spend by Model Family
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────────┐
|
||
│ ACTUAL SPEND BY MODEL (30 days) │
|
||
├────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ GLM-4.6 ████████████████████████████ 16.66 DIEM (26.0%) │
|
||
│ Qwen ████████████████████████ 15.60 DIEM (24.3%) │
|
||
│ Claude-Opus-4.5 ██████████████████ 11.86 DIEM (18.5%) │
|
||
│ Grok-41-Fast ██████████ 6.71 DIEM (10.5%) │
|
||
│ Image-Gen █████████ 6.01 DIEM (9.4%) │
|
||
│ MiniMax-M21 █████ 3.31 DIEM (5.2%) │
|
||
│ Kimi-K2 ██ 1.53 DIEM (2.4%) │
|
||
│ GLM-4.7 ██ 1.25 DIEM (2.0%) │
|
||
│ Other █ 1.17 DIEM (1.7%) │
|
||
│ │
|
||
│ TOTAL 64.10 DIEM │
|
||
│ │
|
||
└────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 1.3 Actual Effective Rates (DIEM per 1M Tokens)
|
||
|
||
From your billing data, sorted by cost efficiency:
|
||
|
||
| Model | Input Rate | Output Rate | Cache Rate | **Effective Rate** | Calls | Total Cost |
|
||
|-------|------------|-------------|------------|-------------------|-------|------------|
|
||
| Qwen-Instruct | $0.071 | $0.27 | - | **0.0626** | 408 | 0.92 |
|
||
| Grok-Code | $0.140 | $1.87 | $0.030 | **0.0721** | 611 | 1.67 |
|
||
| DeepSeek | $0.329 | $1.00 | $0.200 | **0.1825** | 36 | 0.10 |
|
||
| Grok-41-Fast | $0.314 | $1.25 | $0.125 | **0.1857** | 1,152 | 5.03 |
|
||
| MiniMax-M21 | $0.316 | $1.60 | $0.040 | **0.2232** | 360 | 3.31 |
|
||
| Qwen-Thinking | $0.450 | $3.50 | - | **0.3339** | 105 | 1.61 |
|
||
| Qwen-Coder | $0.750 | $3.00 | - | **0.3830** | 530 | 12.99 |
|
||
| Kimi-K2-Thinking | $0.595 | $3.20 | $0.375 | **0.4152** | 111 | 1.53 |
|
||
| GLM-4.6 | $0.850 | $2.75 | - | **0.4455** | 1,724 | 16.66 |
|
||
| Claude-Opus-4.5 | $6.000 | $30.00 | - | **5.2751** | 78 | 11.86 |
|
||
|
||
**Key Insights:**
|
||
1. **Grok is 28× cheaper than Claude** per token
|
||
2. **Qwen-Instruct is 84× cheaper than Claude** for bulk work
|
||
3. **Cache hits reduce Grok input costs by 75%**
|
||
4. Claude is only 78 calls but 18.5% of total spend
|
||
|
||
### 1.4 Context Size Distribution
|
||
|
||
Your actual context sizes reveal optimization opportunities:
|
||
|
||
```
|
||
GROK-41-FAST (1,152 calls):
|
||
Median: 5,291 tokens | P75: 7,217 | Max: 582,506
|
||
Distribution: 0-5k: 1671 | 5-10k: 1245 | 10-20k: 485 | 20-50k: 23 | 50k+: 17
|
||
✓ WELL MANAGED - 85% of calls under 10k tokens
|
||
|
||
KIMI-K2-THINKING (111 calls):
|
||
Median: 5,164 tokens | P75: 17,797 | Max: 48,893
|
||
Distribution: 0-5k: 144 | 5-10k: 39 | 10-20k: 83 | 20-50k: 34
|
||
⚠ CONTEXT BLEEDING - P75 jumps to 18k, needs pruning
|
||
|
||
MINIMAX-M21 (360 calls):
|
||
Median: 10,090 tokens | P75: 23,071 | Max: 169,708
|
||
Distribution: 0-5k: 211 | 5-10k: 202 | 10-20k: 181 | 20-50k: 198 | 50k+: 38
|
||
⚠ HEAVY CONTEXTS - 11% of calls over 50k tokens
|
||
|
||
QWEN-CODER (530 calls):
|
||
Median: 17,016 tokens | P75: 47,601 | Max: 253,462
|
||
Distribution: 0-5k: 340 | 5-10k: 124 | 10-20k: 82 | 20-50k: 284 | 50k+: 230
|
||
⚠ CODE CONTEXTS ARE LARGE - expected but optimize where possible
|
||
|
||
CLAUDE-OPUS-4.5 (78 calls):
|
||
Median: 7,198 tokens | P75: 11,956 | Max: 51,368
|
||
✓ REASONABLE - given high cost, context is well-controlled
|
||
```
|
||
|
||
### 1.5 Cache Efficiency Analysis
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────────┐
|
||
│ CACHE HIT RATES BY MODEL │
|
||
├────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ GROK-CODE: ████████████████████████████████████ 67.8% │
|
||
│ Saved: $1.08 | Cache rate: $0.030 vs $0.250 full │
|
||
│ │
|
||
│ KIMI-K2: █████ 9.8% │
|
||
│ Saved: $0.05 | Cache rate: $0.375 vs $0.750 full │
|
||
│ │
|
||
│ GROK-41-FAST: █████ 8.8% │
|
||
│ Saved: $0.27 | Cache rate: $0.125 vs $0.500 full │
|
||
│ │
|
||
│ MINIMAX-M21: ████ 7.2% │
|
||
│ Saved: $0.16 | Cache rate: $0.040 vs $0.400 full │
|
||
│ │
|
||
└────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Grok-Code's 67.8% cache hit rate** demonstrates what's possible with sequential operations using the same system prompt and context.
|
||
|
||
### 1.6 Hourly Usage Pattern (EST)
|
||
|
||
```
|
||
Hour (EST) | Spend | Activity Level
|
||
───────────┼──────────┼────────────────────────────────
|
||
00-03 | 0.00 | 💤 Dead
|
||
04-05 | 0.84 | 🌅 Early morning
|
||
06-08 | 9.86 | ☕ Morning peak
|
||
09-11 | 3.57 | 📊 Late morning
|
||
12-13 | 7.34 | 🍽️ Lunch peak
|
||
14-15 | 8.13 | 💻 Afternoon work
|
||
16-17 | 14.25 | 🔥 PEAK (4-6pm)
|
||
18-19 | 8.13 | 🌆 Evening
|
||
20-21 | 9.93 | 🌙 Night session
|
||
22-23 | 0.00 | 💤 Dead
|
||
|
||
AUTOMATION WINDOW: 22:00 - 07:00 EST (9 hours of minimal usage)
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Model Selection Strategy
|
||
|
||
### 2.1 Tier Architecture (Data-Driven)
|
||
|
||
Based on actual billing patterns:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ MODEL SELECTION PYRAMID │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌─────────────┐ │
|
||
│ │ TIER 4 │ Claude Opus 4.5 │
|
||
│ │ ORACLE │ 5.28 DIEM/1M effective │
|
||
│ │ <2% │ Architecture, Security, │
|
||
│ └──────┬──────┘ Deadlock Resolution │
|
||
│ │ Budget: 0.05 DIEM/day │
|
||
│ │ │
|
||
│ ┌─────────┴─────────┐ │
|
||
│ │ TIER 3 │ Kimi-K2, Qwen-Thinking │
|
||
│ │ REASONING │ 0.33-0.42 DIEM/1M │
|
||
│ │ 5% │ Complex analysis, │
|
||
│ └─────────┬─────────┘ Multi-step reasoning │
|
||
│ │ Budget: 0.10 DIEM/day │
|
||
│ │ │
|
||
│ ┌───────────────┴───────────────┐ │
|
||
│ │ TIER 2 │ MiniMax, DeepSeek │
|
||
│ │ BALANCED │ 0.18-0.22 DIEM/1M │
|
||
│ │ 15% │ Standard tasks, │
|
||
│ └───────────────┬───────────────┘ Code generation │
|
||
│ │ Budget: 0.50 DIEM │
|
||
│ │ │
|
||
│ ┌────────────────────────────┴────────────────────────────┐ │
|
||
│ │ TIER 1 │ │
|
||
│ │ WORKHORSES │ │
|
||
│ │ 78% │ │
|
||
│ │ Grok-41-Fast: 0.19 DIEM/1M | Grok-Code: 0.07 DIEM/1M │ │
|
||
│ │ Qwen-Instruct: 0.06 DIEM/1M │ │
|
||
│ │ Routing, Quick checks, Bulk processing, PM tasks │ │
|
||
│ │ Budget: Remaining (~7.45 DIEM/day) │ │
|
||
│ └─────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 2.2 Model Selection Matrix
|
||
|
||
| Task Type | Primary Model | Fallback | Cost/Call | Rationale |
|
||
|-----------|---------------|----------|-----------|-----------|
|
||
| **Routing/Dispatch** | Grok-41-Fast | Qwen-Instruct | 0.002 | Cheapest with cache |
|
||
| **PM Coordination** | Grok-41-Fast | MiniMax | 0.003 | Simple decisions |
|
||
| **Code Generation** | Grok-Code | Grok-41-Fast | 0.005 | 67% cache hits |
|
||
| **Code Review** | Grok-41-Fast | DeepSeek | 0.004 | Good reasoning |
|
||
| **Bulk Processing** | Qwen-Instruct | Grok-41-Fast | 0.001 | 0.06 DIEM/1M |
|
||
| **Complex Reasoning** | Kimi-K2 | Qwen-Thinking | 0.012 | Thinking models |
|
||
| **Architecture** | Claude-Opus | Kimi-K2 | 0.054 | Only when needed |
|
||
| **Security Review** | Claude-Opus | - | 0.054 | Non-negotiable |
|
||
|
||
### 2.3 Model Selection Logic
|
||
|
||
```python
|
||
from enum import Enum
|
||
from dataclasses import dataclass
|
||
from typing import Optional
|
||
|
||
class TaskComplexity(Enum):
|
||
TRIVIAL = 1 # Routing, yes/no decisions
|
||
SIMPLE = 2 # Single-step tasks
|
||
MODERATE = 3 # Multi-step, needs context
|
||
COMPLEX = 4 # Reasoning required
|
||
CRITICAL = 5 # Architecture, security
|
||
|
||
@dataclass
|
||
class ModelConfig:
|
||
id: str
|
||
input_rate: float # DIEM per 1M tokens
|
||
output_rate: float # DIEM per 1M tokens
|
||
cache_rate: float # DIEM per 1M tokens (0 if no cache)
|
||
max_context: int # Recommended max context
|
||
tier: int
|
||
|
||
MODELS = {
|
||
"qwen-instruct": ModelConfig("qwen3-235b-a22b-instruct-2507", 0.15, 0.27, 0, 32000, 1),
|
||
"grok-code": ModelConfig("grok-code-fast-1", 0.25, 1.87, 0.03, 16000, 1),
|
||
"grok-fast": ModelConfig("grok-41-fast", 0.50, 1.25, 0.125, 16000, 1),
|
||
"deepseek": ModelConfig("deepseek-chat", 0.50, 1.00, 0.20, 32000, 2),
|
||
"minimax": ModelConfig("minimax-m21", 0.40, 1.60, 0.04, 32000, 2),
|
||
"kimi": ModelConfig("kimi-k2", 0.75, 3.20, 0.375, 32000, 3),
|
||
"qwen-thinking": ModelConfig("qwen3-235b-a22b-thinking-2507", 0.45, 3.50, 0, 32000, 3),
|
||
"claude": ModelConfig("claude-opus-4-5", 6.00, 30.00, 0, 200000, 4),
|
||
}
|
||
|
||
def select_model(
|
||
task_type: str,
|
||
complexity: TaskComplexity,
|
||
budget_remaining: float,
|
||
context_size: int = 0
|
||
) -> str:
|
||
"""
|
||
Select optimal model based on task, complexity, and budget.
|
||
|
||
Returns Venice model ID.
|
||
"""
|
||
# Budget gates
|
||
if budget_remaining < 0.5:
|
||
return MODELS["qwen-instruct"].id # Emergency mode
|
||
|
||
if budget_remaining < 2.0:
|
||
# Low budget - force Tier 1
|
||
if complexity == TaskComplexity.CRITICAL:
|
||
return MODELS["kimi"].id # Downgrade from Claude
|
||
return MODELS["grok-fast"].id
|
||
|
||
# Task-based selection
|
||
selection_map = {
|
||
# Task type -> {complexity: model_key}
|
||
"routing": {
|
||
TaskComplexity.TRIVIAL: "qwen-instruct",
|
||
TaskComplexity.SIMPLE: "grok-fast",
|
||
TaskComplexity.MODERATE: "grok-fast",
|
||
},
|
||
"pm_coordination": {
|
||
TaskComplexity.TRIVIAL: "grok-fast",
|
||
TaskComplexity.SIMPLE: "grok-fast",
|
||
TaskComplexity.MODERATE: "minimax",
|
||
TaskComplexity.COMPLEX: "kimi",
|
||
},
|
||
"code_generation": {
|
||
TaskComplexity.SIMPLE: "grok-code",
|
||
TaskComplexity.MODERATE: "grok-code",
|
||
TaskComplexity.COMPLEX: "minimax",
|
||
TaskComplexity.CRITICAL: "kimi",
|
||
},
|
||
"code_review": {
|
||
TaskComplexity.SIMPLE: "grok-fast",
|
||
TaskComplexity.MODERATE: "grok-fast",
|
||
TaskComplexity.COMPLEX: "deepseek",
|
||
TaskComplexity.CRITICAL: "claude",
|
||
},
|
||
"architecture": {
|
||
TaskComplexity.MODERATE: "kimi",
|
||
TaskComplexity.COMPLEX: "kimi",
|
||
TaskComplexity.CRITICAL: "claude",
|
||
},
|
||
"security": {
|
||
TaskComplexity.SIMPLE: "grok-fast",
|
||
TaskComplexity.MODERATE: "kimi",
|
||
TaskComplexity.COMPLEX: "claude",
|
||
TaskComplexity.CRITICAL: "claude",
|
||
},
|
||
}
|
||
|
||
task_map = selection_map.get(task_type, {})
|
||
model_key = task_map.get(complexity, "grok-fast")
|
||
|
||
return MODELS[model_key].id
|
||
|
||
|
||
def estimate_cost(model_key: str, input_tokens: int, output_tokens: int, cache_tokens: int = 0) -> float:
|
||
"""Estimate DIEM cost for a completion."""
|
||
model = MODELS[model_key]
|
||
|
||
input_cost = ((input_tokens - cache_tokens) / 1_000_000) * model.input_rate
|
||
cache_cost = (cache_tokens / 1_000_000) * model.cache_rate
|
||
output_cost = (output_tokens / 1_000_000) * model.output_rate
|
||
|
||
return input_cost + cache_cost + output_cost
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Context Management
|
||
|
||
### 3.1 The Context Problem
|
||
|
||
Your billing data reveals context size is the **primary cost driver**:
|
||
|
||
- Kimi calls at 48K tokens: **0.040 DIEM each**
|
||
- Kimi calls at 5K tokens: **0.006 DIEM each** (6.7× cheaper)
|
||
- Grok calls at 5K tokens: **0.003 DIEM each**
|
||
|
||
**Every 10K tokens of unnecessary context costs ~0.005-0.015 DIEM.**
|
||
|
||
### 3.2 Context Budget by Role
|
||
|
||
```python
|
||
# Maximum context tokens per NPE role
|
||
CONTEXT_LIMITS = {
|
||
"orchestrator": 4_000, # Minimal - just state and decisions
|
||
"pm": 6_000, # Moderate - task list and status
|
||
"coder": 12_000, # Larger - needs code context
|
||
"reviewer": 8_000, # Moderate - diff + surrounding code
|
||
"editorial": 10_000, # Article + guidelines
|
||
}
|
||
|
||
# Warning thresholds (emit warning in logs)
|
||
CONTEXT_WARNINGS = {
|
||
"orchestrator": 3_000,
|
||
"pm": 5_000,
|
||
"coder": 10_000,
|
||
"reviewer": 6_000,
|
||
"editorial": 8_000,
|
||
}
|
||
```
|
||
|
||
### 3.3 Context Management Strategy
|
||
|
||
```python
|
||
from typing import Optional
|
||
import tiktoken
|
||
|
||
class ContextManager:
|
||
"""
|
||
Aggressive context pruning for cost control.
|
||
|
||
Strategy:
|
||
1. Always keep: system prompt, last 2 user messages, last assistant response
|
||
2. Summarize: everything older than 3 exchanges
|
||
3. Prune: tool outputs older than 2 exchanges
|
||
4. Compress: code blocks to signatures only (in summaries)
|
||
"""
|
||
|
||
def __init__(self, model: str = "grok-fast"):
|
||
self.encoder = tiktoken.get_encoding("cl100k_base")
|
||
self.summary_model = model # Use cheap model for summaries
|
||
|
||
def count_tokens(self, text: str) -> int:
|
||
"""Count tokens in text."""
|
||
return len(self.encoder.encode(text))
|
||
|
||
def count_messages(self, messages: list[dict]) -> int:
|
||
"""Count total tokens in message list."""
|
||
total = 0
|
||
for msg in messages:
|
||
total += self.count_tokens(msg.get("content", ""))
|
||
# Add overhead for role, etc.
|
||
total += 4
|
||
return total
|
||
|
||
async def prepare_context(
|
||
self,
|
||
role: str,
|
||
system_prompt: str,
|
||
messages: list[dict],
|
||
force_limit: Optional[int] = None
|
||
) -> tuple[str, list[dict]]:
|
||
"""
|
||
Prune context to fit within role's budget.
|
||
|
||
Returns: (possibly_modified_system_prompt, pruned_messages)
|
||
"""
|
||
limit = force_limit or CONTEXT_LIMITS.get(role, 8_000)
|
||
warning = CONTEXT_WARNINGS.get(role, limit - 1000)
|
||
|
||
system_tokens = self.count_tokens(system_prompt)
|
||
message_tokens = self.count_messages(messages)
|
||
total = system_tokens + message_tokens
|
||
|
||
if total <= limit:
|
||
if total > warning:
|
||
print(f"⚠️ Context at {total} tokens (warning: {warning})")
|
||
return system_prompt, messages
|
||
|
||
print(f"🔄 Context pruning: {total} -> {limit} tokens")
|
||
|
||
# Strategy 1: Keep essential messages
|
||
essential = []
|
||
essential_tokens = 0
|
||
|
||
# Always keep last 3 messages (2 user + 1 assistant typically)
|
||
for msg in messages[-3:]:
|
||
essential.append(msg)
|
||
essential_tokens += self.count_tokens(msg.get("content", "")) + 4
|
||
|
||
remaining_budget = limit - system_tokens - essential_tokens - 500 # Buffer for summary
|
||
|
||
if remaining_budget < 200:
|
||
# Can't fit summary, just use essential
|
||
return system_prompt, essential
|
||
|
||
# Strategy 2: Summarize older messages
|
||
old_messages = messages[:-3]
|
||
if old_messages:
|
||
summary = await self._summarize_messages(old_messages, remaining_budget)
|
||
|
||
# Prepend summary as system context
|
||
augmented_system = f"{system_prompt}\n\n## Previous Context Summary\n{summary}"
|
||
return augmented_system, essential
|
||
|
||
return system_prompt, essential
|
||
|
||
async def _summarize_messages(self, messages: list[dict], max_tokens: int) -> str:
|
||
"""
|
||
Summarize old messages into compact form.
|
||
Uses Grok for cheap summarization (~0.001 DIEM).
|
||
"""
|
||
# Build summary request
|
||
content_parts = []
|
||
for msg in messages[-10:]: # Last 10 messages max
|
||
role = msg.get("role", "unknown")
|
||
text = msg.get("content", "")[:1000] # Truncate each
|
||
content_parts.append(f"{role}: {text}")
|
||
|
||
content = "\n---\n".join(content_parts)
|
||
|
||
summary_prompt = f"""Summarize this conversation in {max_tokens // 4} words or less.
|
||
Focus on: decisions made, current task, blockers, key context.
|
||
Do NOT include pleasantries or meta-commentary.
|
||
|
||
Conversation:
|
||
{content[:4000]}
|
||
|
||
Summary:"""
|
||
|
||
# Call cheap model for summary
|
||
response = await venice_completion(
|
||
model=self.summary_model,
|
||
messages=[{"role": "user", "content": summary_prompt}],
|
||
max_tokens=min(max_tokens, 500)
|
||
)
|
||
|
||
return response.strip()
|
||
|
||
def compress_code_context(self, code: str, max_lines: int = 50) -> str:
|
||
"""
|
||
Compress code to essential structure for context.
|
||
Keeps signatures, docstrings, removes implementation.
|
||
"""
|
||
lines = code.split("\n")
|
||
|
||
if len(lines) <= max_lines:
|
||
return code
|
||
|
||
compressed = []
|
||
in_function = False
|
||
brace_depth = 0
|
||
|
||
for line in lines:
|
||
stripped = line.strip()
|
||
|
||
# Always keep: imports, class/function definitions, docstrings
|
||
if any(stripped.startswith(kw) for kw in ["import ", "from ", "class ", "def ", "async def ", '"""', "'''"]):
|
||
compressed.append(line)
|
||
if stripped.startswith(("def ", "async def ", "class ")):
|
||
in_function = True
|
||
elif in_function and stripped.startswith(('"""', "'''")):
|
||
compressed.append(line)
|
||
if stripped.count('"""') == 2 or stripped.count("'''") == 2:
|
||
compressed.append(" # ... implementation ...")
|
||
in_function = False
|
||
elif stripped == "":
|
||
compressed.append("")
|
||
|
||
return "\n".join(compressed)
|
||
```
|
||
|
||
### 3.4 Cache Optimization
|
||
|
||
Your Grok-Code's 67.8% cache hit rate shows what's achievable:
|
||
|
||
```python
|
||
class CacheOptimizer:
|
||
"""
|
||
Maximize Venice's prompt caching for cost savings.
|
||
|
||
Venice caches the PREFIX of prompts. To maximize hits:
|
||
1. Put static content (system prompt) FIRST
|
||
2. Put stable context (project info) SECOND
|
||
3. Put variable content (current task) LAST
|
||
"""
|
||
|
||
@staticmethod
|
||
def build_cacheable_prompt(
|
||
system_prompt: str,
|
||
project_context: str,
|
||
task_context: str,
|
||
user_message: str
|
||
) -> list[dict]:
|
||
"""
|
||
Build message list optimized for cache hits.
|
||
|
||
Structure:
|
||
1. System prompt (static) - CACHED after first call
|
||
2. Project context as system addendum - CACHED if unchanged
|
||
3. Task context as assistant message - Varies
|
||
4. User message - Always new
|
||
"""
|
||
messages = [
|
||
{
|
||
"role": "system",
|
||
"content": f"{system_prompt}\n\n## Project Context\n{project_context}"
|
||
}
|
||
]
|
||
|
||
if task_context:
|
||
messages.append({
|
||
"role": "assistant",
|
||
"content": f"Current task context:\n{task_context}"
|
||
})
|
||
|
||
messages.append({
|
||
"role": "user",
|
||
"content": user_message
|
||
})
|
||
|
||
return messages
|
||
|
||
@staticmethod
|
||
def batch_similar_tasks(tasks: list[dict]) -> list[list[dict]]:
|
||
"""
|
||
Group tasks by system prompt and project to maximize cache hits.
|
||
|
||
Running 5 code reviews for the same project sequentially
|
||
means prompts 2-5 get ~75% cache hits on the system prompt.
|
||
"""
|
||
batches = {}
|
||
|
||
for task in tasks:
|
||
key = (task.get("system_prompt_hash"), task.get("project_id"))
|
||
if key not in batches:
|
||
batches[key] = []
|
||
batches[key].append(task)
|
||
|
||
return list(batches.values())
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Tool Architecture
|
||
|
||
### 4.1 Current Tools
|
||
|
||
| Tool | Version | Purpose | Used By |
|
||
|------|---------|---------|---------|
|
||
| `gitea_dev` | 1.1.0 | File ops, branches, PRs, issues | Coder NPEs |
|
||
| `gitea_admin` | 1.1.0 | Teams, permissions, org management | PM NPEs |
|
||
| `venice_info` | 1.0.0 | Model discovery, cost tracking | All NPEs |
|
||
| `editorial_pipeline` | 1.0.0 | Content creation workflow | Editorial NPEs |
|
||
|
||
### 4.2 Required New Tools
|
||
|
||
#### 4.2.1 Cost Tracker Tool (`cost_tracker.py`)
|
||
|
||
**Purpose:** Real-time cost monitoring and budget enforcement.
|
||
|
||
```python
|
||
class Valves(BaseModel):
|
||
VENICE_API_KEY: str = Field(default="", description="Venice API key")
|
||
DAILY_BUDGET: float = Field(default=8.1, description="Daily DIEM budget")
|
||
AUTOMATION_RESERVE: float = Field(default=0.5, description="Reserve for automation")
|
||
ESCALATION_RESERVE: float = Field(default=0.1, description="Reserve for Claude escalations")
|
||
WARNING_THRESHOLD: float = Field(default=0.8, description="Warn at this % of budget")
|
||
|
||
class Tools:
|
||
async def get_balance(self) -> str:
|
||
"""Get current Venice DIEM balance."""
|
||
|
||
async def get_remaining_today(self) -> str:
|
||
"""Get remaining budget for today (resets 19:00 EST)."""
|
||
|
||
async def estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> str:
|
||
"""Estimate DIEM cost for a completion."""
|
||
|
||
async def can_afford(self, estimated_cost: float, include_reserve: bool = True) -> str:
|
||
"""Check if operation fits within budget."""
|
||
|
||
async def record_cost(self, amount: float, model: str, npe_id: str, project_id: str = None) -> str:
|
||
"""Record actual cost after operation."""
|
||
|
||
async def get_daily_report(self) -> str:
|
||
"""Get today's spend breakdown by model and NPE."""
|
||
|
||
async def check_budget_alerts(self) -> str:
|
||
"""Check for budget warnings and return any alerts."""
|
||
```
|
||
|
||
#### 4.2.2 Project Manager Tool (`project_manager.py`)
|
||
|
||
**Purpose:** Manage development projects and work items.
|
||
|
||
```python
|
||
class Tools:
|
||
# Project CRUD
|
||
async def create_project(self, name: str, description: str, daily_budget: float = 1.0) -> str
|
||
async def get_project(self, project_id: str) -> str
|
||
async def list_projects(self, status: str = "active") -> str
|
||
async def update_project(self, project_id: str, **updates) -> str
|
||
|
||
# Work Item Management
|
||
async def create_work_item(self, project_id: str, title: str, item_type: str, assigned_model: str = None) -> str
|
||
async def get_work_item(self, item_id: str) -> str
|
||
async def list_work_items(self, project_id: str, status: str = "open") -> str
|
||
async def update_work_item(self, item_id: str, **updates) -> str
|
||
async def add_comment(self, item_id: str, comment: str, author: str) -> str
|
||
|
||
# Budget Tracking
|
||
async def get_project_budget(self, project_id: str) -> str
|
||
async def record_project_expense(self, project_id: str, amount: float, description: str) -> str
|
||
```
|
||
|
||
#### 4.2.3 NPE Manager Tool (`npe_manager.py`)
|
||
|
||
**Purpose:** Create and manage NPE identities.
|
||
|
||
```python
|
||
class Tools:
|
||
# NPE Lifecycle
|
||
async def create_npe(self, name: str, role: str, model: str, persona: str, tools: list[str]) -> str
|
||
async def get_npe(self, npe_id: str) -> str
|
||
async def list_npes(self, role: str = None, status: str = "active") -> str
|
||
async def update_npe(self, npe_id: str, **updates) -> str
|
||
async def deactivate_npe(self, npe_id: str) -> str
|
||
|
||
# Activity Tracking
|
||
async def get_npe_activity(self, npe_id: str, days: int = 7) -> str
|
||
async def get_npe_cost_report(self, npe_id: str, period: str = "today") -> str
|
||
```
|
||
|
||
#### 4.2.4 Workflow Engine Tool (`workflow_engine.py`)
|
||
|
||
**Purpose:** Execute and monitor multi-step workflows.
|
||
|
||
```python
|
||
class Tools:
|
||
# Workflow Execution
|
||
async def start_workflow(self, workflow_type: str, params: dict, project_id: str = None) -> str
|
||
async def get_workflow_status(self, workflow_id: str) -> str
|
||
async def complete_step(self, workflow_id: str, step_id: str, result: dict) -> str
|
||
async def fail_step(self, workflow_id: str, step_id: str, error: str) -> str
|
||
|
||
# Circuit Breaker
|
||
async def check_circuit(self, workflow_id: str) -> str
|
||
async def trip_circuit(self, workflow_id: str, reason: str) -> str
|
||
async def reset_circuit(self, workflow_id: str) -> str
|
||
|
||
# Escalation
|
||
async def escalate(self, workflow_id: str, reason: str, to_model: str = "claude-opus-4-5") -> str
|
||
```
|
||
|
||
### 4.3 Tool Permission Matrix
|
||
|
||
| Tool | Orchestrator | PM | Coder | Reviewer |
|
||
|------|:------------:|:--:|:-----:|:--------:|
|
||
| cost_tracker | ✓ | R | - | - |
|
||
| project_manager | ✓ | ✓ | R | R |
|
||
| npe_manager | ✓ | - | - | - |
|
||
| workflow_engine | ✓ | ✓ | - | - |
|
||
| gitea_dev (read) | ✓ | ✓ | ✓ | ✓ |
|
||
| gitea_dev (write) | - | - | ✓ | - |
|
||
| gitea_admin | ✓ | ✓ | - | - |
|
||
| venice_info | ✓ | ✓ | ✓ | ✓ |
|
||
|
||
---
|
||
|
||
## 5. NPE Personas & Roles
|
||
|
||
### 5.1 NPE Identity Structure
|
||
|
||
```python
|
||
@dataclass
|
||
class NPEIdentity:
|
||
# Core Identity
|
||
id: str # e.g., "npe-pm-main"
|
||
name: str # e.g., "Project Manager - Main"
|
||
role: str # orchestrator, pm, coder, reviewer
|
||
status: str # active, suspended, archived
|
||
|
||
# Model Configuration
|
||
base_model: str # Venice model ID
|
||
tier: int # 1-4 based on cost
|
||
|
||
# Context Limits
|
||
max_context: int # Max input tokens
|
||
target_output: int # Target output tokens
|
||
|
||
# Budget
|
||
daily_budget: float # DIEM limit per day
|
||
spent_today: float # Running total
|
||
|
||
# Tools
|
||
enabled_tools: list[str] # Tool IDs this NPE can use
|
||
```
|
||
|
||
### 5.2 Orchestrator Persona
|
||
|
||
**ID:** `npe-orchestrator`
|
||
**Model:** `grok-41-fast` (Tier 1)
|
||
**Cost:** ~0.003 DIEM/call
|
||
**Context Limit:** 4,000 tokens
|
||
|
||
```markdown
|
||
# System Prompt: Orchestrator NPE
|
||
|
||
You are the Orchestrator, responsible for coordinating all automated development work.
|
||
|
||
## Core Responsibilities
|
||
1. Receive triggers from cron jobs and webhooks
|
||
2. Route work to appropriate NPEs
|
||
3. Monitor workflow progress
|
||
4. Handle escalations
|
||
5. Manage budget allocation
|
||
|
||
## Constraints
|
||
- You do NOT perform work yourself
|
||
- You MUST check budget before spawning work
|
||
- You MUST use structured JSON for all outputs
|
||
- You MUST keep context under 4,000 tokens
|
||
|
||
## Output Format
|
||
All outputs must be valid JSON:
|
||
|
||
### Spawn Work
|
||
{
|
||
"action": "spawn_workflow",
|
||
"workflow_type": "code_review|feature|bugfix",
|
||
"project_id": "string",
|
||
"assigned_npe": "npe-id",
|
||
"budget_limit": 0.5,
|
||
"priority": "high|medium|low"
|
||
}
|
||
|
||
### Route Escalation
|
||
{
|
||
"action": "escalate",
|
||
"workflow_id": "string",
|
||
"reason": "string",
|
||
"to_model": "claude-opus-4-5",
|
||
"context_summary": "string (max 500 words)"
|
||
}
|
||
|
||
### Budget Check
|
||
{
|
||
"action": "budget_check",
|
||
"remaining": 5.5,
|
||
"can_proceed": true,
|
||
"warnings": []
|
||
}
|
||
|
||
## Decision Rules
|
||
1. If remaining budget < 0.5 DIEM: STOP all non-critical work
|
||
2. If task is security-related: Route to Claude
|
||
3. If task is simple routing: Do it yourself (no spawn needed)
|
||
4. If stuck for > 30 minutes: Escalate
|
||
```
|
||
|
||
### 5.3 PM Persona
|
||
|
||
**ID:** `npe-pm-{project}`
|
||
**Model:** `grok-41-fast` (Tier 1)
|
||
**Cost:** ~0.003 DIEM/call
|
||
**Context Limit:** 6,000 tokens
|
||
|
||
```markdown
|
||
# System Prompt: Project Manager NPE
|
||
|
||
You are a Project Manager responsible for coordinating development work.
|
||
|
||
## Core Responsibilities
|
||
1. Break down requirements into work items
|
||
2. Assign work to Coder NPEs
|
||
3. Review completed work
|
||
4. Track progress and budget
|
||
|
||
## Constraints
|
||
- You do NOT write code
|
||
- You do NOT modify files
|
||
- You MUST check project budget before assigning work
|
||
- You MUST use structured JSON for work assignments
|
||
|
||
## Work Assignment Format
|
||
{
|
||
"action": "assign_work",
|
||
"work_item": {
|
||
"id": "WI-{timestamp}",
|
||
"title": "Brief title",
|
||
"description": "Requirements in 200 words or less",
|
||
"type": "feature|bugfix|refactor",
|
||
"assigned_to": "npe-coder-{specialty}",
|
||
"estimated_tokens": 5000,
|
||
"files_to_modify": ["path/to/file.py"],
|
||
"acceptance_criteria": ["criterion 1"]
|
||
}
|
||
}
|
||
|
||
## Review Format
|
||
{
|
||
"action": "review_complete",
|
||
"work_item_id": "WI-xxx",
|
||
"verdict": "approve|request_changes|escalate",
|
||
"feedback": "string (50 words max)"
|
||
}
|
||
```
|
||
|
||
### 5.4 Coder Persona
|
||
|
||
**ID:** `npe-coder-{specialty}`
|
||
**Model:** `grok-code-fast-1` (Tier 1)
|
||
**Cost:** ~0.005 DIEM/call
|
||
**Context Limit:** 12,000 tokens
|
||
|
||
```markdown
|
||
# System Prompt: Coder NPE
|
||
|
||
You are a Coder responsible for implementing assigned work items.
|
||
|
||
## Core Responsibilities
|
||
1. Read work item requirements
|
||
2. Examine existing code
|
||
3. Implement changes
|
||
4. Commit via Gitea tool
|
||
|
||
## Constraints
|
||
- You ONLY work on assigned items
|
||
- You do NOT make architectural decisions
|
||
- You MUST follow existing code style
|
||
- You MUST output structured JSON for commits
|
||
|
||
## Code Output Format
|
||
{
|
||
"action": "commit_changes",
|
||
"work_item_id": "WI-xxx",
|
||
"changes": [
|
||
{
|
||
"file_path": "src/module/file.py",
|
||
"action": "create|update|delete",
|
||
"content": "full file content",
|
||
"description": "what this change does (20 words max)"
|
||
}
|
||
],
|
||
"commit_message": "feat: description",
|
||
"ready_for_review": true
|
||
}
|
||
|
||
## When Stuck
|
||
{
|
||
"action": "request_help",
|
||
"work_item_id": "WI-xxx",
|
||
"blocker": "description (50 words max)",
|
||
"attempted": ["approach 1", "approach 2"]
|
||
}
|
||
```
|
||
|
||
### 5.5 Reviewer Persona
|
||
|
||
**ID:** `npe-reviewer-{specialty}`
|
||
**Model:** `grok-41-fast` (Tier 1)
|
||
**Cost:** ~0.004 DIEM/call
|
||
**Context Limit:** 8,000 tokens
|
||
|
||
```markdown
|
||
# System Prompt: Code Reviewer NPE
|
||
|
||
You are a Code Reviewer responsible for ensuring code quality.
|
||
|
||
## Core Responsibilities
|
||
1. Review code changes
|
||
2. Check for bugs, security issues, style violations
|
||
3. Provide actionable feedback
|
||
4. Approve or request changes
|
||
|
||
## Constraints
|
||
- You do NOT modify code
|
||
- You do NOT approve your own changes
|
||
- You MUST be specific and actionable
|
||
- You MUST output structured JSON
|
||
|
||
## Review Output Format
|
||
{
|
||
"action": "review_complete",
|
||
"work_item_id": "WI-xxx",
|
||
"verdict": "approve|request_changes|escalate",
|
||
"summary": "one line summary",
|
||
"issues": [
|
||
{
|
||
"severity": "critical|major|minor",
|
||
"file": "path",
|
||
"line": 42,
|
||
"issue": "what's wrong",
|
||
"fix": "how to fix"
|
||
}
|
||
],
|
||
"security_concerns": [],
|
||
"approved": true|false
|
||
}
|
||
|
||
## Escalation Triggers
|
||
- Security vulnerability
|
||
- Architectural concern
|
||
- >3 major issues
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Cron & Scheduling
|
||
|
||
### 6.1 Architecture: Hybrid Scheduler
|
||
|
||
Based on your usage patterns, I recommend a **hybrid scheduler**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ SCHEDULER ARCHITECTURE │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ KUBERNETES CRONJOBS │
|
||
│ ═══════════════════ │
|
||
│ │
|
||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||
│ │ MASTER SCHEDULER (Every 15 min) │ │
|
||
│ │ │ │
|
||
│ │ • Health check all systems │ │
|
||
│ │ • Process pending triggers │ │
|
||
│ │ • Check for stuck workflows │ │
|
||
│ │ • Route escalations │ │
|
||
│ │ │ │
|
||
│ │ Cost: ~0.003 DIEM × 4/hour = 0.012 DIEM/hour │ │
|
||
│ └─────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
|
||
│ │ BURN WINDOW │ │ DAILY │ │ WEEKLY │ │
|
||
│ │ (18:45 EST) │ │ (00:00 UTC) │ │ (Sun 06:00) │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ Use surplus │ │ Cleanup │ │ Full report │ │
|
||
│ │ before reset │ │ Archive │ │ Cost analysis │ │
|
||
│ │ │ │ Reset budgets │ │ │ │
|
||
│ └───────────────┘ └───────────────┘ └───────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 6.2 Master Scheduler Script
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
"""
|
||
Master Scheduler for NPE Orchestration.
|
||
Runs every 15 minutes via Kubernetes CronJob.
|
||
"""
|
||
|
||
import asyncio
|
||
import os
|
||
from datetime import datetime, timezone
|
||
from typing import Optional
|
||
|
||
import httpx
|
||
|
||
# Configuration
|
||
OWUI_URL = os.environ["OWUI_URL"]
|
||
OWUI_TOKEN = os.environ["OWUI_TOKEN"]
|
||
ORCHESTRATOR_CHAT_ID = os.environ.get("ORCHESTRATOR_CHAT_ID")
|
||
EST_OFFSET = -5 # EST timezone offset
|
||
|
||
|
||
async def main():
|
||
"""Main scheduler loop."""
|
||
async with httpx.AsyncClient(
|
||
base_url=OWUI_URL,
|
||
headers={"Authorization": f"Bearer {OWUI_TOKEN}"},
|
||
timeout=60.0
|
||
) as client:
|
||
|
||
now = datetime.now(timezone.utc)
|
||
hour_est = (now.hour + EST_OFFSET) % 24
|
||
|
||
# 1. Always: Health check
|
||
health = await check_system_health(client)
|
||
if not health["ok"]:
|
||
await alert_admin(client, f"System unhealthy: {health['issues']}")
|
||
return
|
||
|
||
# 2. Always: Budget check
|
||
budget = await get_budget_status(client)
|
||
log_budget(budget)
|
||
|
||
if budget["remaining"] < 0.5:
|
||
await alert_admin(client, f"Budget critical: {budget['remaining']:.2f} DIEM")
|
||
# Don't stop - still need to process escalations
|
||
|
||
# 3. Conditional: Process work based on hour
|
||
if 22 <= hour_est or hour_est < 7:
|
||
# Night automation window (22:00 - 07:00 EST)
|
||
await process_automation_queue(client, budget)
|
||
elif hour_est == 18 and now.minute >= 45:
|
||
# Burn window (18:45 - 19:00 EST)
|
||
await run_burn_window(client, budget)
|
||
else:
|
||
# Daytime - only process high-priority triggers
|
||
await process_high_priority_only(client, budget)
|
||
|
||
# 4. Always: Check for stuck workflows
|
||
stuck = await find_stuck_workflows(client, max_age_minutes=30)
|
||
for workflow in stuck:
|
||
await handle_stuck_workflow(client, workflow)
|
||
|
||
# 5. Always: Route pending escalations
|
||
escalations = await get_pending_escalations(client)
|
||
for escalation in escalations:
|
||
await route_escalation(client, escalation, budget)
|
||
|
||
|
||
async def check_system_health(client: httpx.AsyncClient) -> dict:
|
||
"""Verify all system components are operational."""
|
||
issues = []
|
||
|
||
# Check Open WebUI
|
||
try:
|
||
resp = await client.get("/health")
|
||
if resp.status_code != 200:
|
||
issues.append("Open WebUI unhealthy")
|
||
except Exception as e:
|
||
issues.append(f"Open WebUI unreachable: {e}")
|
||
|
||
# Check Venice balance
|
||
try:
|
||
resp = await client.get("/api/v1/venice/balance")
|
||
balance = resp.json().get("balance", 0)
|
||
if balance < 0.1:
|
||
issues.append(f"Venice balance critical: {balance}")
|
||
except Exception as e:
|
||
issues.append(f"Venice check failed: {e}")
|
||
|
||
return {"ok": len(issues) == 0, "issues": issues}
|
||
|
||
|
||
async def get_budget_status(client: httpx.AsyncClient) -> dict:
|
||
"""Get current budget status."""
|
||
# Calculate time until reset (19:00 EST = 00:00 UTC)
|
||
now = datetime.now(timezone.utc)
|
||
hours_until_reset = (24 - now.hour) % 24
|
||
|
||
try:
|
||
resp = await client.get("/api/v1/venice/balance")
|
||
data = resp.json()
|
||
remaining = data.get("balance", 0)
|
||
|
||
# Get today's spend from cost tracker
|
||
spend_resp = await client.get("/api/v1/cost-tracker/today")
|
||
spent_today = spend_resp.json().get("total", 0)
|
||
except Exception:
|
||
remaining = 8.1 # Assume full budget on error
|
||
spent_today = 0
|
||
|
||
return {
|
||
"remaining": remaining,
|
||
"spent_today": spent_today,
|
||
"hours_until_reset": hours_until_reset,
|
||
"automation_reserve": 0.5,
|
||
"escalation_reserve": 0.1,
|
||
"available_for_work": remaining - 0.6 # reserves
|
||
}
|
||
|
||
|
||
async def run_burn_window(client: httpx.AsyncClient, budget: dict):
|
||
"""
|
||
Use surplus DIEM before 19:00 EST reset.
|
||
|
||
ROI: 0.10 DIEM spend can utilize 1.0+ DIEM that would be lost.
|
||
"""
|
||
surplus = budget["remaining"] - 2.0 # Keep 2.0 DIEM for tomorrow morning
|
||
|
||
if surplus < 0.10:
|
||
print(f"No surplus to burn: {budget['remaining']:.2f} DIEM")
|
||
return
|
||
|
||
print(f"Burn window: {surplus:.2f} DIEM surplus available")
|
||
|
||
tasks = []
|
||
|
||
# Priority 1: Summarize active workflows (saves context tomorrow)
|
||
if surplus >= 0.05:
|
||
tasks.append(("summarize_workflows", 0.05))
|
||
surplus -= 0.05
|
||
|
||
# Priority 2: Pre-plan tomorrow's tasks
|
||
if surplus >= 0.08:
|
||
tasks.append(("pre_plan", 0.08))
|
||
surplus -= 0.08
|
||
|
||
# Priority 3: Run pending reviews
|
||
if surplus >= 0.05:
|
||
tasks.append(("pending_reviews", surplus))
|
||
|
||
for task, budget_limit in tasks:
|
||
await dispatch_burn_task(client, task, budget_limit)
|
||
|
||
|
||
async def process_automation_queue(client: httpx.AsyncClient, budget: dict):
|
||
"""Process automation tasks during night window."""
|
||
if budget["available_for_work"] < 0.1:
|
||
print("Budget too low for automation")
|
||
return
|
||
|
||
# Get pending automation tasks
|
||
triggers = await get_pending_triggers(client)
|
||
|
||
for trigger in triggers:
|
||
# Estimate cost
|
||
estimated = estimate_trigger_cost(trigger)
|
||
|
||
if estimated > budget["available_for_work"]:
|
||
print(f"Skipping {trigger['type']}: cost {estimated} > available {budget['available_for_work']}")
|
||
continue
|
||
|
||
await dispatch_trigger(client, trigger)
|
||
budget["available_for_work"] -= estimated
|
||
|
||
|
||
def estimate_trigger_cost(trigger: dict) -> float:
|
||
"""Estimate DIEM cost for a trigger."""
|
||
costs = {
|
||
"code_review": 0.015, # Grok review + routing
|
||
"feature_request": 0.050, # PM + Coder + Review
|
||
"bug_fix": 0.030, # Triage + Fix + Review
|
||
"cleanup": 0.005, # Simple Grok task
|
||
"health_check": 0.003, # Minimal
|
||
}
|
||
return costs.get(trigger.get("type"), 0.010)
|
||
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### 6.3 Cron Schedule (Kubernetes)
|
||
|
||
```yaml
|
||
# Master Scheduler - Every 15 minutes
|
||
apiVersion: batch/v1
|
||
kind: CronJob
|
||
metadata:
|
||
name: npe-master-scheduler
|
||
namespace: open-webui
|
||
spec:
|
||
schedule: "*/15 * * * *"
|
||
concurrencyPolicy: Forbid
|
||
jobTemplate:
|
||
spec:
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: scheduler
|
||
image: python:3.11-slim
|
||
command: ["python", "/scripts/master_scheduler.py"]
|
||
envFrom:
|
||
- secretRef:
|
||
name: npe-secrets
|
||
restartPolicy: OnFailure
|
||
|
||
---
|
||
# Burn Window - 18:45 EST (23:45 UTC)
|
||
apiVersion: batch/v1
|
||
kind: CronJob
|
||
metadata:
|
||
name: npe-burn-window
|
||
spec:
|
||
schedule: "45 23 * * *"
|
||
jobTemplate:
|
||
spec:
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: burner
|
||
image: python:3.11-slim
|
||
command: ["python", "/scripts/burn_window.py"]
|
||
|
||
---
|
||
# Daily Maintenance - 00:00 UTC (19:00 EST - after reset)
|
||
apiVersion: batch/v1
|
||
kind: CronJob
|
||
metadata:
|
||
name: npe-daily-maintenance
|
||
spec:
|
||
schedule: "0 0 * * *"
|
||
jobTemplate:
|
||
spec:
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: maintenance
|
||
image: python:3.11-slim
|
||
command: ["python", "/scripts/daily_maintenance.py"]
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Workflow Patterns
|
||
|
||
### 7.1 Code Review Workflow
|
||
|
||
**Trigger:** Gitea webhook on PR create/update
|
||
**Cost:** ~0.015 DIEM
|
||
**Duration:** 2-5 minutes
|
||
|
||
```
|
||
┌─────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐
|
||
│ START │───▶│ Load Context │───▶│ Review │───▶│ Verdict │
|
||
│ │ │ (Grok, 0.003)│ │ (Grok, 0.008)│ │ │
|
||
└─────────┘ └──────────────┘ └──────────────┘ └────┬────┘
|
||
│
|
||
┌────────────────────────────────────────┤
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
┌──────────┐ ┌───────────┐ ┌───────────┐
|
||
│ APPROVE │ │ REQUEST │ │ ESCALATE │
|
||
│ │ │ CHANGES │ │ to Claude │
|
||
│ Post │ │ │ │ (0.054) │
|
||
│ Comment │ │ Post │ └───────────┘
|
||
│ (0.002) │ │ Feedback │
|
||
└──────────┘ │ (0.002) │
|
||
└───────────┘
|
||
```
|
||
|
||
### 7.2 Feature Development Workflow
|
||
|
||
**Trigger:** Issue with label "feature"
|
||
**Cost:** ~0.050 DIEM
|
||
**Duration:** 15-30 minutes
|
||
|
||
```
|
||
┌─────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ START │───▶│ PM: Analyze │───▶│ PM: Breakdown│───▶│ Coder: │
|
||
│ │ │ (Grok, 0.003)│ │ (Grok, 0.003)│ │ Implement │
|
||
└─────────┘ └──────────────┘ └──────────────┘ │ (Grok, 0.015)│
|
||
└──────┬───────┘
|
||
│
|
||
┌──────────────────────────────────────────┤
|
||
│ │
|
||
▼ ▼
|
||
┌───────────┐ ┌───────────┐
|
||
│ Review │◀────── Revision Loop ──────│ FAILED │
|
||
│ (0.008) │ (max 3x) │ │
|
||
└─────┬─────┘ └───────────┘
|
||
│
|
||
▼
|
||
┌───────────┐
|
||
│ Create PR │
|
||
│ (0.003) │
|
||
└───────────┘
|
||
```
|
||
|
||
### 7.3 Workflow State Machine
|
||
|
||
```python
|
||
from enum import Enum
|
||
from dataclasses import dataclass, field
|
||
from datetime import datetime
|
||
from typing import Optional
|
||
|
||
class WorkflowState(Enum):
|
||
PENDING = "pending"
|
||
RUNNING = "running"
|
||
WAITING_INPUT = "waiting_input"
|
||
STEP_FAILED = "step_failed"
|
||
ESCALATED = "escalated"
|
||
COMPLETED = "completed"
|
||
FAILED = "failed"
|
||
|
||
@dataclass
|
||
class CircuitBreaker:
|
||
max_failures: int = 3
|
||
failure_count: int = 0
|
||
last_failure: Optional[datetime] = None
|
||
cooldown_seconds: int = 300
|
||
|
||
def record_failure(self):
|
||
self.failure_count += 1
|
||
self.last_failure = datetime.now()
|
||
|
||
def is_open(self) -> bool:
|
||
if self.failure_count >= self.max_failures:
|
||
if self.last_failure:
|
||
elapsed = (datetime.now() - self.last_failure).seconds
|
||
if elapsed < self.cooldown_seconds:
|
||
return True
|
||
# Reset after cooldown
|
||
self.failure_count = 0
|
||
return False
|
||
|
||
def should_escalate(self) -> bool:
|
||
return self.failure_count >= (self.max_failures - 1)
|
||
|
||
@dataclass
|
||
class Workflow:
|
||
id: str
|
||
type: str
|
||
state: WorkflowState = WorkflowState.PENDING
|
||
current_step: str = ""
|
||
assigned_npe: str = ""
|
||
project_id: Optional[str] = None
|
||
budget_limit: float = 1.0
|
||
budget_spent: float = 0.0
|
||
circuit: CircuitBreaker = field(default_factory=CircuitBreaker)
|
||
steps_completed: list = field(default_factory=list)
|
||
created_at: datetime = field(default_factory=datetime.now)
|
||
updated_at: datetime = field(default_factory=datetime.now)
|
||
|
||
def can_proceed(self) -> tuple[bool, str]:
|
||
if self.circuit.is_open():
|
||
return False, "Circuit breaker open"
|
||
if self.budget_spent >= self.budget_limit:
|
||
return False, "Budget exhausted"
|
||
return True, "OK"
|
||
|
||
def record_cost(self, amount: float):
|
||
self.budget_spent += amount
|
||
self.updated_at = datetime.now()
|
||
|
||
def complete_step(self, step_id: str, result: dict):
|
||
self.steps_completed.append({
|
||
"step_id": step_id,
|
||
"completed_at": datetime.now().isoformat(),
|
||
"result": result
|
||
})
|
||
self.updated_at = datetime.now()
|
||
|
||
def fail_step(self, step_id: str, error: str):
|
||
self.circuit.record_failure()
|
||
self.state = WorkflowState.STEP_FAILED
|
||
self.updated_at = datetime.now()
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Cost Management
|
||
|
||
### 8.1 Budget Allocation (Based on Actual Data)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ DAILY BUDGET ALLOCATION │
|
||
│ (8.1 DIEM total) │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||
│ │ INTERACTIVE WORK 5.50 DIEM │ │
|
||
│ │ (68% of budget) │ │
|
||
│ │ │ │
|
||
│ │ Claude sessions (1-2/day) ~1.00 DIEM │ │
|
||
│ │ Grok chat (continuous) ~2.50 DIEM │ │
|
||
│ │ Qwen bulk processing ~1.00 DIEM │ │
|
||
│ │ Image generation ~1.00 DIEM │ │
|
||
│ └───────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||
│ │ NPE AUTOMATION 0.25 DIEM │ │
|
||
│ │ (3% of budget) │ │
|
||
│ │ │ │
|
||
│ │ PM checks (12/day × 0.003) ~0.04 DIEM │ │
|
||
│ │ Coder tasks (10/day × 0.005) ~0.05 DIEM │ │
|
||
│ │ Reviews (8/day × 0.004) ~0.03 DIEM │ │
|
||
│ │ Orchestrator (96/day × 0.001) ~0.10 DIEM │ │
|
||
│ │ Buffer ~0.03 DIEM │ │
|
||
│ └───────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||
│ │ RESERVES 0.60 DIEM │ │
|
||
│ │ (7% of budget) │ │
|
||
│ │ │ │
|
||
│ │ Escalation reserve (Claude) ~0.10 DIEM │ │
|
||
│ │ Automation reserve ~0.50 DIEM │ │
|
||
│ └───────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||
│ │ BUFFER / BURN WINDOW 1.75 DIEM │ │
|
||
│ │ (22% of budget) │ │
|
||
│ │ │ │
|
||
│ │ Available for burn window automation if unused │ │
|
||
│ │ Target: Use 80%+ of daily budget │ │
|
||
│ └───────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 8.2 Cost Enforcement
|
||
|
||
```python
|
||
class BudgetEnforcer:
|
||
"""Enforce budget limits at multiple levels."""
|
||
|
||
def __init__(self, daily_budget: float = 8.1):
|
||
self.daily_budget = daily_budget
|
||
self.reserves = {
|
||
"escalation": 0.10,
|
||
"automation": 0.50,
|
||
}
|
||
|
||
async def can_proceed(
|
||
self,
|
||
estimated_cost: float,
|
||
npe_id: str,
|
||
project_id: Optional[str] = None,
|
||
use_reserves: bool = False
|
||
) -> tuple[bool, str]:
|
||
"""Check if operation can proceed within budget."""
|
||
|
||
# Get current balance
|
||
balance = await self.get_venice_balance()
|
||
|
||
# Calculate available
|
||
reserved = sum(self.reserves.values()) if not use_reserves else 0
|
||
available = balance - reserved
|
||
|
||
# Check global
|
||
if available < estimated_cost:
|
||
return False, f"Insufficient budget: {available:.3f} < {estimated_cost:.3f}"
|
||
|
||
# Check per-NPE daily limit
|
||
npe_spent = await self.get_npe_spent_today(npe_id)
|
||
npe_limit = await self.get_npe_daily_limit(npe_id)
|
||
|
||
if npe_spent + estimated_cost > npe_limit:
|
||
return False, f"NPE budget exceeded: {npe_spent:.3f} + {estimated_cost:.3f} > {npe_limit:.3f}"
|
||
|
||
# Check per-project if applicable
|
||
if project_id:
|
||
project_spent = await self.get_project_spent_today(project_id)
|
||
project_limit = await self.get_project_daily_limit(project_id)
|
||
|
||
if project_spent + estimated_cost > project_limit:
|
||
return False, f"Project budget exceeded"
|
||
|
||
return True, "OK"
|
||
|
||
async def record_and_verify(
|
||
self,
|
||
actual_cost: float,
|
||
estimated_cost: float,
|
||
npe_id: str,
|
||
operation: str
|
||
):
|
||
"""Record cost and check for anomalies."""
|
||
# Record
|
||
await self.record_cost(actual_cost, npe_id, operation)
|
||
|
||
# Check for cost overrun
|
||
if actual_cost > estimated_cost * 1.5:
|
||
await self.alert(
|
||
f"Cost overrun: {operation} estimated {estimated_cost:.4f}, actual {actual_cost:.4f}"
|
||
)
|
||
|
||
# Check for budget warnings
|
||
remaining = await self.get_remaining_today()
|
||
if remaining < self.daily_budget * 0.2:
|
||
await self.alert(f"Budget warning: only {remaining:.2f} DIEM remaining today")
|
||
```
|
||
|
||
### 8.3 Automation Cost Projections
|
||
|
||
Based on your actual rates:
|
||
|
||
| Scenario | Model | Per Call | Calls/Day | Daily Cost | Monthly |
|
||
|----------|-------|----------|-----------|------------|---------|
|
||
| PM Check | Grok | 0.0021 | 12 | 0.0252 | 0.76 |
|
||
| Coder Task | Grok-Code | 0.0050 | 10 | 0.0500 | 1.50 |
|
||
| Code Review | Grok | 0.0035 | 8 | 0.0280 | 0.84 |
|
||
| Orchestrator | Grok | 0.0010 | 96 | 0.0960 | 2.88 |
|
||
| Deep Analysis | Kimi | 0.0120 | 3 | 0.0360 | 1.08 |
|
||
| Escalation | Claude | 0.0540 | 1 | 0.0540 | 1.62 |
|
||
| **TOTAL** | | | | **0.2892** | **8.68** |
|
||
|
||
**Conclusion:** Full NPE automation costs **0.29 DIEM/day** (3.6% of budget), leaving **7.81 DIEM** for interactive work.
|
||
|
||
---
|
||
|
||
## 9. Implementation Roadmap
|
||
|
||
### Phase 1: Foundation (Week 1)
|
||
|
||
**Goal:** Basic infrastructure working
|
||
**Budget Impact:** None (setup only)
|
||
|
||
- [ ] Deploy corrected Gitea tools (v1.1.0)
|
||
- [ ] Create `cost_tracker` tool
|
||
- [ ] Set up 2 NPEs manually:
|
||
- [ ] Orchestrator (Grok)
|
||
- [ ] PM (Grok)
|
||
- [ ] Create master scheduler (health check only)
|
||
- [ ] Test: Manual trigger → Orchestrator routes to PM
|
||
|
||
**Success Criteria:**
|
||
- Orchestrator receives triggers
|
||
- Cost tracked per operation
|
||
- No automation yet - just infrastructure
|
||
|
||
### Phase 2: Automation (Week 2)
|
||
|
||
**Goal:** Night automation working
|
||
**Budget Impact:** +0.05 DIEM/day
|
||
|
||
- [ ] Add Coder NPE (Grok-Code)
|
||
- [ ] Add Reviewer NPE (Grok)
|
||
- [ ] Implement Code Review workflow
|
||
- [ ] Enable night automation (22:00-07:00 EST)
|
||
- [ ] Test: PR created → auto-review → comment posted
|
||
|
||
**Success Criteria:**
|
||
- PRs reviewed automatically during night window
|
||
- Reviews posted as Gitea comments
|
||
- Circuit breaker prevents loops
|
||
|
||
### Phase 3: Full Workflows (Week 3-4)
|
||
|
||
**Goal:** Complete workflow coverage
|
||
**Budget Impact:** +0.20 DIEM/day
|
||
|
||
- [ ] Create `project_manager` tool
|
||
- [ ] Create `workflow_engine` tool
|
||
- [ ] Implement Feature Development workflow
|
||
- [ ] Implement Bug Fix workflow
|
||
- [ ] Enable burn window automation
|
||
- [ ] Test: Full feature cycle from issue to PR
|
||
|
||
**Success Criteria:**
|
||
- Features developed from issue to merged PR
|
||
- Budget tracked per project
|
||
- Burn window uses surplus effectively
|
||
|
||
### Phase 4: Escalation (Week 5)
|
||
|
||
**Goal:** Claude integration for complex cases
|
||
**Budget Impact:** +0.05 DIEM/day
|
||
|
||
- [ ] Implement escalation paths
|
||
- [ ] Create context compression for Claude calls
|
||
- [ ] Add security review workflow
|
||
- [ ] Test: Complex review → escalate → Claude response
|
||
|
||
**Success Criteria:**
|
||
- Escalation triggers when needed
|
||
- Claude calls stay under 0.06 DIEM each
|
||
- Responses routed back to workflow
|
||
|
||
### Phase 5: Optimization (Week 6+)
|
||
|
||
**Goal:** Cost optimization and scaling
|
||
**Budget Impact:** -0.05 DIEM/day (savings)
|
||
|
||
- [ ] Implement cache optimization strategy
|
||
- [ ] Add context compression to all NPEs
|
||
- [ ] Tune model selection based on success rates
|
||
- [ ] Add metrics dashboard
|
||
- [ ] Document runbooks
|
||
|
||
**Success Criteria:**
|
||
- Cache hit rates > 50% for Grok
|
||
- System runs 7 days unattended
|
||
- Total automation < 0.25 DIEM/day
|
||
|
||
---
|
||
|
||
## 10. Open Questions
|
||
|
||
### 10.1 Unresolved Decisions
|
||
|
||
| Question | Options | Recommendation | Notes |
|
||
|----------|---------|----------------|-------|
|
||
| State storage | OpenWebUI folders vs. SQLite | OpenWebUI folders | Simpler, no new deps |
|
||
| Token rotation | 30 vs 90 days | 90 days | Manual for now |
|
||
| Max concurrent workflows | 3 vs 5 vs 10 | 5 | Test and adjust |
|
||
| Chat retention | 7 vs 30 vs 90 days | 30 days | Balance audit vs. storage |
|
||
|
||
### 10.2 Your Input Needed
|
||
|
||
1. **Gitea Webhooks:** How are webhooks exposed? Need ingress path for triggers.
|
||
|
||
2. **Claude API Key:** Using Venice's Claude or direct Anthropic? Venice is simpler.
|
||
|
||
3. **Multi-file Commits:** Do you need atomic batch commits in gitea_dev?
|
||
|
||
4. **Test Execution:** Skip CI/CD for Phase 1-5? Add later?
|
||
|
||
5. **Human Approval UI:** Chat-only for now? Dashboard later?
|
||
|
||
### 10.3 Known Limitations
|
||
|
||
1. **No real-time collaboration** - NPEs work asynchronously
|
||
2. **No visual review** - Can't review UI changes
|
||
3. **Venice dependency** - All LLM calls through Venice
|
||
4. **Single Gitea instance** - No multi-repo federation yet
|
||
|
||
---
|
||
|
||
## Appendix A: Quick Reference
|
||
|
||
### Model Rates (DIEM per 1M tokens)
|
||
|
||
| Model | Input | Output | Cache | Effective |
|
||
|-------|-------|--------|-------|-----------|
|
||
| Qwen-Instruct | 0.15 | 0.27 | - | 0.06 |
|
||
| Grok-Code | 0.25 | 1.87 | 0.03 | 0.07 |
|
||
| Grok-41-Fast | 0.50 | 1.25 | 0.125 | 0.19 |
|
||
| MiniMax-M21 | 0.40 | 1.60 | 0.04 | 0.22 |
|
||
| Kimi-K2 | 0.75 | 3.20 | 0.375 | 0.42 |
|
||
| Claude-Opus | 6.00 | 30.00 | - | 5.28 |
|
||
|
||
### Context Limits
|
||
|
||
| Role | Max Tokens | Warning At |
|
||
|------|------------|------------|
|
||
| Orchestrator | 4,000 | 3,000 |
|
||
| PM | 6,000 | 5,000 |
|
||
| Coder | 12,000 | 10,000 |
|
||
| Reviewer | 8,000 | 6,000 |
|
||
|
||
### Budget Summary
|
||
|
||
| Category | Daily DIEM | % of Budget |
|
||
|----------|------------|-------------|
|
||
| Interactive | 5.50 | 68% |
|
||
| Automation | 0.25 | 3% |
|
||
| Reserves | 0.60 | 7% |
|
||
| Buffer/Burn | 1.75 | 22% |
|
||
| **Total** | **8.10** | **100%** |
|
||
|
||
---
|
||
|
||
*Document Status: RFC v2.0 - Based on 30-day billing analysis*
|
||
*Last Updated: January 11, 2026*
|
||
*Data Source: 13,582 Venice.ai transactions* |