Hubexo platform · runebot · v2 proposal

Rune was one. The next ones inherit a foundation.

Today's rune is one tool-using agent and five sibling single-shot LLM features — patterns repeated per-agent, not inherited. We're proposing a foundation — agent runtime, cross-session memory, managed MCP for SQL, eval gates by default. Two months from green-light: two agent classes (rune, sales-card) on the new stack, with sales-card on two runners — interactive and the sales-pitch batch lane. Same definition, two substrates, one eval rubric. The next agent is a schema, a prompt, and an eval rubric — not infrastructure.

Versionv1.0 · proposal
Build duration8 weeks · 4 checkpoints
RegionAWS eu-north-1 (Stockholm)
Reversibility3 green · 5 amber · 1 red
Where rune is today

One tool-using agent. Five sibling LLM features. Shared primitives across both.

Rune is one tool-using agent with five registered tools today — three deterministic Databricks SQL tools, one external ML service (recsys), and one LLM-backed service (company-profiling). In parallel, five standalone single-shot LLM features (sales-card, sales-brief, draft-email, sales-copilot, summary) exist as their own per-project endpoints — they don't go through rune today, but share a project-data fetching layer, the sales playbook layer, FastAPI process, Bedrock client, Databricks connection pool, and MLflow tracing.

The sales-card pipeline already does what v2 wants to generalise: paired-bootstrap CI, k=3 LLM-judge with Holm correction, anti-gaming tripwire, Mechanism A/B human review. The rune agent loop has its own rubric scorer in evals/auto-loop/. v1 also ships shadow-traffic isolation via the pg_ prefix pattern across 3 environments, the /sales-card/evaluate paired-bootstrap pipeline mounted in production, and a runtime tracing kill-switch. v2 builds on these, not over them.

This isn't a failing system — rune works, the patterns are good. It's a system where every next agent re-derives those patterns instead of inheriting them. v2 turns the per-agent patterns into platform primitives.

Where we want to be in two months

A foundation that the next agents inherit by default.

Shared primitives — auth, observability, eval, memory, tool registry — every agent inherits these automatically the moment it's spec'd. Rune itself becomes more sophisticated as a side effect; the system is the point.

North-star metric
Days from PM-approved spec to agent in production, with evals gating promotion.
Today build per-agent primitives
After foundation (target) compose existing parts · per agent

The foundation pays for itself the moment the second agent ships on it. The next agent costs a schema, a prompt, and an eval rubric (plus any new tools the agent needs) — because the rest is inherited. (See §8 — three Q&As on how this works in practice.)

Four things unlock when this foundation lands:

Unlock 1 Eval coverage by default — every agent, every prompt change

MLflow on Databricks is the foundation for prod traces and eval datasets — production tracing already runs there (rune-traces-prod-uc, rune-traces-dev-uc: spans, tool calls, costs, latency, feedback widgets, live today). W3-4 wires the rest: the Prompt Registry (mlflow.genai.register_prompt), mlflow.set_active_model() binding traces to prompt versions, and the first eval dataset. After that wiring lands, every prompt change is scored against the held-out set before promotion to the production alias; a regression caught in prod points straight at the prompt version and input that produced it — no timestamp correlation across deploy logs. Same for model choice — swapping Sonnet 4.6 for an alternate becomes an experiment run, not a deploy.

Unlock 2 Memory across sessions and a headless API so consumers can read it

Two complementary state primitives ship in phase 1. AgentCore Memory provides semantic cross-session persistence — short-term, long-term, per-user scoping — without us building a vector database.

The conversation API exposes the structural side: v1 already stores session_id, turn_index, external_user_id, prompt, answer, and feedback per turn, but no GET endpoint reads it back, so today's embedded UI manages history in-browser. v2 adds Conversation + Turn + Feedback resources with cursor-paginated history, idempotency-keyed turn submission, and explicit cancellation (the 2025-2026 industry pattern). Together: rune remembers semantically, and every consuming team can show users their structural history alongside (see §7 Surface).

Unlock 3 Tools governed where the data lives

MCP is the wire protocol for the data plane. Databricks ships managed MCP servers for both ad-hoc SQL (DBSQL MCP) and predefined typed operations (UC Functions MCP) — same UC governance plane for both. Phase 1 uses DBSQL MCP — closer to today's raw-SQL pattern in rune (databricks-sql-connector), no upfront function definitions needed. UC Functions arrive in phase 2 as common query patterns crystallize into typed, named operations. Non-SQL tools live as native Pydantic AI @agent.tool functions inside the agent runtime. No in-house MCP layer to maintain.

Unlock 4 Batch is a runner, not a lane

One agent definition, two runners. Pydantic AI agent classes run interactively under AgentCore Runtime AND as Bedrock Batch Inference jobs orchestrated from Step Functions — same Pydantic model, same eval rubric, same prompt registry entry, two substrates.

The 200k sales-pitch pre-compute (50k projects × 4 roles) is the first batch citizen: ~$1,350 per regeneration on Bedrock Batch vs ~$2,700 on-demand (at ~2k input + ~500 output per call). Gemini 3 Flash batch would run the same job at ~$250 — roughly 5× cheaper — if accuracy on the sales-pitch rubric holds, which the W5–8 sales-pitch lane can confirm.

Architecture posture

Rent the AWS primitives. Own the data layers.

There are three postures here. Rent everything from AWS (lowest code, deepest lock-in). Own everything ourselves (lowest lock-in, most toil). Hybrid (rent the AWS primitives, own the parts where data residency and open standards matter). We're proposing hybrid — and the trade is laid out below.

Hybrid ★ — Rent the AWS primitives that play to our strengths (we already live in AWS) — including Bedrock Batch for the pre-compute lane (50% discount, AWS-managed queue). Reuse MLflow on Databricks — already provisioned for our ML model registry, already in use, free at the marginal call. Customer data never leaves the EU workspace (europe-prod / europe_prod_catalog) — no DPA, no cross-EEA hop. Own the open formats that make every choice reversible: OTel spans, prompt registry, eval dataset schemas, tool function signatures, batch output schemas in Delta.

Click through to compare. Hybrid is what we're proposing.

Component picks

Layer by layer, today vs. proposed.

Thirteen layers. Each row is one architectural decision with a short justification. Stars mark the components we're adding or replacing.

LayerTodayProposedWhy
InferenceBedrock ConverseBedrock Converse (no change)Model access is already correct.
FrameworkFastAPI + custom tool-use loopPydantic AI inside AgentCore RuntimeType-safe by design (Rust validation core, auto-self-correction on schema mismatch). One primitive for the three agent shapes we use (tool-using, typed-output, multi-turn) — same class handles all of them. ~30-provider matrix for one-line model swaps. v1.0 GA Sept 2025; Thoughtworks Tech Radar 2026; Pydantic-team-maintained.
MemoryAgentCore MemoryCross-session persistence, semantic search, per-user scoping. Closes the conversation-context gap rune has today. Bedrock-native; cross-session semantic recall; W1-2 benchmark validates recall quality (see §8).
IdentityInbound: Hubexo ID Cognito JWT (RS256) + X-API-Key for internal callers. Outbound: Databricks via rune-sp OAuth M2M, external APIs via Secrets Manager keys.Inbound: Cognito unchanged. Outbound: Databricks OAuth M2M + Secrets Manager API keys (no change). Identity vault deferred — see §7 Identity insight.Phase 1 agents don't need per-user RBAC at the Databricks data plane — workload-level OAuth M2M holds. AgentCore Identity becomes the answer when one of three triggers fires: per-user RBAC at Databricks (cross-team commit with the data team, gated on Hubexo's entitlement service shipping — see §10), a new OAuth-based outbound service (MS Graph, Salesforce, or any per-user-OAuth integration), or a SecOps mandate for centralised credential management.
Tool layer5 tools in a typed in-process registry (TOOL_REGISTRY); raw SQL via databricks-sql-connector.Databricks managed MCP — DBSQL (phase 1) + UC Functions (phase 2) + native Pydantic AI tools (non-SQL)Two managed MCP server types on the same UC governance plane. DBSQL MCP in phase 1: the closest match to today's raw-SQL pattern in rune — no upfront function definitions, no decomposing v1's generic query_projects_database before shipping. The agent writes SQL against UC tables; UC ACLs on rune-sp enforce access. UC Functions in phase 2: as common query patterns crystallize, wrap them as typed, named, audited operations. Both share the same UC governance plane (grants, row filters, column masks) — same SP, same scope. The MCP server type is interchangeable; the governance plane isn't. Non-SQL tools are just decorated Python functions in the agent code. No in-house MCP server to operate.
Tool governanceTable-level UC grants on rune-spTable-level UC grants on rune-sp (no change, via DBSQL MCP)Same pattern, same plane — RBAC at the data plane via Unity Catalog grants on the SP. Phase 1 DBSQL MCP runs ad-hoc SQL against tables governed by table-level grants — same as today, exposed via MCP. Phase 2 adds function-level grants alongside table-level grants when UC Functions arrive, both visible in the same UC governance plane. Row filters and column masks (GA in 2026), plus ABAC policies (public preview), become the per-tenant entitlement surface when Hubexo's entitlement service ships (see §10 risk). Databricks AI Gateway is a phase-2 trigger, earned when one of: a SecOps mandate for centralised audit beyond UC's native logging, per-key rate-limit policy (e.g. external partners), or a second tenant pattern arrives. Non-SQL Pydantic AI tools governed by IAM on the AWS side.
Observability + evalMLflow traces in rune-traces-prod-uc + the paired-bootstrap eval pipeline as separate artifacts; outputs joinable by hand, not by shared run-ID.MLflow on Databricks (existing workspace)MLflow 3 has full GenAI eval: mlflow.genai.evaluate(), built-in Safety / RelevanceToQuery / Guidelines scorers, custom LLM judges via make_judge() + Judge Builder UI, eval datasets + labeling sessions + production sampling as first-class objects, prompt registry with version aliases. mlflow.set_active_model() binds prompt version → trace → eval run-ID. Unity Catalog ACLs are available if we introduce role-based prompt promotion in phase 2 — not part of phase-1 scope. Data stays in europe_prod_catalog; no second vendor, no DPA, $0 marginal cost. MLflow 3.0 (June 2025) GenAI eval; OTel-compatible; Apache-2.0.
Batch inferenceBedrock Batch Inference (Sonnet 4.6, EU cross-region)50% discount vs on-demand on 200k+ jobs. Same model family as the interactive lane, so eval rubrics carry over verbatim. JSONL on S3 in/out. 50% on-demand discount; AWS-reference at awslabs/aws-bedrock-batch-stepfn.
Batch orchestrationStep FunctionsAWS-native control flow, retry policies, OTel into MLflow (the same OTel exporter — Databricks ingests OpenTelemetry). (Databricks ai_query() is the documented warehouse-native alternative for the pure-SQL case where input is already in Delta — see §8 and §9.) AWS-native; OTel into MLflow; AWS-reference pattern.
Batch output storageDelta (truth + serving for now)Delta for replay, lineage, joins with the warehouse, and — for phase-1 traffic — backend reads via Databricks SQL warehouse. DynamoDB cache is the phase-2 lever if we observe (a) backend read latency on Delta becoming a measurable share of request budget, or (b) Databricks SQL cost on read-heavy traffic outpacing a DynamoDB single-table. W5–8 batch lane work measures Delta-only read latency on real backend traffic patterns; the cache is wired only if the measurement says it's worth it.
Cache invalidationSource-data version + TTL; prompt/model changes trigger a re-batch decision, not auto-invalidatePrimary trigger: source-row version bump in the upstream Delta tables (project, company, contact) — when the underlying data changes, the cached pitch is stale and the entry is dropped. Secondary: TTL (30d default) for long-tail entries. Prompt-registry and model-card changes are operator-initiated re-batch decisions.
RegionAWS eu-north-1 (Stockholm)AWS eu-north-1 (Stockholm) (no change)All proposed AgentCore primitives are GA in Stockholm by 2026. The earlier AgentCore Evaluations gap is moot because we picked MLflow on Databricks for evals (see §7 Observability insight). runebot, terraform, and Bedrock clients are already eu-north-1 today — no migration. AgentCore GA in Stockholm Q1 2026 with VPC + PrivateLink.
Consumer surface/commercial embedded UI (~1000-line HTML+CSS+JS page rendered server-side, embedded by Commercial Platform; no GET endpoints to list/replay conversations)Headless conversation APIConversation + Turn + Feedback resources, cursor-paginated history, Idempotency-Key headers, explicit POST .../cancel, per-tenant scoping (the 2025-2026 industry pattern). v1 already stores session_id / turn_index / external_user_id per turn; v2 adds the GET endpoints (see §7 Surface). 2025-2026 industry pattern; OpenAI Conversations API + Salesforce Agentforce + LangGraph Threads all converge.
The architecture, end to end

Six layers, top to bottom — plus a parallel batch lane.

One Commercial user. One trace. When Anna uses 'find similar projects' in Commercial, the proxy mints a traceparent and stamps every downstream span with her Cognito user_id, her tenant_id, and the project_id she's working on. Every box below the browser inherits that context — ALB access log, Fargate proxy, AgentCore session, each Pydantic AI tool call, every Bedrock Converse invocation, every Databricks MCP query. In MLflow's rune-traces-prod-uc experiment, Anna's session is one clickable trace: the prompt she saw, the tools the agent chose, the SQL it ran, the tokens we charged (auto-tracked per call), the latency at each hop, and — via the mlflow.set_active_model() binding — the eval scores when we replay it.

Mobile (app)native · SSE consumer
Commercial UIvia Apollo BFF
ALBrouting · TLS
Fargate proxyauth · SSE shim · cancellation · trace-context injection
AgentCore Runtime microVM · 8h sessions · consumption-based pricing in-vm state · per-session RAM
Pydantic AI typed agents · OTel · @agent.tool
Databricks MCPDBSQL · SQL on UC tables
Bedrock Converseclaude · mistral · nova · EU
trace: anna-2026-05-17T14:22Z  user=anna@acme  tenant=acme  project=p-9182
├─ proxy.receive      12ms
├─ agentcore.session  (8h microVM)
│  └─ pydantic_ai.run
│     ├─ tool.databricks_mcp.find_projects   340ms
│     ├─ bedrock.converse (claude-sonnet-4.6) 2.1s · 4,200 tok in / 580 out
│     └─ tool.databricks_mcp.get_company      120ms
└─ proxy.stream_close 2.6s · eval_score=0.84 (MLflow run-ID linked)

That same Pydantic AI agent class also runs as a batch job. Above is the interactive lane that handles Anna's chat. Below is the batch lane — the 200k sales-pitch pre-compute on Step Functions + Bedrock Batch. Two substrates, one definition.

Batch lane · parallel runner · same Pydantic AI agent definitions

Databricks / S3
input rows
Step Functions
orchestrate · retry
Bedrock Batch
50% discount · JSONL
Delta
truth + serving
Backend
reads via Databricks SQL

Same Pydantic AI Agent definition as the interactive lane — same prompt, same typed output schema, same eval rubric. Only the orchestrator changes. 200k sales-pitches in ~$1,350 per regeneration cycle on Sonnet 4.6 batch; the same job on Gemini 3 Flash batch is ~$250 (~5× cheaper) — a cost lever we'd evaluate against rubric scores in the spike. Cache invalidation keyed on (prompt_version, model_version).

How we build this picture in 8 weeks · §6

Build path

Two months. Foundation in W1–4, agents and go-live in W5–8.

The plan proves the foundation in eight weeks across two agent classes — rune (chat) and sales-card (typed-output) — with sales-card running on two runners: interactive (AgentCore Runtime) and batch (Step Functions + Bedrock Batch, called sales-pitch in the batch lane). Same Pydantic AI definition, same eval rubric, two substrates. The remaining four sibling LLM features (sales-brief, draft-email, summary, sales-copilot) are shape-equivalent to one of these two and migrate in phase 2. W1–4 builds the foundation with rune as the first agent on the new stack; W5–8 adds sales-card + sales-pitch, AgentCore Memory, the Conversation API, and v2 going live alongside v1. Architecture conversations are continuous throughout the build; by week 4 we've integrated enough to formally re-scope if needed before committing W5–8.

The 8-week timeline. Two phases, with continuous check-ins throughout and a go-live decision at week 8.

Weeks 1–4 · Foundation
Stack standing, rune as first agent, tracing + first eval dataset

AgentCore Runtime + Pydantic AI deployed across all three environments (playground / dev / prod) via Terraform — same IaC pattern as v1. Rune ported to the new stack as the first agent class. Databricks managed MCP (DBSQL MCP) replaces today's databricks-sql-connector calls — a direct port of v1's raw-SQL pattern, no upfront function definitions needed. Fargate proxy live with auth normalization, SSE shim, cancellation, and trace-context injection. MLflow tracing flowing end-to-end into rune-traces-prod-uc — spans, tool calls, costs, latency. First eval dataset registered: sales-card's existing paired-bootstrap pipeline ported into mlflow.genai.evaluate(), with mlflow.set_active_model() binding traces to model versions.

  • All AWS infra (AgentCore Runtime, ECS Fargate, ALB, Cognito, IAM, Secrets Manager) managed via Terraform in ai-hub-terraform — same IaC pattern as v1.
  • Verifies inbound auth — Hubexo ID Cognito JWT + X-API-Key for internal callers — confirms claim shapes for proxy normalization to (user_id, tenant_id, project_id, product_source, session_id).
Weeks 5–8 · Agents + go-live
Sales-card + sales-pitch, AgentCore Memory, Conversation API, go-live alongside v1

Sales-card (typed-output) lands on the new stack with the existing paired-bootstrap eval pipeline now wired through MLflow's prompt registry + production alias — promotion is gated by the eval score, not by a deploy log. The same sales-card class runs in the batch lane (sales-pitch) via Step Functions + Bedrock Batch, with Delta output and cache invalidation wired to MLflow prompt-alias changes (via Databricks webhook or polling the production alias). AgentCore Memory wired for cross-session persistence. Rune's remaining tools port over, and rune's existing auto-loop rubric scorer (today in evals/auto-loop/) joins MLflow's eval datasets alongside sales-card. v1's FastAPI paths keep running indefinitely — v2 ships alongside, not over. Once shadow A/B is clean, interactive traffic promotes to the v2 stack; v1 stays available as the safety net. Consuming teams migrate to the Conversation API on their own timeline.

  • Conversation API v1 ships alongside: POST/GET /v1/conversations, GET .../turns, POST .../cancel, POST .../feedback with Idempotency-Key support. Consuming teams build against the API on their own timeline (see §7 Surface).
  • SLOs at go-live: p95 latency, eval score on the held-out set.

Agent shapes — what we build now, what we earn the right to do, what we explicitly defer.

Build · now Two Pydantic AI agent classes on two runners

Rune (chat) on AgentCore Runtime, landing in W1–4 as the first agent on the new stack. Sales-card (typed-output) on AgentCore Runtime in W5–8 — and the same sales-card class running batch (sales-pitch) via Step Functions + Bedrock Batch. One framework primitive, two agent classes, two runners.

  • Foundation in W1–4, agents and go-live in W5–8
Earn · phase 2 Remaining 4 sibling LLM features · deal-prep orchestrator · AgentCore Identity

Sales-brief (multi-section strategy brief, deeper than sales-card's 110-160-word snapshot), draft-email, summary, sales-copilot — all shape-equivalent to rune (chat) or sales-card (typed-output) proven in phase 1; these migrate with minimal incremental work once the foundation is solid — prompt and eval-dataset wiring, not net-new infrastructure. One orchestrator (deal-prep) — the canonical multi-agent case, earned after single-agent foundation proves out. AgentCore Identity — earned by any of three triggers (see §4 Identity row): per-user Databricks RBAC, a new OAuth-based outbound service, or a SecOps mandate.

  • Phase 2, after week 8
Defer · explicit Voice · CRM-write · swarm patterns

All explicitly out of scope until we see real demand. Periodic workflow agents (weekly market-intel, daily briefings) also deferred — sales-pitch batch is the one batch citizen in this build.

  • Not in roadmap

The shadow-traffic period in weeks 5–6 means we never run blind — the new stack proves itself against the old one on real requests before we commit. For sales-pitch batch, the first 200k run validates the cost model and cache hit rate before scheduled regeneration kicks in.

Insights · decisions we've made

The fifteen calls that shaped this proposal.

If you only have time to push back on fifteen things in this proposal, push back here. Each row is one architectural call. Flip any one of them and the rest of the proposal moves.

Posture
Hybrid is the unlock — rent the right primitives, own the open formats

Rent the infra primitives AWS has already solved (runtime, memory). Reuse the Databricks workspace we already operate for MLflow + Unity Catalog. Own the open formats that make every choice reversible — OTel spans, prompt registry, eval dataset schemas, tool function signatures.

Consolidation
Eval, tracing, prompts, and datasets all live in the workspace we already operate

MLflow on Databricks is already provisioned, already in use, and already holding our ML model registry. Putting evals + prompt registry + labeling sessions on the same surface means one less SOC 2 review at contract renewal, one less DPA, one less invoice, one less vendor SLA to track. Unity Catalog ACLs are available as a phase-2 lever if we want role-based promotion later — same permission surface that gates production tables.

Modularity
Every layer is an interface, not an implementation

The architecture is a stack of contracts: OTel for observability, OpenAI-compatible API for inference, Pydantic schemas for output, MCP for SQL tools, JSONL on S3 for batch I/O. Each layer's implementation sits behind its contract and is independently swappable — model providers, eval backends, agent frameworks, runtime substrates, batch orchestrators, tool layers, memory stores. The §9 table maps how localized each swap is; most layers are env-var flips or config changes behind a stable contract (env-var flips, one-line config changes). The next time a new model ships, a new eval platform emerges, or a customer requires on-prem — we swap one component, not the architecture.

One pattern
Pydantic AI gives us one pattern across every agent shape we'll ever build

Today rune is a hand-rolled tool-use loop and the five sibling features are bespoke single-shot Bedrock calls — two distinct hand-written code paths. Pydantic AI's Agent class handles tool-using, typed-output, and multi-turn through the same primitive. One mental model, every shape — the next agent is a schema, a prompt, and an eval rubric — not a framework choice. Type-safety catches schema errors at write-time.

Tooling
Repo skills automate the boilerplate

v2 commits to repo-wide AI tooling as part of the codebase. Skills like create-agent (scaffold a Pydantic AI class + prompt registry entry + eval dataset), create-endpoint (scaffold a /v1/... route with auth + tracing + Idempotency-Key), and create-eval (scaffold an eval dataset + LLM judge + production sampling) turn the "next agent is a schema, a prompt, and an eval rubric" claim into something enforced by automation, not discipline. The next contributor invokes a skill instead of copying patterns from existing agents — fewer mistakes, faster onboarding, less drift between agents over time.

Tool layer
Two managed MCP server types on one UC governance plane — DBSQL phase 1, UC Functions phase 2

Databricks ships managed MCP servers for both ad-hoc SQL (DBSQL MCP) and predefined typed operations (UC Functions MCP) — same UC governance plane for both (grants, row filters, column masks). Phase 1 uses DBSQL MCP because rune today writes raw SQL via databricks-sql-connector; DBSQL MCP is the closest match — no upfront function definitions to write, agent ports cleanly. UC Functions arrive in phase 2 as common query patterns crystallize into typed, named, audited operations — fewer attack surfaces, better attribution. For non-SQL tools, native Pydantic AI @agent.tool functions are simpler than running our own MCP server. No in-house MCP server to operate.

Identity
AgentCore Identity is the strategic next platform move, deferred from phase 1

Identity is two problems, not one — only one is phase 1. Inbound stays Cognito (Hubexo ID JWT + X-API-Key). Outbound stays on today's workload-level OAuth M2M to Databricks + Secrets Manager API keys for external services — all the Databricks managed MCP needs in phase 1. AgentCore Identity is the strategic next move — a vault keyed on (workload-id × user-id) that turns the next outbound credential (MS Graph, Salesforce) into a config row and unlocks per-user Databricks RBAC when Hubexo's entitlement service ships and the data team commits to the cross-team work. Posture today: the rune-sp service principal is read-only on customer tables; writes are scoped to internal tracking tables (api_questions, tool_calls, product_users, company_profiles) — prompt-injection through tool calls cannot reach customer data.

Observability
MLflow on Databricks is the system of record. AgentCore Observability is a data source.

MLflow on Databricks is the system of record — same workspace as our ML model registry, already running rune-traces-prod-uc and rune-traces-dev-uc for production tracing today. In W3-4 we wire eval datasets, labeling sessions, the prompt registry, and Judge Builder into the same experiment; mlflow.set_active_model() binds traces to prompt versions. Both phase-1 Pydantic AI agents (rune, sales-card on interactive and batch) and the proxy emit OTel into MLflow with full trace ↔ prompt ↔ eval linkage; phase 2 agents inherit the export. Prod regressions point straight at the prompt version and input that produced them — shared run-IDs by construction. Customer data stays in europe_prod_catalog; Unity Catalog ACLs are a phase-2 lever for role-based prompt promotion. Langfuse Cloud Pro stays the documented escape hatch — OTel-compatible, ~2–3 engineer-days to flip (see §9).

State
Two kinds of state — per-session in microVM RAM, cross-session in AgentCore Memory

Per-session state dies when the microVM session closes. Cross-session recall (prefs, history, semantic search) lives in AgentCore Memory. Two distinct stores, two lifecycles — keep them separate to avoid the bug class.

Batch
Batch is a runner, not a lane — same Pydantic AI agent definition

The agent class, the prompt-registry entry, the eval rubric, the typed output schema — all identical across interactive and batch substrates. Only the executor changes (AgentCore Runtime vs Step Functions + Bedrock Batch). This is why batch doesn't bifurcate the codebase.

Multi-agent
Don't go multi-agent yet — fewer stronger agents, plus workflows

Multi-agent burns far more tokens than equivalent single-agent work, and only pays back when specialists genuinely parallelize. Rune isn't there. Two Pydantic AI agent classes in phase 1 (rune, sales-card — the latter running both interactive and batch), four sibling features queued for phase 2 — each single-purpose. Multi-agent orchestrators earned later when a use case proves it.

Tools
When does a feature become a rune tool? Four questions.

Rune has 5 registered tools today (3 SQL, 1 external ML, 1 LLM-backed service). v2 expands the surface — but not blindly. Each candidate goes through 4 questions: (1) Already called from a non-rune agent? → Pattern B (separate AgentCore Runtime; rune calls via HTTP). (2) Output deterministic per small key, regenerated infrequently, read often? → Pattern C (batch pre-compute; tool just fetches Delta). (3) Per-request unique AND only called inside rune? → Pattern A (in-process @agent.tool with internal Bedrock call). (4) Multi-turn / stateful / has its own UX? → It's an agent, not a tool — peer runtime. Phase 1 patterns emerge from eval-driven prioritization in W3-6.

Region
Stockholm stays home — eu-north-1 is already where we live

All phase-1 components are GA in Stockholm by 2026; the AgentCore Evaluations gap doesn't bind us because eval lives in MLflow. The workspace already holds production data, and EU-DPF + Databricks DPA cover the residency story.

Residency
Customer trace data never leaves the EU Databricks workspace

Traces, evals, prompts, and labels all land in europe_prod_catalog — the customer's own EU Databricks workspace, governed by the existing Databricks DPA. No cross-EEA hop to a SaaS observability backend, no second data processor in the GDPR inventory, no SCCs to maintain for a separate vendor. The architecture's EU posture is now "data stays where it was produced" rather than "data flows to a vendor in Ireland under DPA + SCCs."

Locale
One canonical prompt per agent, locale as a variable — not eight prompt forks

v1 resolves the 8 customer locales (SV/DA/NO/FI/DE/FR/CS/SK) to a language name + Databricks language_id via app/i18n.py; the agent's system prompt gets the correct {response_language} for all 8. But UI string translations via the _i18ns join cover only 4 (SV/DA/NO/FI); DE/FR/CS/SK fall through to English. The plumbing is right; the discipline isn't — sales-card bundles 15 hand-written few-shots across 5 languages into every call (DE/FR/CS/SK get the bundle with no relevant example), and formatting helpers leak hardcoded Swedish labels (Projekt-ID, Titel, Län) into every locale. v2 collapses this to one canonical prompt per agent registered via mlflow.genai.register_prompt with {{user_locale}} as a variable; per-locale few-shots only where Claude Haiku drops on morphology (Finnish, Czech, Slovak); a custom make_judge() language-consistency scorer keyed on expected_language blocks promotion of any prompt that drifts mid-response. Each locale gets a native-authored eval slice, and model routing follows the eval — Haiku 4.5 for the high-resource five, Sonnet 4.6 for FI/CS/SK. A new locale needs a translation tier, an eval slice, and a model-routing decision — same shape, same gate as a new agent.

Surface
Headless API — consumers build their own UI

Today's /commercial is a ~1000-line HTML+CSS+JS page served from runebot, embedded by the Commercial Platform — markdown, tables, feedback UI, in-browser session state, all owned by the agent team. Consumers build their own UI; runebot ships an API, not pixels. v1 already stores the state — every api_questions row carries session_id, turn_index, external_user_id, prompt, answer, feedback — but no GET endpoint exposes it, so today's embedded UI manages history in-browser. v2 adds the canonical 2025-2026 resource model: Conversation + Turn + Feedback, with cursor-paginated list/replay endpoints, streaming SSE submit with Idempotency-Key headers, and explicit POST .../cancel. Cross-session memory stays a separate AgentCore Memory resource. Consumers integrate via the typed SDK generated from the OpenAPI spec — Stainless or Speakeasy auto-publish TypeScript / Python / Go clients on every spec change, so client code stays in sync with the API. Glenigan, ConstructionWire, and Leadmanager get a defined integration contract.

Q&A · questions we wrestled with

Questions we wrestled with. Answers we'd defend.

Two rounds of research (eight agents in total) surfaced the questions below. Each has an answer we'd defend — but several have nuance worth surfacing for the room, especially around the late switch from Langfuse to MLflow, vendor consolidation, and the deferred multi-agent decision.

Why not Bedrock Agent? It exists, AWS owns it, why are we building on AgentCore Runtime + Pydantic AI instead?

Bedrock Agent abstracts the agent loop in ways that make custom routing, custom memory wiring, and multi-tool composition awkward. AgentCore Runtime is the lower-level primitive — it gives us the microVM and the session lifecycle without dictating the agent loop. Pydantic AI sits on top and gives us a type-safe Python agent we can read, fork, and test, with one primitive covering tool-using, typed-output, and multi-turn shapes. The combination is more code we own but materially more control.

Why isn't AgentCore Identity in phase 1?

Identity is two problems — inbound stays on Cognito (Hubexo ID JWT + X-API-Key), no AgentCore Identity needed there. The question is outbound. None of phase-1's two agent classes — rune (chat) and sales-card (typed-output, running both interactive and batch as sales-pitch) — require per-user RBAC at the Databricks data plane. Today's outbound credentials stay simple: Databricks reads use a workload-level rune-sp Service Principal (OAuth M2M, no per-user identity needed); external APIs (recsys, company-profiling in external mode) use static keys in Secrets Manager.

What exactly is Pydantic AI?

An open-source Python agent framework from the Pydantic team — the people behind the validation library underneath FastAPI and most of modern Python. Type-safe by design (Rust validation core), ~30 model providers in one matrix, MIT-licensed, no upsell. Agents are Agent instances with typed inputs, tool functions decorated with @agent.tool, and typed outputs validated by Pydantic. Auto-self-correction on schema mismatch: if the model returns invalid output, the framework re-prompts with the validation error attached.

Strands vs Pydantic AI — why did we pick Pydantic AI?

Strands inside AgentCore Runtime was the runner-up — AWS's own agent framework, with the deepest native AgentCore integration (4-line deploy, AWS-published reference notebooks). We chose Pydantic AI because: (1) type-safety + Rust validation + auto-self-correction matter more when one framework owns every shape we'll ship — two agent classes in phase 1, the four phase-2 siblings on the same primitive — than they would for a one-off; (2) one Pydantic AI primitive handles tool-using, typed-output, and multi-turn shapes as first-class citizens — Strands is most natural at tool-using and treats the others as add-on patterns; (3) reversibility is greener (Pydantic-team trust + native Python tool functions).

Why introduce MCP at all, instead of keeping all tools as native Pydantic AI @agent.tool functions?

For the Databricks SQL surface specifically, MCP is how we use Databricks's managed MCP servers. Phase 1 uses DBSQL MCP (the agent writes ad-hoc SQL against UC tables — the closest match to today's databricks-sql-connector pattern, no upfront function definitions needed). Phase 2 adds UC Functions MCP as common query patterns crystallize into typed, named operations. Both share the same UC governance plane (grants, row filters, column masks) — same rune-sp, same access boundaries. We're not running our own MCP server for non-SQL tools; those live as native Pydantic AI @agent.tool functions inside each agent. No in-house MCP infrastructure.

What's a Unity Catalog function, and when does v2 add them?

A Unity Catalog function is a governed, audited SQL/Python routine registered inside Databricks — same access control as a table, same audit trail, same lineage. Databricks ships a managed MCP server that exposes each UC function as its own MCP tool with its own typed schema — the agent's list_tools call returns find_projects(...), get_project_contacts(...), etc. as distinct tools, auto-discovered with zero per-function Python wrappers on our side. Per-function effort lives on the Databricks side as a CREATE FUNCTION + a descriptive COMMENT (which becomes the LLM's tool description). v2 adds UC Functions in phase 2 — once we've shipped phase 1 on DBSQL MCP and observed which queries become repeated patterns, we wrap those as UC Functions for type safety, easier audit attribution (named call vs SQL string), and prompt-injection resistance (the agent can't write arbitrary SQL through a typed function). Phase 1 stays on DBSQL MCP: the agent writes ad-hoc SQL, governed by UC table-level grants on rune-sp.

Why DBSQL MCP first instead of UC Functions?

DBSQL MCP is the closer match to v1. Today rune writes raw SQL via databricks-sql-connector; DBSQL MCP runs raw SQL against UC tables via MCP — almost a direct port. No upfront function definitions to write, no decomposing v1's generic query_projects_database into typed operations before shipping. The agent moves fast, ports cleanly. UC Functions are a phase-2 layering, not a replacement. Once phase 1 is live and we see which queries get asked repeatedly, we wrap those as UC Functions — typed inputs/outputs, named operations, prompt-injection-resistant (the agent can't write arbitrary SQL through a typed function), easier to audit (a function call attributed to a specific operation, not a SQL string). DBSQL MCP stays as the fallback for ad-hoc queries that don't fit a function. Same governance plane either way. Both share the same UC ACLs on rune-sp — same grants, same row filters and column masks (both GA in 2026), same SP scope (read-only on customer tables; writes scoped to internal tracking tables). The MCP server type is interchangeable; the governance plane isn't. Switching costs are low because nothing about the agent layer changes — the same Pydantic AI agent class points at a different MCP toolset.

How do UC catalog/grant changes reach prod, and how does this evolve when UC Functions arrive in phase 2?

Phase 1 (DBSQL MCP): the catalog, schema, and grants on rune-sp are managed by Terraform (databricks_catalog, databricks_schema, databricks_grant). Promotion is catalog-level: dev_catalog for the dev workspace, europe_prod_catalog for prod. CI runs terraform plan on PR; on merge, GitHub Actions runs terraform apply. No function bodies to ship — the agent writes SQL against tables exposed through DBSQL MCP, governed by table-level grants. Phase 2 (UC Functions): when common query patterns crystallize, function bodies live as .sql files in a Databricks Asset Bundle (DAB) in the runebot repo. CI adds bundle validate; merge runs bundle deploy --target prod alongside the existing terraform apply. Drift defense (added in phase 2): (a) a CI integration test boots the agent against the dev catalog and asserts list_tools contains the expected names — catches renames and signature changes before merge; (b) the eval suite is the second net — a missing function fails the next eval run before the production alias moves. UC functions don't carry MLflow-style version aliases (that's a registered-model feature) — the catalog is the unit of promotion.

Do we need Databricks AI Gateway in phase 1, or can we skip it?

Skip in phase 1. UC grants on rune-sp (table-level for phase 1 DBSQL MCP, function-level added in phase 2 when UC Functions arrive) already enforce RBAC at the data plane — that is the actual security primitive for the Databricks-managed MCP. AI Gateway would add an extra control plane on top (centralised audit beyond UC's native logging, per-key rate limits, key rotation policies) — useful eventually, but none of phase-1's two agent classes drive demand for it. Same pattern as the AgentCore Identity deferral: explicit triggers documented, lever ready when one of (a) SecOps mandate for centralised audit, (b) per-key rate-limit policy for external partners, or (c) a second tenant pattern arrives. Phase-1 traffic runs on UC grants + workload-level OAuth M2M (rune-sp); non-SQL Pydantic AI @agent.tool functions stay governed by IAM on the AWS side.

What do "paired bootstrap", "kappa floor", and "tripwire" actually mean?

Paired bootstrap — run both prompts on the same 200 test cases. Randomly resample those 200 pairs 10,000 times. The gate isn't a fixed threshold; it's whether the 95% confidence interval excludes zero in favor of the new version. Kappa floor — Cohen's / Fleiss' kappa measures inter-judge agreement adjusted for chance. 0.4–0.6 is substantial, 0.7+ is excellent. Below the floor, scores can't be trusted. Tripwire — anti-gaming check. Two patterns block promotion regardless of the headline score: (a) judges suddenly stop agreeing with each other (kappa collapse — the prompt may be confusing them); (b) LLM-judge score goes up while the deterministic oracle score goes down (the prompt is gaming the judges while breaking objective rules).

How does eval-gated promotion work — prompt iteration?

Engineers iterate prompts in the MLflow prompt registry against the agent's held-out dataset, scored via mlflow.genai.evaluate() with the configured judges — fast feedback in a notebook or the MLflow UI, no CI wait. When a candidate clears the gate (paired bootstrap shows improvement vs the prior prompt, tripwires didn't fire, kappa floor cleared), they reassign the production alias to the new version. Code resolves prompts by alias, so the alias change is the deploy — no PR for routine prompt tweaks. Roll back by reassigning the alias to the prior version.

What about changes that touch tool schemas or agent shape?

Code changes (new tool, schema field, agent shape) go through a PR with the eval suite running in CI. Here the eval is the safety net — a regression check, not the optimization target.

Why MLflow over Langfuse (and Braintrust, AgentCore Observability)?

All four — plus Pydantic Logfire, which is general-purpose Python observability rather than LLM-eval-specific — are valid trace sinks. MLflow on Databricks wins on what compounds for us: it's already provisioned and already in use, so the marginal cost is zero and one less vendor needs a SOC 2 review at contract renewal. Unity Catalog ACLs are available if we want role-based promotion in phase 2 — the same permission model that already gates production tables and ML models, ready to extend. mlflow.set_active_model() binds prompt version → trace → eval run-ID, so a prod regression repros in eval by run-ID. And MLflow ships judge alignment (MemAlign / DSPy) for tuning custom make_judge() judges against human-labeled examples — Langfuse has no equivalent. Where Langfuse is genuinely better: prompt-playground UX, A-vs-B paired-bootstrap rendered as a UI panel, and annotation queues with assignment / SLA tracking that are more polished than MLflow's labeling sessions (which cover the same primitives but require Unity Catalog and Judge Builder rather than a dedicated workflow UI). We trade that polish for the four wins above plus $0 marginal cost; the missing stats panels are computed in CI from raw scores (~40 lines total). Braintrust, AgentCore Observability, and AgentCore Evaluations remain documented swap targets in §9.

If we switch to MLflow now, can we go back to Langfuse in six months if we hate it?

Yes, and the cost is bounded. The §9 modularity table marks Eval & tracing backend as Green forward — OTel endpoint flip plus export scripts for prompts and eval datasets, ~2–3 engineer-days. Historical traces stay queryable in MLflow; the CI-side paired-bootstrap script is sink-agnostic.

Lambda vs. Fargate for the proxy layer?

Fargate. The proxy holds long-lived SSE streams (sometimes minutes) and benefits from a warm pool. Lambda's cold-start tail and 15-minute hard ceiling make it the wrong shape for a streaming-proxy role. The Fargate proxy also owns SSE shimming and cancellation propagation — both of which need persistent connection state.

How does AgentCore Memory actually work — what's the API shape we care about?

Three operations matter: write (event-style, indexed automatically), recall (semantic, returns the top-k most relevant prior events for a query), and scope (per-user, per-session, or per-workload). It's not a vector database we manage; we don't think about embeddings or indexes. We write events, we recall when relevant.

AgentCore Memory is the one Red item in §9's reversibility table — how confident are we in its recall quality?

Honestly, less confident than the rest. AgentCore Memory has no third-party benchmarks; competitor Zep wins LongMemEval at 63.8% vs Mem0 at 49.0% on the public benchmark — that's a 15-point gap. We're betting on AWS's recall quality with no independent numbers to anchor it. Mitigation: the W5-8 memory implementation runs a head-to-head benchmark on a 50-conversation eval slice — AgentCore Memory vs Zep Cloud (EU) vs Mem0 vs pgvector + summarizer. Routing decision for the Memory primitive is gated on this benchmark. If AgentCore wins or ties, ships as planned. If Zep wins, the architecture survives the swap because everything else is decoupled (§9 marks Memory Red because the migration is data-plane work, not architecture work).

How does batch fit into the architecture?

Same Pydantic AI agent definition runs under two substrates. Interactive = AgentCore Runtime (microVM session, 8h, request/response or SSE). Batch = Step Functions invoking Bedrock Batch Inference, with input from Delta/S3 and output to Delta (DynamoDB cache is the phase-2 lever). The agent code is identical; only the wrapping changes. The first batch citizen is sales-pitch: 50k projects × 4 roles = 200k pre-computed pitches, served from Delta via Databricks SQL warehouse to the backend in phase 1.

Why not use AgentCore Runtime for batch too?

AgentCore is microVM-per-session — pricing and substrate designed for interactive request/response. Sessions also hard-cap at 8 hours, so a single Runtime session can't carry a full overnight batch — you'd be orchestrating session rotation on top of the cold-start cost, reimplementing what Bedrock Batch already does. 200k jobs in a window is the wrong shape; you'd be paying for microVM cold-starts on every project. Bedrock Batch is purpose-built — 50% discount, AWS-managed queue, scheduled completion. Step Functions → Bedrock Batch is the well-trodden fan-out pattern — managed queue, scheduled completion, no agent-loop overhead per row. AgentCore runs rune (interactive); batch bypasses it. Consistent with the hybrid posture: rent the right primitive for each shape of work.

What's the cache invalidation policy for the 200k pre-computed pitches?

Three triggers. (a) Prompt registry alias change in MLflow (Databricks webhook or polling the production alias) — when the sales-pitch prompt changes, the cache is stale. (b) Bedrock model card change — when Sonnet 4.6 → 4.7 ships, all outputs are stale. (c) TTL — 30 days default for long-tail entries. Cache key is (entity_id, prompt_version, model_version) so partial invalidation works. Backend reads cache first; cache miss falls through to the live agent (which writes back to cache). Hybrid serving — pre-compute the high-traffic 80%, live agent covers the long tail and edits.

Step Functions vs Databricks ai_query() — when each?

Step Functions for AWS-native control flow, retry policies, OTel into MLflow. ai_query() for the pure SQL-shaped case where input is already in Delta and we want warehouse-native parallelism with zero data movement. ai_query() loses Bedrock's 50% batch discount when routing through external Bedrock endpoints — it gains operational simplicity (one SQL statement) at the cost of token rate. W5–8 batch lane work picks one for sales-pitch; the other becomes the documented alternative.

How do we keep model costs bounded as more agents ship on the new stack?

Two layers. Today, no hard cap — interactive Bedrock spend is monitored monthly (historical peak ~$500/mo), but spend can grow with traffic. v2 closes the gap with per-tenant rate limits at the Fargate proxy — token-bucket per consuming product (Commercial, Mobile), separate budgets per resource class (interactive turns vs batch jobs vs session listing). Per-region Bedrock TPM ceilings are respected by the AWS-managed quota system. Cost dashboards land in MLflow alongside trace data so per-agent unit cost is queryable. The v2 controls — per-tenant rate limits + per-agent cost visibility — replace the current "watch the bill" posture.

How does per-tenant data entitlement work — two Swedish customers seeing different data, or Glenigan users seeing different tables entirely?

Phase 1: rune-sp has table-level grants on UC — coarse access (the SP can read what the SP is granted; same access for all callers). This works for phase 1 because rune's queries don't differentiate between tenants today (same SP, same grants, same data surface). Per-tenant fine-grained access requires a Hubexo entitlement service that maps (user, tenant) → allowed data scopes — Hubexo is considering building this but it's not yet available. Where it sits and how it integrates with the agent layer is an org-wide decision still open. When it ships, the architecture plugs in via Databricks-native primitives — UC row filters and column masks (GA), dynamic views referencing current_user() or is_account_group_member(), function-level filter arguments once UC Functions arrive, or ABAC policies attached at catalog/schema level. The agent layer doesn't need to change — entitlement enforcement happens in UC under the hood, regardless of whether the agent calls DBSQL MCP or UC Functions MCP. We're architecturally ready; this becomes a plumbing change when the entitlement service exists and the integration pattern is agreed across the org (see §10 risk).

Glenigan UK has a 2026-09 deadline. How does this proposal handle the timing, data residency, and per-customer entitlement?

Three things to decompose here. Timing: phase-1 go-live (~W8 from kickoff) lands a couple of months before the September Glenigan deadline. Phase-1 foundation work does not directly accelerate the UK regional substrate — UK Databricks tables, UK trace store, and UK eval slice are net-new work, not config flips. The v2 architecture keeps the option open via cross-region Bedrock profiles, per-region Databricks catalogs, and externalized auth — but "keeps the option open" means we can build it after, not that it's free. Data residency: Glenigan UK data needs to land in Databricks somewhere; exactly how that integrates with our EU workspace is an org-wide decision with the data team (not yet made). The architecture doesn't lock us into a particular integration pattern. Per-customer entitlement: distinguishing Glenigan users (UK data access) from European users (RSM data access) requires Hubexo's entitlement service — where the entitlement service sits and how it integrates with the agent layer is a separate org decision (see the entitlement Q&A above and §10 risk). We're architecturally ready (UC row filters, dynamic views, function-level filter args all available), gated on the entitlement service shipping. What we want from this call: alignment on (a) where Glenigan UK data lands in Databricks, (b) entitlement service ownership and timeline, (c) whether to parallel-track Glenigan on v1 stack alongside v2 phase-1, accept slight slip on either, or reorder. Not making this call in the proposal — making it together with product leadership.

How does the agent respond in the right language for each customer locale (Swedish, Danish, Norwegian, Finnish, German, French, Czech, Slovak)?

v1 resolves the 8 customer locales (SV/DA/NO/FI/DE/FR/CS/SK) to a language name + Databricks language_id; the agent's system prompt gets the correct {response_language} for all 8. The frontend passes country_code + ui_language, app/i18n.py maps it, the language name is interpolated into the system prompt with a "respond entirely in this language" directive, and tools join _i18ns translation tables so values come back localized at the data layer — but UI string translations (table headers, error messages) are populated for only 4 (SV/DA/NO/FI); DE/FR/CS/SK currently fall through to English UI labels. v2 keeps the same propagation but tightens the prompt layer: one canonical prompt per agent registered in the MLflow prompt registry with {{user_locale}} as a templated variable (instead of today's mix of single-prompt + 15-example bundles in sales-card and hardcoded Swedish field labels in the formatting helpers), per-locale few-shots injected only for the harder languages (Finnish, Czech, Slovak — where Claude Haiku quality drops on Slavic + Finno-Ugric morphology), and a make_judge() language-consistency scorer keyed on expected_language that fails any prompt drifting mid-response. Each locale gets a small native-authored eval slice (~12 cases) registered alongside the master dataset; promotion to the production alias scores across all 8 slices before the alias moves. Model routing follows the eval — Haiku 4.5 default for SV/DA/NO/DE/FR, Sonnet 4.6 default for FI/CS/SK, with quarterly re-benchmarks. See §7 Locale insight for the full architectural call.

How do consuming apps (Commercial Platform, Mobile, future products) show historical conversations to their users in v2?

v2 ships a headless conversation API. The state already exists — every api_questions row carries session_id, turn_index, external_user_id, prompt, answer, and feedback — but no GET endpoint reads it back, so today the embedded UI manages history in-browser. v2 exposes the 2025-2026 industry-standard resource model: Conversation + Turn + Feedback. Endpoints: POST /v1/conversations (create) · GET /v1/conversations?cursor=... (cursor-paginated list, scoped to the authenticated principal — never trust client-supplied conversation IDs) · GET /v1/conversations/{id} with metadata (auto-generated title from first turn, preview, message-count, updated-at) · GET /v1/conversations/{id}/turns (full replay) · POST /v1/conversations/{id}/turns with Idempotency-Key header (streaming SSE submit) · POST /v1/conversations/{id}/turns/{turn_id}/cancel (explicit cancellation — every reference platform ships this because disconnect detection is unreliable) · POST /v1/turns/{id}/feedback (per-turn thumbs + free-text, the existing feedback flow repositioned). Cross-session memory (preferences, semantic recall) stays a separate AgentCore Memory resource. Consuming teams integrate via a typed SDK generated from the OpenAPI spec (Stainless or Speakeasy) — TypeScript / Python / Go clients regenerated on every spec change, so client code stays in sync with the API automatically. New products call typed functions instead of handcrafting HTTP + SSE plumbing. See §7 Surface insight for the architectural call.

Reversibility

Every layer is swappable. The question is how localized the swap is.

§7's Modularity insight frames every layer of this architecture as an interface, not an implementation. This section makes that concrete: for each strategic layer, we identify the realistic alternative product we'd swap to, and how localized that swap is — config, code, or data plane. Green means the swap is a config or env-var change behind a stable contract. Amber means a localized refactor or one infra component swap with clean boundaries. Red means data migration, schema change, or cross-cutting infra rework. Most everyday changes — model provider, eval backend, consumer surface — are green. The amber items are the ones worth knowing about up front.

LayerToday's pickRealistic swapModularity
Model providerBedrock Sonnet (4.6)Vertex Gemini / Anthropic direct via Pydantic AIGreen — provider string in config; no app code touches the model.
Eval & tracing backendMLflow on DatabricksLangfuse Cloud Pro, Braintrust, or self-host LangfuseGreen forward (OTel endpoint flip + export/import scripts for prompts and eval datasets, ~2–3 engineer-days); historical traces stay in MLflow — dual-read during transition.
Agent frameworkPydantic AILangGraph or AWS StrandsAmber — Pydantic schemas + tool functions are portable; agent-loop wrapper and streaming hooks rewrite per agent.
Interactive runtimeAgentCore RuntimeECS Fargate + self-managed session lifecycleAmber — one infra component swap; rebuild session/IAM/autoscaling, app image unchanged.
Memory storeAgentCore Memorypgvector on RDS, or OpenSearch + summarizerRed — re-implement write/recall + summarization, re-ingest existing memory contents.
SQL tool layerDatabricks managed MCPDirect databricks-sql-connector calls in @agent.tool functionsAmber — every SQL tool body rewrites; UC-grant governance (and phase-2 UC-function governance) regresses to app-level RBAC.
Batch inferenceBedrock BatchVertex AI Batch PredictionAmber — prompts unchanged; S3 ↔ GCS staging + IAM ↔ GCP service accounts is the real cost.
Batch orchestrator + outputStep Functions + Delta-only (DDB deferred)Databricks ai_query() into Delta-onlyAmber — orchestrator is a one-component swap; if/when DDB cache is added, collapsing the split is the data-plane piece.
Consumer surfaceHeadless conversation APIThird-party chat SDK (e.g. assistant-ui, CopilotKit)Green — REST + SSE behind a stable OpenAPI contract.
Risks & non-goals

Risks open, non-goals explicit.

One high-severity item, five medium, one mitigated. Read these through the reversibility analysis in §9 — that's the structural backstop. Risks below cover what could go wrong; reversibility above bounds what it costs us to unwind.

High · live
AgentCore is still maturing — the API surface could move during the build

Several GA/preview drops are queued during our build window for AgentCore Runtime and AgentCore Memory. Mitigation: pin to GA features only; the W1–4 foundation phase re-tests the picks against current APIs before we commit further.

Validates · W1–4
Medium · live
Eval engine location is still undecided — every trace, batch, or scheduled?

Three options: every production trace (most coverage, most cost), out-of-band batch on a sample (cheap, lagging), or MLflow scheduled-job (Databricks Jobs, predictable, gappy). Leaning hybrid. Decision before week 5.

Decide by W5
Medium · live
Pydantic AI is a newer AgentCore deployment target than AWS's first-party framework

Pydantic AI released late 2024 and is widely adopted in 2026, but newer than AWS's first-party framework as an AgentCore deployment target — first-party AWS reference notebooks don't yet exist for Pydantic AI. The W1–4 foundation phase writes our own integration glue and validates the deploy path.

Validates · W1–4
Medium · live
Batch lane introduces new operational primitives — Step Functions failure modes, Delta-read latency on backend traffic, cache staleness during model upgrades

Step Functions retry policies and Delta-read latency calibration on backend traffic (DynamoDB cache is the phase-2 trigger we'd evaluate before adding) both need real-data calibration. Cache staleness when the underlying model upgrades (Sonnet 4.6 → 4.7) requires the version-hash key to invalidate cleanly. Mitigation: W5–8 batch lane work runs a 1k pre-compute end-to-end; go-live gates on observed cache hit rate ≥ 80%; cache invalidation wired to MLflow prompt-alias change (Databricks webhook or alias poll) from week 5.

Validates · W5–8
Medium · open
AI Act compliance — sales-copilot drifts toward high-risk classification

The EU AI Act is in force in 2026 with general-purpose AI obligations applying. Sales-copilot specifically may drift toward "high-risk" if it influences employment-relevant decisions (sales targeting, deal scoring, contact prioritization). Article 50 transparency disclosure ("you are interacting with AI") needed for user-facing agents. Compliance workstream runs parallel to phase 1 — agent-by-agent classification + conformity assessment for any high-risk classification.

Compliance workstream
Medium · open
Per-tenant entitlement requires a Hubexo entitlement service that doesn't exist yet

Phase 1 agents use table-level grants on rune-sp — coarse access works because rune's queries don't differentiate between tenants today (same SP, same data surface). When the proposition expands to per-tenant differentiation (two Swedish customers seeing different data, Glenigan UK vs European customers seeing different table sets), the agent layer needs a tenant-aware filter. Databricks-native primitives are ready (row filters, dynamic views, function-level filter args, ABAC) but require an authoritative entitlement source. Hubexo is considering building this; ownership, scope, and integration pattern are org-wide decisions still open. Mitigation: phase-1 scope explicitly excludes per-tenant differentiation; we plug into the entitlement service when it ships — no agent-layer rework needed.

Org decision pending
Low · mitigated
Vendor lock-in on AgentCore — what if AWS deprecates a primitive?

The AgentCore-managed primitives in §9 (Runtime amber, Memory red) each have a self-managed fallback documented in §9; the §9 table summarizes: three green, five amber, one red across nine strategic layers — most everyday changes are env-var flips or config changes behind a stable contract.

Reversibility · §9

Out of scope for the 8-week build

  • No additional new agents. Phase 1 brings rune (today on ECS Fargate) and sales-card (graduating to a typed-output agent class) onto the new stack, with sales-card as the first batch citizen (sales-pitch in the batch lane). The remaining 4 sibling LLM features (sales-brief, draft-email, summary, sales-copilot) migrate in phase 2. Company-profiling is already a phase-1 rune tool today and continues into v2 unchanged. No other new agents during foundation phase.
  • No multi-agent orchestration. Single-purpose Pydantic AI agents — no supervisor + specialists, no swarm. Multi-agent earns its way in only after this lands.
  • No voice agents. Voice introduces latency budgets and platform integrations that aren't worth doing alongside the foundation build.
  • No daily-briefing agent. Periodic workflow agents (weekly market-intel, daily briefings) are the natural second-wave agent type — out of scope for this build.
  • No CRM-write capabilities. Adding write paths into Salesforce or HubSpot is a different security review entirely.
  • No non-EU regional substrates. Phase 1 ships EU-tenant data in europe_prod_catalog only. Glenigan (UK), ConstructionWire (USA), and Leadmanager (Australia) extend the architecture in phase 2; the design choices (cross-region Bedrock profiles, per-region Databricks catalogs, externalized auth) keep the option open without painting us into a corner now. Glenigan's specific timing is addressed in §8 Q&A.
  • No swarm patterns. Multi-agent debate, voting, or consensus loops — not until we have a single orchestrator working first.
  • No model fine-tuning. Bedrock-hosted base models stay as-is. Custom-trained models are a different proposal.
Hubexo · runebot · v2 proposal
v1.0 · presented to engineering + product leadership
Build window
8 weeks · 4 checkpoints · eu-north-1
fin.