Skip to content

Context & Memory

LLMs have finite context windows. A typical agentic session burns through 100K tokens in 30–60 minutes. Without defense, the agent simply forgets — mid-task, silently, with no warning.

The Math

Context consumption in a real session:

SourceTokens (est.)
System prompt~8K
Conversation history~20K
50 tool results × 2K each~100K
Thinking blocks~15K
Total~143K

Most models have a 200K effective context window. A busy session hits the wall in under an hour. The system needs a layered defense — not a single fallback.

Pre-Flight Token Estimation

Before every API call, the system estimates token count to decide if compaction is needed — preventing 413 errors before they happen.

Deep Dive: Token Estimation Across Providers

Token estimation runs for 3 API providers with different backends:

ProviderMethodFallback
Anthropic APINative token counting endpointCharacter heuristic (~4 chars/token)
AWS BedrockBedrock-specific token estimatorCharacter heuristic
Google Vertex AIVertex token count APICharacter heuristic

Key behaviors:

  • Extended thinking tokens counted separately with configurable budget
  • Beta fields (e.g., tool_search) stripped before counting to prevent overestimation
  • Cache-aware: already-cached prompt tokens counted at reduced weight
  • If estimation fails: falls back to character-based heuristic (~4 chars per token)

The estimation feeds directly into the auto-compact trigger: if estimated tokens > effective context window, Layer 3 fires before the API call — not after a 413 error.

5-Layer Escalating Defense

Each layer activates only when cheaper layers are insufficient. Once a layer runs, it does not retry — it either succeeds or escalates.

flowchart TD CHECK{Context\nusage?} -->|">60%"| L1 CHECK -->|">80%"| L2 CHECK -->|">90%"| L3 CHECK -->|"413 from API"| L4 L4 -->|"drain fails"| L5 L1["Layer 1 — Tool Result Truncation\nPersist full result to disk\nKeep only pointer in conversation\nAgent uses FileRead to retrieve later\nCost: ~0"] L2["Layer 2 — Microcompact\nRemove old tool results at boundaries\nNo LLM call needed\nMarks results stale by tool_use_id\nReplaces with brief summary"] L3["Layer 3 — Auto-compact\nCall LLM to summarize full conversation\nCreates compact boundary message\n20K tokens reserved for output\nTriggers memory extraction in parallel"] L4["Layer 4 — Reactive Compact\nEmergency: API returned HTTP 413\nTries context collapse drain first\nThen full LLM summarize\nCircuit breaker: hasAttemptedReactiveCompact"] L5["Layer 5 — Context Collapse\nRead-time projection only\nDoesn't change stored messages\nChanges what is sent to API\nCheapest recovery before Layer 4"] style L1 fill:#1e293b,color:#86efac,stroke:#334155 style L2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style L3 fill:#1e293b,color:#fcd34d,stroke:#334155 style L4 fill:#1e293b,color:#fda4af,stroke:#334155 style L5 fill:#1e293b,color:#c4b5fd,stroke:#334155

Key principle: Each layer runs at most once per iteration. Failure → escalate. No retries within the same layer. Circuit breakers on every layer.

Layer 3 Detail: Auto-compact

The 20K token reserve for summary output is based on empirical measurement: p99.99 of observed compact summaries = 17,387 tokens. The system over-provisions slightly to guarantee the summary never gets truncated.

When auto-compact fires, memory extraction runs in parallel (non-blocking). The summary replaces everything before the compact boundary. Everything before that boundary is permanently discarded.

Layer 4 Detail: Reactive Compact

Triggered when the API returns HTTP 413 (prompt_too_long) mid-turn — meaning the request was already too large to send. The circuit breaker flag hasAttemptedReactiveCompact prevents infinite loops: if the first attempt fails, the session aborts gracefully rather than spiraling.

Deep Dive: Effective Context Window Calculation

The system calculates how much context it can actually use:

FUNCTION getEffectiveContextWindowSize(model):
maxContext = getContextWindowForModel(model) // e.g., 200,000
// Reserve space for compact summary output
// p99.99 observed compact output = 17,387 tokens
// Over-provision to 20,000 to guarantee no truncation
reserved = min(
getMaxOutputTokensForModel(model),
20_000 // MAX_OUTPUT_TOKENS_FOR_SUMMARY
)
RETURN maxContext - reserved
// Example: 200,000 - 20,000 = 180,000 effective tokens

The 20,000 token reserve is an empirical value: based on production telemetry, 99.99% of compact summaries fit within 17,387 tokens. The system adds a ~15% safety margin. This means out of a 200K context window, only 180K is available for actual conversation — the rest is reserved for the summary if compaction is needed.

Compact Boundary = Controlled Forgetting

A compact boundary is not random truncation. It is a system decision about what to forget and what to keep:

Before boundary (discarded):
turn 1 → turn 2 → ... → turn 47
Boundary message (kept):
"Summary: You were refactoring the auth module.
Completed: token validation, session expiry.
In progress: refresh token rotation.
Key decisions: use httpOnly cookies, 15-min expiry."
After boundary (kept):
turn 48 → turn 49 → ...

The summary captures decisions and state — not a transcript. The agent continues with context, not with history.

Deep Dive: Microcompact Algorithm

Microcompact is the cheapest defense that actually removes content (Layer 1 only truncates). It runs without any LLM call:

FUNCTION microcompact(messages, compactBoundaries):
FOR EACH message IN messages:
IF message.type == "tool_result":
// Check if this tool result is "stale" — older than the newest compact boundary
IF message.turnIndex < latestCompactBoundary.turnIndex:
// Option 1: Replace with brief summary
IF message.resultSize > SUMMARY_THRESHOLD:
message.content = "[Tool result truncated. Original: ${message.toolName} " +
"returned ${message.resultSize} chars. Run tool again if needed.]"
// Option 2: Remove entirely if the tool was a read-only query
ELSE IF message.toolName IN [FileRead, Grep, Glob, WebSearch]:
REMOVE message from conversation
// Create microcompact boundary (marks the cutoff point)
IF anyRemoved:
INSERT microcompactBoundaryMessage(removedCount, freedTokens)

cachedMicrocompact variant (feature-gated): Caches previous microcompact decisions so repeated triggers don’t re-scan the entire conversation. Tracks cache_deleted_input_tokens from API response headers to measure effectiveness.

Memory Lifecycle

sequenceDiagram participant S as Session participant M as Memory Store participant N as New Session S->>S: Conversation + tool calls accumulate S->>M: extractMemories() on session end (background, non-blocking) M->>M: Score by relevance, decay older entries N->>M: memoryScan() + findRelevantMemories() on start M->>N: Inject relevant memories into system prompt Note over N: "Forget the conversation,<br/>remember the lessons"

Memory decay: older memories receive lower relevance scores over time. A memory about “user prefers 2-space indent” stays relevant forever. A memory about “file X had a bug on day 3” decays quickly.

Deep Dive: Memory Extraction and Session Memory Compact

extractMemories() runs at session end as a background task (non-blocking):

FUNCTION extractMemories(sessionMessages):
// Spawn a lightweight agent to extract key facts
// This agent has read-only access and a focused prompt
memories = await agent.extract({
messages: sessionMessages,
prompt: "Extract key facts, decisions, and learnings. Skip transient details.",
maxMemories: 20 // Limit per session
})
FOR EACH memory IN memories:
memory.createdAt = now()
memory.relevanceScore = 1.0 // Starts at full relevance
saveToMemoryStore(memory)

sessionMemoryCompact (630 lines) runs in parallel with auto-compact (Layer 3):

FUNCTION sessionMemoryCompact(messages, compactSummary):
// Extract memories from the messages being compacted
// These memories would otherwise be lost when the compact boundary
// replaces the original messages
emergencyMemories = extractKeyFacts(messages)
// Merge with existing session memories (deduplicate)
FOR EACH memory IN emergencyMemories:
IF NOT isDuplicate(memory, existingMemories):
saveToSessionMemory(memory)

Memory age decay:

FUNCTION calculateRelevance(memory):
daysSinceCreated = (now() - memory.createdAt) / ONE_DAY
// Exponential decay with 30-day half-life
decayFactor = 0.5 ^ (daysSinceCreated / 30)
RETURN memory.baseRelevanceScore * decayFactor
// Day 0: score × 1.0
// Day 30: score × 0.5
// Day 60: score × 0.25
// Day 90: score × 0.125

Memories about stable facts (“user prefers 2-space indent”) are tagged with decay: false and maintain full relevance indefinitely. Only transient memories decay.

Team memory: Agents in the same coordinator team share a memory namespace at .claude/team-memory/{teamName}/. Agent A extracts a memory about a codebase pattern, Agent B in the same team can access it in the next session. This enables organizational knowledge accumulation across agent boundaries.

Background Tasks

The system runs 7 types of background tasks, all writing output to disk for resilience:

TypeDescriptionKey Detail
local_bashShell command in backgroundOutput to file, incremental reads via outputOffset
local_agentAgent with own query loopIsolated context, reports to coordinator
remote_agentAgent on remote serverCross-network, same mailbox protocol
in_process_teammateCoordinator workerShared process, message passing
local_workflowScripted pipelineSequential steps, output chained
monitor_mcpMCP server health checkPeriodic ping, restart on failure
dreamMemory consolidationRuns between sessions

Disk-Based Output Pattern

All background tasks write to files. Readers use outputOffset for incremental reading — like tail -f but crash-safe:

Benefits:
Survives crashes → file persists, state not lost
Unlimited size → not constrained by memory
Multiple readers → coordinator + UI read same file
Incremental reads → no re-reading already-seen output

Stall detection: If no new output appears for 45 seconds, the system checks whether the command is waiting for interactive input — scanning for patterns like Y/n?, Continue?, Press any key. If found, it surfaces the prompt to the user rather than silently hanging.

The Dream Task

Between sessions, a dream agent reviews recent work and consolidates memories. It runs in 4 phases:

PhaseAction
OrientLoad recent session summaries and existing memories
GatherExtract key facts, decisions, and patterns from recent work
ConsolidateMerge redundant memories, resolve contradictions, update scores
PruneRemove stale or low-relevance memories below threshold

Idle time becomes improvement time. The next session starts with sharper context than the last one ended with.

Deep Dive: Stall Detection Patterns

Background shell tasks (local_bash) are monitored for stalls — commands that stop producing output because they’re waiting for interactive input:

FUNCTION checkForStall(task):
// Check every 5 seconds (STALL_CHECK_INTERVAL_MS = 5000)
IF (now() - task.lastOutputTime) > 45_000: // STALL_THRESHOLD_MS
// Read the last 500 bytes of output
tail = readTail(task.outputFile, 500)
// Check against interactive prompt patterns
patterns = [
/\[?[Yy](es)?\/[Nn](o)?\]?\s*$/, // Y/n, Yes/No
/\bContinue\?\s*$/i, // Continue?
/\bProceed\?\s*$/i, // Proceed?
/\bPress\s+(any\s+)?key/i, // Press any key
/\bPassword:\s*$/i, // Password:
/\bpassphrase.*:\s*$/i, // SSH passphrase
/\b(y\/n)\s*$/i, // (y/n)
/>\s*$/, // Generic prompt >
/\$\s*$/, // Shell prompt $
/\boverwrite\b.*\?\s*$/i // Overwrite?
]
IF anyMatch(tail, patterns):
notifyAgent("Task ${task.id} appears to be waiting for input: ${tail}")
// Agent can: send input to stdin, kill task, or escalate to user

The 45-second threshold is calibrated: short enough to catch stalls quickly, long enough to avoid false positives during legitimate pauses (e.g., compilation, large downloads).

Why This Matters to You

  • Use /compact proactively → don’t wait for the emergency layer (Layer 4); trigger Layer 3 yourself at a natural breakpoint
  • Why Claude “forgets” earlier instructions → a compact boundary replaced old messages with a summary; the original instruction is gone
  • How memories persist across sessions → extracted at session end, scored, injected into system prompt at next session start
  • What happens during long-running tasks → output written to disk with outputOffset incremental reads; survives crashes
  • What the “dream” task does → background memory consolidation between sessions; you benefit without any action

See also: The Agent LoopMulti-Agent SystemTool Orchestration