Context & Memory

LLMs have finite context windows. A typical agentic session burns through 100K tokens in 30–60 minutes. Without defense, the agent simply forgets — mid-task, silently, with no warning.

The Math

Context consumption in a real session:

Source	Tokens (est.)
System prompt	~8K
Conversation history	~20K
50 tool results × 2K each	~100K
Thinking blocks	~15K
Total	~143K

Most models have a 200K effective context window. A busy session hits the wall in under an hour. The system needs a layered defense — not a single fallback.

Pre-Flight Token Estimation

Before every API call, the system estimates token count to decide if compaction is needed — preventing 413 errors before they happen.

Deep Dive: Token Estimation Across Providers

Token estimation runs for 3 API providers with different backends:

Provider	Method	Fallback
Anthropic API	Native token counting endpoint	Character heuristic (~4 chars/token)
AWS Bedrock	Bedrock-specific token estimator	Character heuristic
Google Vertex AI	Vertex token count API	Character heuristic

Key behaviors:

Extended thinking tokens counted separately with configurable budget
Beta fields (e.g., tool_search) stripped before counting to prevent overestimation
Cache-aware: already-cached prompt tokens counted at reduced weight
If estimation fails: falls back to character-based heuristic (~4 chars per token)

The estimation feeds directly into the auto-compact trigger: if estimated tokens > effective context window, Layer 3 fires before the API call — not after a 413 error.

5-Layer Escalating Defense

Each layer activates only when cheaper layers are insufficient. Once a layer runs, it does not retry — it either succeeds or escalates.

flowchart TD CHECK{Context\nusage?} -->|">60%"| L1 CHECK -->|">80%"| L2 CHECK -->|">90%"| L3 CHECK -->|"413 from API"| L4 L4 -->|"drain fails"| L5 L1["Layer 1 — Tool Result Truncation\nPersist full result to disk\nKeep only pointer in conversation\nAgent uses FileRead to retrieve later\nCost: ~0"] L2["Layer 2 — Microcompact\nRemove old tool results at boundaries\nNo LLM call needed\nMarks results stale by tool_use_id\nReplaces with brief summary"] L3["Layer 3 — Auto-compact\nCall LLM to summarize full conversation\nCreates compact boundary message\n20K tokens reserved for output\nTriggers memory extraction in parallel"] L4["Layer 4 — Reactive Compact\nEmergency: API returned HTTP 413\nTries context collapse drain first\nThen full LLM summarize\nCircuit breaker: hasAttemptedReactiveCompact"] L5["Layer 5 — Context Collapse\nRead-time projection only\nDoesn't change stored messages\nChanges what is sent to API\nCheapest recovery before Layer 4"] style L1 fill:#1e293b,color:#86efac,stroke:#334155 style L2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style L3 fill:#1e293b,color:#fcd34d,stroke:#334155 style L4 fill:#1e293b,color:#fda4af,stroke:#334155 style L5 fill:#1e293b,color:#c4b5fd,stroke:#334155

Key principle: Each layer runs at most once per iteration. Failure → escalate. No retries within the same layer. Circuit breakers on every layer.

Layer 3 Detail: Auto-compact

The 20K token reserve for summary output is based on empirical measurement: p99.99 of observed compact summaries = 17,387 tokens. The system over-provisions slightly to guarantee the summary never gets truncated.

When auto-compact fires, memory extraction runs in parallel (non-blocking). The summary replaces everything before the compact boundary. Everything before that boundary is permanently discarded.

Layer 4 Detail: Reactive Compact

Triggered when the API returns HTTP 413 (prompt_too_long) mid-turn — meaning the request was already too large to send. The circuit breaker flag hasAttemptedReactiveCompact prevents infinite loops: if the first attempt fails, the session aborts gracefully rather than spiraling.

Deep Dive: Effective Context Window Calculation

The system calculates how much context it can actually use:

FUNCTION getEffectiveContextWindowSize(model):
  maxContext = getContextWindowForModel(model)  // e.g., 200,000

  // Reserve space for compact summary output
  // p99.99 observed compact output = 17,387 tokens
  // Over-provision to 20,000 to guarantee no truncation
  reserved = min(
    getMaxOutputTokensForModel(model),
    20_000   // MAX_OUTPUT_TOKENS_FOR_SUMMARY
  )

  RETURN maxContext - reserved
  // Example: 200,000 - 20,000 = 180,000 effective tokens

The 20,000 token reserve is an empirical value: based on production telemetry, 99.99% of compact summaries fit within 17,387 tokens. The system adds a ~15% safety margin. This means out of a 200K context window, only 180K is available for actual conversation — the rest is reserved for the summary if compaction is needed.

Compact Boundary = Controlled Forgetting

A compact boundary is not random truncation. It is a system decision about what to forget and what to keep:

Before boundary (discarded):
  turn 1 → turn 2 → ... → turn 47

Boundary message (kept):
  "Summary: You were refactoring the auth module.
   Completed: token validation, session expiry.
   In progress: refresh token rotation.
   Key decisions: use httpOnly cookies, 15-min expiry."

After boundary (kept):
  turn 48 → turn 49 → ...

The summary captures decisions and state — not a transcript. The agent continues with context, not with history.

Deep Dive: Microcompact Algorithm

Microcompact is the cheapest defense that actually removes content (Layer 1 only truncates). It runs without any LLM call:

FUNCTION microcompact(messages, compactBoundaries):
  FOR EACH message IN messages:
    IF message.type == "tool_result":
      // Check if this tool result is "stale" — older than the newest compact boundary
      IF message.turnIndex < latestCompactBoundary.turnIndex:
        // Option 1: Replace with brief summary
        IF message.resultSize > SUMMARY_THRESHOLD:
          message.content = "[Tool result truncated. Original: ${message.toolName} " +
                           "returned ${message.resultSize} chars. Run tool again if needed.]"
        // Option 2: Remove entirely if the tool was a read-only query
        ELSE IF message.toolName IN [FileRead, Grep, Glob, WebSearch]:
          REMOVE message from conversation

  // Create microcompact boundary (marks the cutoff point)
  IF anyRemoved:
    INSERT microcompactBoundaryMessage(removedCount, freedTokens)

cachedMicrocompact variant (feature-gated): Caches previous microcompact decisions so repeated triggers don’t re-scan the entire conversation. Tracks cache_deleted_input_tokens from API response headers to measure effectiveness.

Memory Lifecycle

sequenceDiagram participant S as Session participant M as Memory Store participant N as New Session S->>S: Conversation + tool calls accumulate S->>M: extractMemories() on session end (background, non-blocking) M->>M: Score by relevance, decay older entries N->>M: memoryScan() + findRelevantMemories() on start M->>N: Inject relevant memories into system prompt Note over N: "Forget the conversation,<br/>remember the lessons"

Memory decay: older memories receive lower relevance scores over time. A memory about “user prefers 2-space indent” stays relevant forever. A memory about “file X had a bug on day 3” decays quickly.

Deep Dive: Memory Extraction and Session Memory Compact

extractMemories() runs at session end as a background task (non-blocking):

FUNCTION extractMemories(sessionMessages):
  // Spawn a lightweight agent to extract key facts
  // This agent has read-only access and a focused prompt
  memories = await agent.extract({
    messages: sessionMessages,
    prompt: "Extract key facts, decisions, and learnings. Skip transient details.",
    maxMemories: 20   // Limit per session
  })

  FOR EACH memory IN memories:
    memory.createdAt = now()
    memory.relevanceScore = 1.0    // Starts at full relevance
    saveToMemoryStore(memory)

sessionMemoryCompact (630 lines) runs in parallel with auto-compact (Layer 3):

FUNCTION sessionMemoryCompact(messages, compactSummary):
  // Extract memories from the messages being compacted
  // These memories would otherwise be lost when the compact boundary
  // replaces the original messages
  emergencyMemories = extractKeyFacts(messages)

  // Merge with existing session memories (deduplicate)
  FOR EACH memory IN emergencyMemories:
    IF NOT isDuplicate(memory, existingMemories):
      saveToSessionMemory(memory)

Memory age decay:

FUNCTION calculateRelevance(memory):
  daysSinceCreated = (now() - memory.createdAt) / ONE_DAY

  // Exponential decay with 30-day half-life
  decayFactor = 0.5 ^ (daysSinceCreated / 30)

  RETURN memory.baseRelevanceScore * decayFactor
  // Day 0: score × 1.0
  // Day 30: score × 0.5
  // Day 60: score × 0.25
  // Day 90: score × 0.125

Memories about stable facts (“user prefers 2-space indent”) are tagged with decay: false and maintain full relevance indefinitely. Only transient memories decay.

Team memory: Agents in the same coordinator team share a memory namespace at .claude/team-memory/{teamName}/. Agent A extracts a memory about a codebase pattern, Agent B in the same team can access it in the next session. This enables organizational knowledge accumulation across agent boundaries.

Background Tasks

The system runs 7 types of background tasks, all writing output to disk for resilience:

Type	Description	Key Detail
`local_bash`	Shell command in background	Output to file, incremental reads via `outputOffset`
`local_agent`	Agent with own query loop	Isolated context, reports to coordinator
`remote_agent`	Agent on remote server	Cross-network, same mailbox protocol
`in_process_teammate`	Coordinator worker	Shared process, message passing
`local_workflow`	Scripted pipeline	Sequential steps, output chained
`monitor_mcp`	MCP server health check	Periodic ping, restart on failure
`dream`	Memory consolidation	Runs between sessions

Disk-Based Output Pattern

All background tasks write to files. Readers use outputOffset for incremental reading — like tail -f but crash-safe:

Benefits:
  Survives crashes       → file persists, state not lost
  Unlimited size         → not constrained by memory
  Multiple readers       → coordinator + UI read same file
  Incremental reads      → no re-reading already-seen output

Stall detection: If no new output appears for 45 seconds, the system checks whether the command is waiting for interactive input — scanning for patterns like Y/n?, Continue?, Press any key. If found, it surfaces the prompt to the user rather than silently hanging.

The Dream Task

Between sessions, a dream agent reviews recent work and consolidates memories. It runs in 4 phases:

Phase	Action
Orient	Load recent session summaries and existing memories
Gather	Extract key facts, decisions, and patterns from recent work
Consolidate	Merge redundant memories, resolve contradictions, update scores
Prune	Remove stale or low-relevance memories below threshold

Idle time becomes improvement time. The next session starts with sharper context than the last one ended with.

Deep Dive: Stall Detection Patterns

Background shell tasks (local_bash) are monitored for stalls — commands that stop producing output because they’re waiting for interactive input:

FUNCTION checkForStall(task):
  // Check every 5 seconds (STALL_CHECK_INTERVAL_MS = 5000)
  IF (now() - task.lastOutputTime) > 45_000:  // STALL_THRESHOLD_MS
    // Read the last 500 bytes of output
    tail = readTail(task.outputFile, 500)

    // Check against interactive prompt patterns
    patterns = [
      /\[?[Yy](es)?\/[Nn](o)?\]?\s*$/,     // Y/n, Yes/No
      /\bContinue\?\s*$/i,                    // Continue?
      /\bProceed\?\s*$/i,                     // Proceed?
      /\bPress\s+(any\s+)?key/i,              // Press any key
      /\bPassword:\s*$/i,                     // Password:
      /\bpassphrase.*:\s*$/i,                 // SSH passphrase
      /\b(y\/n)\s*$/i,                        // (y/n)
      />\s*$/,                                // Generic prompt >
      /\$\s*$/,                               // Shell prompt $
      /\boverwrite\b.*\?\s*$/i               // Overwrite?
    ]

    IF anyMatch(tail, patterns):
      notifyAgent("Task ${task.id} appears to be waiting for input: ${tail}")
      // Agent can: send input to stdin, kill task, or escalate to user

The 45-second threshold is calibrated: short enough to catch stalls quickly, long enough to avoid false positives during legitimate pauses (e.g., compilation, large downloads).

Why This Matters to You

Use /compact proactively → don’t wait for the emergency layer (Layer 4); trigger Layer 3 yourself at a natural breakpoint
Why Claude “forgets” earlier instructions → a compact boundary replaced old messages with a summary; the original instruction is gone
How memories persist across sessions → extracted at session end, scored, injected into system prompt at next session start
What happens during long-running tasks → output written to disk with outputOffset incremental reads; survives crashes
What the “dream” task does → background memory consolidation between sessions; you benefit without any action

See also: The Agent Loop — Multi-Agent System — Tool Orchestration