Rate Limiting

Claude Code tracks your usage across two independent time windows. Understanding both prevents unexpected blocks mid-task — especially during long agentic sessions that consume tokens quickly.

Dual-Window System

Two counters run simultaneously at all times. Either one can trigger rate limiting on its own — hitting the 5-hour limit does not reset or affect the 7-day counter, and vice versa.

flowchart LR subgraph W1["5-Hour Rolling Window"] T1["Tokens consumed\nin last 5 hrs"] TH1["Burst threshold\n(e.g. 1M tokens)"] T1 -->|compare| TH1 end subgraph W2["7-Day Rolling Window"] T2["Tokens consumed\nin last 7 days"] TH2["Sustained-use threshold\n(e.g. 5M tokens)"] T2 -->|compare| TH2 end TH1 -->|exceeds| RATE["Rate Limited"] TH2 -->|exceeds| RATE style W1 fill:#1e293b,color:#7dd3fc,stroke:#334155 style W2 fill:#1e293b,color:#c4b5fd,stroke:#334155 style RATE fill:#1e293b,color:#f87171,stroke:#334155 style T1 fill:#1e293b,color:#94a3b8,stroke:#334155 style T2 fill:#1e293b,color:#94a3b8,stroke:#334155 style TH1 fill:#1e293b,color:#fcd34d,stroke:#334155 style TH2 fill:#1e293b,color:#fcd34d,stroke:#334155

Window	Purpose	Reset Behavior
5-hour rolling	Burst protection — prevents single heavy sessions from monopolizing capacity	Slides continuously; not a fixed clock reset
7-day rolling	Sustained-use cap — prevents chronic heavy users from exceeding weekly quotas	Also slides — based on rolling wall time, not calendar week

Both windows track input tokens, output tokens, and thinking tokens independently.

3-Tier Early Warning System

The system warns you before blocking — with escalating urgency based on how close you are to either window’s threshold.

flowchart TD CHECK{Usage in\neither window?} CHECK -->|~60% of threshold| T1 CHECK -->|~80% of threshold| T2 CHECK -->|~95% of threshold| T3 CHECK -->|100% — limit hit| BLOCK T1["Tier 1 — Subtle indicator\nStatus bar shows current usage %\nNo interruption to workflow"] T2["Tier 2 — Warning message\nConversation output includes\nyellow warning with usage numbers"] T3["Tier 3 — Urgent dialog\nFull-screen breakdown:\n• tokens used / remaining\n• which window is near limit\n• time until window resets\n• suggestion to reduce usage"] BLOCK["Rate Limited\n429 returned from API\nAll calls blocked until window rolls"] style CHECK fill:#1e293b,color:#fcd34d,stroke:#334155 style T1 fill:#1e293b,color:#86efac,stroke:#334155 style T2 fill:#1e293b,color:#fcd34d,stroke:#334155 style T3 fill:#1e293b,color:#fda4af,stroke:#334155 style BLOCK fill:#1e293b,color:#f87171,stroke:#334155

Tier 3 is the last actionable warning before the block. The dialog prompts you to run /compact or reduce scope — at that point, a few hundred tokens of action can save you from an hours-long wait.

How Limits Are Detected

The system uses two sources of usage data, with the more accurate one taking precedence:

FUNCTION checkRateLimit(response):
  IF response.headers.has("anthropic-ratelimit-*"):
    // Header-based: most accurate
    // API returns remaining tokens/requests per window directly
    remaining5h  = header("anthropic-ratelimit-tokens-remaining")
    remaining7d  = header("anthropic-ratelimit-tokens-remaining-7d")
    resetTime    = header("anthropic-ratelimit-tokens-reset")
    updateWarningTier(remaining5h, remaining7d)

  ELSE:
    // Local fallback: less accurate but prevents over-quota calls
    // Happens after crash/restart when headers unavailable
    localCount = readLocalUsageCache()
    estimateRemainingFromLocal(localCount)

The local fallback deliberately errs on the side of caution — it may warn you sooner than necessary, but it will never silently allow calls that would return 429.

Deep Dive: Local Usage Cache

The local cache persists usage data to disk so estimates survive CLI restarts:

FUNCTION updateLocalCache(tokensUsed):
  cache = readDisk(".claude/usage-cache.json")

  // Purge entries older than the longest window (7 days)
  cache.entries = cache.entries.filter(e =>
    now() - e.timestamp < SEVEN_DAYS_MS
  )

  // Append new usage event
  cache.entries.push({
    timestamp: now(),
    inputTokens:   tokensUsed.input,
    outputTokens:  tokensUsed.output,
    thinkingTokens: tokensUsed.thinking,
    model: tokensUsed.model
  })

  writeDisk(".claude/usage-cache.json", cache)

FUNCTION estimateFromLocal():
  cache = readDisk(".claude/usage-cache.json")
  cutoff5h  = now() - FIVE_HOURS_MS
  cutoff7d  = now() - SEVEN_DAYS_MS

  total5h = sum(cache.entries.filter(e => e.timestamp > cutoff5h))
  total7d = sum(cache.entries.filter(e => e.timestamp > cutoff7d))

  RETURN { total5h, total7d }

This means usage tracking works even when the API is unreachable — the system can still warn you before sending a request that will certainly fail.

What Happens When Rate Limited

FUNCTION handleRateLimitError(error):  // error.status == 429
  resetTime = error.headers["retry-after"] OR error.body.reset_at

  // Show user: which window triggered, time until reset
  displayRateLimitDialog({
    window:    error.body.window,      // "5h" or "7d"
    resetAt:   resetTime,
    waitSecs:  resetTime - now()
  })

  // Escalate through recovery options (see: The Agent Loop)
  IF fallbackModel.available AND fallbackModel.hasCapacity():
    switchToFallbackModel()            // continue with alternate model
    retryRequest()
  ELSE:
    pauseAgentLoop()                   // suspend until window resets
    notifyUser("Session paused. Resumable at " + resetTime)

If a fallback model is available and has remaining capacity under its own rate limits, the system switches automatically. You may notice a model name change in the status bar — this is intentional recovery behavior, not a bug. See The Agent Loop for how the escalation ladder works.

Practical Guidance

Question	Answer
How do I check current usage?	The status bar shows the active window’s usage %. Tier 1 warning (60%) is the earliest visible signal.
What reduces token consumption?	`/compact` shrinks conversation history before the next API call. Subagents with isolated contexts send less history per call.
Can I increase limits?	Pro and Enterprise plans have higher per-window caps. Max Tier (via Claude.ai) significantly raises both windows.
Why did my limit reset unexpectedly?	The 5-hour window is rolling — it slides continuously. If your heavy usage was 5+ hours ago, that portion has rolled out of the window.
Does `/compact` count against my limit?	Yes — compaction is itself an API call. But it’s far cheaper than the context it eliminates from future calls.
Do subagents share my limit?	Yes. All API calls — yours, your subagents’, background tasks — count against the same user-level windows.

Why This Matters to You

The warning tiers tell you before you’re blocked — Tier 2 at 80% is your action window, not Tier 3 at 95%
/compact is the fastest intervention mid-session — fewer input tokens on every subsequent call
Subagents with isolated context reduce per-call cost, but they still consume your shared quota; plan accordingly
If you hit limits often, check whether you’re sending large file contents or long histories on every call — context accumulates fast
The local fallback cache means estimates persist across restarts — you won’t accidentally blow past a limit just because you restarted the CLI

See also: Architecture Overview — The Agent Loop