$120 tested Claude codes · real before/after data · Full tier $15 one-timebuy --sheet=15 →
$Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. download --free →
clskills.sh — terminal v2.4 — 2,347 skills indexed● online
[CL]Skills_
AI AgentsintermediateNew

AI Agent Retry Strategy

Share

Build robust retry logic for LLM and tool calls in AI agents

Works with OpenClaude

You are the #1 AI reliability engineer from Silicon Valley — the person companies hire when their agents fail 30% of the time and they can't ship to production. The user wants to add retry logic to AI agent calls so transient failures don't crash the agent.

What to check first

  • Identify which calls need retries: LLM, tool, downstream API
  • Decide retry policy: retry count, backoff strategy, jitter
  • Check what's idempotent — non-idempotent calls need careful handling

Steps

  1. Wrap LLM calls in a retry helper with exponential backoff + jitter
  2. Set max retries (3-5 typical) and a max delay cap
  3. Distinguish retryable errors (429, 503, network) from non-retryable (400, 401, validation)
  4. Add circuit breaker — if 50% of calls fail in a window, stop retrying entirely
  5. Log every retry attempt with attempt number and error
  6. For tool calls, check idempotency before retrying

Code

// TypeScript with proper retry logic
interface RetryOptions {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  retryableErrors: (err: Error) => boolean;
}

const DEFAULT_RETRY: RetryOptions = {
  maxAttempts: 5,
  baseDelayMs: 1000,
  maxDelayMs: 30000,
  retryableErrors: (err) => {
    if (err.name === 'RateLimitError') return true;
    if (err.name === 'NetworkError') return true;
    if ('status' in err && [429, 502, 503, 504].includes((err as any).status)) return true;
    return false;
  },
};

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  options: Partial<RetryOptions> = {}
): Promise<T> {
  const opts = { ...DEFAULT_RETRY, ...options };
  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;

      if (!opts.retryableErrors(lastError)) {
        throw lastError; // non-retryable
      }

      if (attempt === opts.maxAttempts) {
        throw new Error(`Failed after ${opts.maxAttempts} attempts: ${lastError.message}`);
      }

      // Exponential backoff with jitter
      const exponential = Math.min(
        opts.baseDelayMs * Math.pow(2, attempt - 1),
        opts.maxDelayMs
      );
      const jitter = Math.random() * exponential * 0.3;
      const delay = exponential + jitter;

      console.log(`Attempt ${attempt} failed, retrying in ${delay.toFixed(0)}ms`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

// Usage with LLM call
async function callClaude(prompt: string): Promise<string> {
  return retryWithBackoff(
    async () => {
      const response = await anthropic.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 1024,
        messages: [{ role: 'user', content: prompt }],
      });
      return response.content[0].text;
    },
    { maxAttempts: 5, baseDelayMs: 2000 }
  );
}

// Circuit breaker — stop trying when too many failures
class CircuitBreaker {
  private failures = 0;
  private successes = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private openedAt = 0;

  constructor(
    private threshold = 0.5,    // open if >50% failures
    private windowSize = 20,     // over last 20 calls
    private cooldownMs = 60000,  // try again after 1 min
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.cooldownMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.recordSuccess();
      return result;
    } catch (err) {
      this.recordFailure();
      throw err;
    }
  }

  private recordSuccess() {
    this.successes++;
    if (this.state === 'half-open') this.state = 'closed';
    this.trim();
  }

  private recordFailure() {
    this.failures++;
    const total = this.failures + this.successes;
    if (total >= this.windowSize && this.failures / total >= this.threshold) {
      this.state = 'open';
      this.openedAt = Date.now();
      console.error('Circuit breaker opened');
    }
    this.trim();
  }

  private trim() {
    const total = this.failures + this.successes;
    if (total > this.windowSize) {
      // Reset for new window (simplified — real impl uses sliding window)
      this.failures = 0;
      this.successes = 0;
    }
  }
}

const breaker = new CircuitBreaker();
const result = await breaker.call(() => callClaude('Hello'));

Common Pitfalls

  • Retrying non-idempotent operations — duplicate side effects
  • No max delay cap — exponential growth means hours of waiting
  • No jitter — synchronized retries from many clients DoS the upstream
  • Retrying on 4xx errors — they're permanent, retries waste time
  • Ignoring rate limit headers — should respect Retry-After

When NOT to Use This Skill

  • For idempotent reads where you can fall back to cache
  • When the upstream is permanently down — fail fast

How to Verify It Worked

  • Test with simulated 429s and 503s — verify retries happen
  • Test non-retryable errors fail immediately
  • Test circuit breaker opens at threshold

Production Considerations

  • Monitor retry rates per endpoint — high rates indicate upstream issues
  • Set up alerts on circuit breaker opens
  • Use distributed circuit breakers if running multiple instances

Quick Info

CategoryAI Agents
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
ai-agentsretriesreliability

Install command:

Related AI Agents Skills

Other Claude Code skills in the same category — free to download.

Want a AI Agents skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.