AI AgentsintermediateNew

AI Agent Retry Strategy

Name: AI Agent Retry Strategy
Author: Claude Skills Hub

Build robust retry logic for LLM and tool calls in AI agents

You are the #1 AI reliability engineer from Silicon Valley — the person companies hire when their agents fail 30% of the time and they can't ship to production. The user wants to add retry logic to AI agent calls so transient failures don't crash the agent.

What to check first

Identify which calls need retries: LLM, tool, downstream API
Decide retry policy: retry count, backoff strategy, jitter
Check what's idempotent — non-idempotent calls need careful handling

Steps

Wrap LLM calls in a retry helper with exponential backoff + jitter
Set max retries (3-5 typical) and a max delay cap
Distinguish retryable errors (429, 503, network) from non-retryable (400, 401, validation)
Add circuit breaker — if 50% of calls fail in a window, stop retrying entirely
Log every retry attempt with attempt number and error
For tool calls, check idempotency before retrying

Code

// TypeScript with proper retry logic
interface RetryOptions {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  retryableErrors: (err: Error) => boolean;
}

const DEFAULT_RETRY: RetryOptions = {
  maxAttempts: 5,
  baseDelayMs: 1000,
  maxDelayMs: 30000,
  retryableErrors: (err) => {
    if (err.name === 'RateLimitError') return true;
    if (err.name === 'NetworkError') return true;
    if ('status' in err && [429, 502, 503, 504].includes((err as any).status)) return true;
    return false;
  },
};

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  options: Partial<RetryOptions> = {}
): Promise<T> {
  const opts = { ...DEFAULT_RETRY, ...options };
  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;

      if (!opts.retryableErrors(lastError)) {
        throw lastError; // non-retryable
      }

      if (attempt === opts.maxAttempts) {
        throw new Error(`Failed after ${opts.maxAttempts} attempts: ${lastError.message}`);
      }

      // Exponential backoff with jitter
      const exponential = Math.min(
        opts.baseDelayMs * Math.pow(2, attempt - 1),
        opts.maxDelayMs
      );
      const jitter = Math.random() * exponential * 0.3;
      const delay = exponential + jitter;

      console.log(`Attempt ${attempt} failed, retrying in ${delay.toFixed(0)}ms`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

// Usage with LLM call
async function callClaude(prompt: string): Promise<string> {
  return retryWithBackoff(
    async () => {
      const response = await anthropic.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 1024,
        messages: [{ role: 'user', content: prompt }],
      });
      return response.content[0].text;
    },
    { maxAttempts: 5, baseDelayMs: 2000 }
  );
}

// Circuit breaker — stop trying when too many failures
class CircuitBreaker {
  private failures = 0;
  private successes = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private openedAt = 0;

  constructor(
    private threshold = 0.5,    // open if >50% failures
    private windowSize = 20,     // over last 20 calls
    private cooldownMs = 60000,  // try again after 1 min
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.cooldownMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.recordSuccess();
      return result;
    } catch (err) {
      this.recordFailure();
      throw err;
    }
  }

  private recordSuccess() {
    this.successes++;
    if (this.state === 'half-open') this.state = 'closed';
    this.trim();
  }

  private recordFailure() {
    this.failures++;
    const total = this.failures + this.successes;
    if (total >= this.windowSize && this.failures / total >= this.threshold) {
      this.state = 'open';
      this.openedAt = Date.now();
      console.error('Circuit breaker opened');
    }
    this.trim();
  }

  private trim() {
    const total = this.failures + this.successes;
    if (total > this.windowSize) {
      // Reset for new window (simplified — real impl uses sliding window)
      this.failures = 0;
      this.successes = 0;
    }
  }
}

const breaker = new CircuitBreaker();
const result = await breaker.call(() => callClaude('Hello'));

Common Pitfalls

Retrying non-idempotent operations — duplicate side effects
No max delay cap — exponential growth means hours of waiting
No jitter — synchronized retries from many clients DoS the upstream
Retrying on 4xx errors — they're permanent, retries waste time
Ignoring rate limit headers — should respect Retry-After

When NOT to Use This Skill

For idempotent reads where you can fall back to cache
When the upstream is permanently down — fail fast

How to Verify It Worked

Test with simulated 429s and 503s — verify retries happen
Test non-retryable errors fail immediately
Test circuit breaker opens at threshold

Production Considerations

Monitor retry rates per endpoint — high rates indicate upstream issues
Set up alerts on circuit breaker opens
Use distributed circuit breakers if running multiple instances

Quick Info

CategoryAI Agents

Difficultyintermediate

Version1.0.0

AuthorClaude Skills Hub

ai-agentsretriesreliability

Install command:

Related AI Agents Skills

Other Claude Code skills in the same category — free to download.

Browse all

AI Agentsintermediate

CrewAI Setup

Build multi-agent systems with CrewAI framework

AI Agentsintermediate

AutoGen Setup

Create AI agent conversations with AutoGen

AI Agentsadvanced

LangGraph Workflow

Build stateful AI agent workflows with LangGraph

AI Agentsintermediate

AI Agent Tools

Create custom tools for AI agents (search, calculator, API)

AI Agentsadvanced

AI Agent Memory

Implement agent memory with vector stores and summaries

AI Agentsadvanced

AI Agent Evaluation

Evaluate AI agent performance with benchmarks and metrics

AI Agentsintermediate

AI Agent Observability

Add tracing, logging, and metrics to AI agents so you can debug failures

AI Agentsintermediate

pydantic-ai

Build production-ready AI agents with PydanticAI — type-safe tool use, structured outputs, dependency injection, and multi-model support.

Want a AI Agents skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.

Custom Agent — $5 →|Analyze My Stack — $3 →

AI Agent Retry Strategy

What to check first

Steps

Code

Common Pitfalls

When NOT to Use This Skill

How to Verify It Worked

Production Considerations

Quick Info

Related Skills

Related AI Agents Skills

CrewAI Setup

AutoGen Setup

LangGraph Workflow

AI Agent Tools

AI Agent Memory

AI Agent Evaluation

AI Agent Observability

pydantic-ai