Build robust retry logic for LLM and tool calls in AI agents
✓Works with OpenClaudeYou are the #1 AI reliability engineer from Silicon Valley — the person companies hire when their agents fail 30% of the time and they can't ship to production. The user wants to add retry logic to AI agent calls so transient failures don't crash the agent.
What to check first
- Identify which calls need retries: LLM, tool, downstream API
- Decide retry policy: retry count, backoff strategy, jitter
- Check what's idempotent — non-idempotent calls need careful handling
Steps
- Wrap LLM calls in a retry helper with exponential backoff + jitter
- Set max retries (3-5 typical) and a max delay cap
- Distinguish retryable errors (429, 503, network) from non-retryable (400, 401, validation)
- Add circuit breaker — if 50% of calls fail in a window, stop retrying entirely
- Log every retry attempt with attempt number and error
- For tool calls, check idempotency before retrying
Code
// TypeScript with proper retry logic
interface RetryOptions {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
retryableErrors: (err: Error) => boolean;
}
const DEFAULT_RETRY: RetryOptions = {
maxAttempts: 5,
baseDelayMs: 1000,
maxDelayMs: 30000,
retryableErrors: (err) => {
if (err.name === 'RateLimitError') return true;
if (err.name === 'NetworkError') return true;
if ('status' in err && [429, 502, 503, 504].includes((err as any).status)) return true;
return false;
},
};
async function retryWithBackoff<T>(
fn: () => Promise<T>,
options: Partial<RetryOptions> = {}
): Promise<T> {
const opts = { ...DEFAULT_RETRY, ...options };
let lastError: Error | undefined;
for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err as Error;
if (!opts.retryableErrors(lastError)) {
throw lastError; // non-retryable
}
if (attempt === opts.maxAttempts) {
throw new Error(`Failed after ${opts.maxAttempts} attempts: ${lastError.message}`);
}
// Exponential backoff with jitter
const exponential = Math.min(
opts.baseDelayMs * Math.pow(2, attempt - 1),
opts.maxDelayMs
);
const jitter = Math.random() * exponential * 0.3;
const delay = exponential + jitter;
console.log(`Attempt ${attempt} failed, retrying in ${delay.toFixed(0)}ms`);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError;
}
// Usage with LLM call
async function callClaude(prompt: string): Promise<string> {
return retryWithBackoff(
async () => {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
});
return response.content[0].text;
},
{ maxAttempts: 5, baseDelayMs: 2000 }
);
}
// Circuit breaker — stop trying when too many failures
class CircuitBreaker {
private failures = 0;
private successes = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
private openedAt = 0;
constructor(
private threshold = 0.5, // open if >50% failures
private windowSize = 20, // over last 20 calls
private cooldownMs = 60000, // try again after 1 min
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.openedAt > this.cooldownMs) {
this.state = 'half-open';
} else {
throw new Error('Circuit breaker is open');
}
}
try {
const result = await fn();
this.recordSuccess();
return result;
} catch (err) {
this.recordFailure();
throw err;
}
}
private recordSuccess() {
this.successes++;
if (this.state === 'half-open') this.state = 'closed';
this.trim();
}
private recordFailure() {
this.failures++;
const total = this.failures + this.successes;
if (total >= this.windowSize && this.failures / total >= this.threshold) {
this.state = 'open';
this.openedAt = Date.now();
console.error('Circuit breaker opened');
}
this.trim();
}
private trim() {
const total = this.failures + this.successes;
if (total > this.windowSize) {
// Reset for new window (simplified — real impl uses sliding window)
this.failures = 0;
this.successes = 0;
}
}
}
const breaker = new CircuitBreaker();
const result = await breaker.call(() => callClaude('Hello'));
Common Pitfalls
- Retrying non-idempotent operations — duplicate side effects
- No max delay cap — exponential growth means hours of waiting
- No jitter — synchronized retries from many clients DoS the upstream
- Retrying on 4xx errors — they're permanent, retries waste time
- Ignoring rate limit headers — should respect Retry-After
When NOT to Use This Skill
- For idempotent reads where you can fall back to cache
- When the upstream is permanently down — fail fast
How to Verify It Worked
- Test with simulated 429s and 503s — verify retries happen
- Test non-retryable errors fail immediately
- Test circuit breaker opens at threshold
Production Considerations
- Monitor retry rates per endpoint — high rates indicate upstream issues
- Set up alerts on circuit breaker opens
- Use distributed circuit breakers if running multiple instances
Related AI Agents Skills
Other Claude Code skills in the same category — free to download.
CrewAI Setup
Build multi-agent systems with CrewAI framework
AutoGen Setup
Create AI agent conversations with AutoGen
LangGraph Workflow
Build stateful AI agent workflows with LangGraph
AI Agent Tools
Create custom tools for AI agents (search, calculator, API)
AI Agent Memory
Implement agent memory with vector stores and summaries
AI Agent Evaluation
Evaluate AI agent performance with benchmarks and metrics
AI Agent Observability
Add tracing, logging, and metrics to AI agents so you can debug failures
pydantic-ai
Build production-ready AI agents with PydanticAI — type-safe tool use, structured outputs, dependency injection, and multi-model support.
Want a AI Agents skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.