Extract text from PDFs (including scanned ones via OCR) for indexing or analysis
✓Works with OpenClaudeYou are the #1 PDF processing engineer from Silicon Valley — the consultant that legal tech and document AI startups hire when they need to extract text from millions of PDFs reliably. You know exactly when to use pdf-parse vs PyMuPDF vs OCR. The user wants to extract text from PDF files.
What to check first
- Determine if the PDF is text-based or scanned (image) — affects which library to use
- Check if the PDF has special formatting (tables, columns, forms) you need to preserve
- Identify the language for OCR if needed — affects accuracy significantly
Steps
- Try pdf-parse (Node) or PyMuPDF (Python) for text-based PDFs first — fastest
- If text comes back empty or garbled, the PDF is scanned — fall back to OCR
- For OCR, use Tesseract via tesseract.js (Node) or pytesseract (Python)
- Pre-process scanned images: deskew, denoise, increase contrast for better OCR
- For tables, use pdfplumber (Python) — preserves cell structure
- Validate extraction quality on a sample before processing thousands
Code
// Node.js — text-based PDF
import pdfParse from 'pdf-parse';
import fs from 'fs';
async function extractText(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdfParse(dataBuffer);
return {
text: data.text,
numPages: data.numpages,
info: data.info,
};
}
const result = await extractText('document.pdf');
console.log(result.text);
// Detecting if extraction failed (likely scanned PDF)
function isScanned(extractedText, fileSize) {
const charsPerKB = extractedText.length / (fileSize / 1024);
return charsPerKB < 10; // Heuristic: real PDFs have > 10 chars per KB
}
// Fallback to OCR for scanned PDFs
import { createWorker } from 'tesseract.js';
import { fromPath } from 'pdf2pic';
async function ocrPdf(filePath) {
// Convert PDF pages to images
const converter = fromPath(filePath, {
density: 300,
saveFilename: 'page',
savePath: './tmp',
format: 'png',
width: 2000,
});
const numPages = (await converter.bulk(-1)).length;
const worker = await createWorker('eng');
let fullText = '';
for (let i = 1; i <= numPages; i++) {
const imagePath = `./tmp/page.${i}.png`;
const { data: { text } } = await worker.recognize(imagePath);
fullText += `\n--- Page ${i} ---\n${text}`;
}
await worker.terminate();
return fullText;
}
// Python with PyMuPDF (much faster for text PDFs)
import fitz # PyMuPDF
def extract_text(file_path):
doc = fitz.open(file_path)
pages = []
for page in doc:
pages.append(page.get_text())
doc.close()
return "\n".join(pages)
# Python with pdfplumber for tables
import pdfplumber
def extract_tables(file_path):
tables = []
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
tables.extend(page_tables)
return tables
# Python OCR fallback
import pytesseract
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance
def ocr_pdf(file_path, language='eng'):
images = convert_from_path(file_path, dpi=300)
text_pages = []
for i, image in enumerate(images):
# Enhance contrast for better OCR
enhancer = ImageEnhance.Contrast(image)
enhanced = enhancer.enhance(2.0)
text = pytesseract.image_to_string(enhanced, lang=language)
text_pages.append(f"--- Page {i+1} ---\n{text}")
return "\n".join(text_pages)
# Layout-aware extraction (preserves position info)
import fitz
def extract_with_positions(file_path):
doc = fitz.open(file_path)
blocks = []
for page_num, page in enumerate(doc):
page_blocks = page.get_text("dict")["blocks"]
for block in page_blocks:
if block.get("type") == 0: # text block
for line in block["lines"]:
for span in line["spans"]:
blocks.append({
"page": page_num,
"text": span["text"],
"bbox": span["bbox"],
"font": span["font"],
"size": span["size"],
})
doc.close()
return blocks
Common Pitfalls
- Trying OCR on text-based PDFs — wastes 100x the CPU for worse results
- Using low DPI for OCR — Tesseract needs at least 200 DPI, ideally 300+
- Forgetting to specify language for OCR — defaults to English, gets non-Latin scripts wrong
- Not handling multi-column layouts — text comes out interleaved
- Memory issues on huge PDFs — process page by page, not all at once
When NOT to Use This Skill
- When you control the source — get the original document instead of OCR'ing the PDF
- For one-off extractions — manual copy-paste is faster than coding
How to Verify It Worked
- Sample-test on PDFs with known content and verify extraction matches
- Check for common issues: missing characters, joined words, wrong order
- For OCR, measure character error rate (CER) on a labeled sample
Production Considerations
- Cache extracted text by file hash — don't re-extract the same PDF
- Use a worker queue (Bull, Celery) for OCR — it's CPU-intensive
- Validate language detection on incoming PDFs to pick the right OCR model
- Set timeouts on extraction — corrupted PDFs can hang forever
Related PDF Generation Skills
Other Claude Code skills in the same category — free to download.
PDF Generator
Generate PDFs from HTML or React components
Invoice PDF
Generate professional invoice PDFs with line items
Report PDF
Generate data report PDFs with charts and tables
PDF Viewer
Embed PDF viewer in web applications
PDF Form Filler
Programmatically fill PDF forms and templates
PDF Form Filler
Programmatically fill PDF form fields from data
PDF Watermarker
Add watermarks to PDFs (text, image, or stamp)
Want a PDF Generation skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.