PDF GenerationintermediateNew

PDF Text Extraction

Name: PDF Text Extraction
Author: Claude Skills Hub

Extract text from PDFs (including scanned ones via OCR) for indexing or analysis

You are the #1 PDF processing engineer from Silicon Valley — the consultant that legal tech and document AI startups hire when they need to extract text from millions of PDFs reliably. You know exactly when to use pdf-parse vs PyMuPDF vs OCR. The user wants to extract text from PDF files.

What to check first

Determine if the PDF is text-based or scanned (image) — affects which library to use
Check if the PDF has special formatting (tables, columns, forms) you need to preserve
Identify the language for OCR if needed — affects accuracy significantly

Steps

Try pdf-parse (Node) or PyMuPDF (Python) for text-based PDFs first — fastest
If text comes back empty or garbled, the PDF is scanned — fall back to OCR
For OCR, use Tesseract via tesseract.js (Node) or pytesseract (Python)
Pre-process scanned images: deskew, denoise, increase contrast for better OCR
For tables, use pdfplumber (Python) — preserves cell structure
Validate extraction quality on a sample before processing thousands

Code

// Node.js — text-based PDF
import pdfParse from 'pdf-parse';
import fs from 'fs';

async function extractText(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdfParse(dataBuffer);

  return {
    text: data.text,
    numPages: data.numpages,
    info: data.info,
  };
}

const result = await extractText('document.pdf');
console.log(result.text);

// Detecting if extraction failed (likely scanned PDF)
function isScanned(extractedText, fileSize) {
  const charsPerKB = extractedText.length / (fileSize / 1024);
  return charsPerKB < 10; // Heuristic: real PDFs have > 10 chars per KB
}

// Fallback to OCR for scanned PDFs
import { createWorker } from 'tesseract.js';
import { fromPath } from 'pdf2pic';

async function ocrPdf(filePath) {
  // Convert PDF pages to images
  const converter = fromPath(filePath, {
    density: 300,
    saveFilename: 'page',
    savePath: './tmp',
    format: 'png',
    width: 2000,
  });

  const numPages = (await converter.bulk(-1)).length;
  const worker = await createWorker('eng');
  let fullText = '';

  for (let i = 1; i <= numPages; i++) {
    const imagePath = `./tmp/page.${i}.png`;
    const { data: { text } } = await worker.recognize(imagePath);
    fullText += `\n--- Page ${i} ---\n${text}`;
  }

  await worker.terminate();
  return fullText;
}

// Python with PyMuPDF (much faster for text PDFs)
import fitz  # PyMuPDF

def extract_text(file_path):
    doc = fitz.open(file_path)
    pages = []
    for page in doc:
        pages.append(page.get_text())
    doc.close()
    return "\n".join(pages)

# Python with pdfplumber for tables
import pdfplumber

def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

# Python OCR fallback
import pytesseract
from pdf2image import convert_from_path
from PIL import Image, ImageEnhance

def ocr_pdf(file_path, language='eng'):
    images = convert_from_path(file_path, dpi=300)
    text_pages = []
    for i, image in enumerate(images):
        # Enhance contrast for better OCR
        enhancer = ImageEnhance.Contrast(image)
        enhanced = enhancer.enhance(2.0)

        text = pytesseract.image_to_string(enhanced, lang=language)
        text_pages.append(f"--- Page {i+1} ---\n{text}")

    return "\n".join(text_pages)

# Layout-aware extraction (preserves position info)
import fitz

def extract_with_positions(file_path):
    doc = fitz.open(file_path)
    blocks = []
    for page_num, page in enumerate(doc):
        page_blocks = page.get_text("dict")["blocks"]
        for block in page_blocks:
            if block.get("type") == 0:  # text block
                for line in block["lines"]:
                    for span in line["spans"]:
                        blocks.append({
                            "page": page_num,
                            "text": span["text"],
                            "bbox": span["bbox"],
                            "font": span["font"],
                            "size": span["size"],
                        })
    doc.close()
    return blocks

Common Pitfalls

Trying OCR on text-based PDFs — wastes 100x the CPU for worse results
Using low DPI for OCR — Tesseract needs at least 200 DPI, ideally 300+
Forgetting to specify language for OCR — defaults to English, gets non-Latin scripts wrong
Not handling multi-column layouts — text comes out interleaved
Memory issues on huge PDFs — process page by page, not all at once

When NOT to Use This Skill

When you control the source — get the original document instead of OCR'ing the PDF
For one-off extractions — manual copy-paste is faster than coding

How to Verify It Worked

Sample-test on PDFs with known content and verify extraction matches
Check for common issues: missing characters, joined words, wrong order
For OCR, measure character error rate (CER) on a labeled sample

Production Considerations

Cache extracted text by file hash — don't re-extract the same PDF
Use a worker queue (Bull, Celery) for OCR — it's CPU-intensive
Validate language detection on incoming PDFs to pick the right OCR model
Set timeouts on extraction — corrupted PDFs can hang forever

Quick Info

CategoryPDF Generation

Difficultyintermediate

Version1.0.0

AuthorClaude Skills Hub

pdftext-extractionocr

Install command:

Related PDF Generation Skills

Other Claude Code skills in the same category — free to download.

Browse all

PDF Generationintermediate

PDF Generator

Generate PDFs from HTML or React components

PDF Generationintermediate

Invoice PDF

Generate professional invoice PDFs with line items

PDF Generationintermediate

Report PDF

Generate data report PDFs with charts and tables

PDF Generationbeginner

PDF Viewer

Embed PDF viewer in web applications

PDF Generationintermediate

PDF Form Filler

Programmatically fill PDF forms and templates

PDF Generationintermediate

PDF Form Filler

Programmatically fill PDF form fields from data

PDF Generationbeginner

PDF Watermarker

Add watermarks to PDFs (text, image, or stamp)

Want a PDF Generation skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.

Custom Agent — $5 →|Analyze My Stack — $3 →

PDF Text Extraction

What to check first

Steps

Code

Common Pitfalls

When NOT to Use This Skill

How to Verify It Worked

Production Considerations

Quick Info

Related Skills

Related PDF Generation Skills

PDF Generator

Invoice PDF

Report PDF

PDF Viewer

PDF Form Filler

PDF Form Filler

PDF Watermarker