OCR PDF

Make scanned PDFs searchable with text recognition

How to OCR a PDF online

Scanned PDFs are essentially images — you cannot search, copy, or index the text inside them. PDFRift's OCR tool uses Tesseract.js, a battle-tested open-source text recognition engine, to analyse each page and extract readable text. Everything runs in your browser; the scanned document never leaves your device.

Step-by-step

  1. Upload a scanned PDF (or any PDF with image-based pages).
  2. Click "Start OCR". PDFRift renders each page at high resolution and feeds it through Tesseract.js.
  3. Watch progress in real time — page count and recognition percentage update as you wait.
  4. Copy the extracted text or download it as a .txt file.

Why use browser-based OCR?

Cloud OCR services store your document on their servers, often indefinitely. PDFRift keeps your scanned files private — text recognition runs entirely in WebAssembly inside your browser tab. This is especially important for legal discovery, medical records, financial statements, and any document you wouldn't email to a stranger.

Common use cases

Legal document discovery

Convert boxes of scanned contracts, depositions, and filings into searchable text for keyword review.

Archiving paper records

Digitise old invoices, receipts, and correspondence. Extract the text so it is indexable and searchable.

Academic research

Pull quotes and data from scanned journal articles, historical documents, and book chapters.

Accessibility

Make scanned PDFs readable by screen readers by extracting the text layer.

Tips for better OCR results

  • Higher-resolution scans produce more accurate results. 300 DPI is the recommended minimum.
  • Ensure the scanned image is not skewed — crooked text reduces accuracy significantly.
  • Clean black text on a white background gives the best results. Coloured backgrounds or watermarks may interfere.
  • OCR works best on printed text. Handwriting recognition is limited and results will vary.

Frequently asked questions

What languages does OCR support?

The default engine recognises English text. Tesseract.js supports 100+ languages, but PDFRift currently ships the English model for the fastest load time.

Is OCR 100% accurate?

No OCR engine is perfect. Accuracy depends on scan quality, font, and layout complexity. Always proofread the output for critical documents.

Does this create a searchable PDF?

Currently, PDFRift extracts the text as plain text you can copy or download. A future update will embed the text layer back into the PDF.