ZingPDF logo

Guide

How to extract text from a scanned PDF in C#

Use ZingPDF.OCR when the page text is baked into scanned images instead of embedded in the PDF text layer. The built-in Tesseract engine supports OCR-first workflows as well as workflows that keep embedded text when it is already present.

This is a separate package because OCR has different runtime needs from normal PDF parsing. The core ZingPDF package still handles PDFs that already contain selectable text.

Add the OCR package alongside the core package

OCR lives in ZingPDF.OCR. It builds on the main PDF API rather than replacing it.

The default engine implementation uses Tesseract, so you also need a tessdata folder for the language you want to recognize. If you need the language files or runtime setup details, see the official Tesseract documentation.

Extract text with OCR

This example extracts plain text from a scanned or image-based PDF with the built-in Tesseract engine.

using ZingPDF;
using ZingPDF.OCR;

using var pdf = Pdf.Load(File.OpenRead("scanned-input.pdf"));

var engine = new TesseractOcrEngine("./tessdata", "eng");

var text = await pdf.ExtractPlainTextWithOcrAsync(engine);

Console.WriteLine(text);

Keep embedded text when it already exists

Some workflows want OCR only for image-based pages, while others want OCR to be the main extraction path. PdfOcrOptions lets you choose how embedded text is treated.

The default options prefer embedded text when it is present, but you can pass options explicitly if you want that behavior to be obvious in code.

using ZingPDF;
using ZingPDF.OCR;

using var pdf = Pdf.Load(File.OpenRead("input.pdf"));

var engine = new TesseractOcrEngine("./tessdata", "eng");

var text = await pdf.ExtractPlainTextWithOcrAsync(
    engine,
    new PdfOcrOptions
    {
        PreferEmbeddedText = true
    });

Current limits

This package is aimed at image-based pages. It looks for the largest supported image on each page and sends that image to the OCR engine.

That works well for scanned PDFs where each page is mostly one scan image. It is not a full page renderer for arbitrary vector content, annotations, or mixed scanned and drawn layouts.

Working with PDFs that already contain text?

Use the normal extraction API when the document already contains a reliable PDF text layer.

Open text extraction guide Open OCR docs