Guide
How to extract text from a scanned PDF in C#
Use ZingPDF.OCR when the page text is baked into scanned images instead of embedded in the PDF
text layer. The built-in Tesseract engine supports OCR-first workflows as well as workflows that keep
embedded text when it is already present.
This is a separate package because OCR has different runtime needs from normal PDF parsing. The core
ZingPDF package still handles PDFs that already contain selectable text.
Add the OCR package alongside the core package
OCR lives in ZingPDF.OCR. It builds on the main PDF API rather than replacing it.
The default engine implementation uses Tesseract, so you also need a tessdata folder for the
language you want to recognize. If you need the language files or runtime setup details, see the
official Tesseract documentation.
Extract text with OCR
This example extracts plain text from a scanned or image-based PDF with the built-in Tesseract engine.
using ZingPDF;
using ZingPDF.OCR;
using var pdf = Pdf.Load(File.OpenRead("scanned-input.pdf"));
var engine = new TesseractOcrEngine("./tessdata", "eng");
var text = await pdf.ExtractPlainTextWithOcrAsync(engine);
Console.WriteLine(text);
Keep embedded text when it already exists
Some workflows want OCR only for image-based pages, while others want OCR to be the main extraction path.
PdfOcrOptions lets you choose how embedded text is treated.
The default options prefer embedded text when it is present, but you can pass options explicitly if you want that behavior to be obvious in code.
using ZingPDF;
using ZingPDF.OCR;
using var pdf = Pdf.Load(File.OpenRead("input.pdf"));
var engine = new TesseractOcrEngine("./tessdata", "eng");
var text = await pdf.ExtractPlainTextWithOcrAsync(
engine,
new PdfOcrOptions
{
PreferEmbeddedText = true
});
Current limits
This package is aimed at image-based pages. It looks for the largest supported image on each page and sends that image to the OCR engine.
That works well for scanned PDFs where each page is mostly one scan image. It is not a full page renderer for arbitrary vector content, annotations, or mixed scanned and drawn layouts.
Working with PDFs that already contain text?
Use the normal extraction API when the document already contains a reliable PDF text layer.