Skip to content
Extract

Pulling Text Out of Scans — An OCR Guide for Real-World Documents

Real-world scans aren’t lab-clean. Here’s how to handle skewed pages, mixed languages, faint scans, and dense formatting.

May 5, 2026 · 2 min read
Want to skip the reading?
Open the tool now — free, no signup, no watermark.

Open the tool →

Demos always show OCR on a perfect 600-dpi scan of a Times Roman page. Real life is fax-quality, half-rotated, mixed-language, with handwriting in the margins. The fundamentals still work — but knowing the failure modes saves hours.

Five tweaks that materially raise accuracy

  1. Deskew first. A page tilted more than 5° loses real accuracy. Most modern OCR auto-deskews, but check the result.
  2. Boost contrast on faint scans. Run a “binarise” or “improve contrast” pass before OCR. Pale grey ink on greyish paper is the worst-case input; a quick threshold makes it readable.
  3. Pick the right language. Mixing English and German under “auto-detect” works less well than picking one and accepting some errors in the other.
  4. Use exact layout for forms. Forms with checkboxes and aligned fields need spatial preservation; “flowing text” mode scrambles them.
  5. Re-scan if the source is irrecoverable. No tool fixes a 96-dpi scan of a faded photocopy. Sometimes the right answer is a fresh scan, not a better tool.

Languages and scripts

  • Latin scripts (English, Spanish, German, French, Portuguese, Italian): 99%+ on clean print.
  • CJK (Chinese, Japanese, Korean): 95–98% on clean print, sensitive to font and DPI.
  • Arabic, Hebrew: solid on print; right-to-left layout sometimes confuses output formatting.
  • Cyrillic, Greek: very strong on modern print; shaky on pre-1990 scans where the print quality differs.
  • Indic scripts: improving fast; varies by engine.

What to do with the output

Don’t just grab the .txt and run. Open it next to the source, scroll through the first and last pages, and skim for obvious OCR errors (numbers swapped, common mis-recognitions like “rn” → “m”). A two-minute proofread catches 90% of mistakes that would otherwise propagate downstream.

Frequently asked questions

My OCR keeps reading "0" as "O". Can I fix that?

Common in low-DPI scans. Re-scan at 300 dpi if possible, or use an OCR engine with a "numeric context" mode for fields you know contain numbers.

Can OCR run on photos taken with my phone?

Yes — modern engines handle phone photos well. Improve accuracy by holding the phone parallel to the page and ensuring even lighting.

#OCR #scan #text extraction #troubleshooting

Try PDFRun Free

40+ PDF tools, no account required. Process your first file in under 30 seconds.

Open PDF Tools →