Extraction

Extract text from PDF

You need just the text from a PDF — no layout, no images, just clean copy you can paste into another document, search, or feed into a script.

Tool

⚡ Open the tool

Free · No account · Files deleted in 1 hour

Why this works

PDFRun extracts the full text content as a plain .txt or .md file. Text-based PDFs come through instantly; scanned PDFs run through OCR automatically when needed.

Text extraction is the simplest PDF operation and the most universally useful. Strip away the formatting, fonts, images, columns, and layout decisions — just give me the words. Common cases: quoting from a research paper into your own writing; piping document content into a script that analyses or indexes text; feeding source material into an AI chat for question-answering; building a search index across a library of PDFs; recovering text from a PDF whose source file is lost.

What extracts versus what doesn\'t. Born-digital PDFs (exported from Word, Pages, Google Docs, design tools, accounting software): text was always in the file as real text characters and extracts instantly with near-perfect fidelity. Special characters, accented letters, punctuation, mathematical symbols all survive. Scanned PDFs (image-only sources): text doesn\'t exist as characters; the page is a picture of text. OCR runs first to recognise the text from page images, then extraction works on the recognised text. Accuracy on clean modern scans is 99%+; marginal-quality scans need proofreading.

Three output formats, three different jobs.

Plain text (.txt): a flat stream of words, paragraphs separated by double newlines, no formatting whatsoever. The right pick when you\'re piping into a script, building a search index, feeding into an AI model, or copy-pasting into an editor that\'ll re-format anyway. UTF-8 encoding by default — universally readable.

Markdown (.md): text with structural cues preserved. Headings become `#`/`##`/`###`. Bullet lists become `-` items. Numbered lists become `1.` items. Bold and italic emphasis survive via `**` and `_`. Links carry as `[text](url)`. The right pick when you\'re moving content into a note-taking app (Obsidian, Notion, Bear), a static-site generator (Hugo, Jekyll), or any tool that speaks Markdown natively. Also the right pick for AI chats that handle Markdown better than plain text.

JSON: structured output with text grouped by page, with metadata (page number, source positions, detected language) attached. The right pick when you\'re processing the content programmatically and need to know which page each fragment came from — critical for citation tools, document search systems, and any downstream where source attribution matters.

Reading-order detection. PDFs don\'t store text in reading order — they store positioned glyphs. "Hello" at coordinates (120, 440) and "world" at (180, 440) appear as separate fragments to the parser; the parser has to infer that they\'re part of one sentence. For most documents this works fine, but multi-column layouts are where naive extractors fail — they\'ll read across columns rather than down them, interleaving content from two columns into nonsense paragraphs. Our extractor analyses page layout first to detect columns and follows correct reading order within each column before moving to the next. Newsletter-style two-column documents come out as readable paragraphs, not jumbled fragments.

What to expect on edge cases. Footnotes and endnotes: extract inline with the surrounding text, which can interrupt flow. For critical academic work, accept that you\'ll need to restructure footnote handling manually. Headers and footers: included by default; can be filtered out via a hint. Page numbers: included as text fragments where the source PDF placed them; usually appear at the bottom of each page\'s extracted content. Tables: text within tables extracts as flat sequences of cell values, separated by spaces. For preserving table structure use PDF to Excel instead. Mathematical content: equations represented as embedded images don\'t extract as math; equations using inline Unicode characters extract as the characters with no semantic interpretation.

For AI workflows specifically. Plain text into ChatGPT or Claude works, but Markdown produces noticeably better results because the model uses structural cues (headings, lists) to understand document hierarchy. For RAG (retrieval-augmented generation) pipelines, JSON output with page-level chunking is the most usable starting point because you can trace LLM responses back to source pages for citation.

How it works

1

Open the extract tool
Tap the orange button above. Output defaults to plain .txt; switch to Markdown or JSON in the options if you need structured output.
2

Upload your PDF
Drop the file in. Born-digital and scanned PDFs both work; OCR runs automatically when needed.
3

Choose output format
Plain text for scripts and search indexes. Markdown for note-taking apps and AI chats. JSON for programmatic processing with page-level structure.
4

Run the extraction
Born-digital PDFs extract in 1–3 seconds. Scanned PDFs take 1–3 seconds per page because OCR runs first.
5

Download and use
Save the .txt, .md, or .json file. UTF-8 encoded — readable on every modern device. Files auto-delete from our servers within one hour.

Who this is for

Real-world uses

Researchers

Quote and cite text from papers without retyping; build literature-review databases from PDF collections.

Translators

Get clean source text for CAT tools and translation-memory matching, without re-keying from screen.

Developers

Pipe PDF content into scripts, search indexes, or RAG (retrieval-augmented generation) systems.

AI users

Paste cleaned PDF text into ChatGPT, Claude, or Gemini for question-answering, summarisation, and analysis.

Content teams

Migrate legacy PDF content into modern CMS systems that accept Markdown or HTML, not PDF.

Compliance teams

Build keyword-search indexes across regulatory documents for fast retrieval during audits.

FAQ

Common questions

Will column order be preserved?

Yes — we detect column layouts and follow correct reading order within each column before moving to the next. Two-column newsletters and similar layouts come out as logical paragraphs, not interleaved gibberish.

What about scanned PDFs?

OCR runs automatically when the source PDF has no embedded text layer. Pick the source language for best accuracy — English, French, Spanish, German, Arabic, Chinese, and 100+ others supported.

TXT, Markdown, or JSON — which should I pick?

TXT for scripts and search indexes (flat words, no structure). Markdown for note-taking apps, AI chats, static-site generators (structure preserved via standard syntax). JSON for programmatic processing with page-level structure and metadata.

What encoding does the output use?

UTF-8 — the modern standard, supports every language and character. If your downstream tool only reads Latin-1 or Windows-1252, re-save in your text editor.

Will tables come through readably?

Table content extracts as flat sequences of cell values, separated by spaces. Table structure (which value belongs to which row/column) is lost. For preserving table structure use PDF to Excel instead.

Can I extract from specific pages only?

Use Extract Pages first to pull the pages you want into a smaller PDF, then run text extraction on that. Two-step but reliable.

Will footnotes appear in the right place?

Footnotes typically extract inline with surrounding text, which can interrupt flow. For academic use with critical footnote handling, expect to manually restructure footnotes after extraction.

Other extraction fixes

🔍