Extract tables from a messy PDF into clean markdown
Wann einsetzen: You have a PDF with tables that pdftotext mangles and you don't want to retype them.
Voraussetzungen
- MCP installed —
uvx kreuzberg-mcp— or add via claude mcp add
Ablauf
-
ExtractUse kreuzberg to extract /docs/2025-annual-report.pdf. Give me the tables as markdown and the body text separately.✓ Kopiert→ Clean markdown tables with preserved headers
-
VerifyFor the "Revenue by Segment" table, reconcile the column totals. Flag any OCR misreads.✓ Kopiert→ Arithmetic check with flagged cells
Ergebnis: Markdown tables you can paste into a doc without rework.
Fallstricke
- Scanned PDF — OCR mistakes 6 for 8 — Use the OCR confidence output and re-scan low-confidence cells manually