Steps to Repair and OCR a Scanned or Corrupted PDF in Ubuntu
1. Clean or Repair the PDF
Use Ghostscript to rebuild damaged cross-reference tables and fix malformed PDF structure.
sudo apt install ghostscript
gs -o fixed.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH "input.pdf"
What it does:
- Repairs broken references (
xreferrors) - Normalizes streams and compression
- Outputs a clean, standards-compliant PDF (
fixed.pdf)
If Ghostscript cannot fix the file, try qpdf:
sudo apt install qpdf
qpdf --repair "input.pdf" fixed.pdf
2. Run OCR on the Cleaned PDF
Use OCRmyPDF to embed searchable text into the PDF.
sudo apt install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-fil
ocrmypdf --jobs 4 --deskew --clean -l eng+fil fixed.pdf output_ocr.pdf
What it does:
- Performs OCR using Tesseract (English + Filipino)
- Deskews and cleans pages
- Embeds text layer for search and selection
If OCRmyPDF fails on rendering, use an alternate renderer:
ocrmypdf --pdf-renderer sandwich fixed.pdf output_ocr.pdf
If the PDF is too broken, force rasterization and OCR:
ocrmypdf --force-ocr fixed.pdf output_ocr.pdf
3. Verify OCR Success
Check if text extraction works:
pdftotext output_ocr.pdf - | head
If you see readable text, the OCR worked successfully.
✅ Summary Workflow
| Step | Tool | Command | Purpose | |
|---|---|---|---|---|
| 1 | Ghostscript | gs -o fixed.pdf -sDEVICE=pdfwrite ... |
Clean and repair corrupted PDF | |
| 2 | QPDF | qpdf --repair input.pdf fixed.pdf |
Alternate PDF repair if GS fails | |
| 3 | OCRmyPDF | ocrmypdf --jobs 4 --deskew --clean fixed.pdf output_ocr.pdf |
Add searchable text layer | |
| 4 | Verify | `pdftotext output_ocr.pdf - | head` | Confirm OCR success |