How to OCR PDF Files on Linux Using OCRmyPDF
How to OCR PDF Files on Linux Using OCRmyPDF
Optical Character Recognition (OCR) is a powerful technique for converting scanned documents or image-based PDFs into searchable and editable text. On Linux, one of the best tools for this task is OCRmyPDF. This open-source software allows you to add a text layer to PDF files, making them searchable while maintaining the original formatting. In this guide, we’ll walk you through installing and using OCRmyPDF to OCR PDF files on Linux.
Why Use OCRmyPDF?
Searchable PDFs:Make scanned PDFs searchable and accessible.
Preserves Formatting: Keeps the original layout intact while adding a text layer.
Automatic Language Detection: Supports multiple languages and can automatically detect text.
Open Source and Free: OCRmyPDF is completely free and open-source, making it ideal for Linux users.
Prerequisites
Before you begin, ensure you have the following:
A Linux distribution (e.g., Ubuntu, Fedora, etc.).
Administrator (sudo) access to install packages.
Python 3.6 or newer installed on your system.
Step 1: Install OCRmyPDF
OCRmyPDF can be installed through various methods depending on your Linux distribution.
Option 1: Installing via Package Manager (for Ubuntu/Debian)
For Ubuntu or Debian-based systems, OCRmyPDF is available in the official repositories. Install it by running the following commands in your terminal:
sudo apt update
sudo apt install ocrmypdf
Option 2: Installing via Pip (for All Linux Distros)
If OCRmyPDF is not available in your package manager or you want the latest version, you can install it via Python’s pip:
sudo apt install python3-pip
pip3 install ocrmypdf
Make sure that tesseract-ocr (the OCR engine) is installed. You can install it by running:
sudo apt install tesseract-ocr
Tesseract supports multiple languages, and you can install language packs as needed:
sudo apt install tesseract-ocr-[language-code]
For example, to install English and Spanish language packs, run:
sudo apt install tesseract-ocr-eng tesseract-ocr-spa
Step 2: OCR a PDF File Using OCRmyPDF
Once OCRmyPDF is installed, using it is straightforward. The basic syntax is:
ocrmypdf input.pdf output.pdf
Here’s how it works:
input.pdf: The PDF file you want to process.
output.pdf: The file where the OCR text will be saved.
For example:
ocrmypdf document_scanned.pdf document_searchable.pdf
This command will take the scanned PDF file document_scanned.pdf and output a searchable PDF named document_searchable.pdf.
Step 3: Advanced Usage of OCRmyPDF
OCRmyPDF provides several options to control how the OCR is performed. Some of the most useful options include:
1. Specify OCR Language
By default, OCRmyPDF uses English for text recognition. To OCR PDFs in a different language, use the -l option followed by the language code. For example, to process a PDF in French:
ocrmypdf -l fra input.pdf output.pdf
2. Optimize Output File Size
OCRmyPDF can compress the output PDF to save disk space. Use the –optimize flag to reduce the file size:
ocrmypdf –optimize 3 input.pdf output.pdf
This applies a high level of compression without compromising much on quality.
3. Handle Password-Protected PDFs
If your PDF file is password-protected, you can pass the password to OCRmyPDF using the –pdf-password option:
ocrmypdf –pdf-password mypassword input.pdf output.pdf
4. Skip Text Already Present in PDF
If your PDF already contains some text and you don’t want OCRmyPDF to overwrite it, you can use the –skip-text option to only process non-text elements:
ocrmypdf –skip-text input.pdf output.pdf
Step 4: Automating OCR Tasks
If you frequently deal with scanned PDFs, you can automate the OCR process using a script or cron job. For example, you can set up a cron job that automatically processes PDFs in a specific folder every day.
Create a simple bash script ocr_all_pdfs.sh:
#!/bin/bash
for file in /path/to/pdf/folder/*.pdf; do
ocrmypdf “$file” “/path/to/output/folder/$(basename “$file” .pdf)_searchable.pdf”
done
Then, schedule it with cron to run daily:
crontab -e
Add the following line to run the script at midnight:
0 0 * * * /path/to/ocr_all_pdfs.sh
Step 5: Troubleshooting
Tesseract not found: Ensure that Tesseract is installed and correctly configured.
Slow performance: OCR on large PDFs can be slow. You can speed up the process by lowering the DPI of scanned images or reducing the resolution of the output file.
If you would like to improve yourself in server management, you can purchase a server from our site, experiment and improve yourself in an affordable and reliable environment. I wish you good luck.
Conclusion
OCRmyPDF is a powerful tool that makes it easy to convert scanned PDFs into searchable documents, all from the Linux terminal. Whether you’re archiving documents or creating a searchable database of PDFs, OCRmyPDF ensures that you can efficiently process and organize your files.
For more advanced usage, you can check out the official OCRmyPDF documentation.