How to Install and Use Tesseract on Linux
How to Install and Use Tesseract on Linux
Tesseract is one of the most powerful and widely used Optical Character Recognition (OCR) engines. It can extract text from images, making it highly useful for automating the processing of scanned documents, reading text from screenshots, or even processing image-based PDFs. This guide will show you how to install and use Tesseract on a Linux system, specifically focusing on Ubuntu, but the steps can be adapted for other Linux distributions.
1. Why Use Tesseract?
Tesseract is an open-source OCR tool developed by Google and supports a wide range of languages and scripts. It’s flexible, efficient, and can be easily integrated into various workflows. Some common use cases include:
Extracting text from scanned documents or images.
Automating data entry from image-based files.
Processing screenshots for text analysis.
2. Installing Tesseract on Linux
Tesseract is available in most Linux distributions’ package repositories, making it easy to install. Below are the installation steps for different Linux distributions.
Installation on Ubuntu/Debian
To install Tesseract on Ubuntu or Debian, follow these steps:
Update the Package List
Begin by updating your system’s package list to ensure you’re installing the latest version of Tesseract:
sudo apt update
Install Tesseract
After updating the package list, you can install Tesseract using the following command:
sudo apt install tesseract-ocr
Install Language Packs (Optional)
Tesseract supports various languages. You can install additional language packs as needed. For example, to install the English and Spanish language packs:
sudo apt install tesseract-ocr-eng tesseract-ocr-spa
Verify Installation
To verify that Tesseract was installed correctly, run the following command:
tesseract –version
You should see the version information for Tesseract, confirming that it is installed on your system.
Installation on Fedora/CentOS
For Fedora or CentOS users, use the following commands to install Tesseract:
Install Tesseract on Fedora:
sudo dnf install tesseract
Install Tesseract on CentOS:
Enable the EPEL repository first:
sudo yum install epel-release
sudo yum install tesseract
Installation on Arch Linux
For Arch Linux users, Tesseract can be installed directly from the official repositories:
sudo pacman -S tesseract
3. Using Tesseract for OCR
Once Tesseract is installed, you can use it to extract text from images. Here’s a basic workflow for using Tesseract on Linux.
Step 1: Prepare Your Image
Ensure that your input image is clear and has readable text. Tesseract works best on images with good contrast and little noise. Commonly supported image formats include PNG, JPEG, and TIFF.
Step 2: Perform OCR on an Image
To extract text from an image, use the following syntax:
tesseract image.png output.txt
image.png: The input image file containing the text you want to extract.
output.txt: The file where the extracted text will be saved.
If the text is in English, Tesseract will automatically recognize it. The extracted text will be saved in the output.txt file.
Step 3: Specify a Language
If your image contains text in a language other than English, specify the language with the -l option. For example, to process a Spanish text image, use the following command:
tesseract image.png output.txt -l spa
Make sure you have the appropriate language pack installed for the language you want to process.
Step 4: View the Output
After running the command, open the output file to view the extracted text:
cat output.txt
Step 5: Extract Text from a PDF (Optional)
Tesseract can also work with PDFs by converting each page of the PDF into an image and then performing OCR on those images. However, Tesseract alone cannot directly read PDFs. You can use a tool like pdftoppm (part of the poppler-utils package) to convert a PDF to images, then use Tesseract to extract the text from those images:
Convert PDF to Images:
pdftoppm -png input.pdf output
Perform OCR on Each Image:
Use Tesseract to extract text from each generated image. For example, if the PDF has multiple pages:
tesseract output-01.png output-text-01.txt
tesseract output-02.png output-text-02.txt
4. Advanced Tesseract Features
Page Layout Analysis
Tesseract can detect complex page layouts with multiple columns, tables, or non-linear text. You can enable automatic layout analysis with the –psm option. For example, to extract text from an image with multiple columns, use the following command:
tesseract image.png output.txt –psm 6
Output Formats
By default, Tesseract outputs plain text, but it also supports other formats such as searchable PDFs and hOCR (HTML-based OCR data). To create a searchable PDF, use:
tesseract image.png output pdf
To output hOCR, use:
tesseract image.png output hocr
5. Automating Tesseract with Scripts
You can automate Tesseract in your workflow by creating a simple Bash script. For instance, the following script processes all .png images in a directory and extracts the text into corresponding .txt files:
#!/bin/bash
for img in *.png; do
tesseract “$img” “${img%.png}.txt”
done
This script loops through all PNG images and runs Tesseract, saving the output as a text file with the same base name as the image.
If you would like to improve yourself in server management, you can purchase a server from our site, experiment and improve yourself in an affordable and reliable environment. I wish you good luck.
6. Conclusion
Tesseract is a versatile and powerful tool for extracting text from images on Linux. Whether you’re processing scanned documents, PDFs, or screenshots, Tesseract can handle a wide range of tasks with ease. By following this guide, you’ve learned how to install and use Tesseract on Linux, and you can now incorporate it into your own workflows.
Make sure to explore additional options and advanced features, such as language packs and custom page layout analysis, to get the most out of this OCR engine. For more information, you can refer to Tesseract’s official documentation on GitHub.