OCR in PHP: Read Text from Images with Tesseract
Optical Character Recognition (OCR) is the process of converting printed text into a digital representation. It has all sorts of practical applications — from digitizing printed books, creating electronic records of receipts, to number-plate recognition and even circumventing image-based CAPTCHAs.
Tesseract is an open source program for performing OCR. You can run it on *Nix systems, Mac OSX and Windows, but using a library we can utilize it in PHP applications. This tutorial is designed to show you how.
To keep things simple and consistent, we’ll use a Virtual Machine to run the application, which we’ll provision using Vagrant. This will take care of installing PHP and Nginx, though we’ll install Tesseract separately to demonstrate the process.
If you want to install Tesseract on your own, existing Debian-based system you can skip this next part — or alternatively visit the README for installation instructions on other *nix systems, Mac OSX (hint — use MacPorts!) or Windows.
To set up Vagrant so that you can follow along with the tutorial, complete the following steps. Alternatively, you can simply grab the code from Github.
Enter the following command to download the Homestead Improved Vagrant configuration to a directory named
git clone https://github.com/Swader/homestead_improved ocr
We’re not going to be using Laravel, so change the Nginx configuration in
sites: - map: homestead.app to: /home/vagrant/Code/Laravel/public
sites: - map: homestead.app to: /home/vagrant/Code/public
You’ll also need to add the following to your hosts file:
Installing the Tesseract Binary
The next step is to install the Tesseract binary.
Because Homestead Improved uses a Debian-based distribution of Linux, we can use
apt-get to install it after logging into the VM with
vagrant ssh. It’s as simple as running the following command:
sudo apt-get install tesseract-ocr
As I mentioned above, there are instructions for other operating systems in the README.
Continue reading %OCR in PHP: Read Text from Images with Tesseract%