OCR: Tesseract OCR

From OnnoWiki
Jump to navigation Jump to search

Tesseract OCR Tesseract OCR is a free and open source OCR software available for Linux. Sponsored by Google, and maintained by many volunteers, it is probably the most comprehensive OCR suite available out there that can even beat some paid, proprietary solutions. It provides command line tools as well as an API that you can integrate in your own programs. It can detect text in many languages with good accuracy. It comes with a set of pre-trained data that can be used to identify and extract text. You can also use your own trained data if you need a custom solution or you can get more models from third parties. Tesseract OCR comes with multiple detection engines and you can use them according to your needs depending on the installation method.

MY LATEST VIDEOS

To install Tesseract OCR in Ubuntu, use the command specified below:

$ sudo apt install tesseract-ocr

You can install it in other Linux distributions from default repositories through the package manager. A universal AppImage file and more installation instructions are available here.

Tesseract OCR comes with support for detecting English language content by default. If you want to enable additional languages, you may have to download more language packs. The link given above has instructions for installing additional language packs. In Ubuntu, you can directly find language packages by running the command below:

$ apt-cache search tesseract-ocr-

The command above will output package names for different language packs. Just install them by running a command in the following format:

$ sudo apt install <language-package>

You can get a list of all installed language packs by running the command below:

$ tesseract --list-langs

Once the main Tesseract OCR package and additional language packages have been installed, you can start detecting text from images and PDF files. To extract text, use commands in following formats:


$ tesseract image.png output -l eng
$ tesseract image.png output -l eng+spa
$ tesseract image.png output -l eng pdf

The first command will extract text from “image.png” file in “eng” language and store it in a file called “output”. The second command will parse the image using multiple language packs. The third command can be used to create a PDF file with a text layer superimposed on the image file.

For more information on command line usage of Tesseract OCR, use the following two commands:

$ tesseract --help
$ man tesseract



Referensi

Pranala Menarik