Optical character recognition or optical character reader (OCR) is the process of converting images of text into machine-encoded text. For example, you can take a picture of a book page and then run it through OCR software to extract the text.
In this blog post, we will use the Tesseract OCR library. Tesseract was originally developed at Hewlett-Packard between 1985 and 1994. HP open-sourced the software in 2005. The source code is hosted on GitHub: https://github.com/tesseract-ocr/tesseract
The Tesseract engine is written in C++ and does not run in a browser. Therefore, the only way to use the C++ engine is by sending the picture from a web application to a server, running it through Tesseract, and sending the text back.
But for a few years, a JavaScript port of the Tesseract C++ engine has existed that runs in a browser and does not depend on any server-side code. The library is called Tesseract.js, and you can find the source code on GitHub: https://github.com/naptha/tesseract.js The engine was originally written in ASM.js, and it has been ported to WebAssembly recently.
The following examples use version 5 of the library.
Installation ¶
You can add Tesseract.js to your project by loading it from a CDN:
<script src='https://unpkg.com/tesseract.js@5.0.0/dist/tesseract.min.js'></script>
Or by installing it with npm
:
npm install tesseract.js
Basic Usage ¶
The library provides the recognize
method, which takes an image as input and returns an object with the recognized text.
Here is a simple example.
const exampleImage = 'https://tesseract.projectnaptha.com/img/eng_bw.png';
Tesseract.setLogging(true);
const workerPromise = Tesseract.createWorker("eng", 1, {
legacyCore: true,
legacyLang: true,
logger: m => console.log(m)
});
workerPromise.then(worker => work(worker));
async function work(worker) {
let result = await worker.detect(exampleImage);
console.log(result.data);
result = await worker.recognize(exampleImage);
console.log(result.data);
await worker.terminate();
}
Arguments ¶
The recognize
method expects an image as the first argument. The library supports images in the format bmp, jpg, png, and pbm.
The image can be supplied to the method as:
img
,video
, orcanvas
element- File object (from a file
<input>
) - Blob object
- path or URL to an image
- base64 encoded image
See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/image-format.md
Before you call the recognize
method, you have to create a worker with createWorker()
and pass the languages as an argument. createWorker()
returns a Promise.
Progress ¶
The OCR process runs for a few seconds, and if you want to display progress information to the user, you
can configure a progress listener. You pass an object as an argument to the createWorker()
call.
The logger
property expects a function that is called multiple times during the recognition process.
const workerPromise = Tesseract.createWorker("eng", null, {
logger: m => console.log(m)
});
The object you get in the logger function as a parameter contains the worker and job id and the properties status
and progress
. status
is a string describing the current operation, and progress
is a number between 0 and 1 that represents the current progress in percent.
Progress example:
{
"workerId": "Worker-0-eef9f",
"jobId": "Job-0-716d7",
"status": "recognizing text",
"progress": 1,
"userJobId": "Job-4-1d2eb"
}
If you need more internal information, set the logging flag to true. The library then prints more information into the developer console:
Tesseract.setLogging(true);
Result ¶
The recognize()
method returns a Promise. The object you get from a successful call contains the property data
, which holds information about the recognized text. You can either access the text with the text
property, which contains the recognized text as one string, or you can access it through the lines
, paragraphs
, words
, or symbols
properties. Each group element contains a confidence
score
that tells you how confident the engine is. A score is a number between 0 and 100; higher values signify higher confidence.
{
text: "Mild Splendour of the various-vested Nig ..."
hocr: "<div class='ocr_page' id= ..."
tsv: "1 1 0 0 0 0 0 0 1486 ..."
box: null
unlv: null
osd: null
confidence: 90
blocks: [{...}]
psm: "SINGLE_BLOCK"
oem: "DEFAULT"
version: "4.0.0-825-g887c"
paragraphs: [{...}]
lines: (8) [{...}, ...]
words: (58) [{...}, {...}, ...]
symbols: (295) [{...}, {...}, ...]
}
When you access the text with the properties lines
and paragraphs
, you get the text grouped by lines and paragraphs. The words
contains an array with every recognized word, and symbols
gives you access to each character.
Each element of these properties contains the property bbox
, which represents the x/y coordinates of the bounding box. I use this information in the demo application to draw a rectangle around the selected text.
Here is an example of an element in the words
array. The text
property contains the word, and confidence
tells us the confidence score.
line
and paragraph
reference the line and paragraph object where this word is located. symbols
is an array that holds each character individually (M
, i
, l
, d
).
{
symbols: Array(4)
0: {choices: Array(1), image: null, text: "M", confidence: 99.03752899169922, baseline: {...}, ...}
1: {choices: Array(1), image: null, text: "i", confidence: 98.83988952636719, baseline: {...}, ...}
2: {choices: Array(1), image: null, text: "l", confidence: 99.01886749267578, baseline: {...}, ...}
3: {choices: Array(1), image: null, text: "d", confidence: 99.0378646850586, baseline: {...}, ...}
choices: [{...}]
text: "Mild"
confidence: 91.87923431396484
baseline: {x0: 38, y0: 84, x1: 167, y1: 85, has_baseline: true}
bbox: {x0: 38, y0: 34, x1: 167, y1: 85}
is_numeric: false
in_dictionary: false
direction: "LEFT_TO_RIGHT"
language: "eng"
is_bold: false
is_italic: false
is_underlined: false
is_monospace: false
is_serif: false
is_smallcaps: false
font_size: 17
font_id: -1
font_name: ""
page: ...
block: ...
paragraph: {lines: Array(8), text: "Mild Splendour ...", confidence: 91.35659790039062, ...}
line: {words: Array(6), text: "Mild Splendour ...", confidence: 92.46450805664062, ...}
}
If you want to see the complete result object, visit the URL https://omed.hplar.ch/webocr/basic.html and open the developer console.
detect ¶
Another method that the Tesseract.js library provides is detect()
. This method tries to detect the orientation and script.
Like recognize()
, this method expects an image as the first argument and returns a Promise.
const result = await worker.detect(image);
The result object contains the property data
, which holds the detected script, orientation information, and corresponding confidence score.
{
"tesseract_script_id": 1,
"script": "Latin",
"script_confidence": 39.58333969116211,
"orientation_degrees": 0,
"orientation_confidence": 29.793731689453125
}
See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/api.md#worker-detect
Cleanup ¶
The library runs the OCR engine in a Web Worker. If your application no longer needs the worker, you should terminate it with worker.terminate()
.
More examples ¶
Check out the documentation page for more code examples:
https://github.com/naptha/tesseract.js/blob/master/docs/examples.md
Demo Application ¶
This section shows how to incorporate the Tesseract.js library into an Angular/Ionic application.
You can find the source code for the complete application on GitHub:
https://github.com/ralscha/blog2019/blob/master/webocr
The demo application is hosted on my server, and you can access it with this URL: https://omed.hplar.ch/webocr/
The demo application does not depend on any server-side code, and the OCR runs locally in the web browser and does not send any data to a server.
The application is based on the Ionic blank starter template. First, I added the latest version of Tesseract.js with npm install tesseract.js
.
In the TypeScript code, I imported the library with:
import {createWorker, RecognizeResult} from 'tesseract.js';
As input, the application uses an input tag of type file (type="file"
). Each time the user selects
a file, the method onFileChange
is called, which extracts the File object from the input tag
and passes it to the recognize()
method.
<input #fileSelector (change)="onFileChange($event)" accept="image/*" style="display: none;"
type="file">
The selected picture is then also loaded into an Image object. However, the application does not use an
img
tag to display the picture. Instead, it draws the picture into a canvas. I use a canvas here because the application draws rectangles around the text whenever the user clicks on a text.
async onFileChange(event: Event): Promise<void> {
// eslint-disable-next-line @typescript-eslint/no-explicit-any
this.selectedFile = (event.target as any).files[0];
this.progressStatus = '';
this.progress = null;
this.result = null;
this.words = null;
this.symbols = null;
this.selectedLine = null;
this.selectedWord = null;
this.selectedSymbol = null;
this.image = new Image();
this.image.onload = () => this.drawImageScaled(this.image);
if (this.selectedFile) {
this.image.src = URL.createObjectURL(this.selectedFile);
}
/* download files from 3rd party server
const worker = await createWorker(this.language, 1, {
logger: progress => {
this.progressStatus = progress.status;
this.progress = progress.progress;
this.changeDetectionRef.markForCheck();
}
});
*/
const worker = await createWorker(this.language, 1, {
workerPath: 'tesseract5/worker.min.js',
corePath: 'tesseract5/',
// eslint-disable-next-line @typescript-eslint/no-explicit-any
logger: (progress: any) => {
this.progressStatus = progress.status;
this.progress = progress.progress;
this.changeDetectionRef.markForCheck();
}
});
try {
if (this.selectedFile) {
const recognizeResult = await worker.recognize(this.selectedFile);
if (recognizeResult) {
this.result = recognizeResult.data;
} else {
this.result = null;
}
await worker.terminate();
}
} catch (e) {
this.progressStatus = "" + e;
this.progress = null;
} finally {
this.progressStatus = null;
this.progress = null;
}
// reset file input
if (event.target) {
// eslint-disable-next-line @typescript-eslint/no-explicit-any
(event.target as any).value = null;
}
}
redrawImage(): void {
if (this.image) {
this.drawImageScaled(this.image);
}
}
To display the result, I'm using the list element from the Angular Material project.
You can add Angular Material with the command ng add @angular/material
to any Angular project.
Here is the template for the words list:
@if (words) {
<div class="table-container mat-elevation-z1 ion-margin-top">
<table [dataSource]="words" mat-table>
<ng-container matColumnDef="text">
<th *matHeaderCellDef mat-header-cell>Word</th>
<td *matCellDef="let word" mat-cell> {{word.text}} </td>
</ng-container>
<ng-container matColumnDef="confidence">
<th *matHeaderCellDef mat-header-cell>Confidence</th>
<td *matCellDef="let word" mat-cell> {{word.confidence | number:'1.2-2'}} %</td>
</ng-container>
<tr *matHeaderRowDef="elementColumns; sticky: true" mat-header-row></tr>
<tr (click)="onWordClick(word)" *matRowDef="let word; columns: elementColumns;"
[ngClass]="{'highlight': selectedWord === word}"
mat-row></tr>
</table>
</div>
}
When the user clicks on a list item, the application draws a rectangle around the selected text.
drawBBox(bbox: { x0: number, x1: number, y0: number, y1: number }): void {
if (bbox) {
this.redrawImage();
if (this.ratio === null) {
throw new Error('ratio not set');
}
this.ctx.beginPath();
this.ctx.moveTo(bbox.x0 * this.ratio, bbox.y0 * this.ratio);
this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y0 * this.ratio);
this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y1 * this.ratio);
this.ctx.lineTo(bbox.x0 * this.ratio, bbox.y1 * this.ratio);
this.ctx.closePath();
this.ctx.strokeStyle = '#bada55';
this.ctx.lineWidth = 2;
this.ctx.stroke();
}
}
Self-Host ¶
The library will download several files when your application creates the TesseractWorker.
const worker = await createWorker("eng", 1);
These files are by default not part of the packaged application, and Tesseract.js will download these files from 3rd party servers.
This is not the best solution for writing an in-house application where downloading files from a local server is faster. Or you want to be in complete control of all the application resources and not worry about the availability of these 3rd party servers. If one of the servers does not work, your application no longer works.
Whatever the reason is, you can easily self-host the files. The Web Worker JavaScript and Web Assembly files are part of the Tesseract.js npm package. So we only have to copy these files from the node_modules directory to the build folder. In an Angular application, you can do this by adding the following two entries to the build->options->assets
array.
{
"glob": "worker.min.js",
"input": "node_modules/tesseract.js/dist/",
"output": "./tesseract5"
},
{
"glob": "@(tesseract-core.wasm.js|tesseract-core-simd.wasm.js|tesseract-core-lstm.wasm.js|tesseract-core-simd-lstm.wasm.js)",
"input": "node_modules/tesseract.js-core/",
"output": "./tesseract5"
}
The Angular CLI takes care of copying these files to the build folder. In the application, we need to tell Tesseract.js that it must load these files from the local directory instead of fetching them from the 3rd party server.
const worker = await createWorker("eng", 1, {
workerPath: 'tesseract5/worker.min.js',
corePath: 'tesseract5/',
...
});
The machine learning models that the library downloads from tessdata.projectnaptha.com are not available as npm packages. There are different ways to self-host them. For example, if you only need to support a few languages, you could download the files from the Git repository (https://github.com/naptha/tessdata.git) and put them into the build folder during the build process with an npm script.
Be aware that these files are pretty big, and downloading them each time you build your application could take some time.
Instead of adding them to the project, you could download the files you need and copy them onto an HTTP server. If you need to support multiple languages, you could clone the whole repository.
git clone https://github.com/naptha/tessdata.git
Each language is available in three different versions: normal, fast, and best. The best version gives you higher OCR accuracy, with the downside that your application has to download a bigger machine learning model file.
Like the Web Worker files, you need to configure the path to the machine learning files in the object that you pass as an argument to createWorker
.
const worker = createWorker("eng", 1, {
workerPath: 'tesseract5/worker.min.js',
corePath: 'tesseract5/tesseract-core.wasm.js',
langPath: 'https://myserver/5.0.0',
...
});
The library creates the URL by combining langPath
+ language code + '.traineddata.gz'
Visit the official documentation page for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md