Home | Send Feedback

OCR in a browser with Tesseract.js

Published: July 30, 2019  •  javascript

Optical character recognition or optical character reader (OCR) is the process of converting images of text into machine-encoded text. For example, you can take a picture of a book page and then run it through an OCR software to extract the text.

In this blog post, we are going to use the Tesseract OCR library. Tesseract is written in C/C++ and was originally developed at Hewlett-Packard between 1985 and 1994. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest version 4, released in October 2018, contains a new OCR engine that uses a neural network system based on LSTM, which should increase accuracy quite significantly. Version 4 supports 123 languages out of the box. The source code is hosted on GitHub: https://github.com/tesseract-ocr/tesseract

As mentioned before the Tesseract engine is written in C++ and does not run in a browser. The only way to use the C++ engine is by sending the picture from a web application to a server, run it through the engine and send the text back.

But for a few years, a JavaScript port of the Tesseract C++ engine exists, that runs in a browser and does not depend on any server-side code. The library is called Tesseract.js, and you find the source code on GitHub: https://github.com/naptha/tesseract.js
The engine was originally written in ASM.js and is now ported to WebAssembly.

We are going to use version 2 of the library which is currently (July 2019) available as a beta version. Version 2 is a port of Tesseract 4, and the engine is ported to WebAssembly. The library falls back to ASM.js when the browser does not support WebAssembly.

Installation

You add Tesseract.js either by loading it from a CDN

<script src='https://unpkg.com/tesseract.js@v2.0.0-alpha.13/dist/tesseract.min.js'></script>

or by installing it with npm, if your project is managed by npm.

npm install tesseract.js@next

Basic Usage

The library provides the recognize method that takes an image as input and returns an object with the recognized text. Here a simple example

<script src="https://unpkg.com/tesseract.js@v2.0.0-alpha.13/dist/tesseract.min.js"></script>
<script>
    const exampleImage = 'https://tesseract.projectnaptha.com/img/eng_bw.png';
    const worker = new Tesseract.TesseractWorker();
    worker.recognize(exampleImage)
        .progress(progress => console.log('progress', progress))
        .then(result => console.log('result', result))
        .finally(() => worker.terminate());

basic.html


Arguments

The recognize method expects an image as the first argument. The library supports images in the format bmp, jpg, png and pbm.

The image can be supplied to the method as

See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/image-format.md


As a second (optional) argument, you can specify the target language. If you don't specify a language code, the library uses the default value eng for an English text. This parameter specifies what machine learning model the library needs to download.

Here an example that sets the language to Spanish.

worker.recognize(exampleImage, 'spa')....

You can specify multiple languages separated by '+'. For example, English and Spanish.

worker.recognize(exampleImage, 'eng+spa')....

See this page for a list of all supported languages and the corresponding code:
https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md


As the third argument, which is also optional, you provide an options object, that may include properties that override the default Tesseract.js parameters.

See a list of all supported parameters on this official documentation page:
https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md


Result

The recognize() method returns a TesseractJob object. This object is a Promise like object and provides the methods then and catch. In addition, it provides the finally method , which is always called regardless if the recognize() method fails or succeeds. TesseractJob also provides the progress method , which is called by the recognition method periodically to report the current progress of the operation.

The progress method receives an object with two properties: status and progress. status is a string that describes the current operation, and progress is a number that represents the current progress in percent.

For example:

{status: "recognizing text", progress: 0.17142857142857143}

The then method receives a result object when recognize() finishes successfully. This result object contains the recognized text. You can either access the text with the text property, which contains the text as one string, or you can access it through the lines, paragraphs, words or symbols properties. Each group element contains a confidence score that tells you how confident the engine is. The score is a number between 0 and 100; higher values signify higher confidence.

{
  blocks: [{...}],
  box: null,
  confidence: 93,
  files: {},
  hocr: "...",
  oem: "LSTM_ONLY"
  lines: [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}],
  osd: null,
  paragraphs: [{...}],
  psm: "SINGLE_BLOCK",
  symbols: [{...}, {...}, ...],
  text: "Mild Splendour of the various-vested ...",
  tsv: "...",
  unlv: null,
  version: "4.0.0",
  words: [{...}, {...}, ...]
}

When you access the text with the properties lines and paragraphs you get the text grouped by lines and paragraphs. The words contains an array with every recognized word, and symbols gives you access to each character.

Each element of these properties contains the property bbox which are the x/y coordinates of the bounding box. In the demo application, we use this information to draw a rectangle around the text.

Here an example of an element in the words array. The text property contains the word, confidence tells us the confidence score. line, and paragraph reference the line and paragraph where this word is located. symbols is an array with each character (M, i, l, d)

{
  baseline: {x0: 38, y0: 84, x1: 167, y1: 85, has_baseline: true},
  bbox: {x0: 38, y0: 34, x1: 167, y1: 85},
  block: {paragraphs: Array(1), text: "Mild Splendour of t...", confidence: 93.52715301513672, ...},
  choices: [{...}],
  confidence: 96.3907699584961,
  direction: "LEFT_TO_RIGHT",
  font_id: -1,
  font_name: "",
  font_size: 74,
  in_dictionary: true,
  is_bold: false,
  is_italic: false,
  is_monospace: false,
  is_numeric: false,
  is_serif: false,
  is_smallcaps: false,
  is_underlined: false,
  language: "eng",
  line: {words: Array(6), text: "Mild Splendour of the various-vested Night!↵", ...},
  page: {files: {...}, text: "Mild Splendour of the ...", ...},
  paragraph: {lines: Array(8), text: "Mild Splendour of the...", ...},
  symbols: (4) [{...}, {...}, {...}, {...}],
  text: "Mild"
}

If you want to see the full result object visit the URL https://demo.rasc.ch/ocr/basic.html and open the browser console.


detect

The second method that the Tesseract.js library provides is detect(). This method tries to figure out in what language the text is written in. Like recognize() this method expects an image as first argument and returns a TesseractJob object.

    const detectWorker = new Tesseract.TesseractWorker();
    detectWorker.detect(exampleImage)
        .progress(progress => console.log('progress', progress))
        .then(result => console.log('result', result))
        .finally(() => detectWorker.terminate());

The then method receives the result object, which contains these properties.

{
    tesseract_script_id: 12, 
    script: "NULL", 
    script_confidence: 0.9523810744285583, 
    orientation_degrees: 0, 
    orientation_confidence: 19.713287353515625
}

See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/api.md#tesseractworkerdetectimage---tesseractjob


Cleanup

The library runs the OCR engine in a Web Worker. If your application no longer needs the engine, you should terminate it with worker.terminate().

More examples

Check out the documentation page for more code examples:
https://github.com/naptha/tesseract.js/blob/master/docs/examples.md

Demo Application

In this section, I show how I incorporated the Tesseract.js library into an Angular/Ionic application.

You find the source code for the complete application on GitHub:
https://github.com/ralscha/blog2019/blob/master/webocr

The demo application is also hosted on my server, and you can access it with this URL:
https://demo.rasc.ch/ocr/

The demo application does not depend on any server-side code, and the OCR runs locally in the web browser and does not send any data to the server.

The application is based on the Ionic 4 blank starter template. First I added the beta version of Tesseract.js with npm install tesseract.js@next to the project.

In the TypeScript code, I import the library with

import Tesseract from 'tesseract.js';

As input, the application uses an input tag of type file (type="file"). Whenever the user selects a file the method onFileChange is called, which extracts the File object from the input tag and passes it to the recoginize() method.

  <input #fileSelector (change)="onFileChange($event)" accept="image/*" style="display: none;"
         type="file">

home.page.html

The selected picture is then also loaded into an Image object. The application does not use an img tag to display the picture. Instead, it draws the picture into a canvas. I use a canvas here because the application draws rectangles around the text whenever the user clicks on a text.

  onFileChange(event) {
    this.selectedFile = event.target.files[0];

    const worker = new Tesseract.TesseractWorker({
      workerPath: 'tesseract-200alpha13/worker.min.js',
      corePath: 'tesseract-200alpha13/tesseract-core.wasm.js'
    });
    this.progressStatus = '';
    this.progress = null;

    this.result = null;
    this.words = null;
    this.symbols = null;
    this.selectedLine = null;
    this.selectedWord = null;
    this.selectedSymbol = null;

    worker.detect(this.selectedFile).progress(progressEvent => {
      this.progressStatus = progressEvent.status;
      this.progress = progressEvent.progress;
    }).then(result => {
      console.log(result);
    });

    this.progressBar.set(0);

    worker
      .recognize(this.selectedFile, this.language)
      .progress(progressEvent => {
        this.progressStatus = progressEvent.status;
        this.progress = progressEvent.progress;

        this.progressBar.set(progressEvent.progress * 100);
        this.changeDetectionRef.detectChanges();
      })
      .catch(error => {
        this.progressStatus = error;
        this.progress = null;
      })
      .then(result => {
        this.result = result;
        worker.terminate();
      })
      .finally(() => {
        this.progressBar.complete();
        this.progressStatus = null;
        this.progress = null;
      });

    this.image = new Image();
    this.image.onload = async () => {
      this.drawImageScaled(this.image);
    };
    this.image.src = URL.createObjectURL(this.selectedFile);
  }

home.page.ts

To display the result, I'm using the list element from the Angular Material project. You can add Angular Material with the command ng add @angular/material to any Angular project.

Here the template for the words list:

  <div *ngIf="words" class="table-container mat-elevation-z1 ion-margin-top">
    <table [dataSource]="words" mat-table>

      <ng-container matColumnDef="text">
        <th *matHeaderCellDef mat-header-cell>Word</th>
        <td *matCellDef="let word" mat-cell> {{word.text}} </td>
      </ng-container>

      <ng-container matColumnDef="confidence">
        <th *matHeaderCellDef mat-header-cell>Confidence</th>
        <td *matCellDef="let word" mat-cell> {{word.confidence | number:'1.2-2'}} %</td>
      </ng-container>

      <tr *matHeaderRowDef="elementColumns; sticky: true" mat-header-row></tr>
      <tr (click)="onWordClick(word)" *matRowDef="let word; columns: elementColumns;"
          [ngClass]="{'highlight': selectedWord == word}"
          mat-row></tr>
    </table>
  </div>

home.page.html

When the user clicks on a list item the application draws a rectangle around the selected text

  drawBBox(bbox: { x0: number, x1: number, y0: number, y1: number }) {
    if (bbox) {
      this.redrawImage();

      this.ctx.beginPath();
      this.ctx.moveTo(bbox.x0 * this.ratio, bbox.y0 * this.ratio);
      this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y0 * this.ratio);
      this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y1 * this.ratio);
      this.ctx.lineTo(bbox.x0 * this.ratio, bbox.y1 * this.ratio);
      this.ctx.closePath();
      this.ctx.strokeStyle = '#bada55';
      this.ctx.lineWidth = 2;
      this.ctx.stroke();
    }
  }

  onLineClick(line) {
    this.words = line.words;

    this.drawBBox(line.bbox);

    this.symbols = null;
    this.selectedLine = line;
    this.selectedWord = null;
    this.selectedSymbol = null;
  }

home.page.ts

Self-Host

When you create the TesseractWorker, the library downloads several files.

const worker = new Tesseract.TesseractWorker();

These files are by default not part of the packaged application, and Tesseract.js will download these files from 3rd party servers. For beta 13 the following files will be downloaded:

If you call the detect() method, one additional file will be downloaded.

This is not ideal when you write an in-house application where downloading files from a local server is much faster. Or you want to be in full control of all the application files and don't have to worry about the availability of these 3rd party servers. If one of the servers does not work, your application no longer works.

Whatever the reason is you can self-host these files. The Web Worker JavaScript and Web Assembly files are part of the Tesseract.js npm package so we can copy these two files from the node_modules directory to the build folder. In an Angular application you can do this by adding the following two entries to the build->options->assets array.

              {
                "glob": "worker.min.js",
                "input": "node_modules/tesseract.js/dist/",
                "output": "./tesseract-200alpha13"
              },
              {
                "glob": "tesseract-core.wasm.js",
                "input": "node_modules/tesseract.js-core/",
                "output": "./tesseract-200alpha13"
              }

angular.json

The Angular CLI takes care of copying these two files to the build folder. In the application we need to tell Tesseract.js that it should load these two files instead of fetching them from unpkg.com

    const worker = new Tesseract.TesseractWorker({
      workerPath: 'tesseract-200alpha13/worker.min.js',
      corePath: 'tesseract-200alpha13/tesseract-core.wasm.js'
    });

The machine learning models that the library downloads from tessdata.projectnaptha.com are not available as npm package. There are different ways to self host them. If you only need to support a few languages, you could download the files from the Git repository (https://github.com/naptha/tessdata.git) and put them into the build folder during the build process with a npm script.

Be aware that these files are quite big, and downloading them each time you build your application could take some time.

Instead of adding them to the project, you could download the files you need and copy them onto a server where a HTTP server is installed. If you need to support multiple languages, you could clone the whole repository.

git clone https://github.com/naptha/tessdata.git

Downloading all files requires 4.8 GB of disk space!

Each language is available in three different versions: normal, fast, and best. The best version gives you higher OCR accuracy with the downside that your application has to download a bigger machine learning model file. For the English language, the best file has a size of 12.2 MB, normal 10.4 MB, and fast 1.89 MB.

Like the Web Worker files, you need to configure the path to the machine learning files in the object that you pass to the TesseractWorker constructor.

    const worker = new Tesseract.TesseractWorker({
      workerPath: 'tesseract-200alpha13/worker.min.js',
      corePath: 'tesseract-200alpha13/tesseract-core.wasm.js',
      langPath: 'https://myserver/4.0.0',
    });

The library creates the URL by combining langPath + language code + '.traineddata.gz'

Visit the official documentation page for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md