Home | Send Feedback

OCR in the browser with Tesseract.js

Published: July 30, 2019  •  Updated: December 30, 2019  •  javascript

Optical character recognition or optical character reader (OCR) is the process of converting images of text into machine-encoded text. For example, you can take a picture of a book page and then run it through an OCR software to extract the text.

In this blog post, we are going to use the Tesseract OCR library. Tesseract is written in C/C++ and was originally developed at Hewlett-Packard between 1985 and 1994. HP open-sourced the software in 2005. Since then, Google has been developing and maintaining it.

The latest version 4, released in October 2018, contains a new OCR engine that uses a neural network system based on LSTM, which should increase accuracy quite significantly. Version 4 supports 123 languages out of the box. The source code is hosted on GitHub: https://github.com/tesseract-ocr/tesseract

As mentioned before, the Tesseract engine is written in C++ and does not run in a browser. The only way to use the C++ engine is by sending the picture from a web application to a server, run it through the engine and send the text back.

But for a few years, a JavaScript port of the Tesseract C++ engine exists, that runs in a browser and does not depend on any server-side code. The library is called Tesseract.js, and you find the source code on GitHub: https://github.com/naptha/tesseract.js
The engine was originally written in ASM.js, and it has been ported to WebAssembly recently.

We are going to use version 2 of the library. Version 2 is a WebAssembly port of Tesseract 4.1. The library falls back to ASM.js when the browser does not support WebAssembly.

Installation

You add Tesseract.js to your project by loading it from a CDN

<script src='https://unpkg.com/tesseract.js@2.1.4/dist/tesseract.min.js'></script>

or by installing it with npm.

npm install tesseract.js

Basic Usage

The library provides the recognize method that takes an image as input and returns an object with the recognized text. Here a simple example

    const exampleImage = 'https://tesseract.projectnaptha.com/img/eng_bw.png';

    const worker = Tesseract.createWorker({
      logger: m => console.log(m)
    });
    Tesseract.setLogging(true);
    work();

    async function work() {
      await worker.load();
      await worker.loadLanguage('eng');
      await worker.initialize('eng');

      let result = await worker.detect(exampleImage);
      console.log(result.data);

      result = await worker.recognize(exampleImage);
      console.log(result.data);

      await worker.terminate();
    }

basic.html


Arguments

The recognize method expects an image as the first argument. The library supports images in the format bmp, jpg, png, and pbm.

The image can be supplied to the method as

See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/image-format.md


Before you call the recognize method you have to create a worker with createWorker(), load the tesseract.js-core scripts with load(), load the machine learning models for one or multiple languages with loadLanguage(...), and finally initialize the Tesseract API with initialize(...). The language you specify as the argument for the initialize call can be a subset of the languages you loaded with loadLanguage().

The next example loads the models for English and Spanish in advance, but then only uses English for the next API call.

await worker.loadLanguage('eng+spa');
await worker.initialize('eng');

Later in the application, you simply switch to the other language with another initialize(...) call.


Progress

The OCR process runs for a few seconds, and if you want to display progress information to the user, you configure a progress listener. You do that by passing an object as an argument to the createWorker() call. The logger property expects a function that is called multiple times during the recognition process.

    const worker = Tesseract.createWorker({
      logger: m => console.log(m)
    });

The object you get in the logger function as parameter contains the worker and job id and the properties status and progress. status is a string describing the current operation, and progress is a number between 0 and 1 that represents the current progress in percent.

Progress example:

{workerId: "Worker-0-4d98d", jobId: "Job-0-a37f4", status: "recognizing text", progress: 0.8285714285714286}

If you need more internal information, set the logging flag to true. The library then prints more information into the developer console: setLogging(true)


Result

The recognize() method returns a Promise. The object you get from a successful call contains the property data that holds information about the recognized text. You either access the text with the text property, which contains the recognized text as one string, or you access it through the lines, paragraphs, words, or symbols properties. Each group element contains a confidence score that tells you how confident the engine is. The score is a number between 0 and 100; higher values signify higher confidence.

{
  text: "Mild Splendour of the various-vested Nig ..."
  hocr: "<div class='ocr_page' id= ..."
  tsv: "1 1 0 0 0 0 0 0 1486 ..."
  box: null
  unlv: null
  osd: null
  confidence: 90
  blocks: [{...}]
  psm: "SINGLE_BLOCK"
  oem: "DEFAULT"
  version: "4.0.0-825-g887c"
  paragraphs: [{...}]
  lines: (8) [{...}, ...]
  words: (58) [{...}, {...}, ...]
  symbols: (295) [{...}, {...}, ...]
}

When you access the text with the properties lines and paragraphs, you get the text grouped by lines and paragraphs. The words contains an array with every recognized word, and symbols gives you access to each character.

Each element of these properties contains the property bbox which represents the x/y coordinates of the bounding box. In the demo application, I use this information to draw a rectangle around the selected text.

Here an example of an element in the words array. The text property contains the word, and confidence tells us the confidence score. line, and paragraph reference the line and paragraph object where this word is located. symbols is an array that holds each character individually (M, i, l, d).

{
  symbols: Array(4)
    0: {choices: Array(1), image: null, text: "M", confidence: 99.03752899169922, baseline: {...}, ...}
    1: {choices: Array(1), image: null, text: "i", confidence: 98.83988952636719, baseline: {...}, ...}
    2: {choices: Array(1), image: null, text: "l", confidence: 99.01886749267578, baseline: {...}, ...}
    3: {choices: Array(1), image: null, text: "d", confidence: 99.0378646850586, baseline: {...}, ...}
  choices: [{...}]
  text: "Mild"
  confidence: 91.87923431396484
  baseline: {x0: 38, y0: 84, x1: 167, y1: 85, has_baseline: true}
  bbox: {x0: 38, y0: 34, x1: 167, y1: 85}
  is_numeric: false
  in_dictionary: false
  direction: "LEFT_TO_RIGHT"
  language: "eng"
  is_bold: false
  is_italic: false
  is_underlined: false
  is_monospace: false
  is_serif: false
  is_smallcaps: false
  font_size: 17
  font_id: -1
  font_name: ""
  page: ...
  block: ...
  paragraph: {lines: Array(8), text: "Mild Splendour ...", confidence: 91.35659790039062, ...}
  line: {words: Array(6), text: "Mild Splendour ...", confidence: 92.46450805664062, ...}
}

If you want to see the complete result object, visit the URL https://omed.hplar.ch/webocr/basic.html and open the developer console.


detect

Another method that the Tesseract.js library provides is detect(). This method tries to detect the orientation and script. Like recognize() this method expects an image as the first argument and returns a Promise.

  const result = await worker.detect(image);

The result object contains the property data, which holds the information about the detected script and orientation and the corresponding confidence score.

{
    tesseract_script_id: 1
    script: "Latin"
    script_confidence: 39.58333969116211
    orientation_degrees: 0
    orientation_confidence: 29.793731689453125
}

See the official documentation for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/api.md#worker-detect


Cleanup

The library runs the OCR engine in a Web Worker. If your application no longer needs the worker, you should terminate it with worker.terminate().


More examples

Check out the documentation page for more code examples:
https://github.com/naptha/tesseract.js/blob/master/docs/examples.md

Demo Application

In this section, I show how I incorporated the Tesseract.js library into an Angular/Ionic application.

You find the source code for the complete application on GitHub:
https://github.com/ralscha/blog2019/blob/master/webocr

The demo application is hosted on my server, and you can access it with this URL:
https://omed.hplar.ch/webocr/

The demo application does not depend on any server-side code, and the OCR runs locally in the web browser and does not send any data to the server.

The application is based on the Ionic blank starter template. First, I added the latest version of Tesseract.js with npm install tesseract.js to the project.

In the TypeScript code, I imported the library with

import {createWorker, RecognizeResult} from 'tesseract.js';

As input, the application uses an input tag of type file (type="file"). Each time the user selects a file the method onFileChange is called, which extracts the File object from the input tag and passes it to the recognize() method.

  <input #fileSelector (change)="onFileChange($event)" accept="image/*" style="display: none;"
         type="file">

home.page.html

The selected picture is then also loaded into an Image object. The application does not use an img tag to display the picture. Instead, it draws the picture into a canvas. I use a canvas here because the application draws rectangles around the text whenever the user clicks on a text.

  async onFileChange(event: Event): Promise<void> {
    // @ts-ignore
    this.selectedFile = event.target.files[0];

    this.progressStatus = '';
    this.progress = null;

    this.result = null;
    this.words = null;
    this.symbols = null;
    this.selectedLine = null;
    this.selectedWord = null;
    this.selectedSymbol = null;

    this.image = new Image();
    this.image.onload = () => this.drawImageScaled(this.image);
    this.image.src = URL.createObjectURL(this.selectedFile);

    /*
    const worker = createWorker({
      logger: progress => {
        this.progressStatus = progress.status;
        this.progress = progress.progress;
        this.progressBar.set(progress.progress * 100);
        this.changeDetectionRef.markForCheck();
      }
    });
     */
    const worker = createWorker({
      workerPath: 'tesseract-202/worker.min.js',
      corePath: 'tesseract-202/tesseract-core.wasm.js',
      logger: progress => {
        this.progressStatus = progress.status;
        this.progress = progress.progress;
        this.progressBar.set(progress.progress * 100);
        this.changeDetectionRef.markForCheck();
      }
    });

    await worker.load();
    await worker.loadLanguage(this.language);
    await worker.initialize(this.language);

    this.progressBar.set(0);

    try {
      if (this.selectedFile) {
        const recognizeResult = await worker.recognize(this.selectedFile);
        if (recognizeResult) {
          this.result = recognizeResult.data;
        } else {
          this.result = null;
        }
        await worker.terminate();
      }
    } catch (e) {
      this.progressStatus = e;
      this.progress = null;
    } finally {
      this.progressBar.complete();
      this.progressStatus = null;
      this.progress = null;
    }

    // reset file input
    // @ts-ignore
    event.target.value = null;
  }

home.page.ts

To display the result, I'm using the list element from the Angular Material project. You can add Angular Material with the command ng add @angular/material to any Angular project.

Here the template for the words list:

  <div *ngIf="words" class="table-container mat-elevation-z1 ion-margin-top">
    <table [dataSource]="words" mat-table>

      <ng-container matColumnDef="text">
        <th *matHeaderCellDef mat-header-cell>Word</th>
        <td *matCellDef="let word" mat-cell> {{word.text}} </td>
      </ng-container>

      <ng-container matColumnDef="confidence">
        <th *matHeaderCellDef mat-header-cell>Confidence</th>
        <td *matCellDef="let word" mat-cell> {{word.confidence | number:'1.2-2'}} %</td>
      </ng-container>

      <tr *matHeaderRowDef="elementColumns; sticky: true" mat-header-row></tr>
      <tr (click)="onWordClick(word)" *matRowDef="let word; columns: elementColumns;"
          [ngClass]="{'highlight': selectedWord === word}"
          mat-row></tr>
    </table>
  </div>

home.page.html

When the user clicks on a list item the application draws a rectangle around the selected text

  drawBBox(bbox: { x0: number, x1: number, y0: number, y1: number }): void {
    if (bbox) {
      this.redrawImage();

      if (this.ratio === null) {
        throw new Error('ratio not set');
      }

      this.ctx.beginPath();
      this.ctx.moveTo(bbox.x0 * this.ratio, bbox.y0 * this.ratio);
      this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y0 * this.ratio);
      this.ctx.lineTo(bbox.x1 * this.ratio, bbox.y1 * this.ratio);
      this.ctx.lineTo(bbox.x0 * this.ratio, bbox.y1 * this.ratio);
      this.ctx.closePath();
      this.ctx.strokeStyle = '#bada55';
      this.ctx.lineWidth = 2;
      this.ctx.stroke();
    }
  }

  onLineClick(line: Line): void {
    this.words = line.words;

    this.drawBBox(line.bbox);

    this.symbols = null;
    this.selectedLine = line;
    this.selectedWord = null;
    this.selectedSymbol = null;
  }

home.page.ts

Self-Host

When your application creates and initializes the TesseractWorker, the library is going to downloads several files.

const worker = createWorker({...});
await worker.load();
await worker.loadLanguage(this.language);
await worker.initialize(this.language);

These files are by default not part of the packaged application, and Tesseract.js is going to download these files from 3rd party servers. The latest version will download these files:

If the application calls the detect() method, one additional file will be downloaded.

This is not the best solution when you write an in-house application where downloading files from a local server is much faster. Or you want to be in full control of all the application resources and not to worry about the availability of these 3rd party servers. If one of the servers does not work, your application no longer works.

Whatever the reason is, you can easily self-host these files. The Web Worker JavaScript and Web Assembly files are part of the Tesseract.js npm package so that we can copy these two files from the node_modules directory to the build folder. In an Angular application, you can do this by adding the following two entries to the build->options->assets array.

                "glob": "worker.min.js",
                "input": "node_modules/tesseract.js/dist/",
                "output": "./tesseract-202"
              },
              {
                "glob": "tesseract-core.wasm.js",
                "input": "node_modules/tesseract.js-core/",
                "output": "./tesseract-202"
              }
            ],

angular.json

The Angular CLI takes care of copying these two files to the build folder. In the application, we need to tell Tesseract.js that it must load these two files from the local directory instead of fetching them from unpkg.com.

    const worker = createWorker({
      workerPath: 'tesseract-202/worker.min.js',
      corePath: 'tesseract-202/tesseract-core.wasm.js',
      ...
    });

The machine learning models that the library downloads from tessdata.projectnaptha.com are not available as npm packages. There are different ways to self host them. If you only need to support a few languages, you could download the files from the Git repository (https://github.com/naptha/tessdata.git) and put them into the build folder during the build process with an npm script.

Be aware that these files are quite big, and downloading them each time you build your application could take some time.

Instead of adding them to the project, you could download the files you need and copy them onto an HTTP server that is under your control. If you need to support multiple languages, you could clone the whole repository.

git clone https://github.com/naptha/tessdata.git

Downloading all files requires 4.8 GB of disk space!

Each language is available in three different versions: normal, fast, and best. The best version gives you higher OCR accuracy with the downside that your application has to download a bigger machine learning model file. For the English language, the best file has a size of 12.2 MB, normal 10.4 MB, and fast 1.89 MB.

Like the Web Worker files, you need to configure the path to the machine learning files in the object that you pass as an argument to createWorker.

    const worker = createWorker({
      workerPath: 'tesseract-202/worker.min.js',
      corePath: 'tesseract-202/tesseract-core.wasm.js',
      langPath: 'https://myserver/4.0.0',
      ...
    });

The library creates the URL by combining langPath + language code + '.traineddata.gz'

Visit the official documentation page for more information:
https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md