Speech Recognition in the Browser with Transformers.js

In a previous blog post, I showed you an example of the Transformers.js library running an LLM in the browser for text generation.

Transformers.js is not limited to only running models for text generation; it can also run models for vision and audio tasks. In this blog post, I will show you how to run a speech recognition model in the browser.

As a simple example, I wrote a trivial snake game that can be controlled by voice commands. You can find the complete source code on GitHub.

Implementation ¶

Transformers.js can be installed in any JavaScript project via npm:

npm install @huggingface/transformers

Transformers.js can run any model that is compatible with the ONNX format. The speech recognition model I use in this example is the moonshine-tiny model.

The Moonshine models are trained for speech recognition, capable of transcribing English speech audio into English text. The tiny model is a 27M parameter model. Moonshine is also available in a base version with 61M parameters, which is more accurate. The models were released in October 2024.

The quantized ONNX version of the 27M parameter model, which I use in this demo application, is only about 50MB in size and can be easily downloaded and run in the browser. Transformers.js has a built-in feature that caches downloaded models in the browser's cache, so it doesn't need to download the model again when the user revisits the page.

WebWorker ¶

We could run the Transformers.js model in the main thread of the browser. The problem we face then is that the speech recognition process might take a few milliseconds or seconds to complete. While Transformers.js is processing the audio input, the browser becomes unresponsive. To avoid this, we can run the model in a separate thread. To start a separate thread in the browser, we use a WebWorker.

The demo application for this blog post is written in Angular/Ionic. Angular has excellent support for WebWorkers. All you need to do is run this command:

ng generate web-worker <location>

You can then write your WebWorker in TypeScript and also import third-party libraries, like Transformers.js. Angular's build system takes care of transpiling the TypeScript code to JavaScript and bundling the third-party libraries.

WebWorkers communicate with the main thread via messages. In this demo application, we have two types of messages the main thread sends to the WebWorker: load and generate. The load message tells the WebWorker to load the speech recognition model. The generate message contains the audio input that should be processed by the speech recognition model and converted to text.

Here is the code of the main event listener in the WebWorker.

self.addEventListener('message', async (e: MessageEvent) => {
  const {type, data} = e.data;

  switch (type) {
    case 'load':
      load();
      break;

    case 'generate':
      generate(data);
      break;
  }
});

app.worker.ts

The load method calls the getInstance() method to load the speech recognition model and waits until the model is loaded. After the model is loaded, the WebWorker sends a message to the main thread that the model is ready.

async function load() {
  await AutomaticSpeechRecognitionPipeline.getInstance();
  self.postMessage({status: 'ready'});
}

app.worker.ts

The getInstance() method loads the speech recognition model and returns the tokenizer, processor, and model.

  static async getInstance() {
    this.model_id = 'onnx-community/moonshine-tiny-ONNX';

    this.tokenizer = AutoTokenizer.from_pretrained(this.model_id, {});
    this.processor = AutoProcessor.from_pretrained(this.model_id, {});

    this.model = MoonshineForConditionalGeneration.from_pretrained(this.model_id, {
      dtype: {
        encoder_model: 'fp32',
        decoder_model_merged: 'q4',
      },
      device: 'webgpu',
    });
    return Promise.all([this.tokenizer, this.processor, this.model]);
  }
}

app.worker.ts

The generate method processes the audio input and generates the text output. It uses the tokenizer, processor, and model created in the getInstance() method.

The processor converts the audio input to a format that the model can process. The model generates the output tokens, which are then converted to text by the tokenizer. The final text output is sent to the main thread within a complete message.

async function generate({audio, language}) {
  if (processing) {
    return;
  }
  processing = true;

  const [tokenizer, processor, model] = await AutomaticSpeechRecognitionPipeline.getInstance();

  const streamer = new TextStreamer(tokenizer, {
    skip_prompt: true,
    decode_kwargs: {
      skip_special_tokens: true,
    }
  });

  const inputs = await processor(audio);

  const outputs: any = await model.generate({
    ...inputs,
    max_new_tokens: 64,
    language,
    streamer,
  });

  const outputText = tokenizer.batch_decode(outputs, {skip_special_tokens: true});

  self.postMessage({
    status: 'complete',
    output: outputText,
  });
  processing = false;
}

app.worker.ts

Main program ¶

The constructor in the main program instantiates the WebWorker and sends a load message to the WebWorker. This load message triggers the loading of the model.

  constructor() {
    this.worker = new Worker(new URL('../app.worker', import.meta.url));
    this.initListener();
    this.worker.postMessage({type: 'load'});
  }

speech.page.ts

The initListener method installs the event listener for the messages sent by the WebWorker.

The ready message tells the main application that the model is loaded and ready to process audio input. This demo application blocks any user input with the modelReady flag until the model is ready.

The main thread receives the complete message whenever the model has processed the audio input and generated the text output.

  initListener(): void {
    const onMessageReceived = (e: MessageEvent) => {
      switch (e.data.status) {
        case 'ready':
          this.modelReady = true;
          this.startListen();
          break;
        case 'complete':
          this.handleTranscriptionResult(e.data.output[0]);
          break;
      }
    };

    this.worker.addEventListener('message', onMessageReceived);
  }

speech.page.ts

The startListen method starts the speech recognition. This method leverages the MediaRecorder API to capture the audio input. The MediaRecorder is a built-in browser API that allows you to record audio from the user's microphone. The MediaRecorder API is supported by all modern browsers.

The recorder listens continuously, and every 800 milliseconds, it captures the audio input and sends it to the WebWorker for processing in a generate message. Because it listens continuously, it also records the audio input when the user is not speaking. To avoid sending "empty" audio input to the WebWorker, the code checks, with the help of the audioAnalyzer, if the audio input contains speech. You can find the implementation of the audioAnalyzer here.

  async startListen(): Promise<void> {
    this.ctx = this.canvas().nativeElement.getContext('2d');
    if (!navigator.mediaDevices.getUserMedia) {
      console.error("getUserMedia not supported on your browser!");
      return
    }
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        sampleRate: SpeechPage.SAMPLING_RATE,
        channelCount: 1,
        echoCancellation: true,
        noiseSuppression: true
      }
    });

    this.recorder = new MediaRecorder(stream);
    this.recorder.onstart = () => {
    }
    this.recorder.ondataavailable = (e) => {
      const blob = new Blob([e.data], {type: this.recorder!.mimeType});
      const fileReader = new FileReader();
      fileReader.onloadend = async () => {
        try {
          const arrayBuffer = fileReader.result;

          const decoded = await this.audioContext.decodeAudioData(arrayBuffer as ArrayBuffer);
          const channelData = decoded.getChannelData(0);

          const analysis = this.audioAnalyzer.analyzeAudioData(channelData);
          if (!analysis.hasSpeech) {
            return;
          }

          this.worker.postMessage({type: 'generate', data: {audio: channelData, language: 'english'}});
        } catch (e) {
          console.error('Error decoding audio data:', e);
        }
      }
      fileReader.readAsArrayBuffer(blob);
    };

    this.recorder.onstop = () => {
    };


    this.listenLoop = window.setInterval(() => {
      this.recorder!.stop();
      this.recorder!.start();
    }, 800);

  }

speech.page.ts

handleTranscriptionResult is called whenever the model has processed the audio input and generated the text output. The method checks the text output for certain keywords and triggers the corresponding action. In this demo application, these keywords control the snake in the game.

  handleTranscriptionResult(word: string): void {
    word = word.toLowerCase();
    if (word.includes('go')) {
      this.handleGo();
    } else if (word.includes('stop')) {
      this.handleStop();
    } else if (word.includes('left')) {
      this.handleLeft();
    } else if (word.includes('right')) {
      this.handleRight();
    } else if (word.includes('up')) {
      this.handleUp();
    } else if (word.includes('down')) {
      this.handleDown();
    }
  }

speech.page.ts

Conclusion ¶

This concludes this blog post about speech recognition in the browser with Transformers.js. You can find the complete source code of the demo application on GitHub. We have seen that running a speech recognition model in the browser is possible thanks to the Transformers.js library. The problem for a fast game like snake is that the latency of the speech recognition model is too high, at least on the computers I tested the demo application on. But you can take the WebWorker code from this blog post and integrate it into any JavaScript application where latency might not be an issue. For example, you could write a transcription tool or a voice-controlled chat application. There are many possibilities with speech recognition in the browser. Running a model locally has the advantage that the audio input never leaves the user's device. This is a big advantage for privacy-sensitive applications. But the drawback is that these models are smaller because they need to run on resource-constrained devices and can't compete with the accuracy of cloud-based models. Depending on the use case, this might be acceptable or not.

I hope you enjoyed this blog post. If you have any questions or feedback, send me a message.