Adding speech recognition to an Ionic App

Published: December 18, 2017  •  Updated: February 16, 2018  •  ionic

In this blog post I show you three different ways how you can incorporate speech recognition into a web app, in this case an Ionic app.

As example, I created a simple movie search database with an Ionic front and a Java / Spring Boot back end. The user can search for movie titles by speaking the query into the microphone and let a speech recognition library transcribe it into text. The web app then sends a search request to the Spring Boot application where it searches for matching movies stored in a MongoDB database.

Test data

Before I started with the app I needed some data about movies to insert into the database. The go to address for movie information is the Internet Movie Database and they provide for free a set of raw data files.
You find all the information about the data files here:
Note that you can use the data files, you download there, only for personal and non-commercial use.

The files are hosted on Amazon S3. For downloading and importing the data files I wrote a Java application. The program downloads the files from S3 with the Java S3 library, parses the files with the Univocity library (the data in the files are stored as tab separated values) and then inserts the data into a MongoDB database.

You find the complete code of the importer on GitHub: src/main/java/ch/rasc/speechsearch/ To download the imdb files on your own computer you need an Amazon AWS account and credentials for accessing S3.


The client is written with the Ionic framework and based on the blank starter template. The app displays the movies as a list of cards. At the bottom you find three buttons to start the speech search with the three mechanisms I show you in this blog post.



The server is written in Java and Spring Boot. The search controller is a simple RestController that handles the search requests from the client

public class SearchController {
  private final MongoClient mongoClient;
  private final MongoDatabase mongoDatabase;

  public SearchController(AppConfig appConfig) throws IOException {
    this.mongoClient = new MongoClient("localhost");
    this.mongoDatabase = this.mongoClient.getDatabase("imdb");

  public List<Movie> search(@RequestParam("term") List<String> searchTerms) {
    Set<Movie> results = new HashSet<>();
    MongoCollection<Document> moviesCollection = this.mongoDatabase

    MongoCollection<Document> actorCollection = this.mongoDatabase

    String orQuery = String.join(" ", searchTerms);
    try (MongoCursor<Document> cursor = moviesCollection.find(Filters.text(orQuery))
			.sort(Sorts.metaTextScore("score")).limit(20).iterator()) {
      while (cursor.hasNext()) {
        Document doc =;
	Movie movie = new Movie(); = doc.getString("_id");
	movie.title = doc.getString("primaryTitle"); = doc.getBoolean("adultMovie", false);
	movie.genres = doc.getString("genres");
	movie.runtimeMinutes = doc.getInteger("runtimeMinutes", 0);
	movie.score = doc.getDouble("score");
	movie.actors = getActors(actorCollection, (List<String>) doc.get("actors"));

    Comparator<Movie> comparing = Comparator.comparing(m -> m.score);


The search method takes a list of search terms and starts a regular expression search over the movie title in the movies collection. The code limits the results to 20 entries and uses a cursor to iterate over the returned entries. It then converts the matching documents to pojos (Movie) and returns them in a list to the client.

1 Cordova speech recognition plugin

When you create a Cordova application with Ionic and the browser does not support the functionality you need you look for a native plugin. Fortunately for us there is a plugin that supports speech recognition:

I also installed the Ionic Native wrapper which is technically not necessary but makes the integration of Cordova plugins easier.

ionic cordova plugin add cordova-plugin-speechrecognition
npm install --save @ionic-native/speech-recognition

The method that handles the search is straightforward.

async searchCordova() {
 const hasPermission = await this.speechRecognition.hasPermission();
 if (!hasPermission) {
   await this.speechRecognition.requestPermission();

 this.speechRecognition.startListening().subscribe(async terms => this.movieSearch(terms));


The method first checks if the app has permission to access the microphone of the device. If not it requests the permission. The device present a dialog to the user that asks for permission.

ask1 ask2

Then the method calls the startListening method of the plugin. This method starts listening, the user can now speak his search request into the microphone and the speech plugin automatically transcribes the spoken words into text and returns that back to the application.
On Android this functionality needs an Internet access because the plugin sends the recorded speech sample to Google where a service transcribes it into text. According to the documentation the plugin works similar on an iOS device. It records the speech sample and sends it to an Apple server.

After the plugin returned an array of transcribed text strings, the searchCordova methods calls movieSearch.

async movieSearch(searchTerms: string[]) {
 this.matches = searchTerms;
 if (searchTerms && searchTerms.length > 0) {
   let queryParams = '';
   searchTerms.forEach(term => {
     queryParams += `term=${term}&`;
   const response = await fetch(`${this.serverUrl}/search?${queryParams}`); () => this.movies = await response.json());
 else {
   this.movies = [];


This method is responsible for sending the search terms to the Spring Boot application and handles the response. When the server application found some matching movies they will be presented to the user.

Because this example depends on a Cordova plugin it will not work in a desktop browser and you need to run the app on a device or in an emulator to test it.

2 Web Speech API

Next I looked for a way that works without any native plugins. Good news is there is a specification that provides this functionality, the Web Speech API, bad news is that at the moment (December 2017) only Google implemented this into their Chrome browser (
According to information I found on the Internet the feature is "under consideration" at Microsoft and Mozilla.

The API is easy to use. It's callback based and the app has to implement a few callback handlers. The searchWebSpeech first checks if the speech recognition object is available in the window object. Then it instantiates the webkitSpeechRecognition object that handles the speech recognition.

searchWebSpeech() {
 if (!('webkitSpeechRecognition' in window)) {

 const recognition = new webkitSpeechRecognition();
 recognition.continuous = false;

 recognition.onstart = () => => this.isWebSpeechRecording = true);
 recognition.onerror = event => console.log('error', event);
 recognition.onend = () => => this.isWebSpeechRecording = false);

 recognition.onresult = event => {
   const terms = [];
   if (event.results) {
     for (const result of event.results) {
       for (const ra of result) {




Because I also wanted to disable the button when the recording is running, the onstart and onend handler set and reset a flag.

The Api automatically recognises when the user stops speaking, it then sends the recorded speech sample to Google where it gets transcribed to text. This is what the onresult handler receives as parameter. In this handler the code collects all the transcriptions into one array and calls the movieSearch method that sends the request to the server.

I tested this on Chrome on a Windows Desktop and it works very well. I don't know if the API ever gets implemented in other browsers, but it's a nice addition to the Web platform tool belt.

3 Recording with WebRTC and sending it to the Google Cloud Speech API

The second approach works well but is limited to Chrome and I was more interested in a solution that works in most modern browsers without any additional plugin.

With WebRTC it's not that complicated to record an audio stream in almost all modern browsers. There is a very good library available that smooths out the different WebRTC implementations and is able to record audio: RecordRTC

The example I present here runs on Edge, Firefox and Chrome on a Windows computer. I haven't tested Safari but version 11 has now a WebRTC implementation so I guess the example should work on Apple's browser too.

After the speech is recorded the app transfers it to the server and from there to the Google Cloud Speech API. This is a service that transcribes spoken words into text. The first 60 minutes of recordings are free then you have to pay $0.006 for each 15 second snippet you send to the service.

Unlike the other two example the application has to handle starting and stopping the recording manually. For that the method uses a boolean instance variable and set it to true when the recording is running. The user then has to click on the Stop button when he is finished with speaking.

async searchGoogleCloudSpeech() {
 if (this.isRecording) {
   if (this.recorder) {
     this.recorder.stopRecording(async audioVideoWebMURL => {
       const recordedBlob = this.recorder.getBlob();

       const headers = new Headers();
       headers.append('Content-Type', 'application/octet-stream');

       const requestParams = {
         method: 'POST',
         body: recordedBlob
       const response = await fetch(`${this.serverUrl}/uploadSpeech`, requestParams);
       const searchTerms = await response.json();

   this.isRecording = false;
 else {
   this.isRecording = true;
   const stream = await navigator.mediaDevices.getUserMedia({video: false, audio: true});
   const options = {
     mimeType: 'audio/wav',
     recorderType: RecordRTC.StereoAudioRecorder
   this.recorder = RecordRTC(stream, options);



When the user starts the recording the method accesses the audio stream with getUserMedia and calls the RecordRTC object with the stream as source. In this example I set the audio format to wav and uses the stereo recorder. This works fine on all three browser I tested. When the recording stops the method receives a blob from RecordRTC that contains the recorded audio in wav format. Then it uploads the binary data to Spring boot (/uploadSpeech) and waits for the transcription to return. After that is calls the movieSearch method.

On the server I use the google-cloud-speech Java library to connect the application with the Google Cloud. The project needs this dependency in the pom.xml



Before you can access a service in the Google Cloud you need a credentials file. To get that you need to login into your Google Account and open the Google API Console.
There you either create a new project or select an existing one and add the Google Cloud Speech API to the project. Then open the credentials menu and create a new service account. You can then download a json file that you can add to the project. Don't commit this file into your git repository, it contains sensitive information that allows anybody that has the key to access the API. In this application I externalize the path to this credential file with a configuration property class (AppConfig)

The SearchController needs a client that allows him to access the Google Cloud Speech API. The code reads the credential file and then creates an instance of the SpeechClient class.

public class SearchController {
  private final MongoClient mongoClient;
  private final MongoDatabase mongoDatabase;
  private final SpeechClient speech;

  public SearchController(AppConfig appConfig) throws IOException {
    this.mongoClient = new MongoClient("localhost");
    this.mongoDatabase = this.mongoClient.getDatabase("imdb");

    ServiceAccountCredentials credentials = ServiceAccountCredentials.fromStream(
    SpeechSettings settings = SpeechSettings.newBuilder()
    this.speech = SpeechClient.create(settings);


Last piece of the puzzle is the handler for the /uploadSpeech endpoint. This method receives the bytes of the recorded speech sample in wav format and stores them in a file.

public List<String> uploadSpeech(@RequestBody byte[] payloadFromWeb)
		throws Exception {

  String id = UUID.randomUUID().toString();
  Path inFile = Paths.get("./in" + id + ".wav");
  Path outFile = Paths.get("./out" + id + ".flac");

  Files.write(inFile, payloadFromWeb);

  FFmpeg ffmpeg = new FFmpeg("./ffmpeg.exe");
  FFmpegBuilder builder = new FFmpegBuilder().setInput(inFile.toString())

  FFmpegExecutor executor = new FFmpegExecutor(ffmpeg);

  byte[] payload = Files.readAllBytes(outFile);

  ByteString audioBytes = ByteString.copyFrom(payload);

  RecognitionConfig config = RecognitionConfig.newBuilder()
  RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(audioBytes)

  RecognizeResponse response = this.speech.recognize(config, audio);
  List<SpeechRecognitionResult> results = response.getResultsList();

  List<String> searchTerms = new ArrayList<>();
  for (SpeechRecognitionResult result : results) {
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);


  return searchTerms;


The problem I had, was that the Cloud Speech API cannot handle the wav file it gets from the web app. One problem is the unsupported format and the other is the recording is in stereo but the Speech API requires mono recordings. Unfortunately I haven't found a pure Java library that is able to convert sound files. Fortunately there is a way to do that with a native application and still support multiple operating systems.

ffmpeg is a program for handling multimedia files. One task it provides is converting audio files into other formats. On the download page you find builds for many operating system.

For Windows I had to download the ffmpeg.exe file. To call the exe from the Java code I've found a Java wrapper library that simplifies setting parameters and calling the exe.



In the code you see how I used this library to specify the configuration parameter that convert the wav file into a mono flac file. Flac is one of the supported audio formats of the Cloud Speech API. Calling the Api itself is very easy when you have the recording in a supported format. All it needs is a call to the recognize method with the binary data of the recording and a few configuration parameters like the description of the format.

RecognizeResponse response = this.speech.recognize(config, audio);

The method sends back the text transcription, when the service was able to understand some words in the recording. The uploadSpeech method then sends back these strings to the Ionic app.

This concludes our journey into the speech recognition land. If you are developing a Cordova app, the speech recognition plugin is the easiest way to implement this.

The third solution is more complicated because it also depends on a server part, but supports most modern browsers and does not depend on a plugin.

The Web Speech API looks promising, but as long there is only one browser that supports it, not very useful.