speech ml 100daystooffload

This is an oddly specific question. Let me give some context. When I started working in speech domain around mid of 2018, my company was working on Indian language technologies. At that time, generally available ASR (Automatic Speech Recognition) systems were very poor in Indian languages, including Indian English. To counter this we built ASRs for Indian languages which then became our defensibility for some time.

Later we stopped investing in in-house ASRs by saying something like: ASRs are commoditized and we should build our defenses on a higher level (conversations). In reality there were other factors and return-of-investment maths beyond just ASRs being commoditized. This means that the question of commoditization is still open. In my views, we should be in the following state to claim a positive answer to this question:

  1. Since we are beating human WER (Word Error Rate) levels for quite some time now, all ASRs should perform very similar to humans in a context-free benchmark. In real world, Language Model (LM) and careful context biasing will save you, but having a solid acoustic base means that you can have a hassle-free off-the-shelf experience as a developer which is what you should expect from an ASR in 2024.
  2. The ASR APIs should be interoperable and standardized, supporting all common needs. A few are:
    1. Easy configurability and switching of models
    2. Auxiliary features like sampling rates and format support, diarization, text normalization, etc.
    3. High throughput and reliability
    4. Pricing structure that works on economy of scale

While there are many open ASR models for Indian languages, point 2 still means that this is not a single line of code change in an application. I know that my criteria for commoditization might seem too strict for an ML person. But just an open source model would still need you to manage your infra, handle performance updates in the model, do audio engineering to perform the right piping to and from the ASR, etc. And if, say, an android developer has to think more about ASR beyond a few lines of integration code then it's not commoditized under this definition.

Now, let's rephrase the title of this post to: Is Indian English ASR commoditized yet?


While there are many ways to answer this (and there are many general benchmarks), for core acoustic performance we could start with a context less benchmark like the one we made here1 which has telephone recordings of spoken number patterns. Even though this is low quality telephony audio, the utterances are simple enough and you should expect very close to human performance here from everyone. But even on something this simple we are not there yet, see the table below:

Model Model Spec SER Mean WER
Deepgram - Offline nova-2-general, en-IN 0.1575 0.0240
Deepgram - Offline nova-2-phonecall, en 0.1233 0.0188
Google - Offline telephony_short, en-IN 0.3973 0.0567
Google - Offline chirp_2, en-IN 0.0342 0.0037
Human - Streaming me 0.0205 -
Whisper - Offline large-v3 0.2260 0.1216

SER is Sentence Error Rate that counts how many utterances had at least one error. Both Deepgram and Google represent two styles (speech focused startups and large enterprises) of ASR API providers which are commonly used in applications. All models here are used in offline mode. If streaming variant is available (not present for chirp), they will have slightly worse performance than the non-streaming one. I added whisper v3 which is not available in public APIs (other than, say, Replicate style platforms) to get some sense of open-to-use models' performance. Code for the analysis is here.

In my personal tagging effort, I was confused in around 2% cases, all of which I will count as error. It's clear that high performing models—like chirp_22—exist, but the notion of commoditization implies consistency across providers. While we are definitely moving towards that, right now the answer to the question stays No.

Practically, you don't need textbook definition of commoditization to be met before generating a lot of value using third party ASRs (instead of making something in-house) with some know-how. I believe no one got fired for using Google's Indian English ASR even 3 years ago. But there is still some gap in the ecosystem terms of pure commoditization for English usage in India.

Footnotes:

1

Note that this dataset has label noise that I fixed in my analysis repository.