A Speech-To-Text Practitioner’s Criticisms of Industry and Academia

04.Apr.2020

This is a follow-up article to our article on building speech-to-text (STT) models, Towards an ImageNet Moment for Speech-to-Text. In the first article we mostly focused on the practical aspects of building STT models. In this article we would like to answer the following questions:

What is the so-called ImageNet moment and why does it matter?
Why has it not arrived yet in speech, and how did academia and industry indirectly contribute to it not having arrived yet?
What are the deficiencies of the approach we presented in the previous article?

Table of Contents

What is the ImageNet Moment?

As discussed in our last piece Towards an ImageNet Moment for Speech-to-Text, in our opinion, the ImageNet moment in a given Machine Learning (ML) sub-field arrives when:

The architectures and model building blocks required to solve 95% of standard "useful" tasks are widely available as standard and tested open-source framework modules;
Most popular models are available with pre-trained weights from large datasets that enable fine-tuning on downstream tasks with relatively little data;
This sort of fine-tuning from standard tasks using pre-trained models to different everyday tasks is solved (i.e. tends to work well);
The compute required to train models for everyday tasks is minimal (e.g. 1-10 GPU days in STT) compared to the compute requirements previously reported in papers (100-1000 GPU days in STT);
The compute for pre-training large models is available to small independent companies and research groups;

If the above conditions are satisfied, one can develop new useful applications with reasonable costs. Also democratization occurs - one no
longer has to rely on giant companies such as Google as the only source
of truth in the industry.

Why There is No ImageNet Moment Yet

To understand this, let us try to understand which events and trends led
its arrival in computer vision (CV).

but usually their real applicability is limited by a number of factors:

The majority of such datasets are in English;
Such datasets are great in terms of researching what is possible, but unlike in CV it is hard to incorporate them in real pipelines;
Though remarkable how much effort and attention goes into building datasets like SQUAD, you cannot really use them in production models;
Stable production-grade NLP models usually are built upon data which is several orders of magnitude larger or the task at hand should be much more simple. For example - it is safe to assume that a neural network can do NER reliably, but as for answering questions or maintaining dialogue - this is just science fiction now. I like an apt analogy - building AGI with transformers is like going tothe Moon by building towers;

There is a competing point of view on ML validation and metrics (asopposed to “the higher the better” mantra) that we endorse that says that an ML pipeline should be treated as a compression algorithm, i.e. your pipeline compresses the real world into a set of compute graphs and models with memory, compute, and hardware requirements. If you manage to fit a roughly similarly performing model into a 10x smaller weight footprint or compute size then it is a much better achievement that getting an extra 0.5% on the leaderboard.

On the other hand, the good news is people in the industry are starting to think about the efficiency of their approaches and even Google is starting to publish papers on pre-training Transformers efficiently (on Google scale, of course).

Paper Contents and Structure

5-10% errors when transcribing audio, the table is misleading. We read
some of the papers below and noticed a few things:

Newer papers rarely perform ablation tests with smaller models;
ASR papers claiming state-of-the-art performance rarely post convergence curves;
The papers rarely report the amount of compute used for hyper-param search and model convergence;
Out of the papers we read, only Deep Speech 2 paid attention to how performance on a smaller dataset transfers to real-life data (i.e.
out-of domain validation);
Sample efficiency and scalability to real datasets is not optimized for. Several papers in 2019 did something like this (Time-Depth Separable Convolutions, QuartzNet) - but they focused on reducing the model size, not its training time;

inefficiently, optimize it, then achieve new grounds again), but it
seems that ASR research is a good example of Goodhart's law in practice.

If you read the release notes of pre-trained Deep Speech in PyTorch and saw "Do not expect these models to perform well on your own data!", you may be amazed - it is trained on 1,000 hours of speech and has a very low CER and WER! In practice though, systems fitted on some ideal large 10,000 hour dataset will have WER upwards of 25-30% (instead of 5% on clean speech and 10% on noisy speech as advertised).
Unlike CV research, where better Imagenet performance actually transfers to real tasks with much smaller datasets, in speech, better performance on LibriSpeech does not transfer to real world data! You cannot "just quickly tune" your network on 1,000 hours of speech like you would train your network on 1,000 images in CV;
All of this means that the academic and corporate worlds have produced more and more elaborate methods of overfitting to LibriSpeech.

Although it is understandable researchers want to make progress on their problems and work with the data available, ultimately this shows that an ImageNet-like project for creating a truly large and very challenging dataset first would be far more useful.

Over Reliance on large Compute

fields that would benefit the society as a whole. For example, a story of autonomous truck company Starsky perfectly illustrates this point. They delivered the working product, but the market was not ready for it in part due to “AI-disenchantment”. Borrowing the concept and image from this article, you can visualize the reaction of the society to a new technology with a curve below. If technology has reached L1, it will be widely adopted and everyone will benefit. If L2 is reachable, but requires a lot of investment and time, probably only large corporations or state-backed monopolies will reap its fruits. If L3 is the case, then probably people will just revisit this technology in future.

changed what computing could do. Before relational databases appeared in the late 1970s, if you wanted your database to show you, say, 'all customers who bought this product and live in this city', that would generally need a custom engineering project. Databases were not built with structure such that any arbitrary cross-referenced query was an easy, routine thing to do. If you wanted to ask a question, someone would have to build it. Databases were record-keeping systems; relational databases turned them into business intelligence systems.

This changed what databases could be used for in important ways, and so created new use cases and new billion dollar companies. Relational databases gave us Oracle, but they also gave us SAP, and SAP and its peers gave us global just-in-time supply chains - they gave us Apple and Starbucks. By the 1990s, pretty much all enterprise software was a relational database - PeopleSoft and CRM and SuccessFactors and dozens more all ran on relational databases. No-one looked at SuccessFactors or Salesforce and said "that will never work because Oracle has all the database" - rather, this technology became an enabling layer that was part of everything.

So, this is a good grounding way to think about ML today - it’s a step change in what we can do with computers, and that will be part of many different products for many different companies. Eventually, pretty much everything will have ML somewhere inside and no-one will care. An important parallel here is that though relational databases had economy of scale effects, there were limited network or ‘winner takes all’ effects. The database being used by company A doesn't get better if company B buys the same database software from the same vendor: Safeway's database doesn't get better if Caterpillar buys the same one. Much the same actually applies to machine learning: machine learning is all about data, but data is highly specific to particular applications. More handwriting data will make a handwriting recognizer better, and more gas turbine data will make a system that predicts failures in gas turbines better, but the one doesn't help with the other. Data isn’t fungible.”

Developing his notion “machine learning = just an enablement stack to answer some questions, like the ubiquitous relational database” it is only up to us to decide the fate of speech technologies. Whether their benefits will be reaped by the select few or by the society as a whole remains unseen. We firmly believe that it is certain that speech technologies will become a commodity within 2-3 years. The only question remaining - will they more look like PostgreSQL or Oracle? Or will 2 models co-exist?

Alexander Veysov is a Data Scientist in Silero, a small company building NLP / Speech / CV enabled products, and author of Open STT - probably the largest public Russian spoken corpus (we are planning to add more languages). Silero has recently shipped its own Russian STT engine. Previously he worked in a then Moscow-based VC firm and Ponominalu.ru, a ticketing startup acquired by MTS (major Russian TelCo). He received his BA and MA in Economics in Moscow State University for International Relations (MGIMO). You can follow his channel in telegram (@snakers41).

Thanks to Andrey Kurenkov and Jacob Anderson for their contributions to this piece.

If you enjoyed this piece and want to hear more, subscribe to the Gradient and follow us on Twitter.

Speech-To-Text Perspectives

Jay Taylor's notes