Jay Taylor's notes

back to listing index

A speech-to-text practitioner’s criticisms of industry and academia | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: automatic-speech-recognition speech-to-text news.ycombinator.com
Clipped on: 2020-04-06

Image (Asset 1/2) alt=


Image (Asset 2/2) alt=
I work in a startup doing STT. We have reached SOTA on lowest CER in our language among industry. The main reason we are doing well is not because we have smart engineers tuning on fancy model, but rather we developed novel method to collect tremendous amount of usable data from via internet (crawling speech and text transcripts, using subtitles from movies, etc). Implementing interesting paper improves 1%, but pouring in more data improves 10%. I guess this is why big guys aren't exposing what data they've used. It takes a fortune to collect just 100 hours of clean speech-to-text labeled data, and they will never meet user expectations in market.

Also, we have developed our internal framework that eases pretraining - finetuning to subtask pipeline. After months of usage it required a lot of refactoring to match the need of all forms of DNN models. I would like to hear from other ML engineers if they have a internal framework which generalizes at least one subfield of DNN(nlp, vision, speech etc)

What's an example of an NLP application that "meets user expectations" in the wild? Google's natural language stuff just annoys me. Facebook and google translate don't seem good after multiple breakthrough announcements.

Edit: Sota, state of the art, is taken incredibly glibly, as if that was enough to make a given application a success. This seems like a massive overstatement.


Lots of things are translated entirely by deep translation systems with some light post-editing in the wild.

As another practitioner, I find it odd that the author mentions frameworks such as OpenNMT but not Kaldi when comparing SST toolkits/frameworks. Overall, I get the feeling that the author hasn't worked with speech data for too long.

I agree with some of the general points the author makes, such as papers having tons of equations just for equations sake, reproducibility, not enough details to reproduce (in a conference paper..), chasing SOTA on some dataset, but these aren't limited to speech processing research. Large companies don't release their internal datasets for voice search (Google Assistant, Alexa, etc.), call center, or clinical applications, as there is no incentive to do so and also because they likely can't for licensing and data rights issues. By the way, what's the situation for the author's OpenSTT corpus? Did the people speaking on the thousands of hours of phone calls release them under a free license?


Yup. Copying my post from reddit:

Seems like the title should really be criticisms of STT groups in the industry, work from academic groups is ignored.

The original article was kind of silly anyways, it was all engineering work, very little research. Engineering work is important, but you're not going to get an image net moment in ASR by twiddling with hyperparameters or reducing training time. Also a bizarre decision to pick deepspeech as a framework, of course your models will take forever to train and won't even be that good in the end. It's end-to-end and hasn't been SOTA for a while.

> It is hard to include the original Deep Speech paper in this list, mainly because of how many different things they tried, popularized, and pioneered.

Had to laugh at this. What exactly did they "popularize and pioneer" other than the practice of training and reporting results on very large private datasets? [This, among others, was a much more important end-to-end paper I think.](https://arxiv.org/pdf/1303.5778.pdf)

There are some good points about using private datasets and overfitting on a read-speech dataset, and using very large models etc. but I really wish they would have used their data to train a kaldi system. I can guess why they didn't, they have no background in speech and found it too hard to use. Still disappointing.

Anyways, I would argue the reason that a "imagenet moment" hasn't arrived for ASR is that vision is universal, but speech splits into different languages, making it much harder to build a single model that everyone else can use as a seed model. I believe multilingual models are the future.


> As another practitioner, I find it odd that the author mentions frameworks such as OpenNMT but not Kaldi when comparing SST toolkits/frameworks.

[unpopular opinion ahead] Maybe that's because Kaldi represents the ultimate culmination of the author's critique:

• the provided scripts are useless to anyone without access to the academic datasets they use

• Kaldi is very cumbersome and overly complex to use and adapt (most of the toolchain relies on exact copies of entire directory trees)

• Kaldi is a research tool by researchers for researchers and not in any way shape or form aimed at practitioners in search of a deployable solution

• Kaldi is very poorly documented (from a user perspective) and focuses on recipes for your own experiments, not writing applications and roll out

In summary, Kaldi isn't a practical toolkit. It's a framework for R&D on speech-to-text.


> the provided scripts are useless to anyone without access to the academic datasets they use

Kaldi team shares all datasets when possible on http://openslr.org, a major collection of speech datasets. Librispeech and Tedlium were major breakthrough in their times. When everyone in research trained on 80 hours WSJ and Google trained on 2000 hours private dataset, 1000 hours of librispeech was a game changer.

> Kaldi is very cumbersome and overly complex to use and adapt (most of the toolchain relies on exact copies of entire directory trees)

On the other hand the Kaldi API is very stable. You still can run 4 year old recipes with simple run.sh. Any tensorflow software gets obsolete once in 3 months when TF API changes.

Kaldi recipe results are usually reproducible up to the digit and you can even see the tuning history (which features did improve, which didn't).

> Kaldi is a research tool by researchers for researchers and not in any way shape or form aimed at practitioners in search of a deployable solution

There are hundreds of companies all over the world building practical solutions with Kaldi. The only thing that Kaldi team should do is to ask users to mention it.

> Kaldi is very poorly documented (from a user perspective) and focuses on recipes for your own experiments, not writing applications and roll out

There are also dozen of projects on Github which enable use of Kaldi with simple pip install or docker pull. kaldi-gstreamer-server, kaldi-active-grammar, vosk and many others.


(Disclaimer: I also work in the field.)

I would disagree with most of your points.

The provided scripts are available for a huge number of datasets. Many of the datasets are also useful for use in production. And it's very simple to adapt the provided scripts for further data.

Kaldi is complex and big, yes. But you don't have to use all of it. And it is that way because doing STT properly simply is complex. At least for the conventional standard hybrid HMM/NN models. You could surely go some of the end-to-end ways but that will lead to other problems in production. And once you need to deal with these problems, it might become even more complicated. (This is a whole topic on its own...)

Kaldi is used in research as well as in production. It can be easily deployed.

Kaldi is very well documented. Just check the homepage of it. In addition to it, there is a huge community (both academia and industry).


I agree with many points of the article. I do think we’re closer to an English STT ImageNet equivalent than you think. For most other languages I don’t think it’s possible until data is collected/released, or some kind of automatic data collection process becomes standard that can generate 10k hours of usable training data for each of a bunch of arbitrary languages.

Seeing better “state of the art” results on librispeech is far less interesting to me than two recent developments:

- Facebook’s steaming convnet wav2letter model/arch, pretrained on 60k hours of audio, which I’ve been using for transfer learning with great results. It’s fast, not too huge, runs on cpu, and has excellent results without a language model.

- The librilight dataset. I aligned 30k hours of it (on CPU, on a single large server, in about a week) and my resulting models are generalizing very well.


> The librilight dataset. I aligned 30k hours of it (on CPU, on a single large server, in about a week) and my resulting models are generalizing very well.

Was this using the pretrained model from Facebook? If so, how much custom code did you need to write to get the alignment? I've been looking for a way to take an arbitrary position in a given text and look up where in the corresponding audio it appears, but I'd prefer not having to train a speech recognition model to do that.


I train my own models here: https://talonvoice.com/research/

I used my wav2train project, which is based on Mozilla’s DSAlign: https://github.com/talonvoice/wav2train

DSAlign only generates roughly sentence level alignment, which still may be a good start for you. It works by segmenting the audio by pauses, transcribing the segments with a strict language model, then figuring out where in the text the segments are based on the transcript. wav2train then generates audio segments using the alignment, but you could edit it to stop there and just look at the tlog file that tells you where the sentences are in the original file.

I’ve also used the Gentle Forced Aligner and Montreal Forced Aligner. MFA wasn’t too hard to set up for English, but I don’t recommend running it on large batches of audio, as it was slow and unreliable at large scale for me.


Interesting (possible) additional piece of the puzzle on the industry:

As opposed to the explosion of CV training data, big players may find themselves thinking they can't expose the tagged raw data for audio the same way that tagged raw data for images was exposed to the public. The privacy backlash on the way some previous public-release datasets have been used was notable.

Has the ecosystem changed so the kind of mass-collaboration that led to ImageNet can't be repeated due to privacy concerns of how the audio could be (ab)used?


I see the same issue in optical flow. Everyone and their grandma are overfitting Sintel.

But everyday phenomena, like changing light conditions or trees with many small branches cause the SOTA to fail miserably.

I even approached research teams with my new data set that they could use to measure generalization, but the response has been frosty or hostile. Nobody likes to confirm that their amazing new paper is worthless for real world usage.


This is one of the reasons why I never got myself to finish my master's thesis.

My advisor advised me to use synthetic data, because the work I was supposed to build upon did as well.

I insisted on real data which I generated myself and, unsurprisingly, it showed how undoable this whole thing was(using stereo audio data to localise speakers) with the suggested methods.


As someone who thinker with Text to Speech (TTS), I can say this apply to TTS as well. Good model such as Tacotron2 rarely scale beyond clean ( good text and speech alignmen ) large ( > 12 hours ) datasets.

Very good explanation of the current state of the art in STT. I have also personally gone down this rabbit hole(especially the CTC bit) and agree 100% with what the author says. My personal viewpoints:

1. To have an ImageNet moment, we should have an ImageNet. LibriSpeech doesn't cut it. We should have an equivalent SpeechNet. Problem is visual language is one, while speech languages are many. So we should have an ImageNet for every language? And train for every language?

2. What is an ImageNet moment? Personally, though ImageNet has contributed a lot, similar to speech and NLP, even vision applications have promised more and delivered less in real world scenarios. And just like speech and NLP, only the big boys actually provide solutions in vision which shows in vision also they have access to better data which they don't share.


I'm no way an expect in any of this, but I'd expect that there would be a lot of features common between a lot of languages, akin to the International Phonetic Alphabet [0]. Pre-training on all languages to get those shared features could make it easier to fine-tune eg. English on top perhaps. Or not, just pondering here.

[0] https://en.wikipedia.org/wiki/International_Phonetic_Alphabe...


Good luck finding records of the 6000+ existing languages...

You can check https://github.com/festvox/datasets-CMU_Wilderness, it has recordings of 700 languages created from New Testaments from http://www.bible.is/

I think it's worth pointing out that NLP's Imagenet moment is probably going to be seen as the ELMo/BERT papers that showed we could get significant performance improvements by pretraining models on large amounts of unlabeled text.

Maybe this is too hard on speech due to the intricacies of speech, but I wanted to point out that if the goal is transfer learning, the recipe doesn't have to be the same.


TDLR:

- the big boys (Google, Baidu, Facebook etc - basically FAANG) are getting SOTA using private data without explicitly stating it

- their toolkits are too complicated and not suited to non-FAANG-scale datasets/problems

- published results aren't reproducible and academic papers often gloss over important implementation details (e.g which version of CTC loss is used).

- SOTA/leaderboard chasing is bad and there's a general over-reliance on large compute, so non-FAANG inevitably end up overfitting on publicly available datasets (LibriSpeech).

I'm far from an expert, but having spent the last 6 months familiarising myself with the STT landscape, I would say I mostly agree.

First, the author wants more like Mozilla's Common Voice. Not sure if he/she is aware, but I think the M-AILABS speech dataset (https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) will go a long way to "democratizing" STT.

Second, agree that the toolkits are all complicated and weighted in favour of highly parameterized NNs. This works great if you have FAANG-style data (and hardware), but really isn't pushing the field forward in other respects (compute/training time, parameterization/etc). In all fairness, though, HMM-based frameworks like Kaldi are notoriously complex and take significant effort to wrap your head around. I think speech is just inherently more complex than image or text.

Third - and this applies to ML in general - agree that papers all too often gloss over the important details. I see no reason why publications shouldn't all be accompanied by code and dataset hashes.

I think we're seeing FAANG all converge on a non-optimal solution, simply because we've effectively removed resource constraints. When your company doesn't blink at spending a few million on creating its own datasets, and you have massive TPU farms available at your disposal, there's no incentive to focus on sample- or compute- efficient models.

What's more, they also have the resources to market their successes much more effectively. Think open source frameworks, PR pieces, new product releases, and so on.

This effect is two-fold. First, it creates a lot of noise that crowds out the slower, smaller contributors. Second, it also draws a lot of junior research attention - who would want to risk 4 years of their life on a pathway that might not work when NNs are here today and there's clear evidence that they work (albeit only for certain problems in certain environments with certain datasets).


Facebook released a pretty impressive SOTA system with public data sources a few weeks ago

Are you talking about wav2letter@anywhere?

If so, I haven't dug deeply yet, but I suspect the author's point stands - inference-time efficiency is one thing, but training compute (and separately, data-efficiency) is another.

I don't think the author is necessarily criticizing anyone, and I don't think he/she is saying that these tools/models/papers/etc aren't welcome. My read is that "pushing STT beyond the FAANG use-case means we should stop confining ourselves to FAANG-scale data and models".


What does "SOTA" mean in this context; all I could find is "Software Over the Air" but I'm unsure of what it means in a Speech to Text context

“State of the art”.

But I think your original question is still great:

What does state of the art mean in this context?


It means "State of the art", which would be roughly equivalent to "as good a result as its currently possible"

A lot of it is solving the wrong problems.

'text to speech' is good for tapping people's phones. Voice user interfaces don't get good enough until they can hold a conversation. That is, ask for clarification if you don't understand what people say. Superhuman performance at TTS isn't good enough to replace a keyboard with a backspace button.


Are you mixing up STT and TTS here? Maybe I'm misunderstanding what you're trying to say.

yep, that's the kind of mistake that a human can catch... remember some of those errors are inserted at the transmitter, some along the channel, and some at the receiver. If a receiver is going to get great accuracy, it has to catch errors introduced anywhere!

What is the hacker's "good enough" tiny lib to take even a small set of single words from my own voice as input through my laptop mic?

Pocketsphinx[1]. I used it, and pico2wave along with some old-school interactive-fiction style code to create a "computer" for my open source space game[2] so I can drive the spaceship around by talking to it[3][4].

[1]https://cmusphinx.github.io/wiki/tutorialpocketsphinx/ [2]https://spacenerdsinspace.com [3]https://www.youtube.com/watch?v=tfcme7maygw [4]https://scaryreasoner.wordpress.com/2016/05/14/speech-recogn...


Ooh, thanks for the links. I'll check these out.

I feel like there is a project waiting to happen (or that I'm unaware of) to use movies/TV and captions to extract well-annotated audio.

While for copyright reasons this dataset likely can't be released publicly, the method to generate it from materials should be possible to release so that IF you have the source, you can get the annotated audio.


Audiobooks could be another resource, but they’ll only give you a narrator’s intonation. On the other hand, at some point it may become easier to train people to speak in a certain way to computers instead of training computers to understand the unclear mumblings of casual speech.

> Audiobooks could be another resource

Yes, that's in fact the source of the commonly cited LibriSpeech dataset. It used publicly available recordings from the LibriVox project and has done some cleaning steps as well as alignment of the transcript with the audio to arrive at 1000 hours of cleaned up audio.

The M-AILABS dataset uses LibriVox as well, among other sources, to arrive at 1000 hours of cleaned up data in various european languages.

Overall there's still a large untapped potential in LibriVox data: https://gist.github.com/est31/1e195c55fab8f95a72393db1519da1...

But you'll have to give up properties like gender balancing in your dataset that then has to be counteracted during the learning process, e.g. by having gender-custom learning rates.


Facebook’s librilight taps this potential. They have a script for easily fetching an entire language from librivox, and I have a pipeline on top of that for book fetching and transcript alignment that I’ve successfully trained from.

I think the next step here needs to be auto-finding librivox source books that aren’t in e.g. Gutenberg. I only auto-found books for 75% of English. Only found books for 20% of German. I don’t actually know where to look beyond Gutenberg for books that don’t have easily linked text. Common crawl? Librivox forums?

(Or you can take Facebook’s semi-supervised approach and generate a transcript from thin air instead, but imo that will work better for English than languages that don’t have good baseline models yet)


I don't really see any issues with speech-to-text when I use google assistant. The real lack is in intelligence, even for basic things. I ask Google to navigate to Wegmans. It always insists on asking me which one. Yes it cleverly lets me say, "the closest one", but this becomes a whole lot less clever when you have to tell it "the closest one" every single time. There has never been a single case where I would prefer to drive further to a store that has exactly identical layouts between them. But you can't tell it "the closest one, always".

Lucky you is a probably a native (US) English speaker.

I dare you to try any STT in more "niche" languages or dialects. US English should not be considered the be-all-end-all indicator for progress in this area of research.


Imagine what you could do for a computer's understanding of speech if you used Brad Pitt's performance in Snatch.

A few years back the BBC did a documentary series on fishermen from roughly the same part of Scotland I grew up - to aid understanding it was subtitled!

http://news.bbc.co.uk/1/hi/5244738.stm

What I thought amusing was the it was pretty clear that the people shown talking are trying really hard to talk "properly" - their real accents would probably be much broader.


This is quite interesting, looking at chip technologies, algorithms, actual papers and industry processes together.

Neither debunking nor selling AI but a big picture.


When this becomes a solved problem, government agencies will have a field day. So it has to be delayed until most of the world has free speech.

That sounds absurd. Government agencies are not holding back an entire industry of researchers.

And surely the countries without free speech would be among the most interested in computer-based eavesdropping.


We live in a world where some people have immense power. We, as in engineers and scientists are the enablers.

On a somewhat related note, why does the Google STT engine have a grudge against the word "o'clock"?

If you say "start at four" it gives you the text "start at 4." So far so good, but if you say "start at four o'clock" you just get "start at 4" again, not "start at 4:00" like you wanted. In fact you can say the word o'clock over and over for several seconds and it will type absolutely nothing.


Having recently experienced a similar issue with "filler" words that mysteriously disappeared in a transliteration pipeline, I'd guess that this is due to an artifact in the training data. Most likely the STT engine was trained on transcriptions created by lazy humans who simply wrote "4" when they heard "four o'clock", so the engine learned that "o'clock" is a speech disfluency just like "umm" that usually doesn't appear in writing.

I just tried this on the google assistant and didn't get the behavior you describe. When you look at the streaming results, you can clearly see it produce the o'clock, but it gets normalized to 4:00 at the end.

Are you talking about cloud ASR or something?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: