Note that Mozilla's "Project DeepSpeech" is just a training and recognition engine, and really an implementation of a specific design for LVCSR. A full recognition system requires statistical models, generally trained from very large sources of data, which you have to bring yourself.
Pretrained models are very nice to spin up a system and start using it. You need that because training the model is so hard. But, the pretrained models are by no means a good general model. They are trained with narrow params (like #/size of datasets), so the ability to generalize is very low. It's not uncommon to train a system like that to be 5% WER on Switchboard (think MS/IBM), then have that same system perform at 40% WER on other audio.
You're doing awesome (arduous) work. The text normalization is especially a total bear. I feel your pain. Limiting your text to one file is good in many ways because it allows you to scope down the amount of work needed to do a comparison (but it's a big systematic risk, but hey, there are only so many hours in the day).
Your previous blog post helps in understanding how much work needs to go into comparing speech services. It's super common to undervalue just how much processing a human is doing innately while listening to audio; hearing words, feeling out ideas, resolving ambiguities, etc. So, it's awesome to see deep work into it (besides the speech teams working on these problems like at Google, Baidu, Microsoft, Deepgram [btws I'm a founder of Deepgram]).
I wouldn't be so quick to say the differences in WER should be attributed to how 'modern' the system is. It's more about the areas they play in; what audio type they care about, what training datasets they use, what post processing they do, and language models they choose to apply. (Speed/TurnAroundTime gives you a much better indication of how modern a system is.)
For many speech transcription systems, they focus on specific types of audio as their target market. There are ~4 main types: phone (customer support/sales), broadcast (news/podcast/videos), command and control (siri/google assistant), and ambient (meetings/lectures/security).
Google's video model is perfect for what you are doing (broadcast/podcast, 2 dudes talking into probably pretty good mics).
In other instances the results will be very different (if you compared phone calls, for example). It won't be different just in accuracy, but also speed (throughput and latency), price, and reliability.
It's awesome to see an in depth comparison being discussed broadly. Speech interfacing and understanding is just getting started. We're still at the tip of the Intelligence Revolution and there's still a long way to go. The scale of compute and data is huge, even to bring just one language up to snuff.
Aside: It's a dirty little secret that there actually aren't 20 different speech recognition companies in the world using 20 different systems. There are only a handful (many use Google and tweak the outputs). They are mostly doing one of four things: using old and aged tech, using old but well-oiled tech (like Google, this takes a ton of manpower and no other company spends the money to do it), using an open source spinoff (like Kaldi or Mozilla), building your own from scratch (like Deepgram), or reselling someone else's.
If you care about current times, this is a reasonably good finger in the wind in Sept. 2018:
Use Google if you are doing command and control or broadcast audio, do not use Google if you are doing meetings or phone calls or you need a reliable system (it's unreliable at scale). DO use Google in all cases (even phone/meeting) for audio that is in a language other than English (no other company is even close).
Use Google to prototype systems and teach yourself about how to use a speech recognition API and what results to expect as a baseline.
Do not use Google if you need scale and speed and reliability and affordability.
Do not use Google if you need to use your own vocabulary or if your audio has repetitive things being said in it that have accents or jargon (like call centers). In tat case, use a company that can do a true custom acoustic model and vocabulary for that (like Deepgram). There are only a few companies that will consider doing this (and Google is not one of them.
Expect that many more things are going to be addressed.
Think of it like: what can a human do?
A human can jump into a conversation and quickly tell you: there are 3 people, speaking about rebuilding a feature in the main code, two people are male, one is female, male1 and female1 are doing most of the talking in the beginning, then it's the two dudes at the end, it sounds like the recording is of a meeting they are having, they never came to a definitive conclusion and next steps, they spent 80 minutes in the meeting. All of that (and I'm sure more) will be done by machine in the future.
Thanks for the offer. For the evaluation I’ve just written a few tiny Perl scripts. I’ve used the services manually, e.g. via curl or the website. For transcripts in JSON I ran jq to extract the text. I can put the scripts in a repo but there’s not much to it.
So many unknown new players, almost all better than Nuance (thus Siri). Hard to believe though that those new companies are approaching Googles accuracy. As much as Google is worth being shunned for eroding privacy, their TTS always feels miles ahead of anything