Codec2: A Whole Podcast on a Floppy Disk

(auphonic.com)

486 points | by ericdanielski 2125 days ago

26 comments

dahauns 2125 days ago
Aside from the seriously impressive WaveNet based results, I think the article doesn't do the codec itself enough justice. I mean, low-bitrate speech codecs have been around for some time (hey, vocoders are the oldest kind of audio codecs in history!), and I grew skeptical when they started to compare with mp3 and opus.
But looking at this page Codec2 really holds its own when compared to AMBE and especially MELP, two of the most prominent ultra low-bandwidth speech codecs used today: https://www.rowetel.com/?p=5520
[-]
- gourneau 2124 days ago
  Here is a fascination video history of the vocoder. Complete with coverage of the early room size machines. https://video.newyorker.com/watch/object-of-interest-the-voc...
bcaa7f3a8bbc 2125 days ago
The article failed to mention the original reason why Codec2 is invented.
In digital amateur radio communication, currently the most widely-used codec is AMBE. But AMBE is a proprietary codec, covered by patents, unhackable - the counter-thesis of amateur radio. Codec2 was born to bring freedom to digital amateur radio communication, and technically even better than AMBE.
[-]
- boomlinde 2124 days ago
  The article does mention why Codec2 was invented, under "Background".
- a1k0n 2124 days ago
  FWIW the main AMBE patent expired in December, but I was always surprised hams chose to use it.
MrRadar 2125 days ago
Codec2 is also fully open source and patent-free, in contrast to virtually every other ultra-low-bitrate voice codec (which are proprietary and have expensive patent licensing attached). He has a Patreon if you want to support him in the ongoing development of Codec2 and his SDR modems to enable use of it in amateur radio: https://www.patreon.com/drowe67
[-]
- gwern 2125 days ago
  Codec2 might be patent-free, but Codec2 with a WaveNet decoder isn't because WaveNet (convolutional neural networks for generating audio sequence data) is patented: https://patents.justia.com/patent/20180075343
  [-]
  - merinowool 2125 days ago
    When it was patented? When I was working with AI about 15 years ago I was experimenting with conv nn to generate audio. I wouldn't have expected for this to be patented as this is so friggin obvious thing to do. It is like patenting 2+2=4 once you discover numbers.
    [-]
    - ahoka 2125 days ago
      > It is like patenting 2+2=4 once you discover numbers.
      Welcome to software patents.
    - andai 2124 days ago
      [Serious question] Does your prior art invalidate the patent?
      [-]
      - merinowool 2124 days ago
        I am not a scientist, just I was very interested in that space and it would be a long way to create scientific paper out of my experiments. Since patent law has been created for the privileged to reap profits I wouldn't stand a chance contesting that.
    - zouhair 2124 days ago
      Isn't that specifically what software patents are? Pythagoras could have become a billionaire in his time and don't get me started on Al Khwarizmi.
  - triangleman 2125 days ago
    raises hand
    Question for IP experts: now that I have heard of the existence of WaveNet and a rough idea of how it works (training a neural network to decode low-bitrate speech data with as much fidelity as possible to the original), would I be prohibited from selling a similar product built with the same technique? How about if I had never heard of WaveNet and went about doing the same thing?
    [-]
    - matt4077 2125 days ago
      Yes, independent implementations of patented works are covered by the patent.
      BUT: patents are far more specific than just "a neural network to decode low-bitrate speech data with as much fidelity as possible to the original)". Starting with that goal, you are unlikely to recreate WaveNet's specific structure that is patented.
      In fact, WaveNet describes a more general method to efficiently work with sound signals, somewhat comparable to convolutions for images. It's also not impossible to work with sound using alternative MM structures that are not patented, and might actually perform better than WaveNet.
    - c3534l 2125 days ago
      WaveNet is actually a bit more complicated than that. But its still probably recreatable if you read the paper.
    - particleman2 2125 days ago
      Why do you hate me?
- sp332 2125 days ago
  Is Speex also in that category?
  [-]
  - MrRadar 2125 days ago
    Speex and Opus bottom out around 6000-8000 bps. Codec2 starts at 3200 bps and goes down to 700 bps. The original target use for Codec2 is real-time transmission in the HF (shortwave) and VHF/UHF amateur radio bands where those are about as much as you can transmit within the same bandwidth as analog voice modes once you factor in error correction.
  - voxadam 2125 days ago
    Speex has actually been superseded by Opus. Both are patent free as well.
    [-]
    - zouhair 2124 days ago
      Opus is patented[0] but just royalty-free patents.
      [0]: http://opus-codec.org/license/
corruptio 2125 days ago
Having grown accustom to MP3 artifacts, it's strange to hear artifacts that are natural, but just aren't quite right. More specifically, in the male voice sample: "sold about seventy-seven", I received it as "sold about sethenty-seven".
[-]
- jakobegger 2125 days ago
  Yes, and "certificates" sounds like "certiticates".
  Reminds me of a story about a copying machine that had a image compression algorithm for scans which changed some numbers on the scanned page to make the compressed image smaller. (Can't remember where I read about that, must have been a couple years ago on HN)
  [-]
  - raphlinus 2125 days ago
    It's the lossy jbig2 compression in Xerox copiers: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
    And yes, I think this is a relevant comparison. As the entropy model becomes more sophisticated, errors are more likely to be plausible texts with different meaning, and less likely to be degraded in ways that human processing can intuitively detect and compensate for.
    [-]
    - misnome 2125 days ago
      > t's the lossy jbig2 compression in Xerox copiers: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_....
      My understanding of this fault was that it was a bug in their implementation of JBIG2, not the actual compression? Linked article seems to support this.
      [-]
      - raphlinus 2125 days ago
        I think it was just overly aggressive settings of compression parameters. I don't see any evidence that the jbig2 compressor was implemented incorrectly. Source: [1]
        [1]: https://www.xerox.com/assets/pdf/ScanningQAincludingAppendix...
        [-]
        jay-anderson 2125 days ago
        Right. Jbig2 supports lossless compression. I'm not very familiar with the bug, but it could have been a setting somewhere in the scanner/copier that it was changed to lossy compression instead. Or they had lossy compression on by default or misconfigured some other way (probably a bad idea for text documents).
        [-]
        namibj 2125 days ago
        The bad thing was that it used lossy compression when copying. That was the problem.
        [-]
        gsich 2125 days ago
        No. The bug was when using the "Scan to PDF" function. It happened on all quality settings. Copying (scanning+printing in one step, no PDF) was not effected.
        [-]
        namibj 2125 days ago
        I remember differently, but I don't want to pull up the source right now.
        I did check some of the sources, but was not able to find the one I remember which had statistics on it.
        The xerox FAQ to it does lead me to consider that I might be confusing this with some other incident though, as they claim that Scanning is the only thing that is affected.
        [-]
        gsich 2125 days ago
        https://media.ccc.de/v/31c3_-_6558_-_de_-_saal_g_-_201412282...
        I'd believe him more then any other source.
        [-]
        poizan42 2125 days ago
        He did his presentation in english at FrOSCon 2015, can be seen here: https://www.youtube.com/watch?time_continue=95&v=c0O6UXrOZJo
      - Dylan16807 2125 days ago
        No compression system in the world forces you to share parts of the image that shouldn't be shared. So that's true in a vacuous sense.
        But the nature of the algorithm means that you have this danger by default. So it's fair to put some blame there.
    - c3534l 2125 days ago
      This is a big rabbit hole of issues I'd never even considered before. Should we be striving to hide our mistakes by making our best guess, or make a guess, that if wrong, is easy to detect?
    - tinus_hn 2125 days ago
      The algorithm detected similar patterns and replaced these with references. This lead to characters being changed into similar looking characters that also appeared on the page.
  - eboyjr 2125 days ago
    Xerox copier flaw changes numbers in scanned docs: https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...
- mrob 2125 days ago
  If we're abandoning accurate reproduction of sound and just making up anything that sounds plausible, there's already a far more efficient codec: plain text.
  Assuming 150wpm and an average 2 bytes per word (with lossless compression), we get about 5bps, which makes 2400bps look much less impressive. Add some markup for prosody and it will still be much lower.
  This codec also has the great advantage that you can turn off the speech synthesis and just read it, which is much more convenient than listening to a linear sound file.
  [-]
  - rahimnathwani 2125 days ago
    That codec sounds great, if it exists.
    If you have such a codec, it would be worth testing the word error rate on a long sample of audio. e.g. take a few hours of call centre recordings, pass them through each of {your codec, codec2}, and then have a human transcribe each of:
    - the original recording
    - the audio output from your proposed codec (which presumably does STT followed by TTS)
    - the audio output from CODEC2 at 2048
    Based on the current state of open source single-language STT models, I would imagine that CODEC2 would be much closer to the original. And if the input audio contains two or more languages, I cannot imagine the output of your codec will be useful at all.
  - peterbmarks 2125 days ago
    Speech to text is certainly getting better but it makes mistakes. If the transcribed text was sent over the link and then a text to speech spoke at the other end you'd lose one of the great things about codec2 - the voice that comes out is recognisable as it sounds a bit like the person.
    A few of us have a contact on Sunday mornings here in Eastern Australia and it's amazing how the ear gets used to the sound and it quickly becomes quite listenable and easy to understand.
    [-]
    - andai 2124 days ago
      Could you elaborate on "a contact"?
      Are you using Codec2 over radio?
      [-]
      - baobrien 2124 days ago
        Yeah, the main use case for codec2 right now is over ham radio. David Rowe, along with a few others, also developed a couple of modems and a GUI program[1]. On Sunday mornings, around 10AM, they do a broadcast of something from the WIA and answer callbacks.
        [1] - https://freedv.org/
  - akvadrako 2125 days ago
    What you might be able to do is your the text codec as the first pass, then augment the audio with Codec2 or so to capture the extra information (inflections, accent, etc...), for something in between 2 and 700bps.
- toadworrier 2125 days ago
  One of the very few things I know about audio codecs is that they at least implicitly embody a "psychoaccoustic model". The "psycho" is crucial because the human mind is the standard that tells us what we can afford to throw away.
  So a codec that agressively throws away data but still gets good results must somehow enbody sophisticated facts about what human minds really care about. Hence "artifacts that are natural".
- apricot 2125 days ago
  I thought the same thing. Compression artifacts that don't sound like compression artifacts, could lead to hard-to-detect mistakes.
- userbinator 2125 days ago
  I found the artifacts odd too. It sounds like the guy speaking has gotten a bad cold or allergy and has stuffed sinuses.
- boomlinde 2123 days ago
  I'd love to hear how it sounds with a 700bps stream.
- tripzilch 2124 days ago
  Yes, I heard the same artifacts!
  In the normal codec2 decoding it sounds like "seventy" but muffled and crunchy.
  In the wavenet decoding, the voice sounds clearly higher quality and crisp, but the word sounds more like "suthenty". And not because the audio quality makes it ambiguous but it sounds like it's very deliberately pronouncing "suthenty".
  It's as if in trying to enhance and crisp up the sound, it corrected in the wrong direction. It sounds like the compressed data that would otherwise code for a muffled and indistinct "seventy", was interpreted by wavenet but "misheard" in a sense. When wavenet reconstructs the speech, it confidently outputs a much clearer/crisper voice, except it locks onto the wrong speech sounds.
  With the standard "muffled/crunchy" decoding, a listener can sort of "hear" this uncertainty. The speech sound is "clearly" indistinct, and we're prompted to do our own correction (in our heads), but also knowing it might be wrong. When the machine learning net does this correction for us, we don't get the additional information of how its guess is uncertain.
  This is exactly the sort of artifact I'd expect with this kind of system. As soon as I heard the ridiculously good and crisp audio quality of the wavenet decoder, that fidelity just isn't included in the encoding bits, that's impossible. It's a great accomplishment and just impressive, but it has to "make up" some of those details in a sense very similar to image super resolution algorithms.
  I'm just thinking we should perhaps be careful to not get into a situation like the children's "telephone" game, if for some reason the speech gets re/de/re/encoded more than once. Which is of course bad practice, but even if it happens by accident, the wavenet will decode into confident and crisp audio, so it may be hard to notice if you don't expect it.
  If audio is encoded and decoded a few times, it's possible that the wavenet will in fact amplify misheard speech sounds into radically different speech sounds, syllables or even words, changing the meaning. Kind of like the "deep dreaming" networks. Sounds like a particularly bad idea for encoding audio books, because small flourishes in wording really can matter.
  Edit: I just realised that repeated re/de/re-encoding can in fact happen quite easily if this codec is ever implemented and used in real world phone networks. Many networks use different codecs and re-encoding just has to be done if something is to pass through a particular network.
  But the whole thing is ridiculously cool regardless :) And I wonder if they can improve on this problem.
Ambroos 2125 days ago
That is very impressive! I wonder if a WaveNet decoder could be built for phone calls, as those still sound awful. If it's possible to do this only on the decoder side you don't have to wait for your network to start supporting HD voice or VoLTE to get better quality audio!
[-]
- skykooler 2125 days ago
  I'm curious as to how fast the WaveNet decoder is? The last time I saw an article on it, it took multiple minutes to generate a second of audio.
  [-]
  - gwern 2125 days ago
    The original WaveNet repeated a lot of computations; with caching/dynamic programming, it became a lot faster. Other optimizations were also doable. In any case, that was eventually made moot by using model distillation to train a wide flat (not deep) NN, which is 20x realtime: https://deepmind.com/blog/high-fidelity-speech-synthesis-wav... (This was necessary to make it cost-effective to deploy onto Google Assistant.)
- IshKebab 2125 days ago
  Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent. It makes a huge difference. Unfortunately the place where you really want good quality is call centres - it's often hard to hear people and half of the reason is the shitty POTS quality - and call centres will probably get HDVoice in about 40-50 years. Maybe.
  Edit: nm should have read all of your comment before replying!
  [-]
  - reaperducer 2125 days ago
    >Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent
    Can confirm. I spend a lot of time in fringe reception areas, but every now and then I get a good, strong signal and the HD Voice kicks in between my iPhone and my wife's and it sounds like she's standing right next to me. It really is something to experience, especially if the previous phone call was over regular tech.
    Back when AT&T was running the "You get what you pay for" ads to combat SPRINT and MCI, it had a service you could sign up for that would give your landline phone calls amazing quality.
    Sadly, a majority of people would rather pay less for crap than more for quality; even back then.
    [-]
    - namibj 2125 days ago
      Also why no one really appreciated ISDN over here in central Europe. Yes, there are ways to do better _now_, and it would have been trivial to support channels with better codecs by negotiating something different than u-law 8kHz PCM, but back then that resulted in rather good quality. The issue was that few people got ISDN phones, which resulted in them using analog outputs on an adapter device, which got later incorporated into the internet router, which at one point switched from ISDN to VoIP. And people plug a phone via an analog jacket into the router, instead of using a VoIP capable phone or even anything digital. While many do use DECT cordless phones, those rarely use the DECT hardware inside the router, and instead use the one in the charging dock, which itself connects via an analog, POTS-bandpass-filtered, phone jacket to the VoIP router.
      Oh well, we will probably never get that kind of quality, which is only possible with QoS on the whole path, if there is any congestion. That is the one thing something like rocket.chat and discord can't provide.
      Edit: the way to do this is to force quality upon people, wherever you won't drive them away with the cost this incurs. That way people will associate your brand as a whole with the quality, i.e., in that case, people will associate AT&T with quality, not AT&T premium. Normal people do not even know what kind of plan they are on, except for about one hour before and after they sign the contract.
      [-]
      - gsich 2125 days ago
        >Also why no one really appreciated ISDN over here in central Europe.
        Except Germany. Which had probably the best telephone system in the world when they deployed ISDN nationwide.
        [-]
        namibj 2125 days ago
        I am speaking as a German. Sure, larger businesses used proper ISDN, but your uncle/mom didn't. The best you could hope for there was only DECT compression, aka ADPCM 4bit/8kHz.
        [-]
        gsich 2125 days ago
        Sure they did. Adoption rate was about 30%, possibly even higher at it's peak.
        DECT codec is useless if it's transmitted through analog lines, as it gets converted to standard 3.4k-Hz quality anyway. Except for in-house calls of course.
      - bewo001 2124 days ago
        The problem with land line voice quality is that people expect a land line phone / VoIP adapter to cost like 20 USD or Euro. At this price point you can't have fancy codecs and audio hardware that delivers a decent signal. Bad audio hardware with a good codec can actually decrease audio quality (the mic's noise no longer gets filtered as it is with 8kHz PCM).
    - heywire 2125 days ago
      I think it was AT&T that had a test number you could call to hear a higher quality phone call (I was pretty young at the time, so my memory is fuzzy). I remember it sounding very good, but that test number was the only time I remember hearing that quality over the POTS. The VoLTE and HD voice I occasionally get on my iPhone reminds me of that system.
  - gsich 2125 days ago
    What do you mean with "HDVoice"? On landline connections this usually means G722. G711u/a is definitly not "HD".
    [-]
    - ahoka 2125 days ago
      He probably means Adaptive Multi-Rate Wideband (AMR-WB) [1], AKA G.722.2. There is a common misconception that it is VoLTE only, but actually it works pretty well on 3G too. It is night and day compared to legacy codecs.
      1. https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_Wideband
    - IshKebab 2125 days ago
      I don't know what technology it is specifically, but it's a brand name they used for actual high quality calls. Think, 128 kB/s MP3, rather than the standard cups-and-string quality.
      It only seems to work on mobile.
      [-]
      - gsich 2124 days ago
        I know the difference, used G722. On mobile its G722.2, a totally different codec, but with the same ~7KHz range.
        But there were some companies that advertised a lower frequency range as "HD".
    - namibj 2125 days ago
      I assume the usual wideband codecs used with VoLTE.
- dredmorbius 2125 days ago
  Chaining codecs arbitrarily tends to create really bad artifacting. Current cell and VOIP systems already utilise multiple compression algos.
childintime 2125 days ago
Everything spoken in a whole life could fit on a 128GB pendrive (assuming 5% talk time). Astounding.
[-]
- JetSpiegel 2125 days ago
  Black Mirror is now technically possible.
tommoor 2125 days ago
Make sure you get to the end and listen to the WaveNet samples, amazing stuff.
ksec 2124 days ago
Let say we have Codec2 with WaveNet, its 3.2Kbps now does similar to may be 16Kbps EVS. ( EVS being the codec used in VoLTE, which is slightly better then even Opus in Speeches. )
What "value" / "uses" does this bring us?
It cant be used in podcast because as shown it isn't very good with Music. And many podcast has Music in it.
While Codec 2 with WaveNet can have a 2-4x reduction in bitrate. I cant think of a application that benefits from this immediately.
The other thing I keep having in my mind is convolutional neural networks on Codec in general, Music, Movies, etc. What sort of benefits it bring us.
[-]
- perlgeek 2124 days ago
  > What "value" / "uses" does this bring us?
  Maybe not too much for "us" with LTE and 128GB storage on our phones, but in cases of low bandwith (think digital police radio), or when you have low storage availability, that's really awesome.
- krab 2124 days ago
  If you'd be recording a huge number of phone calls, such size reduction might bring significant savings.
mmastrac 2125 days ago
Seriously impressive and game-changing results, especially when you take Wavenet into account. I'm curious to see how Wavenet would perform w/Opus.
sbr464 2125 days ago
I've become almost entranced with the concept of comparing things to the size of a Floppy Disk. I'm actually planning to get a tattoo of one on my right forearm. I've been working on a large business management platform for the last couple of years and noticed that after investing $500k (salaries/etc) and building a huge amount of functionality, the frontend and backend codebases are still under 1.5mb. Pretty amazing.
[-]
- calabin 2125 days ago
  I actually got a floppy disk tattoo on my foot in a moment of spontaneity (bottomless mimosas). https://imgur.com/a/slCG519
  [-]
  - sbr464 2125 days ago
    nice haha
- p1mrx 2125 days ago
  When we had a 486 running Windows 95, I used to convert CDs to WAV for fun. The GSM 6.10 codec in Sound Recorder (22050 Hz, Mono, 4 KB/s) could fit about 1 song onto a floppy.
  [-]
  - scruffyherder 2125 days ago
    With Real Audio, you could fit one side of a LP onto a floppy diskette
    https://virtuallyfun.com/wordpress/2017/04/20/getting-first-...
    Of course sound quality is lacking, but it was really cool at the time, but the amount of of time and resources needed was insane for the time.
jancsika 2125 days ago
Would be a fun experiment to use something like 3 or even 1 sine to get unintelligible speech, but then pair it with subtitles where each syllable of the text is animated synchronized with the speech. (Like the "follow the bouncing ball" song lyric animations.)
By pairing the audio with the text, you would almost certainly convince the listener that they can understand it.
Edit: typo
[-]
- carapace 2125 days ago
  ;-)
  Sine-Wave Speech Demonstration https://youtu.be/EWzt1bI8AZ0?t=74
  > Sine-wave speech is an intelligible synthetic acoustic signal composed of three or four time-varying sinusoids. Together, these few sinusoids replicate the estimated frequency and amplitude pattern of the resonance peaks of a natural utterance (Remez et al., 1981). The intelligibility of sine-wave speech, stripped of the acoustic constituents of natural speech, cannot depend on simple recognition of familiar momentary acoustic correlates of phonemes. In consequence, proof of the intelligibility of such signals refutes many descriptions of speech perception that feature canonical acoustic cues to phonemes. The perception of the linguistic properties of sine-wave speech is said to depend instead on sensitivity to acoustic modulation independent of the elements composing the signal and their specific auditory effects.
  ~ http://www.scholarpedia.org/article/Sine-wave_speech
  [-]
  - andai 2124 days ago
    To anyone who listens to this, I recommend rewinding to the segment starting at 1:23 a few times and not letting it reach the spoilers. After a few rounds, my brain adjusted to the distortion and I could make it out perfectly, without ever hearing the original.
  - tototomtoboro 2124 days ago
    Wow this is amazing, after listening to this a couple of times the voice became super clear
- bmdavi3 2125 days ago
  Or what if you scrunched the audio down to a bandwidth beyond what was still intelligible, but still captured some semblance of the speaker's voice. Use the original audio to compute subtitles and store them alongside the audio. That's your file.
  Then the player uses both as inputs to ai (some hand waving), which now has enough to put the pieces together and produce something intelligible again, in the speaker's voice.
  [-]
  - gruturo 2125 days ago
    Basically turn the speaker’s voice into a “font”, and then render text with it. Pretty sure it’s been done. Large initial delay while you get the whole font download, then basically just the text to be rendered and the occasional hint to the renderer
  - rspeer 2125 days ago
    Without the "intelligible" part, this makes me think of what the game Celeste does to give its characters voices without voice acting.
    They make voice-like synth sounds, different for each character, that are about the length of the text they're saying. It adds prosody and intonation to the text-based dialogue of the game.
    https://youtu.be/TZpQH8kSWNU?t=2m50s
    [-]
    - andai 2124 days ago
      This is how I imagine an intelligent car would sound if it figured out how to produce speech through the antigravity engine.
      Edit: oh sweet there's intonation too. Were these all made manually?
      [-]
      - rspeer 2124 days ago
        My guess is they're mostly procedurally generated with manual tweaks for particularly significant lines.
- jccalhoun 2124 days ago
  A similar effect: https://soundcloud.com/whyy-the-pulse/an-audio-illusion
mwcampbell 2125 days ago
The WaveNet demos are indeed impressive. But I wonder if the WaveNet decoder needed to be trained for those specific voices.
_emacsomancer_ 2125 days ago
On a related note, I wish more (any!) podcasts were distributed in opus.
[-]
- geofft 2125 days ago
  As far as I know, enough podcast apps require MP3 (and not even VBR!) that you have to use MP3, and you can't have multiple <enclosure>s, so how would you do this? A separate RSS feed for Opus, linked only on the website and not submitted to aggregators?
  [-]
  - CharlesW 2125 days ago
    > As far as I know, enough podcast apps require MP3 (and not even VBR!) that you have to use MP3…
    Nope! Podcast episodes can be encoded using AAC (which is as ubiquitous as MP3) without issue.
    That won't realistically possible with Opus until Opus hardware decoding has available in mobile devices for 5-10 years.
    [-]
    - Hello71 2125 days ago
      I highly doubt there are any devices that are capable of accessing the modern web, with all its JavaScript bloat, yet cannot decode a simple audio codec. Even when Apple was installing AAC hardware decoders, they were already almost obsolete by modern embedded CPU development (especially the rise of medium-power ARM SoCs). I highly doubt any devices released in the past 5 years have any sort of fixed-function audio decoder. Maybe an encoder, possibly some general-purpose DSPs, but not a format-specific decoder.
      [-]
      - floatboth 2125 days ago
        Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.
        The N-Gage QD removed the MP3 decoder that was present in the original model. And you could install a software player, and it would struggle with bitrates above 128kbps :D
        Modern phones can decode video in software (sucks for battery life, and framerate/resolution are more limited than with hardware, but it's possible). Audio is nothing for them.
        [-]
        CharlesW 2124 days ago
        > Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.
        I guess it's irrelevant you feel overwhelmed by how long your phone can go on a charge. Plus, low-power/low-CPU requirements are an order of magnitude more critical in devices like smartwatches.
      - andai 2125 days ago
        I use Opus on my phone all the time. I was in a place where internet was really expensive, so I'd download and convert things on a Linux server.
        In conjunction with youtube-dl I could listen to pretty much anything I wanted, using almost no data.
        These days I use it mostly for audiobooks, if storage is limited.
        [-]
        _emacsomancer_ 2125 days ago
        Opus is awesome for audiobooks at 24kbps (probably one could go lower than this even) and music at 96kbps. I don't hear any difference in quality. It makes a big difference for my mobile which is limited to 128Gb.
        [-]
        andai 2125 days ago
        Man that's a big collection! Though to be fair the iPod 160GB came out like ten years ago, and I thought we'd have advanced a bit more in that department by now. (Like imagine an iPod but instead of the spinny hard drive it's all microSD! There's just no market for it I guess.)
    - sitkack 2125 days ago
      Distribute podcast as HTML file with WASM based decoder, whole file, self contained with either a byte stream out or play/pause
      [-]
      - yoklov 2125 days ago
        Ignoring other issues, this will have rather poor power usage, which is especially relevant given how many people listen to podcasts on mobile devices.
    - digi_owl 2125 days ago
      Or more like never unless you get Apple to front it...
  - _emacsomancer_ 2125 days ago
    Presumably a separate RSS feed. There are podcasts that have separate Ogg RSS feeds.
- iagooar 2125 days ago
  All the +2k podcasts hosted on Podigee (a podcast hosting company mainly known in German-speaking countries) are distributed in opus. But it is, and probably will always be, a rather niche distribution format. AAC had its moment, but MP3 is alive and kicking. Even Apple acknowledges its importance by adding support for chapter markers in iOS12.
  [-]
  - CharlesW 2125 days ago
    > AAC had its moment…
    That moment is 15 years in with no signs of losing steam[1]. AAC effectively replaced MP3 for most online audio use cases, with podcasting as a notable exception[2]. And of course, AAC is the audio format for all basically all online video distribution.
    [1] Apple kicked off the transition in 2003 with the introduction of AAC-based digital music sales.
    [2] Because podcasting is a decentralized medium, and the vast majority of podcasters don't know much (if anything) about media encoding.
    [-]
    - chungy 2124 days ago
      Perceptions are probably influenced heavily by your own usage and places for consumption. I'm also in the camp of "AAC is very rare, if at all"...
      Considering also that YouTube uses WebM, which very explicitly is only Vorbis or Opus for audio, "basically all online video distribution" must exclude the web's most popular video distribution site...
      [-]
      - CharlesW 2123 days ago
        Every YouTube video has had AAC audio from the very beginning. Same goes for every Vimeo video, every Netflix video, every Hulu video, etc. Streaming audio services like Pandora use AAC too.
        That's because AAC is the only format you can count on to work on all devices, and to be hardware decoded on all devices where battery life matters.
  - zdw 2125 days ago
    There are a lot of playback issues - VBR MP3 and older OS releases of both iOS and Android, never mind the car players and similar all contribute to the problem.
    The post-show of this podcast talks about these and other issues in detail - Marco is on both sides of the issue as a podcast producer and podcast app developer: http://atp.fm/episodes/182
    [-]
    - CharlesW 2124 days ago
      If you're a podcaster, another benefit of AAC over MP3 is that VBR is not an issue.
WhiteNoiz3 2125 days ago
The Wavenet stuff sounds great, but I'm curious how big the model is. The audio files may be tiny, but you may need a huge neural network to decode them.
Apocryphon 2125 days ago
"The man behind it, David Rowe, is an electronic engineer currently living in South Australia. He started the project in September 2009, with the main aim of improving low-cost radio communication for people living in remote areas of the world. With this in mind, he set out to develop a codec that would significantly reduce file sizes and the bandwidth required when streaming."
What do you know, it's sort of like Pied Piper without the magical compression or cloud handwaving.
[-]
- LeonM 2125 days ago
  I've been reading David Rowe's blog [0] since 2008, there are some other really interesting projects and products on it. One of my favorites back then was his home build electric car.
  [0] https://www.rowetel.com/
- cyounkins 2124 days ago
  To anyone else confused, I think the statement refers to http://www.piedpiper.com/ and not the folk legend.
codedokode 2125 days ago
I noticed that when you listen to compressed audio first you hear the unnaturality of voice and clicks (probably when one frame's ending doesn't match next frame start). But in a few seconds you adapt to it and now voice sounds pretty clear.
It is impressive how far one can compress speech.
dredmorbius 2125 days ago
I read, and listemn, to this, and am impressed.
Then I think of the possible negative applications.
a noation of 100m people, talking an hour per day on phone or other audio channel, could be stored on 100m * 365 * 1.5 MB of storage annually: 54 PB.
In raw storage, that's less than $2 million. Far below national actor budgets.
samps 2124 days ago
> However, where it starts to get more interesting is the work done by W. Bastiaan Kleijn from Cornell University Library.
The authors are not from Cornell. I think the author made this mistake because the paper is posted on arXiv, and that’s what’s it says at the top of every page?
mr_donk 2125 days ago
This is amazing! With this codec and enough processing power, you could do this bidirectionally and have enough bandwidth to stream a two way realtime voice chat using 2400bps modems over a standard analog phone line!!! ... Oh... Wait a minute...
bitwize 2125 days ago
The plain Codec2 decoder sounds like a TI-99/4A (and works on somewhat similar principles). If I hook a TI-99/4A to the WaveNet decoder, will it sound natural?
gigatexal 2125 days ago
But this guy a beer. What a feat!
hatsunearu 2125 days ago
Side note: I'm still waiting for an open source, cheap way to do FreeDV/Codec2 on VHF either with a dongle that goes between a raspi/SBC or a laptop and a cheap ass radio like a baofeng, or an inexpensive radio with Codec2 support.
[-]
- baobrien 2125 days ago
  I think 2400B support is coming to the FreeDV GUI soon. I've seen some work done on that. That'll let you use a cheap FM radio and a laptop to get on the air with something codec2 based. I'm slowly chipping away at a TDMA mode for SDRs, but that's still probably a ways off.
madengr 2124 days ago
Would be interesting to combine this Codec2 with LoRa modulation. Of course the latter is patented, but it combines both chirped and direct sequence spread spectrum to yield some very resilient modulation.
danschumann 2125 days ago
"Enhance" - said every movie guy ever.
smooc 2125 days ago
Ikzmjzn nsh
mockery 2125 days ago
None of the audio samples play for me (In neither Chrome nor Edge... Other sites play just fine.)
Makes it very hard to evaluate claims of codec quality, which seems like the primary purpose of the blog post. :(
[-]
- BenjiWiebe 2125 days ago
  Even works in the embedded browser of the "Materialistic" HN reader app, on Android 5.1.
- fuzzy2 2125 days ago
  Works for me on my iPad.
- codetrotter 2125 days ago
  Works on iOS 11 for me but I had to press play and then wait a couple of seconds, press pause and then play again and wait another couple of seconds. Try that.
- terramex 2125 days ago
  I confirm, doesn't work for me too. Neither in Safari 56 nor Chrome 67 (macOS 10.13).
- makapuf 2125 days ago
  Fine by me on Firefox/Linux.
- yitchelle 2125 days ago
  There is a option to download the audio for offline listening.
- sbierwagen 2125 days ago
  Working on Chrome 67 on Win 10.
- chapium 2125 days ago
  I have no problems in Chrome.
- jimnotgym 2125 days ago
  Working on Firefox Android
  [-]
  - S3raph 2125 days ago
    confirm, works fine on latest Firefox stable Android
- 0x0 2125 days ago
  Works in Chrome for macOS