After your comment, I muted the audio when the announcer told the audience what the computer would say. I couldn't get it straight away, but when they emphasised different words in the sentence I could usually tell what was being said "Greetings everybody!".
Very impressive indeed. I dug around and found his tweet: "uses a neural network I trained on choral pieces to generate the harmonization for your melody"
Man, shoot. I compose a bit myself and am really into vocal harmony and I was hoping that he'd hard-coded some theory rules.
I did notice that it seems to vary with the past 3 notes you've moved to. If you continue repeating the same pattern of jumps among the scale, you can replicate the chords the voices slide to.
Yeah, I think I'd find it useful- I'd love to be able to ape a few of the tricks that the four voices use to slide to pleasing chords. The music theory "rules" are certainly available, but only in the same sense that the rules of calculus are available somewhere.
A little while back I used this combined with a physics simulator to make a toy where you throw polygons in the air and they scream: https://ohgodwhathaveidone.stackblitz.io/
Code's here if anyone wants to play: https://stackblitz.com/edit/ohgodwhathaveidone – I did a fairly medium job of abstracting the synthesis engine away from the UI, but it might be a decent starting point if you're looking to make other Trombone-based web silliness.
This reminds me of the Sprechmaschine [1] ("speaking machine") built at the end of the 18th century by Wolfgang von Kempelen (the guy who build the original mechanical turk [2]). Here is a YouTube video showing it in action (for example, the machine says "Mama" around 1:14): https://www.youtube.com/watch?v=k_YUB_S6Gpo
Lest we forget that speech synthesis is not just for grotesque but amusing semi-real vocal synths like this, here's a BBC Radio 4 history of speech synthesis as an assistive technology - Klatt's Last Tapes, by Stephen Hawking's daughter, Lucy:
Man this brings back memories. We used this program for my linguistics homework in college. In 1996. Although I think it was an app then not a web page.
And also sometimes referred to as applications. And while the abbreviation “app” only became mainstream around the time of the first iPhone, the warez scene had a habit of referring to software applications as “appz” since way before then, which admittedly is not the exact same word as “app”, appz being an intentional misspelling of a pluralization of the abbreviation, but it’s close to it. They sure loved the letter z. Appz, crackz, mp3z, moviez, gamez, ebookz.
This way of writing gave people some unique words to query search engines for, making it easier to find warez sites, torrent trackers and ftp servers hosting pirated files but I think it orginated long before the web was born — even before any sort of networked computing existed.
Consider the word “phreaking” which was invented at the time of mechanical telephone switching systems. This word came from combining the words “phone” and “freaking”. I think that word could have inspired hackers to use use ph in place of f in other words, and then once you start making that substitution, other substitutions follow, like using z instead of s.
I dunno, I grew up in the 90s so there is a lot of hacker culture that precedes my time. What I do know is that a lot of early hacker culture spawned other subcultures, and that several influences of the origins remain central in these. For example, the demoscene.
How were you involved in the warez scene? How did things work, and how have they changed? Stories that are part of the history of the internet are always intriguing to me so I'm really interested in hearing about that.
No need. This thing is already a voice synthesizer. This is how modern synthesizers work, more or less: by generating a sine wave and then modifying it in the same way as the vocal tract does.
Not a sine wave. The vocal tract is subtractive; the larynx has to generate a waveform with lots of harmonics some of which the vocal tract can then remove.
Yeah, I know what you meant. You use a tool like the CMU pronunciation dictionary[1] to turn words into phonemes, and then you use a model similar to the pink trombone to turn the phoneme string into sound, including the transitions between different phones (which, it turns out, actually matter more than the phones themselves for making it understandable). This is how TTS works.
Reminds me of Xiph's Speex/CELP model of speech as a mix of noise and frequency to achieve high compression, requiring as little as 2.15 kilobits (275 bytes) per second. It sounds perceptibly similar to the original recording, even though the difference between the input and output sampled data may be high:
Thanks! That's like 80% of the way there. It looks to be missing a lot of state internal to the mouth (understandable given that it's targeting avatar lipsyncing), and appears to discretize the values somewhat, making it less useful for linguistics practice. But I bet the underlying technology could be adapted easily.
Wow! I was able to successfully recreate all sorts of letters and sounds just by imagining how my own mouth works, and then manipulating the different components on the pink trombone in the same way. I'm impressed!
I have actually used this very tool back in the day to help learn how to speak in a male or female voice. One of the five things I do is manipulation of my tongue to change the cavity of my mouth to make the space bigger (more masculine) or smaller (more feminine) which this tool demonstrates very well.
Edit: to be clear I used to sound male 24/7 and now I sound female 24/7 Rather than thinking you are speaking male or female it helps if you think you are playing a musical instrument with a number of controls that you control (with your mind whahahaha). Then it is just about learning what each control does and how to play them so you get the result you want.
Your voice is muscle memory so while at the start I had to actively "play" a female voice that is no longer the case and now if I ever want to "play" a male voice I have to actively think about how I am going to speak each word to make it male.
This is also exactly how male countertenor singers produce a female-sounding voice, by making the vocal cavity smaller to adjust the formants upward. If you don’t do that, it just sounds like a male head voice or falsetto.
You can make it sound female by dragging the voicebox control to the right and down. It requires both higher pitch, and less power in the odd harmonics.
I agree it gets closer, but as is with much of speech synthesis, it still to me ends up sounding like "a male talking in a high-pitched voice" and not "a female".
In addition, this would imply that only males can talk in a low register, which is patently false. Low register female voices are fairly common.
It has to do with formants. It's briefly explained in this video, which also demonstrates changing a female voice to male: https://youtu.be/nPAINeIGxMc.
disable "always voice", and click a bit below hard palate, slightly to the right towards the lip (below the at in palate), so there will only be a small gap for the air to go through
The model also lacks mouth width/lip shape, which is crucial for differentiating between Swedish vowels (for example i vs y, which have most of the mouth the same except for y has the lips protruding and i is more of a smile).
What someone needs to do is put sensors in people's mouths, record them saying known phrases, then stick the sensor data+phrases into some AI and see if we can't get that Trombone talkin'!
Thats nice and everything but until I have a hardware version where I can just switch a button on and do this and that, I'm not really going to be truly satisfied. Please hardware'ify.
I'm fairly sure there is no other meaning than the one to which you refer, so its perhaps more likely that you are missing the risqué point being made by assuming the prude levels are higher than you might think. A lot of the folks who make these kinds of hacks, are perfectly fine with the obscure, obscene, perverse nature of their naming of things ... Those of the anthropological inclination may decide that, in fact, an obscene name for something like this is a requisite.
Like how in jargon.txt they used 69 as an example of a big number:
"69 adj. Large quantity. Usage: Exclusive to MIT-AI. "Go away, I have 69 things to do to DDT before worrying about fixing the bug in the phase of the moon output routine..." (Note: Actually, any number less than 100 but large enough to have no obvious magic properties will be recognized as a "large number". There is no denying that "69" is the local favorite. I don't know whether its origins are related to the obscene interpretation, but I do know that 69 decimal = 105 octal, and 69 hexadecimal = 105 decimal, which is a nice property. - GLS)"
For some reason, dirty names are common practice in audio tools. Some examples: "Rectal Anarchy" is another vocal synth for Buzz, grANALizer (emphasis not mine) is a popular granular audio effect, and even in the commercial world, it gets more subtle but it still exists: Image Line sells a plugin called "Gross Beat", which is a bilingual en/fr dirty joke.
The most interesting thing about this one is the chord progressions it generates.
https://youtu.be/TsdOej_nC1M?t=16
https://en.wikipedia.org/wiki/Voder
[0] https://xferrecords.com/products/cthulhu
[0] https://twitter.com/daviddotli/status/1075068713936830464
I did notice that it seems to vary with the past 3 notes you've moved to. If you continue repeating the same pattern of jumps among the scale, you can replicate the chords the voices slide to.
Would you find this useful? Is this not already available somewhere? Sounds like a fun project!
Either way I agree, definitely a fun project.
Code's here if anyone wants to play: https://stackblitz.com/edit/ohgodwhathaveidone – I did a fairly medium job of abstracting the synthesis engine away from the UI, but it might be a decent starting point if you're looking to make other Trombone-based web silliness.
[1] https://de.wikipedia.org/wiki/Wolfgang_von_Kempelen#Die_Spre...
[2] https://en.wikipedia.org/wiki/The_Turk
https://en.wikipedia.org/wiki/Voder
https://www.youtube.com/watch?v=0rAyrmm7vv0
Literate depot: https://github.com/PaulBatchelor/voc
Actually compiled source is there: https://github.com/PaulBatchelor/Soundpipe/blob/master/modul...
https://www.youtube.com/watch?v=097K1uMIPyQ
This way of writing gave people some unique words to query search engines for, making it easier to find warez sites, torrent trackers and ftp servers hosting pirated files but I think it orginated long before the web was born — even before any sort of networked computing existed.
Consider the word “phreaking” which was invented at the time of mechanical telephone switching systems. This word came from combining the words “phone” and “freaking”. I think that word could have inspired hackers to use use ph in place of f in other words, and then once you start making that substitution, other substitutions follow, like using z instead of s.
I dunno, I grew up in the 90s so there is a lot of hacker culture that precedes my time. What I do know is that a lot of early hacker culture spawned other subcultures, and that several influences of the origins remain central in these. For example, the demoscene.
I'd also (as an English speaker) like to see/hear Dutch g and Xhosan clicks.
specifically: https://twitter.com/shaunlebron/status/989192507828432896
1 http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Pink Trombone has no phoneme mapping, you'd have to code all that by hand, and that's exactly what my original question was.
https://www.speex.org/docs/manual/speex-manual/node9.html
Bitrate comparison:
https://www.speex.org/comparison/
Samples:
https://www.speex.org/samples/
Maybe higher compression can be achieved with better prediction, aka machine learning.
Edit: to be clear I used to sound male 24/7 and now I sound female 24/7 Rather than thinking you are speaking male or female it helps if you think you are playing a musical instrument with a number of controls that you control (with your mind whahahaha). Then it is just about learning what each control does and how to play them so you get the result you want.
Your voice is muscle memory so while at the start I had to actively "play" a female voice that is no longer the case and now if I ever want to "play" a male voice I have to actively think about how I am going to speak each word to make it male.
In addition, this would imply that only males can talk in a low register, which is patently false. Low register female voices are fairly common.
So they can make it sound more masculine but still high pitched
https://www.youtube.com/watch?v=PDn7ygnJUfI
https://www.youtube.com/watch?v=3jcqKnIa8T4
https://www.youtube.com/watch?v=bo5ZEgBEapk
https://github.com/giuliomoro/pink-trombone
... which is because the latter was patterned after the shape of the mouth.
[1] https://www.urbandictionary.com/define.php?term=pink%20tromb...
"69 adj. Large quantity. Usage: Exclusive to MIT-AI. "Go away, I have 69 things to do to DDT before worrying about fixing the bug in the phase of the moon output routine..." (Note: Actually, any number less than 100 but large enough to have no obvious magic properties will be recognized as a "large number". There is no denying that "69" is the local favorite. I don't know whether its origins are related to the obscene interpretation, but I do know that 69 decimal = 105 octal, and 69 hexadecimal = 105 decimal, which is a nice property. - GLS)"