For your listening pleasure, here's a full-length demo. I decided to use the Jonathan Coulton classic "Re Your Brains", because I can legally share and modify his music under its Creative Commons license.
While it's a great technology, the result sounds somewhat robotic. On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder, something is missing. The voice contains pieces of strumming sound. Guitar also sounds "blurred", as if someone cut an object from the picture and blurred the cut to make it less visible. Clap sound is distorted, on the original recording it sounds the same, but after separation it sounds different every time, as if it was filtered or compressed with low bitrate.
It is amazing how the ear manages to distinguish all the sounds without distortion.
> While it's a great technology, the result sounds somewhat robotic.
That's like complaining about how bad the pig plays the violin. This is absolutely incredible. The complexity level for this problem is right off the scale and the software does a passable job of it. Given some time and more training data and a few more people working on it this has serious potential.
That roboticness is because there's overlap in frequencies between the voice and the instruments.
I have no idea how this tool splits them up at the implementation level but I imagine it tries to split it up based on frequencies and when it lifts out the voice, it's cutting out a ton of frequencies that would normally be in your voice so now it sounds very unnatural, blocky and metallic.
With studio quality headphones I can notice a massive difference for the worse between the original and separated vocal track. It reminds me of when I turned on a noise gate too high when recording audio for my courses. That noise gate clamps down on certain frequencies to help eliminate room noise but it also dampens or removes natural frequencies that occur on the low end of an average male voice. It gives that same very jagged sounding audio waveform.
I don't argue that it's a great technology, but some of the neighbour commenters wrote things like "perfect", and to me it doesn't sound like "perfect" yet.
For example, in the beginning, if you listen to the phrase "from the office ...", in original recording it sounds smooth, like a single phrase (and the voice is warm and pleasant), but in the separated track it sounds like it is synthesized from pieces that are not properly connected, like vocaloid songs. It sounds little harsher. Transitions sound unnatural. And the phrase "heya Tom" in the separated track is split into "heya" and "Tom" with some unnatural sound between (or maybe in the beginning of "Tom"). It is like transitions you can hear in vocaloid tracks. Or the kind of artifacts you get if you over-compress an MP3 file.
And "it's good to see you" part also sounds robotic.
Maybe it's losing some of harmonics, but in a different way for different syllables and they don't sound like a single phrase anymore.
People are hard at work on denoising of audio using neural networks also. I would expect that if one trained denoisers for each type of source separated here, and passed the separated audio through them, things would get even better.
Yes, for source separation. Denoising is generally a separate task. A denoising network would take in a noisy signal for a single source and outputs a cleaned up one. It would be trained on the specific source type, for example vocals.
The original lead vocal is definitely processed. You can hear that processing clearly if you apply the classic vocal removal (really, center removal) effect. (My favorite implementation of that is the Center Cut DSP for foobar2000.)
Tom's Diner wasn't just a benchmark. Brandenburg listened to it obsessively, over and over, to the exclusion of a lot of other content he might have done well to pay more attention to. Which is what tends to happen when you work on audio processing code, for better or worse.
So that's why the MP3 format mangles male vocals so badly at all but the highest bitrates. Now you know the Rest Of The Story... or at least you've read it on the Internet.
I gave a talk at pycon this year about dsp , specifically some of the complexities surrounding this. I came across a few other ml projects that claimed to do this as well, and the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly. in the git repo of this project they also explicitly state you need to train on your own data set, though you can use their models of your like. YMMV. I will love to try this out, as it's definitely a complex bit of audio engineering. That said, i loved learning everything i did preparing for my talk and need to finish up some other parts of the project to get the jukebox working... Maybe this will help :)
Seems like most music (from the 70s on at least) is recorded multi-track and the data is out there, just not accessible to anybody. If you ever watch Rick Beato videos, he takes classic songs and isolates vocal/drum/etc. tracks all the time, I'm not sure how he has access to them: https://www.youtube.com/playlist?list=PLW0NGgv1qnfzb1klL6Vw9...
But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.
Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.
Rick says in one of his videos that he and some of his buddies have got old copies of the original source (separated) tracks, and they kind of pass them around between each other.
I find that pretty amazing given the litigiousness of some in the music industry, but there we are.
Side note: I discovered Rick Beato a few months ago and I've watched heaps of his videos. It's really fascinating hearing old classics torn down to their constituent parts. Here's one of my favourites of his: https://www.youtube.com/watch?v=ynFNt4tgBJ0 (Boston - More than a feeling).
Rick Beato is excellent. Nahre Sol and Adam Neely also do great analyses of things. Adam in a more theory oriented way and Nahre in a more feeling and composition focussed way; "Funk as digested by a classical musician" for example looks at funk to try and find the key structures of the style which illuminates things I might not have noticed otherwise.
Also 8 bit music theory has very solid video essays on varying compositional concepts that are reflected using game music. I actually find his work most consistently satisfying. Neely and Beato are great but lower s/n ratio. Nahre not enough watches to say but thumbs up for her, too.
Don't forget JazzDuets's channel. His content seems to be most mature and uses actual playing a lot to tune your ear. I find him actually a bit too advanced for my level but I like a lot his very humble and friendly personal touch.
Given that, I expect that a show titled "What Makes This Song Great" will do fine. Who doesn't love having somebody note the non-obviously good parts of their work. Especially if, as with Weird Al, proper royalties are paid.
The artists and performers are often quite reasonable. When you sign a major label deal, many sign away the rights to and control of their work in an effort to make a living and support their families. They need the money, also maybe a gold record.
Once they sign, the RIAA and label lawyers get to work, so the creator may not have any influence or own the masters.
Artists have a good chance of getting the point that authentic publicity is gonna garner authentic fans with authentic ticket stubs, but in the contract, on page 147 section 14a, under "Rights and Royalties" states ...
OMG yes. I watch very little YouTube, but reading these comments I thought "this Rick guy is probably that one I saw a couple months ago, his separated Boston tune was really amazing." And there it is.
I'm sure you've thought of this, but could/have the tracks from the Rock Band games be used for training?
There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.
FYI The term "stems" to refer to the individual tracks of a piece of recorded music is a lot older than NI's format. I love NI, but I'm annoyed that they chose to appropriate the industry standard term as a proprietary product.
This is better than I originally thought, but it's still a bit confusing. The Stems file spec (available via registration) is basically an MP4 container with some JSON metadata. This seems to have the usual donwsides of MP4 patents, but it's actually about as good as any standard a pro audio software company has released.
Ideally, I'd have liked to have seen a completely open audio codec used for both encoding and container, but MP4 is a pretty safe bet for compatibility , and it's not really NI's fault that it has some patent issues.
All in all, I could pedantically argue the "open" status, but I'll instead give credit where it's due, and give kudos to NI for releasing a pretty damn usable file format.
I'm even happy that it's limited to 4 parts. For the purposes of live performance with DJ style gear, this is plenty. If a performer wants more parts then they're probably going to be creating some or all of those parts. Either way, they'll probably be using something more like Ableton rather than Traktor.
As discussed elsewhere in this thread, it's not as I suggested, a proprietary format. It's still a format created by NI which appropriates the industry standard name. For a list of parties using NI's implementation of the file format see https://www.stems-music.com/stems-partners/ .
I'm happier that it's a (mostly) open standard, but I'm still slightly annoyed at the confusion that comes from NI appropriating the industry term. It's like if I released a non-text representation of storing data using a particular subset of technology that was standardized, and then called it "The Binary" format. Technically nothing wrong with it, but it's bound to cause confusion!
The SNES is a 1990s game console. Its music is generally synthesized by the SPC700 chip, from individual instruments stored in 64 kilobytes of RAM (so the instruments often sound synthetic and muffled). The advantage is that it's possible to separate out instruments.
- Programatically gather a list of all samples used in the song
- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.
- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.
Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.
The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.
BTW, in the play-along mode in GB where you get pre-recorded accompaniment tracks, you can replace the drummer's kit with a drum machine and hang some filters onto it. Much fun is to be had.
However, this reminds me that filters probably make things much harder for the separation model, with the explosion of possible sounds from an instrument or voice. (Vishudha Kali's music is a nice illustration of that.)
I did come across the person who did a similar project (automating instruments based on previously recorded music), however in one project that was playing live instruments from an NES, the signals were already separated. That said, I'm not following the context of your response to my post.
You mentioned that "the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly." I think using SNES music as training data is a viable way of getting hundreds of songs' worth of training data in a fairly automated fashion. (I'd estimate that each game has 10 to 80 songs which can be used for training, I have 5 to 10 games of OSTs already downloaded, and each song is only 64 kilobytes and takes minimal disk space before rendering to WAV.)
This is very timely. I've been working for about 3 months now on a utility that transforms mp3's to midi files. It's a hard problem and even though I'm making steady progress the end is nowhere in sight. This will give me something to benchmark against with for instance voice accompanied by piano. Thank you for making/posting this.
Yes, it's terrible :) This particular file the result of the following transformations:
midi file -> wav file (fluidsynth)
wav file -> midi file (my utility)
midi file -> wav file (fluidsynth once more)
wav file -> mp3 file (using lame)
Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.
But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.
Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.
Interpreting the results is tricky, they are obviously better on 'recall' but that is at the expense of being much less precise which gives a much better result for my code; besides it is nicer to listen to because there are far fewer spurious notes.
My code also runs about 100 times as fast and uses very little in terms of resources. So, rather than being depressed it looks like I'm on to something :)
messed around with the 2stem model for a bit and it's reasonably good. I think phonicmind is still a bit better - phonicmind tends to err on the side of keeping too much, while the 2stem model tries to isolate aggressively and often damages the vocal as a result
(distorting words by losing some harmonics, or losing quiet words entirely)
you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.
For those who, like me, hadn’t heard of PhonicMind before, it’s an online service at https://phonicmind.com/ that charges $4 to $1.5 per song to separate out vocals, drums, bass, and the rest of the sounds. You can upload any audio file to that website and get a 30-second preview of separated parts for it.
An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
Gave this a go, it's an easy install with pip, and results are pretty quick even on an old macbook. Splits into 2stems (vocals/accompaniment) on some random songs I chose actually quite good using the pretrained models provided. Of course, ripping the vocals out of the accompaniment takes out a good chunk of the middle frequencies so some songs sound a bit wonky.
Worth a play if you are interested.
Same thoughts here. I ran Thriller, Alligator by Of Monsters and Men, and In Hell I'll be in Good Company by The Dead South on the 2 / 5 / 4 stems, respectively. Impressive results. Definitely agree that some of the middle frequencies show some error.
Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.
(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)
I wonder how it would fare on Pink Floyd's "Sheep", where vocals seamlessly transform into instrumentals and it's impossible to tell where one ends and the other begins. https://www.youtube.com/watch?v=3-oJt_5JvV4 (skip to around 1:40)
There is a patent for Melodyne that describes looking for harmonics vs time in FFTs, then heuristics for deciding which belong to one note and where it starts and ends, then assigning some of the resudual energy (e.g. noisy onset) to each note.
That's the second time I've seen someone mention Melodyne for separating vocals from a full song source - I don't think that's something it can do? Melodyne is for tuning vocals / instruments & correcting timing on already isolated tracks.
melodyne's editing interface lets you remove different notes from a polyphonic track. so if it's just vocals + other tonal sounds, you can manually remove the other tonal sounds. example: https://youtu.be/2ZjdDatxTaQ?t=83
Hmm, never tried that with melodyne myself and the video you posted isn't a great example of an accurate vocal extraction - those are more like vocal chops and are already pretty dirty to begin with. Based on my experience with Melodyne, I'd be surprised if you could cleanly extract a plain singing vocal without tons and tons of work.
I look forward to a day I can click a button to watch videos online without any unnecessary and distracting background music (though it would be better if there were an option and precedent to offer unornamented narrative in video players). The next step after this would be to have live 'music cancelling' headphones for the grocery store (if such a thing still exists).
The extracted vocals sound great! But the resulting accompaniment tracks I've heard so far (tried on a handful of songs) aren't of usable quality for most purposes where you'd want an instrumental track – they're too sonically mangled.
Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...
What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.
Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).
Is there anything like this for images? Meaning essentially trying to decompose back into photoshop layers. Wouldn't be feasible for lots of stuff that is completely opaquely covering something, but I'm thinking for things like recoloring a screen print, etc.
Played with it. The quality of the result is mostly dependent on the amount of clipping in the source file. Basically, all post-90s masters produce weird results with orcs singing in the background. And classics from 60s yield fantastic results.
I gave it a try with Megadeth's Holy wars, was expecting something like this but got very deep audio. Not sure why but perhaps it's because bassist David Ellefson uses pick which gives the percussive sound and it suits to Megadeth.
Any parameter I could use with spleeter to get a similar output?
Not only are the results good, but the music is generated decently rapidly. The implications are clear: whoever wants to make a quick fortune on YouTube should start converting and uploading truckloads of songs as fast as possible. The demand is there. I could easily see that bringing in millions of views.
I really disagree. It sounds... awful. On par with other approaches, sure, but the main vocals sounds like a case study in digital artifacts and the accompaniment sounds like there's a filter automated over the track.
Very cool. A close friend of mine (and lead singer in our band May years ago) recently died and we have a couple great recordings from 2 decades ago of his vocals. When we recorded the rest of the instruments they were DI into a Boss BR8. The lyrics sound awesome but the guitar and drums are recorded poorly. This may give us a chance to split the vocals out of the final tracks, and re-record the tracks as a tribute.
The repo's quick start instructions  show how to use it with the "2-stems" model , which separate the source audio into two files: output/source/vocals.wav and outputdir/source/accompaniment.wav:
When you have an instrumental version of a song (from the same stems as the vocal version) this is already one way to get the vocals out without any fancy machine learning. The main tricks besides what you can do in Audacity like that are properly time-aligning the tracks (even if they drift a bit) and compensating for phase issues and compression. I wrote a dirty tool that does that and I've been meaning to turn it into some kind of nicer GUI version.
I've been doing something like this for a bit in Audition. Center channel extract > invert phase > save as wav > create multi-track project > add original > add modified > up the volume on the vocal extracted modified version until the vocals go away
Or you can do the exact opposite and instead of center channel extracting the vocals you can remove the vocals and use this method to better isolate vocals.
Although if things do fun stuff with stereo it might not be exact.
I set up a Colab notebook to try spleeter out for myself. You can try picking up your favorite mp3, renaming it to "audio_sample.mp3", uploading it to the Colab, and spining all the cells on the notebook. Enjoy.
I was just testing it out on Gazal, it seems to work perfectly. But when it seems to fail with Qawwalis.
My understanding of how this works. Is it safe to assume that the training data from dreezer lacks enough examples of Qawwalis?
Aren't all the most popular audio formats lossy? Extracting full data from lossy compression requires reconstruction. Even if they are able to completely extract all tracks they would have gaps and be very low quality.
Source separation of speech and speech denoising is a well established field, more researched in general than music source separation. Intelligence officers very likely have access to a range of well-performing ML tools for extracting speech.