Extract voice, piano, drums, etc. from any music track

(github.com)

1459 points | by dsr12 11 days ago

54 comments

  • mwcampbell 11 days ago

    For your listening pleasure, here's a full-length demo. I decided to use the Jonathan Coulton classic "Re Your Brains", because I can legally share and modify his music under its Creative Commons license.

    First, the original:

    https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

    Now the derived stems:

    Vocals: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

    Accompaniment: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

    Note: I'm not affiliated with this project or Mr. Coulton. I just think this is a cool project and wanted to share.

    • codedokode 11 days ago

      While it's a great technology, the result sounds somewhat robotic. On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder, something is missing. The voice contains pieces of strumming sound. Guitar also sounds "blurred", as if someone cut an object from the picture and blurred the cut to make it less visible. Clap sound is distorted, on the original recording it sounds the same, but after separation it sounds different every time, as if it was filtered or compressed with low bitrate.

      It is amazing how the ear manages to distinguish all the sounds without distortion.

      • jacquesm 11 days ago

        > While it's a great technology, the result sounds somewhat robotic.

        That's like complaining about how bad the pig plays the violin. This is absolutely incredible. The complexity level for this problem is right off the scale and the software does a passable job of it. Given some time and more training data and a few more people working on it this has serious potential.

      • nickjj 11 days ago

        That roboticness is because there's overlap in frequencies between the voice and the instruments.

        I have no idea how this tool splits them up at the implementation level but I imagine it tries to split it up based on frequencies and when it lifts out the voice, it's cutting out a ton of frequencies that would normally be in your voice so now it sounds very unnatural, blocky and metallic.

        With studio quality headphones I can notice a massive difference for the worse between the original and separated vocal track. It reminds me of when I turned on a noise gate too high when recording audio for my courses. That noise gate clamps down on certain frequencies to help eliminate room noise but it also dampens or removes natural frequencies that occur on the low end of an average male voice. It gives that same very jagged sounding audio waveform.

        • codedokode 10 days ago

          I don't argue that it's a great technology, but some of the neighbour commenters wrote things like "perfect", and to me it doesn't sound like "perfect" yet.

          For example, in the beginning, if you listen to the phrase "from the office ...", in original recording it sounds smooth, like a single phrase (and the voice is warm and pleasant), but in the separated track it sounds like it is synthesized from pieces that are not properly connected, like vocaloid songs. It sounds little harsher. Transitions sound unnatural. And the phrase "heya Tom" in the separated track is split into "heya" and "Tom" with some unnatural sound between (or maybe in the beginning of "Tom"). It is like transitions you can hear in vocaloid tracks. Or the kind of artifacts you get if you over-compress an MP3 file.

          And "it's good to see you" part also sounds robotic.

          Maybe it's losing some of harmonics, but in a different way for different syllables and they don't sound like a single phrase anymore.

          • sjwright 10 days ago

            It’s perfect relative to the most wildly optimistic expectations anyone could have reasonably held beforehand.

            The vocals don’t have any significant residual artefacts from drum hits or any residual bleed-through of instruments playing the same notes. It’s magical.

        • rafa1981 10 days ago

          Maybe the source was compressed audio instead of flac/wav?

          Edit: the source is an mp3, which removes audio frequencies based on perception/masking with other frequencies. It's perfectly normal that it is showing artifacts. A better source is needed.

        • zzzeek 11 days ago

          this is like someone just flew you to the surface of Mars in an hour and your only comment is that the ride was bumpy. The demo above is mind blowing.

          I basically want to run this over every Steve Gadd recording I have.

          • jononor 11 days ago

            People are hard at work on denoising of audio using neural networks also. I would expect that if one trained denoisers for each type of source separated here, and passed the separated audio through them, things would get even better.

            • rrss 10 days ago

              spleeter is using neural networks.

              • jononor 10 days ago

                Yes, for source separation. Denoising is generally a separate task. A denoising network would take in a noisy signal for a single source and outputs a cleaned up one. It would be trained on the specific source type, for example vocals.

                • rrss 4 days ago

                  Gotcha, thanks.

            • MattRix 11 days ago

              It's easy to pick up on lots of little things, but this is still extremely impressive. It's also more than good enough for people who want to practice singing/playing over the song by themselves.

              I think the next step would be to train a network that can un-robotify songs and then run it on this.

              • uryga 11 days ago

                > On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder

                funny, the original recording seemed kind of robotic to me! maybe not robotic, but like it's been filtered somehow. but that might just be my not-so-great headphones

                • mwcampbell 10 days ago

                  The original lead vocal is definitely processed. You can hear that processing clearly if you apply the classic vocal removal (really, center removal) effect. (My favorite implementation of that is the Center Cut DSP for foobar2000.)

              • M4v3R 11 days ago

                Wow, I've listened to several attempts at this over the years, but this one is waaay better than anything I've heard. It's almost perfect.

                • 8bitsrule 10 days ago

                  IME, this tool is certainly an order of magnitude closer to the Holy Grail than anything I've ever heard. Kudos to Deezer R&D.

                  https://www.youtube.com/watch?v=KPlmrq_rAzQ

                  • noja 11 days ago

                    So Jonathan Coulton is now the new Suzanne Vega?

                    • BrentOzar 11 days ago

                      > So Jonathan Coulton is now the new Suzanne Vega?

                      Bravo. For people who didn't get the sublime reference, Suzanne Vega's song Tom's Diner was a benchmark test during development of the MP3.[1]

                      [1]: https://en.wikipedia.org/wiki/Tom%27s_Diner#The_"Mother_of_t...

                      • CamperBob2 11 days ago

                        Tom's Diner wasn't just a benchmark. Brandenburg listened to it obsessively, over and over, to the exclusion of a lot of other content he might have done well to pay more attention to. Which is what tends to happen when you work on audio processing code, for better or worse.

                        So that's why the MP3 format mangles male vocals so badly at all but the highest bitrates. Now you know the Rest Of The Story... or at least you've read it on the Internet.

                        • danso 10 days ago

                          This sounds really interesting. Is there a good writeup/oral history about this?

                        • mwcampbell 11 days ago

                          The reference is even more apt because the two of them have collaborated. She sang the led vocal on his song "Now I Am an Arsonist" (from the album Artificial Heart).

                      • jesuslop 11 days ago

                        Holly cow! The separation is sort of perfect! Thanks for the demo.

                        • eps 11 days ago

                          "Indistinguishable from magic."

                        • savrajsingh 10 days ago

                          Did it just work or did you have to supply something beyond the original JC track?

                        • ropable 10 days ago

                          They need to link these on the project readme as a demo.

                        • voicedYoda 11 days ago

                          I gave a talk at pycon this year about dsp [1], specifically some of the complexities surrounding this. I came across a few other ml projects that claimed to do this as well, and the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly. in the git repo of this project they also explicitly state you need to train on your own data set, though you can use their models of your like. YMMV. I will love to try this out, as it's definitely a complex bit of audio engineering. That said, i loved learning everything i did preparing for my talk and need to finish up some other parts of the project to get the jukebox working... Maybe this will help :)

                          1. https://m.youtube.com/watch?v=fevxy-s0vo0

                          • lubujackson 11 days ago

                            Seems like most music (from the 70s on at least) is recorded multi-track and the data is out there, just not accessible to anybody. If you ever watch Rick Beato videos, he takes classic songs and isolates vocal/drum/etc. tracks all the time, I'm not sure how he has access to them: https://www.youtube.com/playlist?list=PLW0NGgv1qnfzb1klL6Vw9...

                            But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.

                            Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.

                            • abraae 11 days ago

                              Rick says in one of his videos that he and some of his buddies have got old copies of the original source (separated) tracks, and they kind of pass them around between each other.

                              I find that pretty amazing given the litigiousness of some in the music industry, but there we are.

                              Side note: I discovered Rick Beato a few months ago and I've watched heaps of his videos. It's really fascinating hearing old classics torn down to their constituent parts. Here's one of my favourites of his: https://www.youtube.com/watch?v=ynFNt4tgBJ0 (Boston - More than a feeling).

                              • ehnto 11 days ago

                                Rick Beato is excellent. Nahre Sol and Adam Neely also do great analyses of things. Adam in a more theory oriented way and Nahre in a more feeling and composition focussed way; "Funk as digested by a classical musician" for example looks at funk to try and find the key structures of the style which illuminates things I might not have noticed otherwise.

                                • mushishi 11 days ago

                                  Also 8 bit music theory has very solid video essays on varying compositional concepts that are reflected using game music. I actually find his work most consistently satisfying. Neely and Beato are great but lower s/n ratio. Nahre not enough watches to say but thumbs up for her, too.

                                  Don't forget JazzDuets's channel. His content seems to be most mature and uses actual playing a lot to tune your ear. I find him actually a bit too advanced for my level but I like a lot his very humble and friendly personal touch.

                                • wpietri 11 days ago

                                  Isn't that litigiousness mainly around money? Another important currency in the music world is respect. E.g., look at how Chamillionaire talks about Weird Al: http://yankovic.org/blog/2006/09/13/high-praise-from-chamill...

                                  Given that, I expect that a show titled "What Makes This Song Great" will do fine. Who doesn't love having somebody note the non-obviously good parts of their work. Especially if, as with Weird Al, proper royalties are paid.

                                  • cartoonworld 11 days ago

                                    The artists and performers are often quite reasonable. When you sign a major label deal, many sign away the rights to and control of their work in an effort to make a living and support their families. They need the money, also maybe a gold record.

                                    Once they sign, the RIAA and label lawyers get to work, so the creator may not have any influence or own the masters.

                                    Artists have a good chance of getting the point that authentic publicity is gonna garner authentic fans with authentic ticket stubs, but in the contract, on page 147 section 14a, under "Rights and Royalties" states ...

                                  • phkahler 11 days ago

                                    OMG yes. I watch very little YouTube, but reading these comments I thought "this Rick guy is probably that one I saw a couple months ago, his separated Boston tune was really amazing." And there it is.

                                    • jmpavlec 10 days ago

                                      Thanks for the Rick Beato mention. Just spent a couple hours watching some of his breakdowns. Fascinating stuff and reminded me how much I like this type of analysis.

                                    • pcf 9 days ago

                                      I've personally collected thousands of multitracks, stems and remix kits.

                                      Sources are e.g. multitracks that someone leaked (like original unmixed Madonna sessions), constructed MOGG files from various Rock Band games, stems prepared for remixers etc.

                                      • amelius 11 days ago

                                        I'm wondering if you could even use this to separate unrelated pieces of audio? E.g. instrumental music and someone reading a book out loud. And if you could use this to generate useful training data.

                                      • TheRealSteel 11 days ago

                                        I'm sure you've thought of this, but could/have the tracks from the Rock Band games be used for training?

                                        There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.

                                        • taxidump 11 days ago

                                          To add there is a format of music called Stems designed for DJ's and live remixers from Native Instruments which is a disassembly of the song into it's various parts.

                                          https://www.native-instruments.com/en/specials/stems/

                                          • Intermernet 11 days ago

                                            FYI The term "stems" to refer to the individual tracks of a piece of recorded music is a lot older than NI's format. I love NI, but I'm annoyed that they chose to appropriate the industry standard term as a proprietary product.

                                            https://en.wikipedia.org/wiki/Stem_mixing_and_mastering

                                            • crucialfelix 11 days ago

                                              They published it as an open standard. Many apps and stores support it. The purpose was to take an informal industry practice and make it into a formal portable open standard.

                                              https://www.stems-music.com/

                                              • Intermernet 11 days ago

                                                This is better than I originally thought, but it's still a bit confusing. The Stems file spec (available via registration) is basically an MP4 container with some JSON metadata. This seems to have the usual donwsides of MP4 patents, but it's actually about as good as any standard a pro audio software company has released.

                                                Ideally, I'd have liked to have seen a completely open audio codec used for both encoding and container, but MP4 is a pretty safe bet for compatibility , and it's not really NI's fault that it has some patent issues.

                                                All in all, I could pedantically argue the "open" status, but I'll instead give credit where it's due, and give kudos to NI for releasing a pretty damn usable file format.

                                                I'm even happy that it's limited to 4 parts. For the purposes of live performance with DJ style gear, this is plenty. If a performer wants more parts then they're probably going to be creating some or all of those parts. Either way, they'll probably be using something more like Ableton rather than Traktor.

                                              • unlinked_dll 11 days ago

                                                idk anyone who uses this as a proprietary format. "stems" is an industry standard term.

                                                • Intermernet 11 days ago

                                                  As discussed elsewhere in this thread, it's not as I suggested, a proprietary format. It's still a format created by NI which appropriates the industry standard name. For a list of parties using NI's implementation of the file format see https://www.stems-music.com/stems-partners/ .

                                                  I'm happier that it's a (mostly) open standard, but I'm still slightly annoyed at the confusion that comes from NI appropriating the industry term. It's like if I released a non-text representation of storing data using a particular subset of technology that was standardized, and then called it "The Binary" format. Technically nothing wrong with it, but it's bound to cause confusion!

                                          • jimbo1qaz 11 days ago

                                            The SNES is a 1990s game console. Its music is generally synthesized by the SPC700 chip, from individual instruments stored in 64 kilobytes of RAM (so the instruments often sound synthetic and muffled). The advantage is that it's possible to separate out instruments.

                                            Either:

                                            - Programatically gather a list of all samples used in the song

                                            - Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.

                                            Or

                                            - Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.

                                            Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.

                                            The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.

                                            • aasasd 11 days ago

                                              I'd guess this would result in a model for separating SNES music.

                                              • kranner 11 days ago

                                                I would guess Garageband tracks would be more representative of 'real' instruments than chiptunes.

                                                • aasasd 8 days ago

                                                  BTW, in the play-along mode in GB where you get pre-recorded accompaniment tracks, you can replace the drummer's kit with a drum machine and hang some filters onto it. Much fun is to be had.

                                                  However, this reminds me that filters probably make things much harder for the separation model, with the explosion of possible sounds from an instrument or voice. (Vishudha Kali's music is a nice illustration of that.)

                                              • voicedYoda 11 days ago

                                                I did come across the person who did a similar project (automating instruments based on previously recorded music), however in one project that was playing live instruments from an NES, the signals were already separated. That said, I'm not following the context of your response to my post.

                                                • jimbo1qaz 10 days ago

                                                  You mentioned that "the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly." I think using SNES music as training data is a viable way of getting hundreds of songs' worth of training data in a fairly automated fashion. (I'd estimate that each game has 10 to 80 songs which can be used for training, I have 5 to 10 games of OSTs already downloaded, and each song is only 64 kilobytes and takes minimal disk space before rendering to WAV.)

                                            • jacquesm 11 days ago

                                              This is very timely. I've been working for about 3 months now on a utility that transforms mp3's to midi files. It's a hard problem and even though I'm making steady progress the end is nowhere in sight. This will give me something to benchmark against with for instance voice accompanied by piano. Thank you for making/posting this.

                                              For an idea how this project is coming along:

                                              https://jacquesmattheij.com/toccata.mp3

                                              Yes, it's terrible :) This particular file the result of the following transformations:

                                              midi file -> wav file (fluidsynth)

                                              wav file -> midi file (my utility)

                                              midi file -> wav file (fluidsynth once more)

                                              wav file -> mp3 file (using lame)

                                              Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.

                                              But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.

                                              Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.

                                              Having fun!

                                              Edit: I've done a little write-up: https://jacquesmattheij.com/mp3-to-midi/

                                              • IAmGraydon 11 days ago

                                                Just FYI in case you weren't aware - Ableton Live and several other DAWs have this capability built in. It's far from perfect, but great for humming a melody and then quickly turning it into MIDI.

                                                • alez 10 days ago

                                                  There’s a pretty cool library from Googles Magenta team that does piano transcription pretty well. https://magenta.tensorflow.org/onsets-frames

                                                  They say it’s only really good for piano, but I definitely use it for all kinds of samples. Great for inspiration

                                                  • jacquesm 10 days ago

                                                    So, I used it to run the same toccata test, here are the results:

                                                    The Magenta code:

                                                    f_measure 71.56 precision 65.75 recall 78.49 accuracy 55.72

                                                    My little batch of code:

                                                    f_measure 77.74 precision 93.40 recall 66.57 accuracy 63.58

                                                    Interpreting the results is tricky, they are obviously better on 'recall' but that is at the expense of being much less precise which gives a much better result for my code; besides it is nicer to listen to because there are far fewer spurious notes.

                                                    My code also runs about 100 times as fast and uses very little in terms of resources. So, rather than being depressed it looks like I'm on to something :)

                                                    • alez 10 days ago

                                                      Oh wow that's extremely promising! Yeah the magenta thing destroys my browser when I run it. Still feels like magic though haha. I would be extremely interested in some other options so good luck!

                                                      • jacquesm 10 days ago

                                                        Thank you! If you have any files you want me to test with then feel free to send them, email is in my profile.

                                                    • jacquesm 10 days ago

                                                      Oh cool, thank you for that, I did not know about this yet. That may come in very handy.

                                                    • jacquesm 11 days ago

                                                      How good is it (% accuracy) for polyphony? I can upload the toccata original if you want.

                                                      • viburnum 10 days ago

                                                        How do you do this in Ableton?

                                                        • smcnally 10 days ago

                                                          https://www.ableton.com/en/manual/converting-audio-to-midi/

                                                          There are four options.

                                                          > Convert Melody to New MIDI Track

                                                          > This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.

                                                          • jacquesm 10 days ago

                                                            > This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.

                                                            Neat. So, the big difference then is that I do fully polyphony but I'm still limited to 'just' piano but that's already hard enough for now.

                                                            • viburnum 10 days ago

                                                              Thanks so much. I thought I had read through the manual but there's all kinds of stuff I've missed.

                                                      • lreichold 11 days ago

                                                        An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.

                                                        I was fascinated by the work done by “The Sound of Pixels” project at MIT.

                                                        http://sound-of-pixels.csail.mit.edu/

                                                        • renaudg 11 days ago

                                                          That’s quite clever but not really practical : instruments heard in most music produced today aren’t "played" by humans.

                                                        • czr 11 days ago

                                                          messed around with the 2stem model for a bit and it's reasonably good. I think phonicmind is still a bit better - phonicmind tends to err on the side of keeping too much, while the 2stem model tries to isolate aggressively and often damages the vocal as a result (distorting words by losing some harmonics, or losing quiet words entirely)

                                                          example:

                                                          https://files.catbox.moe/wjruiv.mp3 (phonicmind)

                                                          https://files.catbox.moe/uuzot3.mp3 (spleeter 2stem)

                                                          you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.

                                                          • roryokane 10 days ago

                                                            For those who, like me, hadn’t heard of PhonicMind before, it’s an online service at https://phonicmind.com/ that charges $4 to $1.5 per song to separate out vocals, drums, bass, and the rest of the sounds. You can upload any audio file to that website and get a 30-second preview of separated parts for it.

                                                          • ooobo 11 days ago

                                                            Gave this a go, it's an easy install with pip, and results are pretty quick even on an old macbook. Splits into 2stems (vocals/accompaniment) on some random songs I chose actually quite good using the pretrained models provided. Of course, ripping the vocals out of the accompaniment takes out a good chunk of the middle frequencies so some songs sound a bit wonky. Worth a play if you are interested.

                                                            • tomrod 11 days ago

                                                              Same thoughts here. I ran Thriller, Alligator by Of Monsters and Men, and In Hell I'll be in Good Company by The Dead South on the 2 / 5 / 4 stems, respectively. Impressive results. Definitely agree that some of the middle frequencies show some error.

                                                              It would be really cool to create "music mappers"/life sounds tracks like what you can do with pictures & art styles (e.g. https://medium.com/tensorflow/neural-style-transfer-creating...)

                                                              • colorincorrect 11 days ago

                                                                known nothing about the results, i suspect that mid-ranges are poorer mainly because human frequency response is most sensitive towards mid-range aka vocal-pitch frequency

                                                              • sehugg 11 days ago

                                                                It's really good on the 2-stem stuff. On the 4-stem model, it's a bit shy about the bass part, and parts drift in and out. I'd like to try it on a FLAC.

                                                                • lrobinovitch 11 days ago

                                                                  Same, Rage Against The Machine - Killing In the Name came out sounding great. Very cool.

                                                                • iamchrisle 11 days ago

                                                                  Non-open source products that also separate vocals from music if you need something more "professional".

                                                                  One-click process: Xtrax Stems 2 (https://audionamix.com/technology/xtrax-stems/)

                                                                  Professional: ADX Trax Pro 3 (https://audionamix.com/technology/adx-trax-pro/)

                                                                  Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.

                                                                  (Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)

                                                                  • xamuel 11 days ago

                                                                    I wonder how it would fare on Pink Floyd's "Sheep", where vocals seamlessly transform into instrumentals and it's impossible to tell where one ends and the other begins. https://www.youtube.com/watch?v=3-oJt_5JvV4 (skip to around 1:40)

                                                                    • SemiTom 11 days ago

                                                                      Interesting to read Thomas Dolby's thoughts on music/technology interfaces--particularly with VR https://semiengineering.com/thomas-dolbys-very-different-vie...

                                                                      • Intermernet 11 days ago

                                                                        I'd love to see how this compares with Celemony Melodyne. As far as I've been able to determine, Melodyne doesn't use ML, but it's hard to find out exactly what it does use.

                                                                        Either way, an open source competitor to Melodyne is a welcome addition!

                                                                        • dspig 11 days ago

                                                                          There is a patent for Melodyne that describes looking for harmonics vs time in FFTs, then heuristics for deciding which belong to one note and where it starts and ends, then assigning some of the resudual energy (e.g. noisy onset) to each note.

                                                                          • SyneRyder 11 days ago

                                                                            That's the second time I've seen someone mention Melodyne for separating vocals from a full song source - I don't think that's something it can do? Melodyne is for tuning vocals / instruments & correcting timing on already isolated tracks.

                                                                            • czr 11 days ago

                                                                              melodyne's editing interface lets you remove different notes from a polyphonic track. so if it's just vocals + other tonal sounds, you can manually remove the other tonal sounds. example: https://youtu.be/2ZjdDatxTaQ?t=83

                                                                              • vonseel 11 days ago

                                                                                Hmm, never tried that with melodyne myself and the video you posted isn't a great example of an accurate vocal extraction - those are more like vocal chops and are already pretty dirty to begin with. Based on my experience with Melodyne, I'd be surprised if you could cleanly extract a plain singing vocal without tons and tons of work.

                                                                            • matchagaucho 11 days ago

                                                                              I’ve always assumed Melodyne uses FFT bins.

                                                                            • bravura 11 days ago

                                                                              Is the paper, "Spleeter: A Fast And State-of-the Art Music Source Separation Tool With Pre-trained Models", available yet? What is the methodology?

                                                                            • davidy123 11 days ago

                                                                              I look forward to a day I can click a button to watch videos online without any unnecessary and distracting background music (though it would be better if there were an option and precedent to offer unornamented narrative in video players). The next step after this would be to have live 'music cancelling' headphones for the grocery store (if such a thing still exists).

                                                                              • Wow. Office background noise mute.

                                                                                The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.

                                                                                Music can play at volume then quiet itself when it detects a person speaking directly to you.

                                                                                Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.

                                                                              • huskyr 11 days ago

                                                                                This is incredible. I made an example using David Bowie's "Changes". A bit robotic, but even the echo is still present in the vocal track. https://www.youtube.com/watch?v=KPlmrq_rAzQ

                                                                                • iagooar 11 days ago

                                                                                  Does it work with spoken word as well? My use case: improve podcast quality by extracting the vocals only, and leaving out all background and accidental noise.

                                                                                  • ssttoo 11 days ago

                                                                                    Not free nor open source but you can try a plugin called izotope Rx for this purpose

                                                                                  • clashmeifyoucan 11 days ago

                                                                                    I wonder how this compares to Open Unmix (https://github.com/sigsep/open-unmix-pytorch), that one calls itself state-of-the-art as well and is done in collaboration with Sony from what I see of their paper.

                                                                                  • exogen 9 days ago

                                                                                    The extracted vocals sound great! But the resulting accompaniment tracks I've heard so far (tried on a handful of songs) aren't of usable quality for most purposes where you'd want an instrumental track – they're too sonically mangled.

                                                                                    Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...

                                                                                    What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.

                                                                                    Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).

                                                                                    • alez 10 days ago

                                                                                      Tried it on “Halleluwah” by CAN, had to hear those drums:

                                                                                      https://soundcloud.com/alezzzz/can-halleluwah-drums-extracte...

                                                                                      Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever

                                                                                      • smrq 10 days ago

                                                                                        Wow, this is the isolated track I never knew I needed to hear.

                                                                                      • BeeBoBub 11 days ago

                                                                                        I am working on a product which makes use of this technology. I generate vocal pitch visualizations for karaoke

                                                                                        http://pitchperfected.io

                                                                                        • noja 11 days ago

                                                                                          Cool - but your website needs some work. It looks like a landing page to gather interest rather than something backed by a real product. Show us some videos and singing, before and after, etc.

                                                                                          • mh- 11 days ago

                                                                                            FYI your email confirmation is going straight to spam on gmail. I'd recommend reaching out to Mailchimp.

                                                                                          • gdsdfe 11 days ago

                                                                                            Wait the implications of this are huge for electronic music DJs

                                                                                            • tomrod 11 days ago

                                                                                              Audio Neural Transfer Learning could be amazing.

                                                                                              • exikyut 11 days ago

                                                                                                The audio (^F soundcloud) sounds a little warbly... if that can be largely mitigated, then yes, remixes will never be the same

                                                                                                • yowlingcat 11 days ago

                                                                                                  While not great, the phase smearing is orders of magnitude better than most vocal isolation plugins I've used. I only expect it to get better. Very cool!

                                                                                              • mettamage 10 days ago

                                                                                                Ah, so this should've been the answer to my ask HN [1].

                                                                                                ;-)

                                                                                                Edit: I see someone added it in as an answer 14 hours ago. Well, you have my vote ^^

                                                                                                [1] https://news.ycombinator.com/item?id=21399838

                                                                                                • cma 11 days ago

                                                                                                  Is there anything like this for images? Meaning essentially trying to decompose back into photoshop layers. Wouldn't be feasible for lots of stuff that is completely opaquely covering something, but I'm thinking for things like recoloring a screen print, etc.

                                                                                                • orloffm 11 days ago

                                                                                                  Played with it. The quality of the result is mostly dependent on the amount of clipping in the source file. Basically, all post-90s masters produce weird results with orcs singing in the background. And classics from 60s yield fantastic results.

                                                                                                • strags 11 days ago

                                                                                                  This is awesome. I now have Guns'n'Roses playing in my office, and Axl Rose is a faint voice coming from the garage.

                                                                                                  • murat124 11 days ago

                                                                                                    I gave it a try with Megadeth's Holy wars[1], was expecting something like this[2] but got very deep audio. Not sure why but perhaps it's because bassist David Ellefson uses pick which gives the percussive sound and it suits to Megadeth.

                                                                                                    Any parameter I could use with spleeter to get a similar output?

                                                                                                    [1] https://www.youtube.com/watch?v=9d4ui9q7eDM

                                                                                                    [2] https://www.youtube.com/watch?v=uWkykQHsJ-Y

                                                                                                    • mothsonasloth 11 days ago

                                                                                                      I'm trying to find something to generate tracks without guitar. Then I can cover them with my guitar. Will this software help me?

                                                                                                      • ksherlock 10 days ago

                                                                                                        depends? It splits it into 2 parts (vocal, everything else), 4 parts (drum, bass, vocal everything else), or 5 parts (drum, bass, vocal, piano, everything else). piano isolation is the weakest.

                                                                                                        If you want to play along to drum + bass, then yes.

                                                                                                      • sheinsheish 11 days ago

                                                                                                        Can we use sample libraries to write, record and simulate desired stems for training? I guess the more naturally played the better?

                                                                                                        • theLotusGambit 11 days ago

                                                                                                          Not only are the results good, but the music is generated decently rapidly. The implications are clear: whoever wants to make a quick fortune on YouTube should start converting and uploading truckloads of songs as fast as possible. The demand is there. I could easily see that bringing in millions of views.

                                                                                                          • ErikAugust 11 days ago

                                                                                                            They’d still get tagged for copyright.

                                                                                                          • toptal 11 days ago

                                                                                                            Can someone provide a demo link of source music vs. output?

                                                                                                            • tomrod 11 days ago

                                                                                                              I gave it a test using the project audio sample. Neat stuff. https://soundcloud.com/thomas-roderick-836298141/sets/spleet...

                                                                                                              • semiotagonal 11 days ago

                                                                                                                Holy shit that works way better than I expected. The github project should link to this or a similar example, the technical description doesn't do it justice.

                                                                                                                • jcims 11 days ago

                                                                                                                  Yeah the fact that it got the reverb in the vocal track is pretty impressive!

                                                                                                                  • unlinked_dll 11 days ago

                                                                                                                    I really disagree. It sounds... awful. On par with other approaches, sure, but the main vocals sounds like a case study in digital artifacts and the accompaniment sounds like there's a filter automated over the track.

                                                                                                                    Far from useful.

                                                                                                            • jknz 11 days ago

                                                                                                              On iOS, Chord AI [1] gives pretty good results for the guitar chords of any music surrounding the phone.

                                                                                                              [1]: https://apps.apple.com/us/app/chord-ai/id1446177109

                                                                                                              • ohlookabird 11 days ago

                                                                                                                That's sounds pretty nice! Anyone know an Android version? I just checked Yamaha Chord Tracker and MyChord, but both don't seem to be able to use the microphone.

                                                                                                              • amelius 11 days ago

                                                                                                                > Spleeter is the Deezer source separation library with pretrained models

                                                                                                                Curious, what is Deezer using this for?

                                                                                                                • Zopieux 9 days ago

                                                                                                                  On-demand karaoke parties right inside Deezer?

                                                                                                                  • neohaven 11 days ago

                                                                                                                    Gonna guess beat/mood/song analysis.

                                                                                                                    • amelius 10 days ago

                                                                                                                      Yeah but they can use the raw song for that I suppose.

                                                                                                                      • cbHXBY1D 10 days ago

                                                                                                                        Easier way to match voices?

                                                                                                                  • tomrod 11 days ago

                                                                                                                    This is so neat! I went looking a few months back for something like this, and the best I found was Google's Magenta.

                                                                                                                    It would be really cool to use this to feed into Magenta. Think of the mashups!

                                                                                                                    • esfandia 11 days ago

                                                                                                                      Karaoke with the most obscure songs!

                                                                                                                      • xamuel 11 days ago

                                                                                                                        And even better, karaoke that doesn't suck--most karaoke tracks are covers by cheap bands and you can clearly tell the inferior quality if you're familiar with the song.

                                                                                                                        • megablast 11 days ago

                                                                                                                          Sure, not to mention the awful singing accompanying most karaoke tracks.

                                                                                                                      • hodder 11 days ago

                                                                                                                        Very cool. A close friend of mine (and lead singer in our band May years ago) recently died and we have a couple great recordings from 2 decades ago of his vocals. When we recorded the rest of the instruments they were DI into a Boss BR8. The lyrics sound awesome but the guitar and drums are recorded poorly. This may give us a chance to split the vocals out of the final tracks, and re-record the tracks as a tribute.

                                                                                                                        Much appreciated.

                                                                                                                        • antman 11 days ago

                                                                                                                          How could we extract anything but the voice e.g. karaoke?

                                                                                                                          • danso 10 days ago

                                                                                                                            The repo's quick start instructions [0] show how to use it with the "2-stems" model [1], which separate the source audio into two files: output/source/vocals.wav and outputdir/source/accompaniment.wav:

                                                                                                                                $ spleeter separate -i spleeter/source.mp3 \
                                                                                                                                     -p spleeter:2stems \
                                                                                                                                     -o outputdir
                                                                                                                            
                                                                                                                            
                                                                                                                            [0] https://github.com/deezer/spleeter#quick-start

                                                                                                                            [1] https://github.com/deezer/spleeter/wiki/2.-Getting-started#u...

                                                                                                                            • aasasd 11 days ago

                                                                                                                              I'd guess extract the voice and then subtract it from the rest with something like Audacity. I'm not sure which operation would do that, but I believe that it exists.

                                                                                                                              Also, other comments here speak of “separating into the voice and accompaniment,” so maybe the model/program already do exactly what you need.

                                                                                                                              • marcan_42 11 days ago

                                                                                                                                Invert and mix.

                                                                                                                                When you have an instrumental version of a song (from the same stems as the vocal version) this is already one way to get the vocals out without any fancy machine learning. The main tricks besides what you can do in Audacity like that are properly time-aligning the tracks (even if they drift a bit) and compensating for phase issues and compression. I wrote a dirty tool that does that and I've been meaning to turn it into some kind of nicer GUI version.

                                                                                                                                • navjack27 10 days ago

                                                                                                                                  I've been doing something like this for a bit in Audition. Center channel extract > invert phase > save as wav > create multi-track project > add original > add modified > up the volume on the vocal extracted modified version until the vocals go away

                                                                                                                                  Or you can do the exact opposite and instead of center channel extracting the vocals you can remove the vocals and use this method to better isolate vocals.

                                                                                                                                  Although if things do fun stuff with stereo it might not be exact.

                                                                                                                                  • voicedYoda 11 days ago

                                                                                                                                    If you don't mind sharing (even if it's cmd line), I'd love to explore.

                                                                                                                                • NilsIRL 11 days ago

                                                                                                                                  I'm pretty sure it does that by default.

                                                                                                                                • fdej 11 days ago

                                                                                                                                  Has anyone tested this with Glenn Gould recordings?

                                                                                                                                  • jacquesm 11 days ago

                                                                                                                                    Hehe, you want to split his singing and humming into a separate track?

                                                                                                                                  • jefftk 11 days ago

                                                                                                                                    I have a large number of multitrack recordings of contra dances if anyone wants to try training this on them.

                                                                                                                                    • exikyut 11 days ago

                                                                                                                                      You train this on them! (And then put the results on SoundCloud or YouTube.)

                                                                                                                                    • yoogidoky 10 days ago
                                                                                                                                      • dgreensp 11 days ago

                                                                                                                                        Are there any examples I can listen to?

                                                                                                                                      • nagateja_1995 10 days ago

                                                                                                                                        I was just testing it out on Gazal, it seems to work perfectly. But when it seems to fail with Qawwalis. My understanding of how this works. Is it safe to assume that the training data from dreezer lacks enough examples of Qawwalis?

                                                                                                                                        • seventhtiger 10 days ago

                                                                                                                                          Aren't all the most popular audio formats lossy? Extracting full data from lossy compression requires reconstruction. Even if they are able to completely extract all tracks they would have gaps and be very low quality.

                                                                                                                                        • xiphmont 11 days ago

                                                                                                                                          Just tried it on _Meet The Sniper_. Disappointment :-(

                                                                                                                                          [This was an unreasonable test, and it did really well considering the likely training set. I bet it could do much better with better data. Still... man, I was so hoping for magic.]

                                                                                                                                          • lebuffon 11 days ago

                                                                                                                                            Has someone in the Intelligence community approached the author? Oops that's classified. :)

                                                                                                                                            This has implications for extracting sounds from noisy recordings or am I off base? Does it only track pitch patterns?

                                                                                                                                            • jononor 11 days ago

                                                                                                                                              Source separation of speech and speech denoising is a well established field, more researched in general than music source separation. Intelligence officers very likely have access to a range of well-performing ML tools for extracting speech.

                                                                                                                                            • haywirez 11 days ago

                                                                                                                                              Is there a known approach that attempts to separate all distinct sounds (timbres rather than pitch) in a track? Specifically targeted at electronic music, not standard acoustic ensembles.

                                                                                                                                              • dancek 11 days ago

                                                                                                                                                That's an interesting idea. Many instruments have greatly varying timbres, though. Combining the timbres back to instruments would require another level of processing.

                                                                                                                                              • jMyles 11 days ago

                                                                                                                                                Somewhat of a tangent, but does anyone have a recommendation of an open source (ideally python) program that can make MIDI from piano audio?

                                                                                                                                                • jacquesm 11 days ago

                                                                                                                                                  I've been working on this for the last 3 months.

                                                                                                                                                  Can you send me your audio file? I'll send you back the midi I can generate; it won't be perfect but it might be usable. See comment elsewhere in this thread.

                                                                                                                                                • tianshuo 10 days ago

                                                                                                                                                  Here's an open-source project from Google Deepmind's Magenta Project, that does exactly what you want. https://magenta.tensorflow.org/onsets-frames

                                                                                                                                                  • iamchrisle 11 days ago

                                                                                                                                                    There is probably something out there, but I know you can do this in Ableton Live by dragging an audio file onto a MIDI track and it will extract the notes into MIDI for you.

                                                                                                                                                    • jononor 11 days ago

                                                                                                                                                      Automatic music transcription is the technical/academic name of this task. Maybe that can help you in your search?

                                                                                                                                                      • CurryPaste 11 days ago

                                                                                                                                                        MIDI Guitar 2 from Jam Origin (jamorigin.com) works well even for piano.

                                                                                                                                                        • vonseel 11 days ago

                                                                                                                                                          Melodyne or Ableton. I've found Melodyne to be more accurate, but still not perfect.

                                                                                                                                                      • odiroot 11 days ago

                                                                                                                                                        Well, can it extract the bass track from "And the justice for all"?

                                                                                                                                                        • amelius 11 days ago

                                                                                                                                                          Can this also separate the backing vocals from the lead singer?

                                                                                                                                                          • marsknight 11 days ago

                                                                                                                                                            Thank you!! This is really amazing :)

                                                                                                                                                            • unlinked_dll 11 days ago

                                                                                                                                                              demo links would be helpful

                                                                                                                                                              • jsilence 10 days ago

                                                                                                                                                                Karaoke everything!

                                                                                                                                                                • foobaw 11 days ago

                                                                                                                                                                  This is amazing - so much possible learning for aspiring producers.

                                                                                                                                                                  • nycjobsboard 11 days ago

                                                                                                                                                                    Cool stuff

                                                                                                                                                                    • propercoil 11 days ago

                                                                                                                                                                      Pytorch > TensorFlow