Pitch Detection with Convolutional Networks

(0xfe.blogspot.com)

83 points | by zeroxfe 1524 days ago

10 comments

  • oever 1524 days ago
    Just the other day I discovered what a big research field, Music Information Retrieval, this is.

    Here is the video archive of a recent conference on the topic.

    https://ismir2019.ewi.tudelft.nl/

    There are some FOSS applications e.g. https://www.sonicvisualiser.org/ but I'm surprised how bad the results of the analysis are. Intuitively, it seems such a simple problem.

  • unlinked_dll 1524 days ago
    Using synthesized audio from midi seems like an atrocious way to train a neural net for pitch detection. It's also not particularly difficult to detect pitch on such sounds, you should mess with it a bit to remove the fundamental, add inharmonic tones, noise, vibrato, etc. Just counting zero crossings is enough for most of that data.

    Side note: cepstral processing is going to be a lot more effective than spectrograms, and preprocessing is cheaper than training the ANN.

    • zeroxfe 1524 days ago
      > Using synthesized audio from midi seems like an atrocious way to train a neural net for pitch detection.

      Not sure why you think so. Almost everything you suggested (missing fundamentals, noise, vibrato, reverb, velocity, distortion, etc.) can be synthesized with tools like sox. It worked very well for me. :-)

      > cepstral processing is going to be a lot more effective than spectrograms, and preprocessing is cheaper than training the ANN.

      I did do this in my initial attempts, and found no improvement over spectrograms. Turns out NNs can learn log nonlinearities quite easily. (EDIT: to be precise, I calculated the mel-cepstrum and fed it to the network.)

      • matheist 1523 days ago
        > to be precise, I calculated the mel-cepstrum and fed it to the network

        The mel scale cepstrum is inappropriate for pitch detection. You want to use the cepstrum without scaling frequencies. Take the fourier transform, normalize, take logarithms[+], then (inverse) fourier transform in the frequency domain.

        The advantage of using the cepstrum for pitch detection is that most signals you're looking at will be harmonic — equally spaced overtones — and so when you take a fourier transform in the frequency domain you'll get a peak corresponding to that equal spacing, which will provide you with the fundamental frequency. (Even if it's missing!)

        Using the mel scale totally wrecks that periodicity and throws away the pitch information. (Which is part of why it's used for speech-to-text! In those cases you want to throw away pitch information. Unless you're processing a tonal language, in which case probably don't use the mel scale.)

        [+] Why logarithms? In the frequency domain, most harmonic sounds look like the product of a high frequency periodic "signal" (the fundamental and its overtones) with a slowly varying signal (frequencies which are emphasized or de-emphasized, like e.g. formants in the case of speech). Taking the logarithm splits that into the sum of a high frequency signal (overtones) and a low frequency signal (formants), and since the fourier transform is linear, that'll show up as a single peak in the cepstrum corresponding to the gap in overtones (ie the fundamental) and some stuff in the low-frequency bins corresponding to the formants.

        • zeroxfe 1523 days ago
          Thanks, that's very helpful, and explains why mel-cepstrum didn't work so well. So far, I found that I'm getting the best results with FFT + log scale, particularly against audio with fundamentals stripped.
  • dentalperson 1524 days ago
    I think this is a nice writeup, but I would argue with the claim [The error is] "Pretty much exactly the resolution of the FFT we used. It's very hard to do better given the inputs."

    The 19hz error being explained by the FFT resolution only makes sense for a classification-based loss/error that used the (FFT_bins / 2) as classes.

    Since the proposed network is using regression, even though you have a frequency resolution of 19hz, you should be able to estimate pitch with finer resolution if you are using any popular non-rectangular window because it can be fit to match the shape of the main lobe. You would only expect such a large error at the very low frequencies, where there wasn't much to interpolate on because the next harmonic would overlap.

    For an example see figure 4 in PARSHL, (One of the original sinusoidal analysis frameworks where the frequency of each harmonic is estimated by fitting to a parabola) https://ccrma.stanford.edu/~jos/parshl/parshl.pdf

    A neural network should be able to do much better than parabolic fitting.

    • zeroxfe 1524 days ago
      Thanks yes, this is totally correct -- the FFT uses a Tukey window, which should be able to match the main lobe. I got some other feedback that perceptual training tasks have a very long tail, so it's possible that the network will learn better if I run it for a few hours (I only ran it for about 10 minutes.)

      I'll give it a shot (and edit the post.)

  • oever 1524 days ago
    Did you consider using Constant Q Transform instead of STFT?

    mpv and ffmpeg come with CQT visualization:

    mpv --lavfi-complex="[aid1]asplit[ao][a]; [a]showcqt[vo]" "$@"

    You can even get it from microphone with some piping:

    parec --latency-msec=1 | sox -V --buffer 32 -t raw -b 16 -e signed -c 2 -r 44100 - -r 44.1k -b 16 -e signed -c 2 -t wav - | ffplay -fflags nobuffer -f lavfi 'amovie=pipe\\:0,asplit=2[out1][a],[a]showcqt[out0]'

  • jmwilson 1523 days ago
    Resolution in the frequency domain can be significantly improved over the natural resolution of the DFT (19 Hz in your case). If the fundamental frequency exactly matches one of the DFT bin frequencies, it would mean the frame size is an exact multiple of the period, so the DFT of successive frames would look the same, and there would be no phase shift at the fundamental component. (Or if overlapping frames were analyzed, there would be an expected phase shift in proportion to amount of overlap.) In the more likely case that the fundamental doesn't match a bin frequency, there's a phase shift between successive frames that's proportional to difference between the fundamental and the associated bin frequency. This can be extracted to get a more accurate estimate of the actual fundamental. I wrote a STFT-based chromatic tuner as a hobby project using this technique and it would easily resolve better than .1 Hz changes using 10(ish) ms frame sizes.
  • BookPage 1524 days ago
    nice post! I'm working on tempo detection with deep networks atm and found the synthesis section very helpful. I'm wondering if you read many of the recent publications on pitch detection + deep learning to guide your model building. At least for tempo detection there's an abundance of material there to use as starting points which can help bootstrap the network build.
    • zeroxfe 1524 days ago
      I didn't look very hard for deep learning approaches to pitch detection, mainly because I was really interested in chord recognition (which is a far more interesting problem, IMO.) I didn't find any good research here (would love pointers.)

      I did though spend a lot of time studying non-DL approaches to pitch-detection, mainly because I wanted better real-time performance for my game Pitchy Ninja (https://pitchy.ninja).

      • BookPage 1524 days ago
        Yeah chord recognition is a much meatier problem for sure. Anything where the pitch signal gets is blended with others is pretty cool. I haven't dived deep into pitch work yet, but this[1] is a fairly solid recent review paper that I followed with good results for tempo stuff.

        [1] https://arxiv.org/abs/1905.00078

    • jimhefferon 1524 days ago
      Very interesteing. I have long wondered: if I have the sheet music, is tempo detection good enough that it can follow along on the sheet?
  • knzhou 1524 days ago
    It would be interesting to see how this performs, compared to simpler methods, on hard cases like a missing fundamental. (If you play a sound with power at 200 Hz, 300 Hz, 400 Hz, ..., i.e. all the multiples of 100 Hz but not 100 Hz itself, humans always perceive a pitch of 100 Hz.)
    • zeroxfe 1524 days ago
      I have training data which includes examples with missing fundamentals (synthetically removed), and the network does learn to recognize this. (The heavy regularization also helped a lot.)

      Although I was able to test with real instruments (and my grotesque voice), I didn't find any good live examples of audio with missing fundamentals to test with. It did recognize held-out synthesized data correctly though.

  • amylene 1524 days ago
    Could this be used on human voices to detect tones like anger, openness, etc? If so, it might be monetizable as a sales value add.
    • zeroxfe 1524 days ago
      That's definitely interesting, never thought about it. Prob hard to find "angry voice" training data though :-)
      • keenmaster 1524 days ago
        You can scrape a database of movie scripts, tie each word in the script to a moment in the movie (accurate to the second), and extract recordings that are supposed to demonstrate an emotion. You can even use the modifiers that are in a script, such as “very angrily” to train on various degrees of each emotion. If emotions are not inscribed into the script, you can use textual affect detection tools. There’s got to be some way to do this without having mTurks label every second of a recording.
  • rkagerer 1524 days ago
    What kind of latency was achieved on the detection side? Could you use this for real-time applications?
    • zeroxfe 1524 days ago
      About 10ms/sample (including pre-processing inputs, and generating spectrograms) on a current-gen GPU. It's a bit too high for real-time detection (compared to well-known pitch estimation approaches.)
  • anonytrary 1524 days ago
    I have no idea why we need to do this when we can just compute the strongest frequency components with a Fourier transform. Can anyone explain the advantages of this seemingly expensive method over simple (and more complex) Fourier analysis?
    • zeroxfe 1524 days ago
      See the section "On Pitch Estimation" which addresses exactly that:

      --- Pitch detection (also called fundamental frequency estimation) is not an exact science. What your brain perceives as pitch is a function of lots of different variables, from the physical materials that generate the sounds to your body's physiological structure.

      One would presume that you can simply transform a signal to its frequency domain representation, and look at the peak frequencies. This would work for a sine wave, but as soon as you introduce any kind of timbre (e.g., when you sing, or play a note on a guitar), the spectrum is flooded with overtones and harmonic partials. ---

      All this said, you don't need deep learning for decent pitch detection -- it's a solved problem, and there are lots of well-known algorithms for it. Deep learning is useful for more advanced music info retrieval such as interval and chord recognition, which was one of my goals with this experiment.

      • AstralStorm 1523 days ago
        You can use deep networks to get better resolution there though, keyword is spectral reassignment, normal methods use maximum likelihood.

        What you will get is a variant spectrogram. Method is related to edge directed interpolation and bilateral transform. (You can do the same with log-cepstrum.)

        I find this whole exercise linked an undergrad level toy, even in such basic thing as pitch.

        The real problem is polyphony and instrument segmentation which this does not begin to touch.