Transformers from Scratch

(peterbloem.nl)

265 points | by stablemap 1707 days ago

10 comments

  • cgearhart 1707 days ago
    This is a _great_ article. One of the things I enjoy most is finding new ways to understand or think about things I already feel like I know. This article helped me do both with transformer networks. I especially liked how explicitly and simply things were explained like queries, keys, and values; permutation equivariance; and even the distinction between learned model parameters and parameters derived from the data (like the attention weights).

    The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.

  • dusted 1707 days ago
    And here I was, excited to learn something about actual transformers, something involving wire and metal..
    • myself248 1707 days ago
      Same. Magnetization current and core saturation are fairly fundamental properties influencing transformer design, but they're barely even mentioned in introductory texts.

      I feel like a lot of modern transformers are just sort of cargo-cult imports of old designs because everyone who knew the salient parameters has retired and the current crew just kinda nudges things until they work. A from-scratch explanation, up to the current state of the art, would be invaluable to anyone who deals with them.

      But nah. This is HN, where headlines are their own code.

    • petemc_ 1707 days ago
      In that case you might enjoy this:

      https://ludens.cl/paradise/turbine/turbine.html

      Was posted here a while back. Fascinating guy.

    • CDSlice 1707 days ago
      I thought it would be about making a Transformers like toy using 3D printing or something.
      • skohan 1707 days ago
        That would be a really neat project: printing a functional transformer as a single piece.
    • NKCSS 1707 days ago
      Same here, would have loved a build creating some actual transformers :)
      • megous 1707 days ago
        We did in school on a coil winding machine, and also learned the math.

        I don't rember the winding part being that much fun. :)

        It was something old, akin to this: https://www.youtube.com/watch?v=Y-GyMYZ8yTU

        • segfaultbuserr 1707 days ago
          RF transformers (in forms of various coils, chokes, baluns, etc) are more interesting and complex to analyze. I once winded three of them, and none worked. I thought I finally found an article that explains the subject, well, not a chance ;-)
      • quickthrower2 1707 days ago
        I was hoping for monad transformers.
  • yamrzou 1707 days ago
    This is the best article I have read so far explaining the transformer architecture. The clear and intuitive explanation can’t be praised enough.

    Note that the teacher has a Machine Learning course with video lectures on youtube that he references throughout the article : http://www.peterbloem.nl/teaching/machine-learning

  • Gallactide 1707 days ago
    This man was my professor at the VU.

    Honestly his lectures were fun and easy to look forward too, I'm really glad his post is getting traction.

    If you find his video lectures they are a really graceful introduction to most ML concepts.

  • isoprophlex 1707 days ago
    Stellar article, I never understood self attention; this makes it so very clear in a few concise lines, with little fluff.

    The author has a gift for explaining these concepts.

  • NHQ 1707 days ago
    This is sweet. I've written conv, dense, and recurrent networks from scratch. Transformers next!

    Plug: I just published this demo using GD to find control points for Bezier Curves: http://nhq.github.io/beezy/public/

  • ropiwqefjnpoa 1707 days ago
    Ah yes, machine learning architecture transformers, I knew that.
  • siekmanj 1707 days ago
    Wow. I have been looking for a good resource on implementing self-attention/transformers on my own for the last week - can't wait to read this through.
  • ccccppppp 1707 days ago
    Noob question: I have some 1D conv net for financial time series prediction. Could a transformer architecture be better for this task, is it worth a try?
    • hadsed 1707 days ago
      If you think a longer context length might be helpful consider stacking convolutions to give higher units a bigger receptive field, or try the convolutional LSTM. If that helps and you have a further argument for why an even larger context window would be helpful then perhaps try attention and in that case a Transformer would be reasonable. But your stacked conv net would be the fastest and most obvious thing that should work (with the caveat that I know nothing else about your data and it's characteristics, which is a really big caveat).

      Consider looking at your errors and judging whether they stem from things your current model doesn't do well but that Transformers do, i.e., correlating two steps in a sequence across a large number of time steps. Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

      • ccccppppp 1707 days ago
        Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware such a thing existed.

        > Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

        But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.

        Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.

        • hadsed 1694 days ago
          Yeah, in some sense any layer is a "memory module". Perhaps more specifically, attention solves the problem of directly correlating two items in a sequence that are very, very far away from each other. I'd generally caution against using attention prematurely as it's extremely slow, meaning you'll waste a lot of your time and resources without knowing if it'll help. Stacking conv layers or using recurrence is an easy middle step that, if it helps, can guide you on whether attention could provide even more gains.
  • gwbas1c 1707 days ago
    The title is deceiving. I thought this was an article about building your own electrical transformer, or building your own version of the 1980s toy.
    • kranner 1707 days ago
      FWIW I immediately thought of the ML architecture, not the toy or electrical device. HN is very ML-heavy these days, and in that context 'transformers' has a familiar and obvious meaning.
    • Beldin 1707 days ago
      Same here: i was wondering if he had implemented some autobots in the kids programming language Scratch.

      Because that would seriously be awesome.

    • SenHeng 1707 days ago
      I, too, thought it was the former, but the toy geek in me hoped for the latter. To my surprise it was neither.
    • coolness 1707 days ago
      I think the top level domain of the author's blog kinda gives it away :)
      • ChickeNES 1707 days ago
        Huh? The website's TLD is NL not ML. (And I'll also chime in to say that I thought it was either electrical transformers or 3D printed transformers toys).
        • coolness 1705 days ago
          Oh wow, my sight must be poor since I honestly thought it was .ml.