A Visual Guide to Vision Transformers

(blog.mdturp.ch)

237 points | by md2rp 13 days ago

9 comments

  • md2rp 13 days ago
    A Visual Guide to Vision Transformers This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.
    • bArray 13 days ago
      Nice! A small piece of feedback: I would have the dimensions mentioned in the text also annotated on the diagram. It wasn't exactly clear how the input data was flattened for example.
      • byteknight 13 days ago
        Would also add, as a 100% math idiot, linear transformations, and how it performs them is not explained.

        Entirely plausible this is intended for someone more "mathmatical" than myself but appreciate the work regardless.

        • md2rp 13 days ago
          Thanks for the feedback! I left it out intentionally but probably worth thinking about doing a more fundamental guide!
      • md2rp 13 days ago
        Thanks for the feedback! Will add it in the revision!
  • challenger-derp 13 days ago
    Very nice. I wish I could do this sort of scroll story in my digital notes. Is this done with a javascript library?
    • md2rp 13 days ago
      Yes this was done with a combination of GSAP Scrolltrigger https://gsap.com/docs/v3/Plugins/ScrollTrigger/ and https://d3js.org/
      • TuringTest 13 days ago
        That kind of scroll is OK-ish for a background parallax effect, or maybe some pretty fade-in/out effects while elements scroll into view (without changing their relative position in the page).

        When it interferes with the main functionality of the page, namely reading the content, they break accessibility, distract over understanding the difficult topic, make the content brittle against changes in the platform (different browsers or future standard updates), and as others pointed out make it difficult or impossible to use alternative presentations.

        With most comments commenting on the presentation and not on the content, I think it makes clear that it detracts from the experience more than helps.

  • causal 13 days ago
    I like this, but think there is some crucial motivation missing in steps 10.1-10.3 regarding what query/key weights are and why they're needed.
  • lyapunova 13 days ago
    To be honest, I actually really like the visual delivery here. It's especially helpful for understanding what's going on with computer vision problems. Please make more!
  • bilsbie 12 days ago
    This is great. Is it better to combine this with a language model so it can also apply knowledge about relationships between items?
  • SpaceManNabs 13 days ago
    Lucas Beyer has a lot of references and material as well that I recommend.
  • nothrowaways 12 days ago
    Neat
  • Jacob__ 12 days ago
    I like it a lot.
  • tantalor 13 days ago
    Stop scrollytelling! It's awful, nobody should do this.
    • 4chandaily 13 days ago
      Agreed. My scroll wheel should scroll the page, not advance slides or split birds or whatever else. If you need to do this kind of information display, use buttons or a UI widget to control it. Don't hijack the HID devices I use for accessibly operating my computer.

      This goes for Scroll Wheels, Scrollbars, the Back Button, the Right Click Button, or any other standard input paradigm. (please) Don't fuck with these! Some of us make use of accessibility features, and messing with our interfaces makes these break or behave in unexpected ways.

    • elicash 13 days ago
      I'd be annoyed if my bank did this, or airlines, or anything where I just need to get a task done.

      For personal websites, I actually think individuality and fun and creativity are good.

    • observationist 13 days ago
      It's aggressively inaccessible. I don't know if it's a "I'm a web designer, I know better" thing or what.

      Web designers: Don't let form interfere with function. The function of this page is to communicate information about transformers. The form effectively prevents that from happening. Don't do it. No, bad, stop.

    • layer8 13 days ago
      This. You can’t use reader mode, you can’t save the page as a PDF, you can’t use PageUp/PageDown because you’ll miss some in-between state, and the scroll position where a certain image is shown may not be the preferred one for reading the corresponding text. And the JS will invariably break sooner or later.