Generative Modeling with Sparse Transformers

(openai.com)

70 points | by stablemap 1823 days ago

4 comments

  • yorwba 1823 days ago
    Using two attention layers with √N inputs to cover a context of size N = √N × √N is somewhat intuitively understandable for image data, since the decomposition corresponds to rows and columns.

    But it's quite surprising that this also works for text data, especially that the fixed pattern performs better than the strided one, despite there not being anything analogous to image boundaries in the data.

    It'd also be interesting to see what happens for other decompositions, such as 3 layers of ∛N or a logarithmic stack of dilated convolutions.

    • AdamDKing 1823 days ago
      It seems that using "fixed attention" for text would encourage the network to periodically summarize the context so far and put it in that fixed column for the rows below to access.

      Maybe the reason "strided attention" didn't work as well is that it would require the network to put this context summary in every column lest the rows below be unable to access it. That would waste features since the summary wouldn't vary much over time but would still be stored in full at each step.

      If this is true, the approach they used for images might actually be inefficient in a similar way.

  • joe_the_user 1823 days ago
    So "Transformers" are part of the attention-based systems, which are a approach for modeling input-output relationships that is an alternative to Recurrent Neural Networks. These are instead based on Convolutional Neural Networks.

    The innovation here is that the transformer is compressed, allowing the system to deal with longer sequences.

    • AdamDKing 1823 days ago
      You seem to be saying this work is based on convolutional neural networks. That's incorrect. It uses the same attention mechanisms from natural language processing which involve no convolution operations.

      Convolutions have a different set of weights for each position offset (with a fixed window size), and reuse those weights across the entire input space.

      Transformer-based networks like this work compute attention functions between the current position's encoding and every previous position, then use the outputs to compute a weighted sum of the encodings at those positions. Hence they can look at an arbitrarily large window and the number of parameters they have is independent of the size of that window.

    • skdotdan 1823 days ago
      Are Transformers based on convolutions?
  • skdotdan 1823 days ago
    That's really impressive!

    However, I'm a bit disappointed with the code release. I was expecting the full source code and setup.

    • sgillen 1823 days ago
      It seems openAI is getting less and less open. I would like to see the source too, although I think maybe we've been a bit spoiled in the past expecting them to share all their source with us.

      They have a lot of incentives not to. Keeping the code under wraps allows them to maintain an edge over other researchers and companies and the space, which helps them secure more funding, publish more papers etc.

      It's also not really fair to them if some other research group uses code from OpenAI to achieve results and then doesn't share their own code or modifications.

  • tezka 1823 days ago
    What is NLL for 32x32 Imagenet? Thats a common benchmark and it’s strange that it’s missing from this paper. Also, will you release cifar10 samples? Curious what they look like at 2.80