Build Your Own Imagen Text-to-Image Model

(assemblyai.com)

111 points | by SleekEagle 589 days ago

5 comments

  • MuffinFlavored 589 days ago
    Maybe put a screenshot of a few examples of what this homegrown text-to-image model can produce at the end of the article?
    • SleekEagle 589 days ago
      We'll be training it over the coming weeks and releasing a checkpoint! Right now it is just the model source code
  • neodypsis 589 days ago
    Cool write-up. I'm currently interested in learning about Diffusion Models and found your other article [0] to be a nice introduction to the topic.

    [0] https://www.assemblyai.com/blog/diffusion-models-for-machine...

    • SleekEagle 589 days ago
      That's great! I'm glad you're enjoying them :)
  • isoprophlex 589 days ago
    Very lucid explanation of the internal workings of this Imagen-like model. Thanks, I haven't seen things explained this clearly before!
    • SleekEagle 589 days ago
      Thanks so much! I learned a ton with this project so it's great to hear that I communicated it well :)
  • abriosi 589 days ago
    Great post.

    I'm eager to see what the next 5 years will bring us

    • SleekEagle 589 days ago
      Appreciate it! And I know ... I feel like we're really hitting a watershed moment with ML/DL. It looks like artists on DeviantArt have already objected to AI-generated art being allowed on the website or suggested a mandatory watermark at the very least.

      A very interesting time we're living in

      • O__________O 589 days ago
        Do you have a source for the claim artist on DeviantArt (in any significant form) are making demands related to ML generated text-to-images?

        (Few quick Google searches turned up nothing.)

        • SleekEagle 589 days ago
          Maybe I'm wrong that it was DeviantArt, but I read something about this. Sorry I don't have a source, I should've checked before mentioning it. The closest thing I could find were rules only for specific groups!

          https://www.deviantart.com/rtnightmare/journal/New-Group-Rul...

          • O__________O 589 days ago
            Thanks, appears one of the nine admins posted a message to a group that has less than two hundred members; to me, that neither significant, nor representative of DeviantArt. Also appears they got push back from the group.

            While I didn’t review much of the “art” from the group, looked like clip art memes with text; if so, little odd that admin would take issue with anything being made using text-to-image generated art.

      • Bluecobra 589 days ago
        Seems a bit daft to me. Who are they to say that AI generated art isn't art? Couldn't you say the brush has been replaced keystrokes? Someone still needs to type in inputs and decide what is good/bad. I can understand banning bots that are automatically generating/uploading stuff. Also I wonder how they can prove if a human vs. AI made something if the quality is good enough.
        • SleekEagle 589 days ago
          I think for a creative website like that is intended to showcase artists it makes sense. Unfortunately I don't think there's much they can do about losing in the market to text-to-image models in the long run... the costs are essentially zero
        • rchaud 589 days ago
          Who's to say a monkey picking stocks by throwing darts at a wall can't be a portfolio manager?

          "AI Art" all looks the same to me. Just enough fuzziness in the style so you can't see the hard edges of what they copied, or rather indexed in memory as part of the training dataset, then created a small variation of that.

          That might be good enough for replacing Pexels, Unsplash or any of those stock photo sites that blogs pull from. But not much else.

          • axg11 589 days ago
            You're going to be proven wrong in weeks to months. Just a couple of years ago the consensus was that DALL-E/Imagen/Stable Diffusion quality image generation was impossible. Now it's very real and quality is improving every month.
            • ZetaZero 589 days ago
              Agreed. I liken this to computer chess, and later computer go. It was long believed both were impossible.
            • SleekEagle 589 days ago
              This is the first time I've seen Stable diffusion. Is it a new model or diffusion paradigm?
              • lucidrains 589 days ago
                It is from the Heidelberg giants, Patrick Esser and Robin Rombach

                They continued building off their latent diffusion direction (encode with vqgan-vae and then diffusion in latent space)

                All roads lead to rome

                https://github.com/CompVis/stable-diffusion https://arxiv.org/abs/2112.10752

                • SleekEagle 589 days ago
                  Interesting, thanks for the link. It seems that CLIP encodings aren't as useful as frozen encoders from the textual domain, which is a little unintuitive imo. Can't keep up with all these advancements!
                  • lucidrains 589 days ago
                    Ikr! Hang on, this ride is about to get crazy :)
          • SleekEagle 589 days ago
            You don't think in a few years AI art will be indistinguishable for human generated? In 2012, self driving cars were a funny joke. A decade later, here they are. I think AI is somehow chronically both overestimated and underestimated
            • rchaud 589 days ago
              Everything on Stock photo sites is human generated, free and effectively infinite. In other words, commodified to the point of having its market value be $0.

              I'm sure this can be monetized to generate convincing AI porn, but for non-porn uses, what will it replace? Deep fried memes?

              The only photos worth money are those of real people photographed at real moments in time. Nobody ever bought a Getty subscription for photorealistic clipart.

              • dougmwne 589 days ago
                Stock images are not free. Hiring an artist to create a concept or illustration based on your instructions is also not free. Creating art assets for games is not free. Copyright is a pretty big thing and these models currently seem to sidestep it wonderfully.

                Also I can generate a photo of Leonardo DiCaprio picking his nose with a French fry, so that has some value for me.

            • donkarma 589 days ago
              self driving cars are still a funny joke
              • autoexec 588 days ago
                Yeah, but the joke is on us because we're the ones who are being forced to share the road with them. I sure didn't sign up to be a part of the beta testing.
      • TulliusCicero 589 days ago
        How would you even enforce a mandatory watermark? How do you prove the work was AI-generated, at least for ones with no obvious flaws?
        • SleekEagle 589 days ago
          Exactly, it's just not feasible. Maybe they will train a discriminator to determine which ones are generated, but I don't think that would work very well. Also, DALL-E 2 images come with a watermark but I'm pretty sure there are already tools to remove that...
          • mlsu 589 days ago
            Imagine if they did manage to train a good discriminator!

            That would lead to even higher quality images.

            I support this effort.

            • stu2b50 589 days ago
              Creating a GAN feedback loop in real life
            • SleekEagle 589 days ago
              Haha great point! AI always wins
            • bongoman37 589 days ago
              undefined
      • Baeocystin 589 days ago
        I remember the stink-eye cast by (some) artists who grew up with traditional media at how digital painting didn't count, and was somehow 'cheating'. As if the masters of the past wouldn't have loved having layers and control-z!

        Same situation now. I know working artists thrilled with the idea of being able to iterate in hours what previously would have taken weeks to test composition, form, and the like as they focus in on what they have in mind. I can't wait to see what they come up with.

        • acomjean 589 days ago
          I think Digital media is ok for most artists. Almost all the commercial science visualization folks are using it (based on a web-mini conference panel), and a fair number of artist in the nonprofit I help out with use digital tools.

          I think this is different than machine generated art.

          I know a few of us who have dabbled in creating procedurally generated art have found it fun and useful for some things, but kinda soulless, and less satisfying. AI art gets around that by using huge training sets of human generated stuff and mimicking. Its good and getting better, but its not like you get exactly what you had envisioned (though you get what you asked for)..

      • abriosi 589 days ago
        I understand that artists are scared. Personally I look at it as freedom. Having a tool like this in your artists toolkit will augment performance by a stretch
        • SleekEagle 589 days ago
          And, of course, these models learn in part from the art generated by these artists. If we stop having humans create art, we're no longer generating data to train the next generation of models and so in some sense the "creativity" of these models would seem to necessarily be hindered.
          • Baeocystin 588 days ago
            Don't you think that the selection event that occurs when human minds choose exactly what to use and post for others to see will continue pushing things forward, even if what's driving the brush strokes changes a bit?
    • upupandup 589 days ago
      In 5 years I expect text-to-porn to start replacing xvideos

      - we will surely see sequential image synthesis by then

      - we will surely see matching motion audio synthesis

      - we will surely see single image to 3D reconstruction

      - we will surely see haptic feedback and VR progress

      - we will win.

      • Enginerrrd 589 days ago
        I suspect it will take more like 10 years at least to produce convincing video. The technology isn't too far off except that the compute requirements are pretty extreme without some clever work. Lots of clever stitching needs to be done too. You need models that can take a description of a scene and produce a story-board like series of low res images. (And maybe vice versa). Then, you need a model that can infer semantics and movement in logical ways between those panels to generate images to fill in the gaps. Then lots and lots of clever cleanup and resolution enhancement both of individual frames and the changes between neighboring frames without introducing all kinds of weird, fuzzy, moving, dream-like artifacts.

        ...Then you've got to somehow add audio that ALSO understands semantics in the same way as the story board. Maybe something that can generate an audio clip to go with the storyboard. ....And then fill in the gaps based on the generated video. Making those match seems like a really hard, but not impossible problem. In the short term, a bunch of moaning at appropriate times to mouths moving and whatnot seems feasible though.

        Although, I expect fairly high quality text-to-image porn is likely only a few months to a year away.

        The technology is there, someone just needs to pay to train the model... and then the cost of compute is what, like 300 grand? A few hundred more should get you enough engineering to apply existing techniques. Say $1 million in costs for a product that seems like incentive enough to get a bunch of members to pay a monthly fee.

        • cercatrova 589 days ago
          At the rate AI image generation is going, I highly doubt it'll take another 10 years. Only 10 years ago did AlexNet come onto the scene and blow away image recognition contests.
        • JayStavis 589 days ago
          For the general case I'd probably agree. But at the same time if you scope it down a bit, it's already here: https://www.synthesia.io/
  • ttul 589 days ago
    Thank you for providing a git repo to go along with the exceptionally detailed commentary!
    • SleekEagle 589 days ago
      My pleasure! Thanks for the kind words :)