> The central intuition in using T5 is that extremely large language models, by virtue of their sheer size alone, may still learn useful representations despite the fact that they are not explicitly trained with any text/image task in mind. [...] Therefore, the central question being addressed by this choice is whether or not a massive language model trained on a massive dataset independent of the task of image generation is a worthwhile trade-off for a non-specialized text encoder. The Imagen authors bet on the side of the large language model, and it is a bet that seems to pay off well.
The way out of this dilemma is to fine-tune T5 on the caption dataset instead of keeping it frozen. The paper notes that they don't do fine-tuning, but does not provide any ablation or other justification. I wonder if it would help or not.
> is trained on hundreds of millions of images and their associated captions
So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?
Or is something like that only available to the rich with access to lawyers on tap?
I mean I can imagine if a nobody wanted to do something like this, they'd get bankrupted by having to deal with all the photographers / artists spotting a tiny sliver of their art in the image produced by the model.
Furthermore, would something like this work with music? For instance, train the model on all Spotify songs and then generate songs based on "Get me a Bach symphony played on sticks with someone rapping like Dr Dre with lisp."
Or do music industry have enough money to bully anyone into not doing that?
Presumably Google's terms of service or fair use laws. The real restriction is that, even if you had the dataset, training costs tens of thousands of dollars. Only corporations can really afford to train these things.
Regarding music - audio generation with Diffusion Models (the main component of Imagen and DALL-E 2) has been done, but not sure about music specifically. We will definitely reach the point where most e.g. pop beats will be able to be made by AI relatively soon.
All a producer has to do is generate 100 beats and select the one s/he likes, potentially interpolate between 2 or finetune it.
This is a real issue, but it's solvable with work.
It's claimed that ML models' output isn't copyrightable because it's fair use, but that's hard to believe; a large model can easily memorise and output exactly one of its inputs again. This is easier to see with text, where GPT and Copilot both do it, but images can do it too.
> So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?
Build the model out of Creative Commons images only. There's a lot of 'em and it's good enough. You may need to exclude CC-BY since they currently can't follow the attribution requirement.
> Or is something like that only available to the rich with access to lawyers on tap?
More likely companies willing to license a stock photography database.
I have shown imagen (and dalle2) to a number of people now (non-tech, just everyday friends, family, co-workers) and I have been pretty stunned by the response I get from most people:
"Meh, that's kinda cool? I guess?" or "What am I looking at?"..."Ok? So a computer made it? That seems neat"
To me I am still trying to get my jaw off the floor from 2 months ago. But the responses have been so muted and shoulder shrugging that I think either I am missing something or they are missing something. Even really drilling in, practically shaking them "DO YOU NOT UNDERSTAND THAT THIS IS A ORIGINAL IMAGE CONSTRUCTED ENTIRELY BY AN AI?!?!" and people just seem to see it as a party trick at best.
I think I can explain this that for most people the whole world is basically magic anyway. They don’t understand any of the details about how any digital tech works so to them they have no framework for which things are impressive and which things are not. The just know that computers can do a great many things that they know nothing about. “Oh I can bank online? Ok.” “Oh, I can have the computer write my book report for me? Ok.” “Oh, this McDonalds is fully staffed by sentiment robots? Ok.”
A pretty common generalization I've witnessed is many non technical people (even people who are tech savvy but have no CS background do this) is people assuming the feature that is in reality quite difficult to implement won't take much effort, and vice versa.
A non-technical person in 2014 (when the above was originally published) would likely have the same conception of the difficulty of recognizing a bird from an image as they would in 2022, even though the task itself has gone from near-insurmountable to off-the-shelf-library in eight years.
Even as Imagen and Dall-E 2 amaze us today, these feats will likely be commonplace in a few years. The non-technical may have only a vague sense that their new TikTok filter is doing something that was impossible only a few years prior.
Exactly and I was thinking of that XKCD. Very much case in point, I have the Merlin Bird ID app which can determine species from ridiculously blurry photos and can also identify hundreds of birds from their calls alone in noisy environments. In 2014 I would have sworn this would be impossible.
The tooltip you get when you hover your cursor over the comic:
"In the 60s, Marvin Minsky assigned a couple of undergrads to spend the summer programming a computer to use a camera to identify objects in a scene. He figured they'd have the problem solved by the end of the summer. Half a century later, we're still working on it."
I'm working with his son Henry Minsky and other great people at Leela AI on that same old problem, applying hybrid symbolic-connectionist constructivist AI by combining neat neural networks with scruffy symbolic logic to understand video, and it's mind boggling what is possible now:
>Our AI system, Leela, is motivated by intrinsic curiosity. Leela creates theories about cause and effect in her world, and then conducts experiments to test these theories. Leela can connect all her knowledge and use this network to make plans, reason about goals, and communicate using grounded natural language.
>Leela has at her core a hybrid symbolic-connectionist network. This means that she uses a dynamic combination of artificial neural networks and symbol networks to learn. Hybrid networks open the door to AI agents that can build their own abstractions on the fly, while still taking full advantage of the power of deep learning.
>Neats and scruffies: Neat and scruffy are two contrasting approaches to artificial intelligence (AI) research. The distinction was made in the 70s and was a subject of discussion until the middle 80s. In the 1990s and 21st century AI research adopted "neat" approaches almost exclusively and these have proven to be the most successful.
>"Neats" use algorithms based on formal paradigms such as logic, mathematical optimization or neural networks. Neat researchers and analysts have expressed the hope that a single formal paradigm can be extended and improved to achieve general intelligence and superintelligence.
>"Scruffies" use any number of different algorithms and methods to achieve intelligent behavior. Scruffy programs may require large amounts of hand coding or knowledge engineering. Scruffies have argued that the general intelligence can only be implemented by solving a large number of essentially unrelated problems, and that there is no magic bullet that will allow programs to develop general intelligence autonomously.
>The neat approach is similar to physics, in that it uses simple mathematical models as its foundation. The scruffy approach is more like biology, where much of the work involves studying and categorizing diverse phenomena.
We're looking for talented engineers and designers to help, including neats and scruffies working together!
That is exactly what Will Wright (the creator of SimCity and The Sims, and Robot Wars / Battle Bots contestant) was getting at when we made these one-minute robot reality videos about "Empathy" and "Servitude".
His idea was to probe just how much random people on the street (or in a diner) would believe about autonomous intelligent robots operating in the real world.
Of course we were actually hiding behind the scenes tele-operating the robots through hidden cameras and a wireless web interface, listening to what the people said and making the robots respond with a voice synthesizer and sound effects, clicking on pre-written phrases and typing ad-libbed responses.
Empathy (a broken down robot begs for help from passers by on the streets of Oakland):
Here's a more recent video of Will throwing a tantrum about the failure of SimSandwich, destroying his old creations because they're pixely and poorly rendered, then complaining about how those jerks at EA hate him:
I think if you've been paying attentiont to the space, this generation of image diffusion is shocking in how quickly it has improved on what we had a year ago.
But if you've never considered that a computer can produce an original image, this is just a new thing computers can do. OTOH I think it's also a lack of imagination in how useful this is, so far the output has been kind of random, so it seems a little gimmicky. Already "Parti" has gotten much closer to allowing a user to describe exactly what they want in the image, and as people start to see the use cases for them personally, it will hit them that they no longer have to hire someone, they can just type a request into a box.
You can just type a request in the box if you don't particularly care what the result looks like and also don't care that some of the features might be copyrighted (since large models are quite capable of memorizing their training data.)
Asking for two different images in a series that have similar "art styles" is going to be enough work to still need a specialist aka an artist; it'll be most useful in cases you never would've bothered finding one before.
> Asking for two different images in a series that have similar "art styles" is going to be enough work to still need a specialist aka an artist
Running a separate style transfer network on the generated images is currently possible, although won't achieve the best possible results.
I wouldn't be surprised in the near future to see generation models that can take a text prompt and an image to mimic the style of, which could let it take style into account when generating the image rather than at just the surface level.
Since “style” isn’t at the surface level, you can’t take it into account with a single input. It means whatever the client wants it to mean. Getting an AI to do what you want there is still going to be a long conversation they won’t want to do.
It might (likely will be) easier to use the AI as a storyboard generator and have your in-house artists redraw it.
Its because people have been able to do this for years now, and so did you. You can try right now. Go to google, type "cat on a bicycle" and hit image search. TADA, computer made cats on bicycles images appear! Wheres the magic in that?
>THIS IS A ORIGINAL IMAGE
Yeah, about that. Ask it to draw you a fast inverse square root.
I've made perhaps overly absolutist statements like "don't you see! this kills artists jobs!" and it was shrugged off as if I was insane. I probably could've phrased it differently, but to me this is game changing in several fields. Granted, it will open up a new field of "generative artists" but, having played with these things, this is a pretty trivial job, and their training nets are only going to get better.
I’ve had a lot of fun playing with Disco Diffusion prompts, but I agree that the people excited about “a generation of prompt artists” are a bit misguided. Soon an AI will emerge that can come up with “better” prompts than you, and the “art” of creating prompts will have a lower skill ceiling.
Like a neutral network just for making prompts that result in aesthetically pleasing Imagen images? And then maybe we can come up with a neutral net that can decide which pictures are good and which aren't. Then we can just have robots making art for the sake of consumption solely by robots.
I can ask davinci-002 "Vivid description of a painting with dancers:" and get:
The painting is of two dancers in a passionate embrace, their bodies entwined as they move together in a sensual dance. The woman's dress is flowing and reveals her curves, while the man's shirt is open, revealing his muscular chest. They are surrounded by a crowd of people who are watching them with looks of admiration and desire. The painting is full of color and movement, and the dancers seem to be in a world of their own, lost in their passion for each other.
dall-e mini is sadly not quite up to the challenge, but it gives the generation a lot more detail. Some other examples:
"The painting is of two dancers in the middle of a dance. They are both wearing white, and their hair is flowing around them as they move. The background is a blur of color, and the light is shining on the dancers, making them look like they are in the spotlight."
"The painting is full of energy and movement, with the dancers leaping and spinning around the stage. They are all wearing brightly coloured costumes, which stand out against the dark background. The light from the stage spotlight is shining on them, making them look even more vibrant. The whole scene is full of life and excitement."
To me, it paves the way for creative prototyping. I don't see this as a zero-sum game between artists and AI. Instead, I could see artists using this for some serious time saving, and leveraging that extra time and energy for creating better results.
You don't need good-looking pictures for propaganda. Old people (the main targets) believe literally anything they see on Facebook, especially if it confirms their priors aka fits their worldview, and prefer it to look bad because that's more authentic. For anyone else, the point is to make them disbelieve everything, not to believe you specifically.
Over a decade ago, Will Wright (of SimCity fame) faked conversational AGI robots in the streets and restaurants of Oakland. It consistently took people 2.4 seconds to go from “Oh look. The robots have arrived.” to “And, I’ll have fries with that.”
Hollywood and the media have taught the public that tech is literally magic and can do literally anything. “Anything” is expected and pedestrian.
I often think a similar thing about aliens. That is, instead of the panicking and hysteria or whatever that fiction imagines might accompany the discovery of aliens I fully expect that people will mostly go "Oh, neat. Aliens." And go on with their lives.
Well, I'm still in awe that I have a bunch of walls around me and can cover my body with clothes, or that I'm still alive after all this time, and that I can even rest most of the day and not spend body energy running after or from animals. Amazing stuff.
I haven't gotten such dismissive responses, but probably only because those I'm inclined to share such things with are the exact kinds of people who'd be blown away by them, and immediately grasp the significance.
Treating Imagen as just an "AI art generator" is extremely short sighted. Sure, you could just try to sell the outputs directly. But the real value is using it to supplement larger works. No need for a stock photo subscription service if you can just generate them automatically. Don't need artists to create textures for your simple games. I can spin up a merch shop powered entirely by AI art and nobody would know. The marginal cost of creation is approaching zero.
And perhaps even more interestingly these things not only exist but there is competition in this space! Essentially unregulated competition as well (and likely for the next 10 years). The cost will be driven into the ground.
The apocryphal Henry Ford quote about the average person wanting better horses comes to mind. People off the street have no concept of the impact this tech and the methods behind it will have. Sure, no one is going to be printing these and hanging them in museums. Very few artists support themselves that way, though. The people diffusion models are coming for are the graphic designers, the concept artists, the marketers, and everyone else with a copy of Photoshop and a Getty subscription. GPT-3 is amazing, but it's also not good enough to be useful. Imagen is industry-destroying.
Although I agree that a somehow less extreme version of that will happen in the course of this decade bar a legal decision to prohibit using those models, that won't translate to comparable revenues. The companies providing those services will struggle to make even 10% of the salaries of the displaced workers in revenue. In fact, this will probably be a GDP-destroying (though not value-destroying) application of technology.
I am willing to bet that the revenue from AI-generated "art" will be smaller than the revenue from human-generated art in 5 years (or even 10 years) despite the former probably being at least 2 orders of magnitude higher in volume.
This is basic supply and demand + acknowledging the fact that humans don't care about AI "achievements".
AI achievements will be indistinguishable from human achievements. Humans will try to pass off AI achievements as their own. The line will become so blurred that it will be impossible to tell the difference.
In general, it's not possible for machines to replace labor - this is the Luddite fallacy. If the machines do exactly what you ask them to do this becomes even more true, because labor has the comparative advantage that they'll do things you don't know to ask for.
It is possible for the labor to quit and find something better to do, as happened to elevator operators, but that's a good thing.
In the case of chess, AIs don't want money and Magnus does, so they're not going to help you find ways to get more of it.
He wouldn't be at the top only if he was not using said way and his competitors were.
This is of course missing the point. If chess enthusiasts knew that grandmasters were using AI to aid them when playing in tournaments, the interest (and the revenues) would simply plummet.
No, something that's been causing a lotta confusion in AI art is people stand up quick implementations generally matching the general description in the paper, but, they're not really investing in training them. Then people see "imagen-pytorch" on GitHub and get confused, either think it's Imagen itself or a suitable replica of it.
There's like 3 projects named DallE, and then the 2 real DallEs...frustrating.
People are really thirsty to play with this tech, you can't blame them. Just search for dataset creators on Hugging Face. I'd link directly to several of them running but it would just overwhelm the creators. If you want to be in early you'll find them. The beautiful thing is open source is going to make this stuff available for everyone and in very short timeframe. It's crazy how fast it moves.
Eh, it exists and it's an inevitability that it will eventually be used in terrible ways. OpenAI and Google people just want the CV booster for having created it but want to pretend it's not their fault when it's used to do a racismsexism.
I mean DALL-E 2 was the first time my jaw really hit the floor, although in fairness GPT-3 probably should've done that, but it's easier to do with images.
And then for this to drop just a month later? Insane. It makes you wonder if they're actually releasing cutting edge, or Google decided to write this paper just because of the publication of DALL-E 2. Maybe they've had this model in the bag for a year.
Unfortunately it seems like it's greater than 0...
If we ignore the procedurally generated NFTs created from mixing and matching various assets and go with ones where AI is the selling point, we're left with a few notable ones: Sophia, a robot w/ some low-level AI sold a single piece for 689k USD [^1]. Botto, a VQGAN-based algorithm sold a single piece for 430k USD and has sold multiple other pieces for tens to hundreds of thousands of dollars. Slightly more modest are some other projects like Metascapes [^3] and Eponym [^4], which produced some really tedious pieces that managed to sell for 3.5k USD and 10k USD respectively. That said, the Eponym piece seems to be some sort of self promotion, so maybe we can say that the actual prices for these collections are somewhere in the fraction of an ETH range if they can be sold at all.
Honestly, only the Botto piece is remotely interesting to look at, and even then I feel as if the blurred, "dreamy" aesthetic that seems to be in so many different AI painting approaches (style-transfer, VQGANS, DALL-E, maybe others I'm not aware of). I think it was more interesting back when we could pretend that these were the electric sheep at the fringes of some deep-sleeping latent intelligent potential but now they just feel kinda arbitrary and lacking deliberation. I absolutely love the field and think these researchers have done tremendous work, but I feel as though all the lay news attention is on the art, and not on the algorithm that generated it. The fascinating thing is that we have a machine that can produce novel something from words or basic ideas and that the output's content retains these ideas, not so much that art itself has that much compositional or stylistic merit.