I'm probably missing something obvious here, maybe someone can explain the following to me.
- Their approach is a composition of 2 steps, what they call "stylization" and "smoothing".
- Top left of 2nd page they claim: "Both of the steps have closed-form solutions"
- Equations 5 is the closed form solution for the "smoothing" step.
My question: Where's the closed-form solution for the stylization step that they're claiming?
Are they calling equation 3 a closed-form expression? In this case the title and the claim in the introduction are rather misleadinng, because computing 3 requires you to train autoencoders.
You don't train it for every image; in this way, a neural network often is a "closed-form solution": it provides you an equation, admittedly a very convoluted one, which can be used to obtain its solution, admittedly usually an approximation, in a finite amount of time. The normal solution to this problem (according to the paper) is an iterative technique "to solve an optimization problem of matching the Gram matrices of deep features extracted from the content and style photos", whereas this one is simply two passes: stylization and smoothing.
Previous stylisation was slow because it needed to SGD optimisation for each image to be stylised. This uses a NN trained once. When you've trained a NN it is precisely a closed form solution, in the style y = max(0, 3x + 4). However they are normally a little longer to wrote down :P
Ah okay right this is the answer. Previous approaches [1] are deep generative models that you have to optimize for each input, whereas here you run just a forward evaluation on a model that you've trained beforehand.
I would still argue the term closed-form is misleading here, because:
- Even during training at any given time you can read off a "closed-form expression" of the neural network of this type, so closed-form in this broad sense really doesn't mean much. Furthermore any result of any numerical computation ever are also closed-form solutions according to this, on the grounds that they result from a computation that completed in finite number of steps. So really whenever you ask a grad student to run some numerical simulation expect them to come back saying "Hey I found a closed-form expression!"
- The reason the above is absurd is that these trained NN's aren't really solutions to the optimization problem, but approximations. So this is really saying I have a problem, I don't know how to solve it but I can produce a infinite sequence of approximations. Now I'm gonna truncate this sequence of approximations, and call this a closed form solution.
The analogy in highschool math would be computing an infinite sum that doesn't converge, but now let's instead just add to some large N, and call this a closed-form solution.
Actually, I agree with you. Initially you seemed to object to the term "closed form"; this now highlights the more pertinent point - these models are 100% closed form, but 0% "solution" in the formal sense.
Someone correct me if I'm wrong, but I believe this refers to the fact that it can be expressed in terms of certain simple mathematical operations like addition, subtraction, multiplication, powers, roots etc.—and as a consequence, the execution is very efficient. My understanding is that 'closed form' solution is essentially something that resembles a polynomial (again, accepting corrections!).
Closed form just means you can do it in a finite number of operations. So just "run X" rather than the previous versions of this kind of thing which are "repeat X until measure Y is lower than the limit I care about". (my basic understanding)
I checked the Wikipedia article, and the sorts of operations involved do appear to be a part of the definition: https://en.wikipedia.org/wiki/Closed-form_expression —though it sounds like it's a somewhat loosely defined term.
I really don't think that it's fair to call neural networks closed-form solutions. The term immediately makes me assume that it enabled you to bypass the training stage altogether.
Notice that all of the examples ilustrated in the paper contain similar scenes. The content image is a building, while the style image is also a building. Or an image of trees is styled using another image of trees.
But how well does it fare when you give it an image of a house and an image of something completely different, like a dog or a slipper?
The interesting question, then, is how far off can this be and still work? Is the limit "reasonable", or is there room for improvement of the algorithm?
E.g. I think most humans would say taking this content picture:
It shows a red crab on a beach in front of the bright blue ocean with a blue sky and white clouds.
I guess transfer of the wooden house amidst yellow fields with a reddening sky might lead to a wooden crab on a yellow field in front of a reddish-yellow ocean with red sky and clouds, or something.
Did you have to do anything extra to get it working? I've set things up according to the documentation (I think), but I get dimension size errors when running it.
Haha, yes, I had to rewrite their code a bit. All the .unsqueeze(1).expand_as(...) in photo_wct.py need to be replaced by just .expand_as(...) and the return value of __feature_wct needs to be wrapped in torch.autograd.Variable.
I'm going to submit a PR, but it took me a bit of experimentation to fix these errors, so the code is a bit messier than I'd like.
Ahh that looks like the error I was hitting, thanks. I might try replacing the bits as well, though I just upgraded pytorch from 0.1.12 to 0.3 and it became much slower (I killed it after 5-6 minutes of setup).
Thoses interested in that technology : I made two videos 18 months ago, No optical flow and youtube compression kills everything but still decent if watched in 4K on a big screen :)
Consider artists. There's a tremendous potential in using technology like this in art, and preventing someone from selling their works will often put them off of using it at all.
What does the license of the product have to do with the output of the product? You can use GIMP and GCC commercially, for example and libraries used with GCC often have runtime exemptions for their output
Hmmmm. Does the licence of the tool affect the output from the tool? Photoshop is propriety but Adobe doesn't have to explicitly grant me rights to the work I create with it.
You only need a license for the copyright though, in the worst case you waive your right to distribute your derivative code if it has been used for commercial applications (which would be a weird interpretation, but I can't find a precise explanation what the 'non-commercial' license covers).
Contrary to modern software developers, artists are used to the notion that tools developed by other people are worthy of some kind of compensation, even if found on some flea market.
I wonder whether this could be applied to a real-time scenario. Modern Real-time renderers for games will often have a tone mapping step that let artists color grade the final output. The paper cites a 11+ second runtime for 1K inputs, which is orders of magnitude off what it would need to be, but perhaps a simpler version run on the GPU is feasible.
Nvidia is pretty big in the machine learning space in general, not game specific these days - GPUs are pretty general purpose highly multithreaded number crunchers and Nvidia's been making moves further in this direction with their own CUDA-based training tools, the DGX-1, the Jetson and other products.
(a) This problem is long known as color/contrast transfer, and it was solved > 10yrs ago, (b) the results shown in this paper aren't objectively or subjectively better/more photo-realistic than Kokaram etc.'s work; and (c) I question whether this task even requires deep learning at all.
This problem is long known as color/contrast transfer, and it was solved more than 10yrs ago. The results shown aren't objectively or subjectively better/more photo-realistic than Kokaram etc.'s work which is far simpler. I question whether this task even requires deep learning at all.
Only tangentially related, but has anyone ever tried to apply style-transfer on human faces for artificial aging or rejuvenation? Like for the movie industry or something?
The only machines with decent GPUs in them I have access to run Windows and Windows Subsystem for Linux doesn't allow GPU access. Other than dual-booting or running Linux in VirtualBox - is there any way I can try this?
None of the dependencies seemed to be Linux-specific at a quick glance. You might be able to install all that on Windows (not sure how pleasant experince it'll be).
Virtualbox won't help you, because you can't give proper access to the GPU for the VM guest unless you set up PCI-e passthrough and dedicate your whole GPU to the VM guest (and use your integrated graphics for the host). Not sure if this is even possible if Windows is the host.
If you don't feel like setting up a Linux install on your box, you could try some of the GPU cloud services.
Also I am told the proprietary nvidia drivers have a software lock that prevents you from using GPU passthrough unless you buy certain more expensive models.
There is a work around. A number of GeForce cards gave the exact same chipset as a Quadro card but with a resistor pulling down an external pin. That resistor can be changed to make the card identify as a Quadro.
This is just spoofing the PCI VID:PID numbers to the driver and relying on driver bugs(?) to function. You could do the same with a few lines of kernel hacks far easier than soldering. It does not enable any features that are fused off in the hardware. This setup is not reliable.
Also, these posts are from 2008 and 2013, 5 and 10 years old. These hacks probably don't work any more.
Is it really all that hard to have a demo site for these things? It would be a lot of fun to play with crossing pictures. I'm guessing it's because using a graphics card in the browser isn't good enough yet?
I'm not sure how fast their FastPhotoStyle approach is, but a TensorFlow implementation of the original neural style transfer can take upwards of 20 minutes to create the final stylized image. If someone had the pre-trained model and neural net code in JS to read it and you could do it all client side then it would be possible, but still very slow.
The tech has come a long long way since the original, even before this FastPhotoStyle project.
A few months ago, there was TensorFire [0] that was able to do it in the browser. Quick google also gives other results [1]. There's also many apps that can do it in seconds. Speed definitely isn't an issue anymore, but getting it to work in browser can be tricky.
Conda is basically an alternative to pip and virtualenv, used by the Anaconda python distribution that's really popular in the data science and machine learning community. The easiest way to get it is to install miniconda: https://conda.io/docs/user-guide/install/linux.html
Looks like it's a lot faster. They compare their approach to the Luan et al. approach, and for a 1024x512 image, they are about 30-60x faster. They also seem to be more accurate with better results.
- Their approach is a composition of 2 steps, what they call "stylization" and "smoothing".
- Top left of 2nd page they claim: "Both of the steps have closed-form solutions"
- Equations 5 is the closed form solution for the "smoothing" step.
My question: Where's the closed-form solution for the stylization step that they're claiming?
Are they calling equation 3 a closed-form expression? In this case the title and the claim in the introduction are rather misleadinng, because computing 3 requires you to train autoencoders.
I would still argue the term closed-form is misleading here, because:
- Even during training at any given time you can read off a "closed-form expression" of the neural network of this type, so closed-form in this broad sense really doesn't mean much. Furthermore any result of any numerical computation ever are also closed-form solutions according to this, on the grounds that they result from a computation that completed in finite number of steps. So really whenever you ask a grad student to run some numerical simulation expect them to come back saying "Hey I found a closed-form expression!"
- The reason the above is absurd is that these trained NN's aren't really solutions to the optimization problem, but approximations. So this is really saying I have a problem, I don't know how to solve it but I can produce a infinite sequence of approximations. Now I'm gonna truncate this sequence of approximations, and call this a closed form solution.
The analogy in highschool math would be computing an infinite sum that doesn't converge, but now let's instead just add to some large N, and call this a closed-form solution.
[1] e.g. https://arxiv.org/pdf/1508.06576.pdf
But how well does it fare when you give it an image of a house and an image of something completely different, like a dog or a slipper?
What is the correct answer to a question that's not well-formed?
E.g. I think most humans would say taking this content picture:
https://wallpapershome.com/images/pages/pic_hs/10150.jpg
and styling it with this picture:
https://c2.staticflickr.com/4/3499/3876547311_c2e32759d9_z.j...
is a pretty well-posed operation. How does that look using this algorithm?
I guess transfer of the wooden house amidst yellow fields with a reddening sky might lead to a wooden crab on a yellow field in front of a reddish-yellow ocean with red sky and clouds, or something.
Did you have to do anything extra to get it working? I've set things up according to the documentation (I think), but I get dimension size errors when running it.
I'm going to submit a PR, but it took me a bit of experimentation to fix these errors, so the code is a bit messier than I'd like.
I was using the pytorch 0.1.12 installed with conda (following their USAGE.md) and it took ~30s total for the transfer.
For some reason it's taking me about 4-5 minutes for the transfer, but the code now runs and the rest of the runtime is only a few seconds.
https://www.youtube.com/watch?v=2YRVt80g2Ek
https://www.youtube.com/watch?v=i69cBYI6f-w
Seems fine to me. If you want to develop something commercial you'd roll your own anyway. Nothing else is restricted by this license.
The license of GCC doesn't affect the license of your binaries.
The license of python doesn't affect the license of your software.
etc.
It's really great that NVIDIA is releasing code for their deep learning research.
https://francois.pitie.net/colour/
https://francois.pitie.net/colour/
Virtualbox won't help you, because you can't give proper access to the GPU for the VM guest unless you set up PCI-e passthrough and dedicate your whole GPU to the VM guest (and use your integrated graphics for the host). Not sure if this is even possible if Windows is the host.
If you don't feel like setting up a Linux install on your box, you could try some of the GPU cloud services.
This requires dedicating the whole GPU and the PCI-e slot to the virtual machine guest.
For more flexible virtualization setups, you need the professional quality cards.
http://www.eevblog.com/forum/chat/hacking-nvidia-cards-into-...
Apparently this can also be done from software
http://archive.techarp.com/showarticleefc1.html
Also, these posts are from 2008 and 2013, 5 and 10 years old. These hacks probably don't work any more.
A few months ago, there was TensorFire [0] that was able to do it in the browser. Quick google also gives other results [1]. There's also many apps that can do it in seconds. Speed definitely isn't an issue anymore, but getting it to work in browser can be tricky.
[0] https://tenso.rs/demos/fast-neural-style/
[1] https://reiinakano.github.io/fast-style-transfer-deeplearnjs...
https://www.youtube.com/watch?v=I3l4XLZ59iw
Images and speech require different architectures (CNNs vs RNNs).
> conda install pytorch torchvision cuda90 -y -c pytorch
What is conda? How do i install it on ubuntu 16.014?