1) The input dataset from Memegenerator is a bit weird. More importantly, it does not distinctly identify top and bottom texts (some have a capital letter to signifify the start of the bottom text, which isn't always true). A good technique when encoding text for these types of things is to use a control token (e.g. a newline) to indicate these types of behaviors. (the conclusion notes this problem: "One example would be to train on a dataset that includes the break point in the text between upper and lower for the image. These were chosen manually here and are important for the humor impact of the meme.")
2) The use of GLoVe embeddings don't make as much sense here, even as a base. Generally the embeddings work best on text which follows real-world word usage, which memes do not follow. (in this case, it's better to let the network train the embeddings from scratch)
3) A 512-cell LSTM might be too big for a word-level model of that size; since the text follows rules, a 256-cell Bidirectional might work better.
Question: This is one of the pieces of neural nets that has always seemed completely opaque voodoo to me. What estimating are you doing to suggest a 512-cell LSTM could stand to be swapped out with a 256-cell bidirectional? What constraints are you optimizing for?
Not a constraint per se, but having too big of a neural network (or any statistical model) can cause it to overfit and generalize poorly; of course, generalizing better is a good objective for text generation.
You can use 512-cell LSTMs if you have a lot of text, though.
I'd put $100 on the researcher coming up with the title and working from there. "Dank Learning"? Come on, it's a meme in itself. That said, worth publishing? Sure it's at the top of HN. Ground breaking results, nah. Though, I admit I am impressed with the applied solution, using deep learning and some apriori direction to derive context from images is neat.
I think the image needs to be an input somehow. I imagine running an image classifier (e.g., YOLO9000) to extract “pretrained” features and making those values inputs into a modified LSTM could allow learning to synthesize text and perception. I’d suggest learning new image embeddings (training a neural network to extract image features from scratch), but it’d be difficult to get enough images/enough different images.
I was expecting this to use some formats that aren't from 2012. It would be interesting to see a neural network that could decide text for more complex meme formats that trend on twitter and instagram.
Who knows, maybe they already are? I mean I'm confident there's a ton of content farms out there already that just run a cronjob every couple minutes to pluck the top ten images off of a subreddit, checks if they've been published on their own channel yet and republishes them.
If not, I'll brb, need to set up some websites / facebook accounts.
9gag was caught out a few years back for automatically harvesting images off the front page of reddit, then posting it to 9gag like it was from a "real user", and artificially inflating the upvotes.
You could tell it was automated, because every once in a while, a very reddit specific meme would appear on the 9gag front page, with a bunch of confused comments from 9gag users who didn't understand it. Here's a writeup from a couple of years ago on it 
I don't doubt that other clickbait sites like BoredPanda do exactly the same thing.
All their generated examples look like Markov chain generated captions. Pretty random and generally unfunny. I completely disagree with the claim that you can't differentiate between these generated memes and real memes. None of these would make the front page of reddit, for example.
In this case, yes the memes are a subset of image macros. However that's because the algorithm only produces images. Not all memes are images, like hit F to pay respect, the old $pun -aroo, Zoop, and my axe, and we did it reddit are all examples of non image macro based memes.