The Unreasonable Effectiveness of Deep Feature Extraction

(basilica.ai)

324 points | by hiphipjorge 1896 days ago

15 comments

asavinov 1896 days ago
Deep feature extraction is important for not only image analysis but also in other areas where specialized tools might be useful such as listed below:
o https://github.com/Featuretools/featuretools - Automated feature engineering with main focus on relational structures and deep feature synthesis
o https://github.com/blue-yonder/tsfresh - Automatic extraction of relevant features from time series
o https://github.com/machinalis/featureforge - creating and testing machine learning features, with a scikit-learn compatible API
o https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last! The workflow engine allows for integrating feature training and data wrangling tasks with conventional ML
o https://github.com/xiaoganghan/awesome-feature-engineering - other resource related to feature engineering (video, audio, text)
[-]
- mlucy 1896 days ago
  Definitely. There's been a lot of exciting work recently for text in particular, like https://arxiv.org/pdf/1810.04805.pdf .
  [-]
  - nl 1896 days ago
    Or from today, OpenAI's response to BERT: https://blog.openai.com/better-language-models/
    Breaks 70% accuracy on the Winograd schema for the first time! (a lazy 7% improvement in performance....)
- psandersen 1895 days ago
  This is a great resource, thanks for sharing!
  I'd be interested to hear what kind of experience people are having with these frameworks in production.
kieckerjan 1896 days ago
As the author acknowledges, we might be living in a window of opportunity where big data firms are giving something away for free that may yet turn out to be a big part of their furure IP. Grab it while you can.
On a tangent, I really like the tone of voice in this article. Wide eyed, optimistic and forward looking while at the same time knowledgeable and practical. (Thanks!)
[-]
- gmac 1895 days ago
  big data firms are giving something away for free
  On that note, does anyone know if state-of-the-art models trained on billions of images (such as Facebook's model trained via Instagram tags/images, mentioned in the post) are publicly available and, if so, where?
  Everything I turn up with a brief Google seems to have been trained on ImageNet, which the post leads me to believe is now small and sub-par ...
  [-]
  - hamilyon2 1894 days ago
    Have you found anything?
    [-]
    - gmac 1891 days ago
      Afraid not — I was hoping for some replies here!
- chasely 1895 days ago
  I also found the writing to be engaging and informative. Not many product websites have posts that make me go back through their archive.
bobosha 1896 days ago
This is very interesting and timely to my work, I had been struggling with training a Mobilenet CNN for classification of human emotions ("in the wild"), and struggling to get the model to converge. I tried multiclass to binary models e.g. angry|not_angry etc. But couldn't get past the 60-70% accuracy range.
I switched to extracting features from Imagenet and trained an xgboost binary and boom...right out of the box am seeing ~88% accuracy.
Also the author's points about speed of training and flexibility is major plus for my work. Hope this helps others.
[-]
- mlucy 1896 days ago
  Yeah, I think this pattern is pretty common. (Basilica's main business is an API that does deep feature extraction as a service, so we end up talking to a lot of people with tasks like yours -- and there are a lot of them.)
  We're actually working on an image model specialized for human faces right now, since it's such a common problem and people usually don't have huge datasets.
fouc 1896 days ago
>But in the future, I think ML will look more like a tower of transfer learning. You'll have a sequence of models, each of which specializes the previous model, which was trained on a more general task with more data available.
He's almost describing a future where we might buy/license pre-trained models from Google/Facebook/etc that are trained on huge datasets, and then extend that with more specific training from other sources of data in order to end up with a model suited to the problem being solved.
It also sounds like we can feed the model's learnings back into new models with new architectures as well as we discover better approaches later.
[-]
- mlucy 1896 days ago
  > He's almost describing a future where we might buy/license pre-trained models from Google/Facebook/etc that are trained on huge datasets, and then extend that with more specific training from other sources of data in order to end up with a model suited to the problem being solved.
  Yup, that's basically it. (Although I think there might be more than two parties involved; I think probably there will be one giant pretrained image model that everyone in the world starts from, then someone will specialize it for some domain, then someone will specialize that for some subdomain, all the way down to an individual person's problem, which might only have a few thousand data points.)
- XuMiao 1896 days ago
  What do you think of life-long learning scenario that models are trained incrementally forever? For example, I train a model with 1000 examples, it sucks. The next guy pick it up and train a new one by putting a regularizer over mine. It might still suck. But after maybe 1000 people, the model begins to get significantly better. Now, I will pickup what I left and improve it by leveraging the current best. This continues forever. Imagine that this community is supported by a block chain. We won't be relying on big companies any more eventually.
  [-]
  - jacquesm 1896 days ago
    What is it with the word 'blockchain' that will make people toss it into otherwise completely unrelated text?
    [-]
    - Varcht 1896 days ago
      You know, we love our overloaded terms? In this context it means "Decentralized storage". Keeps people on their toes, keeps the AI guessing.
    - oehpr 1896 days ago
      nothing, they're describing a series of content addressable blocks that link back to their ancestors. Which is a good application of a block chain. Think IPFS.
      It's not cryptocurrency. Though cryptocurrency definitely popularized the technique.
      [-]
      - fwip 1896 days ago
        IPFS isn't a blockchain just like git isn't a blockchain. "Blockchain" has semantic meaning that "a chain of blocks" does not.
    - SiempreViernes 1896 days ago
      That recent period in time when it was a license to print money?
  - Terr_ 1896 days ago
    > What do you think of life-long learning scenario that models are trained incrementally forever?
    The same as the "life-long" coding scenario where monoliths are tweaked incrementally forever.
    They may have niches but they'll kinda suck, because the underlying problem-space evolves too. Code loses value with age.
- gipp 1896 days ago
  Not sure if you were just being cheeky, but this is pretty much exactly what GCP's AutoML offerings are.
stared 1896 days ago
A few caveats here:
- It works (that well) only for vision (for language it sort-of-works only at the word level: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)
- "Do Better ImageNet Models Transfer Better?" https://arxiv.org/abs/1805.08974
And if you want to play with transfer learning, here is a tutorial with a working notebook: https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/
[-]
- mlucy 1896 days ago
  There's actually been a lot of really good work recently around textual transfer learning. Google's BERT paper does sentence-level pretraining and transfer to get state of the art results on a bunch of problems: https://arxiv.org/pdf/1810.04805.pdf
  [-]
  - stared 1896 days ago
    Thanks for this reference, I will look it up. Though, from my experience people in NLP still (be default) train from scratch, with some exceptions for tasks on the same dataset:
    - https://blog.openai.com/unsupervised-sentiment-neuron/
    - http://ruder.io/nlp-imagenet/
    [-]
    - samcodes 1895 days ago
      This is true, but rapidly changing. In addition to fine tuneable language models, you can do deep feature extraction with something like bert-as-service [0] ... You can even fine tune Bert on your days, then use the fine tuned model as a feature extractor.
      [0] https://github.com/hanxiao/bert-as-service
mlucy 1896 days ago
Hi everyone! Author here. Let me know if you have any questions, this is one of my favorite subjects in the world to talk about.
[-]
- skybrian 1896 days ago
  What do you think of the BagNet paper? It sounds like the important thing for image recognition is just coming up with local features?
  https://openreview.net/forum?id=SkfMWhAqYQ
  [-]
  - mlucy 1896 days ago
    I hadn't read it before! That's a fascinating result, actually. They emphasize interpretability in the paper, but I find it more interesting that you can do so well with only local information.
    My first thought is that it makes sense that averaging together a bunch of local predictions would work well on the ImageNet task, since the different classes tend to have obviously different local textures, and class-relevant information makes up a large part of the image. I would be very curious to see if the technique is as competitive for other tasks.
    [-]
    - skybrian 1896 days ago
      Yeah, it seems like it would be useful for debugging to replace some part of the architecture with a simple linear sum and see if it does just about as well?
- yazr 1896 days ago
  I come from Deep reinforcement learning. When considering simulated environments (such as AlphaZero, AlphaStar), can feature engineering dramatically improve the cpu-requirement or sample-efficiency ?
  Or are low-level features the "easiest" part for the network to learn?
  Edit1 : I understand of course the academic purity of working from raw data.
  Edit2: so simulated means lots of samples, on policy learning, but also very cpu intensive.
- fouc 1896 days ago
  What do you think are the most interesting types of problems to solve with this?
  [-]
  - mlucy 1896 days ago
    I think if you have a small to medium sized dataset of images or text, deep feature extraction would be the first thing I'd try.
    I'm not sure what the most interesting problems with that property are. Maybe making specialized classifiers for people based on personal labeling? I've always wanted e.g. a twitter filter that excludes specifically the tweets that I don't want to read from my stream.
    [-]
    - fouc 1896 days ago
      One problem that intrigues me is Chinese-to-English machine translation. Specifically for a subset of Chinese Martial Arts novels (especially given there's plenty of human translated versions to work with).
      So Google/Bing/etc have their own pre-trained models for translations.
      How would I access that in order to develop my own refinement w/ the domain specific dataset I put together?
      [-]
      - mlucy 1896 days ago
        I don't think you could get access to the actual models that are being used to run e.g. Google Translate, but if you just want a big pretrained model as a starting point, their research departments release things pretty frequently.
        For example, https://github.com/google-research/bert (the multilingual model) might be a pretty good starting point for a translator. It will probably still be a lot of work to get it hooked up to a decoder and trained, though.
        There's probably a better pretrained model out there specifically for translation, but I'm not sure where you'd find it.
        [-]
  - asavinov 1896 days ago
    IMHO (deep) feature engineering is important in these cases:
    o the lower the level of representation the more important it is to increase the level of abstraction by learning or defining manually new features
    o in the presence of (fine-grained) raster (automated) feature engineering is especially important. Therefore, feature engineering is important in audio analysis (1d raster) and video analysis (2d raster).
- julius_set 1896 days ago
  Great article, I have a question pertaining to Time series data. Would this work well on a smaller dataset of pre processed sensor readings for HAR?
  [-]
  - mlucy 1896 days ago
    I don't work with time series data much myself. I would imagine you can get at least some transfer learning, since there are patterns that show up across different domains. It looks like there's been a little bit of work done on this: https://arxiv.org/pdf/1811.01533.pdf .
    According to them, transfer learning can improve a time series model if you pick the right dataset to transfer from, but they don't seem to be getting the same unbelievably strong transfer results that you'd see on images and text.
- jewelthief91 1896 days ago
  Considering the rate of change in this field, what would be beneficial to learn for people who don't actually get to use machine learning in their day to day job? I'd love to dive in and learn more about machine learning but I don't want to waste time learning something that will be totally irrelevant in a couple years.
jfries 1896 days ago
Very interesting article! It answered some questions I've had for a long time.
I'm curious about how this works in practice. Is it always good enough to take the outputs of the next-to-last layer as features? When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step? And the inputs to the net you're training is the features? Does the new net always only need 1 layer?
What are some examples of where this worked well (except for the flowers mentioned in the article)?
[-]
- mlucy 1896 days ago
  > Is it always good enough to take the outputs of the next-to-last layer as features?
  It usually doesn't matter all that much whether you take the next-to-last or the third from last, it all performs pretty similarly. If you're doing transfer to a task that's very dissimilar from the pretraining task, I think it can sometimes be helpful to take the first dense layer after the convolutional layers instead, but I can't seem to find the paper where I remember reading that, so take it with a grain of salt.
  > When doing quick iterations, I assume the images in the data set have been run through the big net as a preparation step?
  Yep. (And, crucially, you don't have to run them through again every iteration.)
  > And the inputs to the net you're training is the features? Does the new net always only need 1 layer?
  Yeah, you take the activations of the late layer of the pretrained net and use them as the input features to the new model you're training. The new model you're training can be as complicated as you like, but usually a simple linear model performs great.
  > What are some examples of where this worked well (except for the flowers mentioned in the article)?
  The first paper in the post (https://arxiv.org/abs/1403.6382) covers about a dozen different tasks.
mikekchar 1896 days ago
It's hard to ask my question without sounding a bit naive :-) Back in the early nineties I did some work with convoluted neural nets, except that at that time we didn't call them "convoluted". They were just the neural nets that were not provably uninteresting :-) My biggest problem was that I didn't have enough hardware and so I put that kind of stuff on a shelf waiting for hardware to improve (which it did, but I never got back to that shelf).
What I find a bit strange is the excitement that's going on. I find a lot of these results pretty expected. Or at least this is what I and anybody I talked to at the time seemed to think would happen. Of course, the thing about science is that sometimes you have to do the boring work of seeing if it does, indeed, work like that. So while I've been glancing sidelong at the ML work going on, it's been mostly a checklist of "Oh cool. So it does work. I'm glad".
The excitement has really been catching me off guard, though. It's as if nobody else expected it to work like this. This in turn makes me wonder if I'm being stupidly naive. Normally I find when somebody thinks, "Oh it was obvious" it's because they had an oversimplified view of it and it just happened to superficially match with reality. I suspect that's the case with me :-)
For those doing research in the area (and I know there are some people here), what have been the biggest discoveries/hurdles that we've overcome in the last 20 or 30 years? In retrospect, what were the biggest worries you had in terms of wondering if it would work the way you thought it might? Going forward, what are the most obvious hurdles that, if they don't work out might slow down or halt our progression?
[-]
- aabajian 1896 days ago
  If you haven't, you should take a few moments to read the original AlexNet paper (only 11 pages):
  https://papers.nips.cc/paper/4824-imagenet-classification-wi...
  What you're saying is true, it should have worked in theory, but it just wasn't working for decades. The AlexNet team made several critical optimizations to get it work: (a) big network, (b) training on GPU, and (c) using a ReLU instead of tanh(x).
  In the end, it was the hardware that made it possible, but up until their paper it really wasn't for sure. A good analogy is the invention of the airplane. You can speculate all you want about the curvature of a bird's wing and lift, but until you actual build a wing that flies, it's all speculation.
- dchichkov 1896 days ago
  We've learned to learn cost functions, instead of hardcoding. Discriminative models are always worriesome. Nuclear war, unexpected deterioration of democracy or unexpected and rapid change of climate.
- pwbdecker 1896 days ago
  I feel the same way. I was working on NN research in the 2000s and spent all my time trying to optimize the performance to deal with non trivial data sets. I got as far as working on GPU implementations around 2008 before I moved on to other subjects, but seeing these results now is incredibly validating. There was no shortage of profs and grad students that scoffed at me at the time. I still kick myself for not keeping up with it
al2o3cr 1896 days ago
Contrast a similar writeup on some interesting observations about solving ImageNet with a network that only sees small patches (largest is 33px on a side)
https://medium.com/bethgelab/neural-networks-seem-to-follow-...
purplezooey 1896 days ago
Question to me is, can you do this with i.e. Random Forest too, or is it specific to NN.
gdubs 1895 days ago
This is probably naive, but I’m imagining something like the US Library of Congress providing these models in the future. E.g., some federally funded program to procure / create enormous data sets / train.
[-]
- rsfern 1895 days ago
  I don’t think it’s that naive. NIST is actively getting into this space: https://www.nist.gov/topics/artificial-intelligence
CMCDragonkai 1895 days ago
I'm wondering how this compares to transfer learning applied to the same model. That is compare deep feature extraction plus linear model at the end vs just transferring the weights to the same model and retraining to your specific dataset.
zackmorris 1896 days ago
From the article:
Where are things headed?
There's a growing consensus that deep learning is going to be a centralizing technology rather than a decentralizing one. We seem to be headed toward a world where the only people with enough data and compute to train truly state-of-the-art networks are a handful of large tech companies.
This is terrifying, but the same conclusion that I've come to.
I'm starting to feel more and more dread that this isn't how the future was supposed to be. I used to be so passionate about technology, especially about AI as the last solution in computer science.
But anymore, the most likely scenario I see for myself is moving out into the desert like OB1 Kenobi. I'm just, so weary. So unbelievably weary, day by day, in ever increasing ways.
[-]
- coffeemug 1896 days ago
  Hey, I hope you don't take it the wrong way -- I'm coming from a place where I hope you start feeling better -- but what you're experiencing might be depression/mood affiliation. I.e. you feel weary and bleak, so the world seems weary and bleak.
  There are enormous problems for humanity to solve, but that has always been the case. From plagues and famines, to world wars, to now climate change, AI risk, and maybe technology centralization. We've solved massive problems before at unbelievable odds, and I want to think we'll do it again. And if not, what of it? What else is there to do but work tirelessly at attempting to solve them?
  I hope you feel better, and find help if you need it -- don't mean to presume too much. My e-mail is in my profile if you (or anyone else) needs someone to talk to.
- guelo 1896 days ago
  It seems kind of obvious in retrospect. I used to envision "the singularity" as somehow organically emerging from distributed technology and that would make it benevolent. But that was so naive. The singularity was always going to require massive investments of a scale that only monopolies or militaries can provide. That it currently looks like it will come out of ad-tech monopolies comfortable with psychological manipulation at a global scale is the most terrifying possibility of all.
- existencebox 1896 days ago
  I'm torn.
  On one hand, I absolutely see the logic, feel the occasional despair, and tend to agree with you, especially when it comes to economies of scale. I'll never write algos that'll detect the alpha hedge funds can. I'll never write the NLP that my own employer can leverage trivially.
  On the other hand, do I really need to? In 90% of the use cases where I want to solve a problem, with some pile of hacks and heuristics I've gotten "more than good enough." And the big companies will keep investing on ways to scale up and optimize these algos, which will only benefit us tiny users too. I did both my last publication and patent using a CPU-bound model and not an ounce of deep learning, with a corpus you could fit on a thumbdrive.
  I've watched a bigCo spend _months_ of some of the best engineers I know to optimize a tiny subproblem of a subproblem. (object similarity detection) Meanwhile I had to solve an isomorphism for my home camera system, threw together a prototype in a few hours with openCV and _really_ rudimentary bit-array hacks, declared it "WORKABLE" and have been using it for the last 3 years. There are some areas where what's in open source is pretty much what I'd use given any options. (Pandas, Spark, postgres) and some areas where it's not (pgadmin :P and OS UX (looking at you canonical) to name two). This isn't a one sided battle, to the strength and credit of non-big-corps.
  Maybe it's the eternal rebel in me, but I'm a fan of desert kenobi, it's the start of a journey. Stick it to the man!
- patcon 1896 days ago
  I recall this was the top-voted comment (rightfully imho) until shortly after someone suggested that the author was depressed instead of being reasonable... Now it's at the bottom, and lots of "wow, this is super-interesting" comments are above.
  fwiw, this quoted bit also jumped out at me as perhaps the most important note
  I never know what to make of the psychology of this/my community (tech, specifically), but the dynamics here on HN always provide me lots of "food for thought" to overfit :)
- rstuart4133 1891 days ago
  I took the opposite conclusion from it.
  After doing a back of the envelope calculation and concluding AlphaGo Zero 250MW hours of power I concluded we were going to see AI monopolies. Someone would develop the best imagine recognition, corner the market and get a stream of profits that allowed them to pour more money into training and round and round we go. As you say it was a depressing thought.
  If deep feature learning / embedding is really this effective it turns it upside down. We could end up with a AI being constructed from layers you buy off the shelf. Lots of parts coming from different vendors, all competing - not unlike the software stacks we use now.
  It might even go the MJPEG / AV1 way - large companies get the shits with paying someone for the layers and combine their resources to build a better layer than any one of them could on their own, and they can do it because they are all going do additional training on top and put it to a different use.
  This is impossibly speculative - but thinking you had to invest in 250MW Hours go build a decent Go machine was equally speculative and not nice. No open source group was going to come up with the next killer Go playing box if turned out that way. Now there is a possibility it may not.