New Method for Compressing Neural Networks Better Preserves Accuracy

(developer.amazon.com)

120 points | by georgecarlyle76 1926 days ago

14 comments

twotwotwo 1926 days ago
Interesting 'cause once you think of it as a lossy compression task, lots of possible tricks come to mind. One is that you could "cheat" to correct the most important errors: look at which input words have high (distance from original point * frequency of use), and save the (in their example) 300-element error vector (original embedding - estimate) for those. Since we use the most common words a lot, you might be able to recover a noticeable chunk of accuracy without using much additional space.
Really, you want to correct the errors that are important, and importance to the task is a different thing from error magnitude * word frequency. You might run the network on real tasks with the uncompressed table and the compressed table, and add correction vectors for the words to which you can attribute the most mistakes made by the compressed net. And you don't necessarily have to correct the whole vector or correct it perfectly; you could round values in the correction vector down to zero when they're small enough, saving space and computation, or save your corrections with low precision, or try another round of SVD on some error vectors and only store 30 more weights for correcting important errors (and one 30x300 matrix) not 300, or some combo of tricks like that.
(I think this paper is saying that after you get the SVD-compressed lookup table, you do more NN training that tweaks the embedding and possibly the rest of the net. The links and weights for correcting important errors could be tweaked in the same process.)
It's tempting to treat neural networks as their own black box in a world apart from other CS sometimes, when in fact there can be lots of interesting connections. Now where all those connections ought to be and their ideal weights I don't know :P
azhenley 1926 days ago
The article's title is actually "New Method for Compressing Neural Networks Better Preserves Accuracy".
Summary of a key result: "In one set of experiments, for instance, we showed that our system could shrink a neural network by 90 percent while reducing its accuracy by less than 1%. At the same compression rate, the best prior method reduced the accuracy by about 3.5%."
Link to the 2019 AAAI paper: https://arxiv.org/pdf/1811.00641.pdf
[-]
- sctb 1926 days ago
  We've reverted the title from the submitted “Method shrinks NNs by 90% with sub-1% accuracy dropoff, vs. 3.5% for prior SOA”. Submitters: there's no need to try to pack the article into 80 characters! Titles and articles are two separate things.
- ralusek 1926 days ago
  Do they mean compressing the amount of neurons involved in computing, or do they mean compressing the outputted model?
  [-]
  - irodov_rg 1926 days ago
    Its mostly the lookup table which takes up the most space. This work is about breaking it into 2 layers and continuing to train to gain accuracy. The output model becomes 90% smaller compared to the original model.
- trhway 1926 days ago
  it would be interesting if they also show whether those 90% are necessary or not to get that 1% back. I mean we kind of suspect that probably not, the interesting part here would be to show proof or at least some good foundation.
  [-]
  - irodov_rg 1926 days ago
    This is shown in one of the plots, the less you compress the less you loose. While there is some analysis in the paper on how the computations reduce etc, the results are mostly emperical.
jknz 1926 days ago
The goal of the post is to compress an already trained network.
An more ambitious related goal is to perform the training of the network with at most 10% (or other small percentage) of the network weights being nonzero at any time during training. If possible without a great loss in accuracy, this could be particularly useful to save memory during training. The paper https://arxiv.org/abs/1711.05136 propose a method to achieve this.
arnocaj 1926 days ago
I would not consider word embeddings to be state of the art anymore. Word Embeddings are like TF-IDF when word embeddings came out. Have a look at BERT model that just recently got published and is outperforming all kind if NLP tasks with one main Architecture. I would consider BERT language model two levels higher than word embeddings, as it considers full context sensitive embeddings, dependent on the text left and right of the word in parallel.
[-]
- mlucy 1926 days ago
  I would second this; sentence embeddings outperform word embeddings on basically all tasks where you actually have sentences to work with. The only downside is that they're significantly more computationally intensive, especially for Transformer models like BERT.
  (Note: I'm fairly biased, since I work on https://www.basilica.ai, which among other things makes sentence embeddings available over a REST interface.)
- yazr 1926 days ago
  Is BERT computationally (and sample-wise) equivalent to previous SOTA?
  (I do DRL but not NLP)
  I sometimes read these DL papers and the requirements are not really feasible if you have to re-implement them in a modified domain.
  [-]
  - irodov_rg 1926 days ago
    BERT is more computationally expensive. It might end up giving better results on the task mentioned in the paper but we don't know. At the time of writing this all of the contextual word embedding techniques were fairly new and were not tried.
- DoctorOetker 1926 days ago
  I can see utility in demonstrating breakthroughs with the simpler compatible technique as opposed to a more complicated state of the art technique. The goal of scientific communication is to communicate with the simplest examples possible the rationale and effect of a proposal. Then anyone using more advanced frameworks can understand and consider implementing it in theirs.
anishacharya 1926 days ago
Just to clarify a bunch of questions below: The idea is more like : start from a full embedding layer train the network, after the training stabilizes break the first layer into two using the SVD formulation in an online fashion and continue training - after only a few more epochs you gain back all accuracy almost. This is particulary interesting and efficient compared to quantization / pruning methods as there is no way to regain lost accuracy in those methods... quantized distillation can be an option but too slow as pointed out by the authors, offline embedding compression like FastText.zip compresses embedding matrix offline and seems to be less information preserving than this method leading to worse performance. Another important thing to note: this method shows significant improvement over quantization/ pruning/ hashing when the network is deep. Deeper the network more is the quantization error due to more FLOPs in low precision - whereas this algorithm being principled from an information theoretic perspective maintains steady performance irrespective of network depth actually deeper nets will give better gains. In many of our experiments on smaller tasks this method actually acts as a cheap regularization as well leading to boosted performance. Also i guess we need to appreciate the mathematical elegance and ease of implementation. Also look at the Numerical Analysis that guarantees and provides bounds compared to Quantization methods.
nestorD 1926 days ago
So they took a NN taking embeddings as input, added a first layer going from one-hot encoding to embeddings (you just use the embedding matrix as first layer), did a SVD on it and retrained the network on top.
Using a SVD+retraining to compress a neural network is a common idea, their gain come mostly from the idea of applying that to the embeddings which are usually considered separately from the network (but do take space on your device).
[-]
- anishacharya 1926 days ago
  The idea ismore like : start from a full embedding layer train the network, after the training stabilizes break the first layer into two using the SVD formulation in an online fashion and continue training - after only a few more epochs you gain back all accuracy almost. This is particulary interesting and efficient compared to quantization / pruning methods as there is no way to regain lost accuracy in those methods... quantized distillation can be an option but too slow as pointed out by the authors, offline embedding compression like FastText.zip compresses embedding matrix offline and seems to be less information preserving than this method leading to worse performance. Another important thing to note: this method shows significant improvement over quantization/ pruning/ hashing when the network is deep. Deeper the network more is the quantization error due to more FLOPs in low precision - whereas this algorithm being principled from an information theoretic perspective maintains steady performance irrespective of network depth actually deeper nets will give better gains. In many of our experiments on smaller tasks this method actually acts as a cheap regularization as well leading to boosted performance. Also i guess we need to appreciate the mathematical elegance and ease of implementation. Also look at the Numerical Analysis that guarantees and provides bounds compared to Quantization methods.
mlucy 1926 days ago
Interesting. It took me a while to figure out what the main contribution here is, since doing dimensionality reduction on embeddings is fairly common. I think the main contribution is an empirical measure of A) how little is lost by reducing the dimensionality of the embeddings, and B) how much the dimensionality-reduced embeddings benefit from being fine-tuned on the final dataset.
It looks like fine-tuning the lower-dimensionality embeddings helps a lot (1% degradation vs. 2% for the offline dimensionality reduction according to the paper).
This makes me pretty happy, since I run a startup that trains and hosts embeddings (https://www.basilica.ai), and we made the decision to err on the side of providing larger embeddings (up to 4096 dimensions). We made this decision based on the efficacy of offline dimensionality reduction, so if fine-tuned dimensionality reduction works even better, that's great news.
[-]
- DoctorOetker 1926 days ago
  my background is physics, but I have been reading and following AI papers for many years, without actually applying this knowledge.
  I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings. With minimalist I do not refer to simplest in usage, but simplest in getting acquainted with the underlying codebase.
  I have no interest in becoming a proficient user of a commercially relevant library, since machine learning is not how I earn money. But I can imagine one of the more theoretical courses on machine learning has a code base for students to tinker with. i.e. where the educational facet of how the word embedding algorithms work is somewhat prioritized over the commercial relevance and computational efficiency facets. Do you know of any such minimal implementations?
  The idea concerns objective bias removal (potentially at the cost of loss of factual knowledge), so if it works it may be interesting outside of word embeddings as well...
  EDIT: the only functional requirement I have of the codebase is that given a corpus, it can generate an embedding such that the classic "king" + "woman" - "man" = ~ "queen" is satisfied. the only computational requirement I have regarding computational efficiency is that it can generate this embedding on a single core laptop within a day for a roughly billion word corpus. any additional functionality is for my purpouses overcomplicating. So I am looking for the simplest (well-documented, preferably in paper form) codebase which achieves this.
  [-]
  - speedplane 1926 days ago
    >I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings.
    Using AI on natural language is still in its infancy. You're not going to find a clean off-the-shelf solution if you're doing something unique.
    Google Cloud has a natural language AI module that makes solving certain problems pretty easy (little to no coding). The next step is to look at packages like NLTK, which is more complex but can handle a larger set of natural language processing (you need to be comfortable with Python). If neither of the above tools does the job, you're going to need to dive much deeper, and learn about the fundamentals of neural nets and natural language processing and become familiar with tools like Tensorflow.
    Natural language processing is a much more open field than image processing, despite a seemingly much lower bandwidth and data size. As far as data-size, all of Shakespeare's works can fit into a single iPhone image, but understanding what's going on in natural language is often a far more difficult task than what's going on in an image.
    [-]
    - DoctorOetker 1926 days ago
      I apologize for my late edit, and thank you for your response. I just need a minimal codebase that generates a word embedding that satisfies these quasi linear relationships.
karmakaze 1926 days ago
This leads in a very interesting direction. Previously I read that DNN is just as content to fit random input as it is to fit meaningful categories of input. This makes it seem that the DNN is just a memory trick and can't possibly extract any meaning.
Now being able to compress the network after being trained, it's evident that not all the nodes were needed/used. A good experiment would be to train a DNN on random input samples and verify that it is far less compressible. If so, the degree of compressibily could be used as a proxy estimate of how much 'conceptualization' is being done.
[-]
- anishacharya 1926 days ago
  Overcomplete nature of DNN is a common phenomenon and often advocated to convert saddle points into local minima --- understanding the actual landscape is difficult due to the inherent nonlinearity.... this indeed leads to an interesting research direction and many prominent research groups have started investing heavily on this
- visarga 1926 days ago
  > Previously I read that DNN is just as content to fit random input as it is to fit meaningful categories of input.
  That's not exactly right. DNN's can memorise randomly given labels, but it's harder than learning real data, because real data has more regularities.
atrudeau 1926 days ago
Very cool idea. How does fine tuning the SVD initialization compare to training from random initialization using the same architecture? I couldn't find this in the paper.
ppod 1926 days ago
Something I don't understand:
I thought that when using LSTM for classification, the input vector for the sentence was a single vector of length 300, recurrently updated with each word in the text. And for the averaging method, it was a vector of length 300, the average of each of the vectors of length 300 for the words in the sentence. At what stage in the standard process would a matrix of size V*300 be used to represent a sentence or document?
[-]
- twotwotwo 1926 days ago
  I think the big matrix in the post is the "dictionary" used to map words to NN input vectors, not a representation of one input sentence or document. So its size is (total words understood by NN * dimensions of embedding).
  [-]
  - ppod 1926 days ago
    Yes, sorry that's actually quite clear in the text now that I read it again.
drewm1980 1926 days ago
"NLP" would be welcome in the title since this seems specific to natural language processing.
anishacharya 1926 days ago
The paper is here: https://arxiv.org/pdf/1811.00641.pdf
darawk 1926 days ago
They're just SVDing the embeddings? Is that really new?
[-]
- twotwotwo 1926 days ago
  I think the new things are integrating the SVD output into the NN as a couple of layers and then training it more, and adjusting the learning rate differently from how it was done before. The training step will tend to tune the compression to work well for the specific task.
  Getting good results on the task is different from making the compressed representation of the embedding better on all-purpose metrics (like average magnitude of difference from the original embedding). For example, you might end up with changes that improve representation of common words, or words where errors actually mess up results, at the cost of making the decompression worse by simple metrics.
  (It wasn't clear to me if the training adjusts the whole net's weights or just the two layers they add up front. If it adjusts everything, you could also get joint changes to the original net + decompression machinery that only work well together.)
  [-]
  - irodov_rg 1926 days ago
    Training adjusts all whole net's weights. And you are right, we get the joint changes and decompression machinery to work together.
josteink 1926 days ago
Can some translate the title into a non-obfuscated language please?
This is ridiculous.
[-]
- rococode 1926 days ago
  NN = Neural Network
  SOA = State of the Art
  The rest should be clear: it's a new method that shrinks the size of neural nets by 90% while only dropping the accuracy <1%. The previous state of the art for doing the same task of shrinking neural nets dropped the accuracy by 3.5%.
  Compressing/shrinking neural nets is an important task because most edge devices (i.e. smart devices like thermostats, security cams, etc.) don't have the compute power to run large neural networks and need more compact ways to do the same tasks. Currently the solution is to use internet connectivity to query an API but that's not ideal for a variety of reasons.
  [-]
  - visarga 1926 days ago
    Not neural nets in general, just the word embedding layer. This layer tends to be large, like 1,000,000 x 300, so there are problems deploying it to production, especially in mobile environments. They can reduce the second dimension to something like 30 using SVD, replacing the large layer with the smaller one, and then retraining.
    Note that whole networks can also be compressed to similar levels with a variety of techniques, this one is just slightly better for a specific type of layer.
- CardenB 1926 days ago
  This isn't a productive way to criticize something. Please state what you were confused about and suggest an alternative.