Deep Learning Breakthrough Made by Rice University Scientists

(arstechnica.com)

112 points | by Tomte 1592 days ago

7 comments

primitivesuave 1592 days ago
I would hardly characterize this as a breakthrough: https://openreview.net/forum?id=r1RQdCg0W
[-]
- Audoenus 1592 days ago
  A good rule of thumb I always use if a science article title has the word "breakthrough"in it, then it's probably not a breakthrough.
  If the title does nothing to describe the actual discovery made and solely consist of "Breakthrough in [Field]" then it's definitely not a breakthrough.
  [-]
  - jvm_ 1592 days ago
    Sounds like the rule where if the headline ends in a question mark, the answer to the question is No.
    [-]
    - taneq 1592 days ago
      Oh, you mean Cunningham's law.
      [-]
      - rflrob 1592 days ago
        I see what you did there...
    - jkaptur 1592 days ago
      I wonder why no one seems to have published an article titled "Is Betteridge's Law of Headlines True?"
      edit: who else? https://www.johndcook.com/blog/2013/03/18/was-betteridge-rig...
  - t_mann 1591 days ago
    A good rule that could be refined by applying it only to topics that the general public cares about. A breakthrough in analytic number theory or international accounting standards is probably genuine, one in AI or battery technology probably not.
  - 19f191ty 1592 days ago
    Unless it's in Math. If a Math breakthrough makes it to the popular media then it's most likely a very big breakthrough.
    [-]
    - remarkEon 1592 days ago
      Why is that? I suppose complex maths is harder for the science journalist to understand, and doesn’t get as many clicks, so if they are reporting on it it’s because it’s substantial?
  - smaddox 1592 days ago
    Any rule that reduces the posterior of a breakthrough is generally going to be an improvement. Unless your definition of "breakthrough" is extremely generous.
- allovernow 1592 days ago
  >"[its] training times are about 7-10 times faster, and... memory footprints are 2-4 times smaller" than those of previous large-scale deep learning techniques.
  Which matches the abstract. If this has general applications it's a pretty big leap to shrink model sizes and speed up training by orders of magnitude, especially at a time when many SOTA models are only feasible for well funded groups because of their size.
  [-]
  - primitivesuave 1590 days ago
    This is not a strong result as noted by the reviewers, and they have not proven state-of-the-art performance among other things. Those of us who did ML in academia also look down upon using the media to bolster ones claims before a thorough peer review of research.
- billconan 1592 days ago
  thank you for the link. I couldn’t understand it as explained by Ars. I found open reviews are better.
m0zg 1592 days ago
Word to the wise: as someone who actually works in the field, trust NO claims until you can verify them with real code.
Papers very often contain the very uppermost bound of what's _theoretically_ possible when it comes to benchmarks. Researchers rarely have the skill to realize those gains in practice, so any performance numbers in papers should be assumed theoretical and unverified unless you can actually download code and benchmark them yourself or unless they come from a research organization known for competent benchmarking (e.g. Google Brain). In particular any "sparse" approach is deeply suspect as far as its practical performance or memory efficiency claims: current hardware does not deal with sparsity well unless things are _really_ sparse (like 1/10th or more) and sparsity is able to outweigh architectural inefficiencies.
[-]
- ganzuul 1592 days ago
  https://github.com/Tharun24/MACH/
  [-]
  - m0zg 1592 days ago
    https://github.com/Tharun24/MACH/blob/master/amazon_670k/src...
    Run on a single machine by logically partitioning GPUs. Don't get me wrong, I'm not disputing that this could work or that it could be a "breakthrough". I'm just saying that unless it's independently replicated and confirmed, it's just a paper like a million others.
    [-]
    - ganzuul 1592 days ago
      It's an interesting premise nonetheless. Perhaps another similar approach would be one from mathematical manifolds, where they have charts and atlases, and I believe they build the atlas by having overlapping charts.
mpoteat 1592 days ago
Not a full time ML researcher, but I thought I understood that batching is already an extremely common practice. I don't see the novelty here.
[-]
- dnautics 1592 days ago
  Bigger batches are good, but they result in locking. Picking a good batch size relative to how much data you have is important. This new technique lets you, effectively, buy a "meta batch" for free (that is a terrible analogy, but it's the best I can do.).
  As batches get bigger and can't fit inside a single gpu or single compute node, your challenge becomes data transport. So anything that will be able to decouple your computatational agents can be a win.
  In this case, it's a more clever way of decoupling your agents. Normally asynchronous batches are awful, but this is kind of a very clever way of allowing for asynchronous batching of your data.
  If I may opine on the matter, I think we're reaching a point where machine learning researchers should start thinking about abandoning python as a programming medium. For example, the other decoupling strategy (decoupled neural net backpropagation) doesn't really seem like something I would want to write in python, much less debug someone else's code. Python is really not an appropriate framework for tackling difficult problems in distribution and network coordination.
  [-]
  - comicjk 1592 days ago
    As long as the big ML libraries support these strategies, people will use them. The choice of user language is not critical. Tensorflow/PyTorch are basically an ML-specific programming model with a Python interface.
    [-]
    - dnautics 1591 days ago
      They don't, that's my point. I can find only one library for this: https://arxiv.org/abs/1608.05343
  - allovernow 1592 days ago
    What's the performance difference in cpp and python that's wrapping cpp for critical sections?
    Pretty much all array operations in numpy are as I understand are calling into cpp libraries for cpu and GPU operations.
    [-]
    - dnautics 1591 days ago
      Did you read what I wrote? I'm not making any claims about numerical performance. I'm saying there are better choices (in terms of being easy for the programmer to write and debug) for programming other aspects, like network, asynchronous coordination, etc.
  - Random_ernest 1591 days ago
    Which other programming language would you suggest?
ganzuul 1592 days ago
This seems to be their latest work: https://arxiv.org/abs/1910.13830
gambler 1592 days ago
>Instead of training on the entire 100 million outcomes—product purchases, in this example—Mach divides them into three "buckets," each containing 33.3 million randomly selected outcomes.
So, uh, they're doing what random forests were doing for decades? What is the key difference?
[-]
- overlords 1592 days ago
  Random forests split the features This splits the outcomes.
  So each tree in RF only looks at a few features. In this, each model looks at all the features.
  RF can handle multiclass problems of tens to hundreds (maybe thousands). This MACH algo can handle multiclass problems of millions/billions (extreme classification).
m3kw9 1592 days ago
Looks like any advancement can be called a breakthrough, a onion paper breakthrough can be a breakthrough
deadens 1592 days ago
Umm... Here's an obvious idea, what if you don't store the entire model in memory and use message passing architecture to distribute the model kinda like how HPC people have been doing this entire time? Non distributed models are a dead end anyway.
[-]
- derision 1592 days ago
  Latency between GPUs kills performance
  [-]
  - sudosysgen 1592 days ago
    It depends on just how huge the model is. Some models take multiple seconds to run/backpropagate and might take hundreds of gigabytes of memory, in which case it could be useful.
    [-]
    - strbean 1592 days ago
      Also seems like a problem that could be partially solved by tailoring the NN architecture. Does that make sense?
      [-]
      - ganzuul 1592 days ago
        Do you mean like Stochastic Gradient Descent does?