How the heck did this end up on the homepage again?[a]
This paper makes broad grandiose claims, backed only by tiny toy experiments on tiny toy datasets with tiny toy neural networks.
By the same logic, I could claim that neural networks are equivalent to lookup tables, and then run a few experiments showing how look up tables are a better choice for tiny toy problems in which the answer for every possible combination of inputs is known in advance. That would be silly, wouldn't it?
I bet the author's polynomial regression library would do horrendously bad compared to SOTA deep models in cognitive tasks such as language translation, photo generation, unsupervised language modeling, playing Go, Dota2, or a similarly difficult game in a reinforcement learning setting, and so on.
There are dozens of papers "explaining" neural networks, using basic math, highly abstract math and even theoretical quantum physics models. I don't see any of them used in constructing, improving or interpreting results of actual neural networks. Moreover, while most of them have math that would take me weeks to fully understand, they rarely make concrete, testable predictions.
Most of the recent advances in ANN learning came from using more data and using different types of layers. A real theory of neural nets would explain why architecture A works better than architecture B for a particular problem. Or why a certain architecture plateaus after working with X samples, while other can learn from many more.
The way I understand it, the core problem of understanding deep learning is that it does not scale down. You can make a simple framework of 3-4 neurons and observe how it works. Then you make it bigger and it works differently. No one properly explained why this happens. People who do applications seem to just state their intuitions and then throw a couple formulas on top to make it seem scientific. There is a great essay on that by Piekniewski. No research I've seen lists all the parameters/architectures/datasets they tried that didn't work.
 - Maybe with the exception of the paper on neural networks as ODEs.
I mean what is the difference between like using 50 layers versus 75 layers? Ok, yeah we don't know, but to presume that we don't understand anything beyond 3-4 neurons is just nonsense.
By comparison. Go look at any synthetic chemistry paper. There's going to be crazy procedures. One time Phil Baran used Xenon Tetrafluoride as an oxidizing agent. And yeah, for some reason none of the other standard oxidizing agents worked. We don't know why. Hell it could have been that the grad student had bad hands with one of the standard ones. We can make a SWAG an to why XeF4 worked. But we'd be fooling ourselves to claim "we know", so we don't say we do. Does that discredit synthetic chemistry as an enterprise? Does that mean synthetic chemistry is not backed by theory?
That video is an introduction to the concept of CNNs for beginners. It explains in informal terms the structure and mechanics of a CNN, and probably gets around to discussing why it might work in very broad terms.
This is extremely far from what academically inclined people are talking about when they are using the word "theory".
People are hoping to see, ultimately, mathematical proofs that neural networks can e.g. efficiently learn target functions given a certain amount of training data and a certain number of trainable parameters.
To address your chemistry analogy: no, it doesn't discredit synthetic chemistry as an enterprise, it just means there is a gap in the theory. Similarly, the people bemoaning gaps in the deep learning theory aren't trying to discredit deep learning as an enterprise: the practical successes are very real; the theory is just lagging, as it often has in application-led fields.
> It explains in informal terms the structure and mechanics of a CNN
It actually gets quite formal. you could implement a (non-performant) CNN from everything in that video. The point is, that someone came up with a hypothesis "hey imposing this structure should yield faster convergence to usable results" is totally inductive science, and that is able to be explained quite well and simply from simple primitives that you don't need a math degree to understand really indicates some level of maturity in the field.
>People are hoping to see, ultimately, mathematical proofs that neural networks can e.g. efficiently learn target functions given a certain amount of training data and a certain number of trainable parameters.
yeah I don't dispute that, but to characterize our understanding of machine learning as not going beyond 3 or 4 neurons is just plain crazy.
Moreover, for some RNNs, e.g. character-based LSTMs trained on language models, we can extract feature details. I recall one case (sorry not bothered to find it) where there was a memory line dedicated to detetcting opening and closing quotation marks. Now this is a feature that one would expect to be obvious, or at least more so than other language features, but it's still a rather high level (and successful) attempt to understand how these machines function.
>Can you honestly say with a straight face that there is no theory behind why convnets are the right choice?
There is theory of convnets, but it's more of a post-factum analysis, really, and I'm not sure it's behind much of anything at this point. Researchers use convnets because they work in ways that work for them.
At risk of embarrassing the rest of the thread, please note that the paper was released with an R package  which performs the polynomial regression algorithm, and additionally an entire chapter of the paper is dedicated to running this package against two other R packages which implement neural networks, with competitive results.
I don't know if it's at all embarrassing (though there are some HNers that definitely did not at least skim the paper, which is a whole other thing).
The examples they gave in the paper were all tiny: MNIST was the biggest with 784 features (Crossfit has <40, Forest has <60, Taxi has ~11), so this is an absurd comparison. You can get better results than all of the models presented, in much less time, using random kitchen sinks (also sometimes known as 'random features') .
Plus, it's a little weird to me that their polynomial regression "runs out of memory"—they're solving a least squares problem for which we have a ton of good algorithms for. More specifically, there's no need to generate the data all at once (e.g., why they mention the "polynomial matrix would be huge") much less run QR factorization on this super dense matrix... rather, you would do this in a streaming fashion (e.g., generate the points, one by one, and proceed via SGD, or a nice pre-conditioned version of it since we have a good approximation of the Hessian via some simple streaming dimensionality reduction).
At the risk of sounding a bit harsh, this paper reads like someone recently found out about least squares and tried a few of the bases until they realized that simple things do pretty well in small cases when compared to fancy models like NNs (which is something almost every researcher that I work with knows—though I don't work directly in ML).
Indeed, there is the unfortunate fact that directly pointing out that a particular person has not read the paper is against HN's rules. Let us not violate the Prime Directive further.
If I understand the origin of this paper correctly, it is mainly the idea and work of an undergraduate student. From this perspective, several problems become more understandable. The limited computation done in support of the paper may have been done by a setup as meager as a lone student's laptop. The naïve view of machine learning as a field could be explained by undergraduate eagerness; indeed, perhaps they did "recently [find] out about least squares".
I agree with your thoughts on SGD. The authors did do PCA for some of their tests, but this entire setup should be buildable in a more efficient and incremental way.
I think, however, that the reason that this paper got the other three co-authors on it, despite its controversial nature, is that the core of the original idea is interesting and has yielded related fruit like UMAP  in the past. We should be more comfortable exploring it.
I like your book reference. It's informative and direct, without mumbo-jumbo. To accelerate the lookup for the next follower, start at p279.
When I was an AI grad student back in the late 80's and early 90's logic was all the rage. People published papers about things like non-monotonic logics, intentional vs extensional logics, higher-order logics, the frame problem, circumscription... One thing that always bothered me about the field was that the whole premise behind using logic as the basis for AI seemed to me to be obviously flawed because neither the inputs to nor the outputs from my brain were logical propositions, they were images and sounds on the input side, and sounds and gestures on the output side. Even my internal reality didn't seem to consist of logical propositions. When I went to open the refrigerator to make a sandwich, I didn't feel like I needed to prove any theorems in order to do it. And I think subsequent developments have vindicated my skepticism.
Well, I have a similar objection to neural networks. Human learning is fundamentally qualitatively different from contemporary neural network learning. It is simply not the case that I learn what a cat looks like by being shown a bunch of images of cats and having some meta-input that says "label:cat". Instead, I see a cat -- and a whole bunch of other things, including the environment that the cat happens to be in -- and I hear some sounds, and I somehow manage to make sense of the whole melange and sort out what a cat is and what the sound of my mother's voice is when she says, "That's a cat!, and the fact that the "cat" part of the audio stream is a label, and that it stands for this furry purry thing moving around in front of me, and that by producing some outputs that cause me to produce the same sounds from my own body I get the reward of hearing mommy say, "That's right!"
What we call neural networks today are interesting and useful, but I don't think they are even remotely close to anything interesting that the human brain does, and so quibbling over whether or not they are "essentially polynomial regression" or not is, IMHO, missing the point.
[EDIT: I had closed with "missing the point rather badly" but subsequent comments have convinced me that this extra emphasis was unwarranted.]
I disagree, fundamentally understanding what neural nets are from any angle might be useful to debug them better.
I do agree with that learning happens differently in the human brain. Your comment points out that humans sub-consciously cluster a lot of things, and do a lot of other things at the same time (e.g. you also mention reinforcement training) in order to make sense of the world.
If only children between 0 and 3 would be self-aware enough of what they're going through, would've made this question so much more easy to answer. I wonder whether their form of learning and our form of learning is the same, I don't think so (pruning and all that).
I would say that we were too eager to jump on "Learning" term. It implies that those things are somehow close to us, humans. In reality NN don't learn anything, it is just a optimization problem on the simplified model. And all the issue with them just proves that we are mostly brute-forcing it.
Basically, we have yet (from what I know - I am not a researcher, and this field moves fast) to find any form or method by which biological neural networks (aka brains) perform "backpropagation" - ie, neurons seem to be feedforward only systems.
Furthermore, we (people) don't need thousands or millions of labeled examples in order to learn something. Now, I know someone else replied to you about how you see millions of cat images based on some flawed reasoning that the eyes are seeing at 60 Hz - which is untrue, because it's a continuous analog system - plus there's the thing of neurons being spiking potential systems (vs what common ANNs model); but the idea is intriguing that we do see much more than one or two "images". But there are other situations where we learn things that don't require this at all, and there aren't such vast amounts of "labeled input" to learn a concept.
These two issues are problems from mainly a conceptual point; they obviously don't stop current ANNs from functioning or being useful. But they do call into question how our brains are actually doing things, and especially doing this with such low amounts of power (vs the huge amounts of power required for deep learning systems, for what would be basic tasks for humans).
Another wrench that may come into play: I recently learned (I think it was from an HN post) that biologists and others are coming around to the idea that genes don't work like we once thought they actually did, and that cells - internally and externally - have some kind of communications system and other "stuff" going on that makes the whole system much more complex than the simplification most people learn about.
I tend to wonder how such new information and realizations and "mechanics" will play into how we understand how neurons actually work and communicate biologically - whether any of that plays into "how brains learn". There's long been questions about chemical pathways and other things that aren't captured by common (and even uncommon) ANN models. Then there's also been questions as to what roles other cells in the brain that aren't neurons (glial cells) in the process, if any.
While neural networks are responsible for recent breakthroughs in problems like computer vision, machine translation and time series prediction, they can also combine with reinforcement learning algorithms.
Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or maximize along a particular dimension over many steps; for example, maximize the points won in a game over many moves. They can start from a blank slate, and under the right conditions they achieve superhuman performance. Like a child reinforced by its mother, these algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones – this is reinforcement. While that may sound trivial, it's a vast improvement over their previous accomplishments, and the state of the art is progressing rapidly.
Reinforcement learning solves the difficult problem of correlating immediate actions with the delayed returns they produce. Like humans, reinforcement learning algorithms sometimes have to wait a while to see the fruit of their decisions. They operate in a delayed return environment, where it can be difficult to understand which action leads to which outcome over many time steps.
Reinforcement learning algorithms can be expected to perform better and better in more ambiguous, real-life environments while choosing from an arbitrary number of possible actions, rather than from the limited options of a video game. That is, with time we expect them to be valuable to achieve goals in the real world.
> One thing that always bothered me about the field was that the whole premise behind using logic as the basis for AI seemed to me to be obviously flawed because neither the inputs to nor the outputs from my brain were logical propositions
Well, calculus is a bunch of logic propositions and it explains physics pretty well. So clearly some good things can be achieved by formal logic alone.
Decision trees are very much "logic" and they work pretty well.
Logic-based AI is pretty cool and would be great for fixing a lot of current problems in software engineering. Too bad it's mostly abandoned.
>Well, calculus is a bunch of logic propositions and it explains physics pretty well. So clearly some good things can be achieved by formal logic alone.
The idea to use logical rules to infer things about nature was arrived at through the non-logical "gestures and senses" methods that the parent was talking about. If all you had was logic you'd have no way to choose your axioms, or even your rules of inference. The axioms and rules of inference were, ultimately, selected from the vast sea of alternative options based on our observations, which rest on our senses.
> Well, calculus is a bunch of logic propositions and it explains physics pretty well.
I mean, it is true that you can construct a Turing machine using predicate logic, and thus all computer algorithms can be expressed in this way, but what matters is the relevant level of abstraction, and thinking about things using predicate logic is not always the best.
> It is simply not the case that I learn what a cat looks like by being shown a bunch of images of cats and having some meta-input that says "label:cat".
How do you know this? It seems to me this is likely how it works. If you spend an hour playing with a cat, and you look at the cat with your eyes at 60hz, you see about 5 million labeled augmented images of a cat.
Intuitively I feel like there is vastly more context involved in any biological neural network learning, but particularly in the human brain. You might be inclined to reduce the learning process to just a sequence of images, but the internal reference system brings in an enormous amount of other information that can get attached: sounds, smells, touch, taste, feelings, desires and so on. These all shape not just the way the new information is pre-processed but also how it is stored and in turn that newly stored memory influences future learning. The amount of feedback loops and indirect inputs is staggering.
Perhaps I'm naive, but the difference in apparent complexity and capability between artificial neural networks as we use and understand them today, and their biological counterparts seem vast.
I'm certainly not doing research into new forms of ML and AI algorithms, but I have used the existing concepts to some degree and just intuitively it feels like what we have today is still a very primitive approximation of what nature has achieved over the course of evolution. It feels like it's not merely a question of more data and more computational power, but we are missing fundamental building blocks to actually go beyond fancy curve fitting.
Note that I'm not making an assumption that fancy curve fitting cannot eventually give rise to human level AI, but certainly at the moment I see zero evidence that it can.
For sample efficiency, there are many tests where people are shown a projection of a novel 3d object for a few seconds and then are able to pick it out from a collection of other projections of various objects (from different poses). This is not simple memorization (since a person must solve the inverse-rendering problem and then compare the 3d representations).
that's not how it works. Presuming you don't already know chinese, I can do something like show you a new chinese character that you've never seen, once, and you will be able to pick it out of a group of ten random chinese characters. Almost every time.
Ok. Presuming your example is correct (it's not, neural impulses travel at roughly half the speed of sound) I could describe a Chinese character for you and you could identify it, so that's zero shot learning. Even so, if you've ever actually trained a neural networks you would know what happens when you feed the same data (n) times
The essence of this is that humans aren't blank slates. We have 500 million years of ancestors helping us tune our functioning in a world that contains large moving self willed discrete lifeforms that are pertinent to our survival.
Neural Networks seem to be capable of both too. Although I'm just guessing what do you mean precisely by deduction. Would you agree that solving Sudoku requires deduction?
"Finally, we show how recurrent relational networks can learn to solve Sudoku puzzles from supervised training data, a challenging task requiring upwards of 64 steps of relational reasoning. We achieve state-of-the-art results amongst comparable methods by solving 96.6% of the hardest Sudoku puzzles."
"We present NeuroSAT, a message passing neural network that learns to solve SAT problems after only being trained as a classifier to predict satisfiability. Although it is not competitive with state-of-the-art SAT solvers, NeuroSAT can solve problems that are substantially larger and more difficult than it ever saw during training by simply running for more iterations. Moreover, NeuroSAT generalizes to novel distributions; after training only on random SAT problems, at test time it can solve SAT problems encoding graph coloring, clique detection, dominating set, and vertex cover problems, all on a range of distributions over small random graphs."
If I'm reading this right, the core of the argument is that since any continuous function is can be approximated via a taylor series expansion, then activation functions can be seen, in-effect, as polynomial in nature, and since a neuron layer is a linear transformation followed by an activation function, the the whole system is polynomial.
That's "technically" correct, but it feels like an academic cop-out. Interesting/useful transfer functions tend to be functions that take very large expansions to be approximated with any accuracy.
That is a serious practical problem that we find even in this paper. The authors were unable to fit a degree 3 polynomial to a subset (just 26 components) of MNIST data (handwritten digits) due to a "memory issue".
But mathematical theory need not be practical. The relation between NNs and polynomial regression might be a fruitful theoretical observation even if the equivalent polynomial regression is incalculable.
I don't think it is even fruitful. We already know that mappings that don't contain poles can be approximated in various ways: Taylor expansion, piecewise linear, fourier transforms.. Taylor expansion is the polynomial fitting for authors, piecewise linear is NN with relu activation.
Not to mention that as a practical matter, the ability to train a neural network with backpropagation is important to get results that actually converge in a reasonable amount of time. It's not useful to say "but you could just use a polynomial regression" if you can't actually generate the equivalent polynomial regression in the same amount of time that you can train a neural network.
I'm not sure that's actually correct. In fact I'm sure it is incorrect for a polynomial of degree 1. Otherwise, there's nothing special about relu or tanh that you can't use sequential/backprop on polynomial regression in general.
> any continuous function is can be approximated via a taylor series expansion
We can get a good polynomial approximation of any continuous function but just on bounded set. Wouldn't such assumption (restriction of activation function domain) be problematic?
I think that's a very good point. Yes, you can approximate any given clasical NN with a polynomial, but how does the number of terms in the polynomial scale with the network size and the desired accuracy? There might be a very good paper there.
I haven't read the paper, but how can that be true when the polynomials produced by polynomial regression always converge at the "edges" (taking x far left or far right) to either positive or negative infinity (unless the polynomial is a constant), while DNNs can be built to have a sigmoid (or gaussian) function at the output, which doesn't have the same problem (it constrains the output to a range)?
Polynomials (without "activation functions" at the end) make for poor classifiers.
(This is a practical problem I've actually had when trying to replace DNNs with polynomial regression.)
It's not true. They make their main claim as follows:
"For general activation functions and implementations, we can at least say that the
function is at close to a polynomial, by appealing to the famous Stone-Weierstrass
Theorem [Katznelson and Rudin(1961)], which states that any continuous function on a compact set can be approximated uniformly by polynomials.5
event, for practical purposes here, we see that most activation functions can be
approximated by a polynomial. Then apply the same argument as above, which
Note that it's only true on a compact set, implying bounded set, implying it will not be valid out to infinity, only on some bounded, closed set.
Ok, but then that makes this a purely theoretical paper about something already pretty well known. Almost all interesting problems that DNNs are solving nowdays are some kind of classification problem (or with finite discrete outputs), which polynomials are really unsuited for (unless there are an infinite number of terms).
The wordpress page doesn't mention it but with "Neural Networks" they appear to mean "only" traditional deep neural networks, so no CNNs for example, which they categorize as "preprocessing".
Which is a legitimate view, I've seen it expressed in other places as well, but if you hear "neural networks" then you usually mean the entire thing and not just the fully connected and softmax readout layers at the end.
Conv layers are strictly special cases of FC layers (with respect to expressive power).
> For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
In general, if you can make assumptions about your problem and the form the solution takes, you can find a good solution with fewer parameters and fewer data, at the cost of suboptimal performance on problems that violate your assumptions. Conv layers are an example of this.
> The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
It's correct that the inferrence mode of a convolutional neural network can be reduced to the inference mode of a FC network. However, during training, you need to make sure that the weights are the same, and FC networks don't ensure that. So you need to train CNNs-embedded-in-FC slightly differently from normal FCs, otherwise what you're getting is not an CNN any more because the weights are different. What you need to do is to a) initialize all kernel entries with the same weights and b) instead of applying the weight changes directly, average the weight changes over all offsets and only then apply them to the specific kernel positions.
Absolutely–if you wanted to get an FC layer that satisfies Conv layer constraints, you would need to perform gradient descent subject to those constraints; normal gradient descent won't do that unless the actual optimum is of convolutional form. That's why I said (with respect to expressive power) :)
Classifying CNNs as just a form of preprocessing is not a legitimate view IMO. The breakthrough with CNNs is that they provide stronger gradient signals because of weight sharing. Essentially, CNNs made the first deep neural networks possible, they were trainable in areas where plain fully connected layers were not.
Hmm interesting, in the well-known book Neural Networks and Deep Learning  by Michael Nielsen, it is shown that neural networks can approximate any function .
This leaves the question: are neural networks able to produce more functions than simple polynomials? Apparently the answer of the summary of the mentioned paper is no (I didn't read the actual paper, only the summary).
I simply only know the basics of neural networks (I fully understand chapter 1 and 2 of ), so I wouldn't know understand why the two are practically equivalent. I mean, I can imagine why but I'd be making the fallacy of simply understanding that a neural net is able to approximate any polynomial and then think "yup, that's it, no clue how to generate counter examples, QED for all I care O:)"
So with this in mind is anyone able to argue why neural nets are essentially only polynomial regression?
They are not "equivalent". Polynomials can only produce polynomials, but these polynomials can be made to be as close as you want to any continuous function (see Stone-Weierstrass theorem). For example, a NN with a ReLU activation function produce a function that is not a polynomial.
Now I understand that this not the main point; both polynomials and NN can approximate any function that is not artificially nasty. The only thing is that a NN is easier to optimize and generalizes better for high dimensional problems such as image classification.
The issue is more complicated than simply expressiveness. The author claim that "one should use PR instead" seems dubious at best, but I'm now going to read the paper and figure it out.
Very interesting topic. Indeed, one way to interpret the neural networks is to design the deep architecture interpretable at the first place. Namely, one can design the structure and activation functions of each layer to mimic one step of gradient descent. Then, K layer of this tailored DNN resembles K iterations of an optimization algorithm, e.g., see [1-3]. But still, one need to prove the convergence of such networks (empirical results show very promising convergence speed).
Back before deep learning was all the rage textbooks used to argue that two layers were enough because a two layer network can approximate anything arbitrarily well.
They were right and wrong. Two layer networks CAN indeed approximate any function arbitrarily well. They just do a piss poor job of it. It can take exponentially more para meters than a better formulation.
The lesson is that representation matters a lot. There are lots of ways to construct universal approximators but they are not made equal.
I actually always describe neural networks as a robust taylor series. What we are trying to do by calculating the weights are matching the "plane" of your data space. In the case of classification, you may have multiple taylor series generated, where each one represents a class, such that the highest value output of all the functions concatenated is your classification.
Once you understand that a neural network "learns" the underlying data space, it is easy to understand why/how you can just change the output layer of a trained network in many cases to solve a different problem (on the same dataset).
PR is quite similar to a neural network with a single extremely wide layer. Unless you are hashing the polynomials, you will quickly run into an intractible problem, while properly setup NNs automatically learn which polynomials are needed for solving the problem, giving the budget of the network architecture (number of layers, number of neurons, etc.).
Nobody in industry will abandon NNs over PRs if they are looking to making it easier to handle. I doubt on most industrial problems, that PR even comes close to NNs in performance.
I think what you literally need is a network that is "narrow" but also has many layers. This is because what PR features over simple linear regression is interaction terms, and multiple layers are an alternate way to get those. This also hints at one problem with true PR - in a multi-variate context, the amount of coefficients you'll need to fit is going to scale very badly given the number of variables and the degree of your polynomial. NN layers have natural dimenionality-reduction properties that make them more flexible.
The old seems to be new again. Advancements in hardware are dusting off the ANN genre, making it shiny and bright, and sparking reinventions an order of magnitude greater than the previous revolution circa the 80's.