Alice's adventures in a differentiable wonderland

(sscardapane.it)

235 points | by tosh 20 days ago

7 comments

xanderlewis 20 days ago
> Stripped of anything else, neural networks are compositions of differentiable primitives
I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.
I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.
[-]
- jonas21 20 days ago
  I feel like this statement is both obvious after spending a few minutes working with neural networks and completely useless in helping you build better neural networks.
  It's kind of like saying, "Stripped of anything else, works of literature are compositions of words"
  [-]
  - Horffupolde 20 days ago
    Well, I’d argue that could also be a bit enlightening. It’s like taking a moment to appreciate that forests are composed of single trees. It takes a certain level of insight to appreciate systems at various depths.
  - falcor84 20 days ago
    Let me try to one-up that: "Stripped off anything else, works of literature are compositions of the uppercase letter I"
- captainclam 20 days ago
  Ugh, exactly, it's so cool. I've been a deep learning practitioner for ~3 years now, and I feel like this notion has really been impressed upon me only recently.
  I've spent an awful lot of mental energy trying to conceive of how these things work, when really it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
  And the cool part is that this yields such rich, interesting, sometimes even useful, structures!
  I like to think of this cognitive primitive as the analogue to the idea that thermodynamics is just the sum of particles bumping into each other. At the end of the day, that really is just it, but the collective behavior is something else entirely.
  [-]
  - xanderlewis 20 days ago
    > At the end of the day, that really is just it, but the collective behavior is something else entirely.
    Exactly. It’s not to say that neat descriptions like this are the end of the story (or even the beginning of it). If they were, there would be no need for this entire field of study.
    But they are cool, and can give you a really clear conceptualisation of something that can appear more like a sum of disjoint observations and ad hoc tricks than a discipline based on a few deep principles.
  - JackFr 20 days ago
    NAND gates by themselves are kind of dull, but it's pretty cool what you can do with a billion of them.
  - kadushka 20 days ago
    it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9
    This is not how gradient based NN optimization works. What you described is called "random weight perturbation", a variant of evolutionary algorithms. It does not scale to networks larger than a few thousand parameters for obvious reasons.
    NNs are optimized by directly computing a gradient which tells us the direction to go to to reduce the loss on the current batch of training data. There's no trying up or down and seeing if it worked - we always know which direction to go.
    SGD and RWP are two completely different approaches to learning optimal NN weights.
    [-]
    - xanderlewis 20 days ago
      I don’t think the author literally meant tweaking the parameters and seeing what happens; it’s probably an analogy meant to give a sense of how the gradient indicates what direction and to what degree the parameters should be tweaked. Basically, substitute ‘the gradient is positive’ for ‘increasing this parameter decreases performance’ and vice versa and it becomes correct.
      [-]
      - p1esk 20 days ago
        That substitution is the main difference between SGD and RWP.
        It’s like describing bubble sort when you meant to describe quick sort. Would not fly on an ML 101 exam, or in an ML job interview.
        [-]
        xanderlewis 20 days ago
        It’s not like that at all. You couldn’t accidentally sound like you’re describing quick sort when describing bubble sort, or vice versa. I can’t think of any substitution of a few words that would do that.
        The meaning of the gradient is perfectly adequately described by the author. They weren’t describing an algorithm for computing it.
        a_random_canuck 20 days ago
        I don’t think anyone is trying to pass an exam here, but just give an understandable overview to a general audience.
    - captainclam 20 days ago
      I guess you could say I don't know RWP from Adam! :D
      My og comment wasn't to accurately explain gradient optimization, I was just expressing a sentiment not especially aimed at experts and not especially requiring details.
      Though I'm afraid I subjected you to the same "cringe" I experience when I read pop sci/tech articles describe deep learning optimization as "the algorithm" being "rewarded" or "punished," haha.
      [-]
      - kadushka 20 days ago
        No worries, we're all friends here!
        it's just you happened to accidentally describe the idea behind RWP, which is a gradient-free optimization method, so I thought I should point it out.
- naasking 20 days ago
  > one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name
  One also shouldn't fall into the dual trap of assuming that just because one understands how a model works, it cannot have any bearing on the ever-mysterious operation of the brain.
  [-]
  - stefantalpalaru 20 days ago
    [dead]
- SkyBelow 20 days ago
  Before the recent AI boom, I was mystified by the possibility of AI and emulating humans (in no small part thanks to works of fiction showing AI powered androids). Then I created and trained some neural networks. Smaller ones, doing much of nothing special. That was enough to break the mysticism. To realize it was just multiplying matrices. Training them was a bit more advanced, but still applied mathematics.
  Only recently have I begun to appreciate that the simplicity of the operation, applied to a large enough matrices, may still capture enough of the nature of intelligence and sentience. In the end we can be broken down into (relatively) simple chemical reactions, and it is the massive scale of these reactions that create real intelligence and sentience.
  [-]
  - naasking 20 days ago
    Exactly, the people who are derisive of those who consider ML models to exhibit glimmers of true intelligence because it's only matrix multiplications always amuse me. It's like they don't even realize the contradiction in holding the position that seemingly complex and intelligent outward behaviour should not be used as an indication of actual complexity and intelligence.
    [-]
    - citizen_friend 20 days ago
      If you study this historically you will see that every generation thinks they have the mechanism to explain brains (gear systems, analog control/cybernetics, perceptrons).
      My conclusion is we tend to overestimate our understanding and the power of our inventions.
      [-]
      - naasking 20 days ago
        The difference is that we now actually have a proof of computational power and computational universality.
        [-]
        citizen_friend 20 days ago
        Analog circuits have the same computational power. Piecewise linear functions have the same computational universality.
        [-]
        naasking 20 days ago
        Except we didn't know any of that, nor did know how to construct physical analogs in order to achieve universal computation. At best, we had limited task-specific computation, like clocks and planetary motion.
        [-]
        citizen_friend 19 days ago
        We knew about universal function approximators like. Polynomials and trig functions since the 1700s. Turing and godel were around 1910 and 1920. The cybernetics movement is big in the 30s and 40s. Perceptrons 50s and 60s
        [-]
        naasking 19 days ago
        Taylor expansions for all functions do not exist. Furthermore, our characterization of infinity was still poor, so we didn't even have a solid notion of what it would mean for a formalism to be able to compute all computable functions. The notion of a universal computer arguably didn't exist until Babbage.
        I stand by my position that having a mathematical proof of computational universality is a significant difference that separates today from all prior eras that sought to understand the brain through contemporaneous technology.
        [-]
        citizen_friend 19 days ago
        > Taylor expansions
        That’s not what I’m talking about. This is a basic analysis topic:
        https://en.m.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_th...
        At least mid 1800s for a proof. 1700s also explored Fourier series.
        > stand by my position
        And you’re still ignoring the cybernetics, and perceptrons movement I keep referring to which was more than 100 years ago, and informed by Turing.
        [-]
        naasking 19 days ago
        > That’s not what I’m talking about. This is a basic analysis topic:
        It's the same basic flaw: requiring continuous functions. Not all functions are continuous, therefore this is not sufficient.
        > And you’re still ignoring the cybernetics, and perceptrons movement I keep referring to which was more than 100 years ago, and informed by Turing.
        What about them? As long as they're universal, they can all simulate brains. Anything after Church and Turing is just window dressing. Notice how none of these new ideas claimed to change what could in principle be computed, only how much easier or more natural this paradigm might be for simulating or creating brains.
        [-]
        citizen_friend 19 days ago
        This implies it works piecewise. That’s also true of neural nets lol. You have to keep adding more neurons to get the granularity of whatever your discontinuities are.
        It’s also a different reason than Taylor series which uses differentiability.
        You do not understand this subject. Please read before repeating this: https://en.m.wikipedia.org/wiki/Universal_approximation_theo...
        > what about them
        Then you seem to have lost the subject of the thread.
  - mistermann 20 days ago
    Next step, in case you get bored: why (in fact, not a minor distinction) does such a simple approach work so well?
- sideshowb 20 days ago
  > deep learning is simply the application of sequences of operations that are nonlinear but nonetheless differentiable
  Though other things fit this description which are not deep learning. Like (shameless plug) my recent paper here https://ieeexplore.ieee.org/document/10497907
- gessha 20 days ago
  It is soothing to the mind because it conveys that it’s understandable but it doesn’t take away from the complexity. You still have to read through math and pytorch code and debug nonsensical CUDA errors, comb through the data, etc etc
  [-]
  - whimsicalism 20 days ago
    the complexity is in the values learned from the optimization. even the pytorch code for a simple transformer is not that complex, attention is a simple mechanism, etc.
    [-]
    - gessha 20 days ago
      Complexity also comes from the number of papers that work out how different elements of network work and how to intuitively change them.
      Why do we use conv operators, why do we use attention operators, when do we use one over the other? What augmentations do you use, how big of a dataset do you need, how do you collect the dataset, etc etc etc
      [-]
      - whimsicalism 20 days ago
        idk, just using attention and massive web crawls gets you pretty far. a lot of the rest is more product-style decisions about what personality you want your LM to take.
        I fundamentally don't think this technology is that complex.
        [-]
        gessha 20 days ago
        No? In his recent tutorial, Karpathy showed just how much complexity there is in the tokenizer.
        This technology has been years in the making with many small advances pushing the performance ever so slightly. There’s been theoretical and engineering advances that contributed to where we are today. And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.
        Also, the post is generally about neural networks and not just LMs.
        When making design decisions about an ML system you shouldn’t just choose the attention hammer and hammer away. There’s a lot of design constraints you need to consider which is why I made the original reply.
        [-]
        whimsicalism 20 days ago
        Are there micro-optimizations that eke out small advancements? Yes, absolutely - the modern tokenizer is a good example of that.
        Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.
        > And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.
        I think the current technology is useable.
        > you shouldn’t just choose the attention hammer and hammer away
        It's a good first choice of hammer, tbph.
- jimbokun 20 days ago
  I always get the impression even the proponents of these algorithms when they didn't seem so promising, are shocked at the capabilities demonstrated by models built with such a relatively simple procedure.
- zackmorris 20 days ago
  Ya so far this is the best introduction to neural networks from first principles that I've seen.
  Quickly skimming the draft pdf at https://arxiv.org/pdf/2404.17625 I can grok it instantly, because it's written in familiar academic language instead of gobbledygook. Anyone with an undergrad math education in engineering, computer science, etc or a self-taught equivalent understanding of differential equations should be able to read it easily. It does a really good job of connecting esoteric terms like tensors with arrays, gradients with partial derivatives, Jacobians with gradients and backpropagation with gradient descent in forward/reverse mode automatic differentiation. Which helps the reader to grasp the fundamentals instead of being distracted by the implementation details of TensorFlow, CUDA, etc. Some notable excerpts:
  Introduction (page 4):
```
  By viewing neural networks as simply compositions of differentiable primitives we can ask two basic questions (Figure F.1.3): first, what data types can we handle as inputs or outputs? And second, what sort of primitives can we use? Differentiability is a strong requirement that does not allow us to work directly with many standard data types, such as characters or integers, which are fundamentally discrete and hence discontinuous. By contrast, we will see that differentiable models can work easily with more complex data represented as large arrays (what we will call tensors) of numbers, such as images, which can be manipulated algebraically by basic compositions of linear and nonlinear transformations.
```
  Chapter 2.2 Gradients and Jacobians (page 23):
```
  [just read this section - it connects partial derivatives, gradients, Jacobians and Taylor’s theorem - wow!]
```
  Chapter 4.1.5 Some computational considerations (page 59):
```
  In general, we will always prefer algorithms that scale linearly both in the feature dimension c and in the batch size n, since super-linear algorithms will become quickly impractical (e.g., a batch of 32 RGB images of size 1024×1024 has c ≈ 1e7). We can avoid a quadratic complexity in the equation of the gradient by computing the multiplications in the correct order, i.e., computing the matrix-vector product Xw first. Hence, pure gradient descent is linear in both c and n, but only if proper care is taken in the implementation: generalizing this idea is the fundamental insight for the development of reverse-mode automatic differentiation, a.k.a. back-propagation (Section 6.3).
```
  Chapter 6 Automatic differentiation (page 87):
```
  We consider the problem of efficiently computing gradients of generic computational graphs, such as those induced by optimizing a scalar loss function on a fully-connected neural network, a task called automatic differentiation (AD) [BPRS18]. You can think of a computational graph as the set of atomic operations (which we call primitives) obtained by running the program itself. We will consider sequential graphs for brevity, but everything can be easily extended to more sophisticated, acyclic computational graphs.
  
  The problem may seem trivial, since the chain rule of Jacobians (Section 2.2, (E.2.22)) tells us that the gradient of function composition is simply the matrix product of the corresponding Jacobian matrices. However, efficiently implementing this is the key challenge, and the resulting algorithm (reverse-mode AD or backpropagation) is a cornerstone of neural networks and differentiable programming in general [GW08, BR24]. Understanding it is also key to understanding the design (and the differences) of most frameworks for implementing and training such programs (such as TensorFlow or PyTorch or JAX). A brief history of the algorithm can be found in [Gri12].
```
  Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter 2.2 Gradients and Jacobians (page 23) for better context.
  [-]
  - barrenko 20 days ago
    Thank you for the summary, and thanks to the OP/author of the book.
    I started self-studying programming some time ago, then pivoted to AI/ML and (understandably) ended up mostly studying math, these resources are a boon to my folk.
- andoando 20 days ago
  What does "differentiable primitives" mean here?
  [-]
  - xanderlewis 20 days ago
    I think it’s referring to ‘primitive functions’ in the sense that they’re the building blocks of more complicated functions. If f and g are differentiable, f+g, fg, f/g (as long as g is never zero)… and so on are differentiable too. Importantly, f composed with g is also differentiable, and so since the output of the whole network as a function of its input is a composition of these ‘primitives’ it’s differentiable too.
    The actual primitive functions in this case would be things like the weighted sums of activations in the previous layer to get the activation of a given layer, and the actual ‘activation functions’ (traditionally something like a sigmoid function; these days a ReLU) associated with each layer.
    ‘Primitives’ is also sometimes used as a synonym for antiderivatives, but I don’t think that’s what it means here.
    Edit: it just occurred to me from a comment below that you might have meant to ask what the ‘differentiable’ part means. See https://en.wikipedia.org/wiki/Differentiable_function.
    [-]
    - andoando 20 days ago
      Is this function composition essentially lambda calculus then?
      [-]
      - xanderlewis 20 days ago
        Composition here just means what it does for any two functions: the value of the ‘composition’ of f and g at x is defined to be f applied to g applied to x. In symbols, its: f∘g := f(g(x)) for each x in the domain of f. It may seem obvious, but the fact that this new thing is also a function (that is, its value is well-defined for every input) is actually a very useful thing indeed and leads to… well, most of mathematics.
        You can certainly do function composition in lambda calculus: in fact, the act of composition itself is a higher order function (takes functions and returns a function) and you can certainly express it formally with lambda terms and such. It’s not really got anything to do with any particular language or model of computation though.
        [-]
        andoando 20 days ago
        I didn't form my question too well. I understand all that. What I am asking is, are these function compositions equivalent to equivalent/similar to functions in lambda calculus?
        I guess my question, is what are the primitive functions here doing?
        [-]
        xanderlewis 20 days ago
        Well, yes, to the extent that functions are functions are functions (they’re just associations or mappings or whatever you want to call them).
        Maybe your question boils down to asking something more general like: what’s the difference between functions to a computer scientist (or a programmer) and functions to a mathematician? That is, are ‘functions’ in C (or lambda calculus), say, the same ‘functions’ we talk about in calculus?
        The answer to that is: in this case, because these are quite simple functions (sums and products and compositions thereof) they’re the same. In general, they’re a bit different. The difference is basically the difference between functional programming and ‘traditional’ programming. If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.
        Since you’re asking about lambda calculus in particular, the answer is that they’re the same because lambda calculus doesn’t have state. It’s ‘purely functional’ in that sense.
        >I guess my question, is what are the primitive functions here doing?
        I’m not really sure what you mean. They’re doing what functions always do. Every computer program is abstractly a (partial) function.
        Does that help, or have I misunderstood?
        [-]
        andoando 20 days ago
        So when I think of functions in lambda calculus, I think of the I,S,K functions which when composed can produce functions like "copy", "add", "remove", "if", etc which then can do different computations like "copy every other symbol if the symbol is true", "multiply 5 times then add 2". Since lambda calculus is complete, any computation/program can be composed.
        When I think of functions in a traditional mathematical sense, I think about transformations of numbers. x->2x, x->2x^2, etc. I completely understand composition of functions here, ex x->2(x->2x)^2, but its unclear how these transformations relate to computation. For a regression problem, I can totally understand how finding the right compositions of functions can lead to a better approximations. So I am wondering, in an LLM architecture, what computations do these functions actually represent? I assume, it has something to do with what path to take through the neural layers. I probably just need to take the time to study it deeper.
        >If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.
        Totally understood from the perspective of functions in say, Java. Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function. The "global" state is from another reference, just local variables of the encompassing function. If a function is modifying variables outside of the "function block" (in say Java), the "input" to the function isn't just the parameters of the function. Imo, this is more of an artifact of implementation of some languages rather than a fundamental difference. Python for example requires declaring global args in the function block. Go one step further and require putting global args into the parameters list and you're pretty close to satisfying this.
        [-]
        xanderlewis 20 days ago
        I think you’re actually massively overthinking it.
        The state of a neural network is described entirely by its parameters, which usually consist of a long array (well, a matrix, or a tensor, or whatever…) of floating point numbers. What is being optimised when a network is trained is these parameters and nothing else. When you evaluate a neural network on some input (often called performing ‘inference’), that is when the functions we’re talking about are used. You start with the input vector, and you apply all of those functions in order and you get the output vector of the network. The training process also uses these functions, because to train a network you have to perform evaluation repeatedly in between tweaking those parameters to make it better approximate the desired output for each input. Importantly, the functions do not change. They are constant; it’s the parameters that change. The functions are the architecture — not the thing being learned. Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.
        > Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function.
        You’re pretty much right about that, but there are two important problems/nitpicks:
        (1) We can’t prove (in general) that a given program will halt and evaluate to something (rather than just looping forever) on a given input, so the ‘entire program’ is instead what’s called a partial function. This means that it’s still a function on its domain — but we can’t know what its precise domain is. Given an input, it may or may not produce an output. If it does, though, it’s well defined because it’s a deterministic process.
        (2) You’re right to qualify that it’s the whole program that is (possibly) a function. If you take a function from some program that depends on some state in that same program, then clearly that function won’t be a proper ‘mathematical’ function. Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function. You have to remember that in mathematics, unlike in programming, a function consists essentially of three pieces of data: a domain, a codomain, and a ‘rule’. If you want to be set-theoretic and formal about it, this rule is just a subset of the cartesian product of its domain and codomain (it’s a set of pairs of the form (x, f(x))). If you change either of these sets, it’s technically a different function and there are good reasons for distinguishing between these. So it’s not right to say that mathematical functions and functions in a computer program are exactly the same.
        [-]
        andoando 20 days ago
        I appreciate your responses, sorry I hope I don't seem like Im arguing for the sake of arguing.
        >Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.
        Im a little confused. The discussion thus far about how neural networks are essentially just compositions of functions, but you are now saying that the function is static, and only the parameters change.
        But that aside, if these parameters change which neurons are activated, and this activation affects which neurons are activated in the next layer, are these parameters effectively not changing the path taken through the layers?
        >Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function.
        So say we have this program " let c = 2; function 3sum (a,b) { return a+b + c; } let d = 3sum(3,4)"
        I believe you are saying, if we had constructed this instead as
        "function(a,b,c) { return a+b+c } let d = 3sum(3,4,2) "
        then, this is a different function.
        Certainly, these are different in a sense, but at a fundamental level, when you compile this all down and run it, there is an equivalency in the transformation that is happening. That is, the two functions equivalently take some input state A (composed of a,b,c) and return the same output state B, while applying the same intermediary steps (add a to b, add c to result of (add to b)). Really, in the first case where c is defined outside the scope of the function block, the interpreter is effectively producing the function 3sum(x,y,c) as it has to at some point, one way or another, inject c into a+b+c.
        Similarly, I am won't argue that the current, formal definitions of functions in mathematics are exactly that of functions as they're generally defined in programming.
        Rather, what I saying is that there is an equivalent way to think and study functions that equally apply to both fields. That is, a function is simply a transformation from A to B, where A and B can be anything, whether that is bits, numbers, or any other construction in any system. The only primitive distinction to make here is whether A and B are the same thing or different.
      - OJFord 20 days ago
        Function composition is just f(g(x)), considered as a single function that's the composition of f and g; it has the domain of f and the range of g.
        In lambda calculus terminology it's an 'application' (with a function argument).
  - esafak 20 days ago
    functions, which you can compose to increase their expressiveness, and run gradient descent on to train.
    The success of deep learning is basically attributable to composable (expressive), differentiable (learnable) functions. The "deep" moniker alludes to the compositionality.
  - CobrastanJorji 20 days ago
    Continuous mathematical functions which have derivatives.
    [-]
  - dirkc 20 days ago
    When I did "AI" it would have meant the sigmoid function, these days it's something like ReLU.
- jxy 20 days ago
  > > Stripped of anything else, neural networks are compositions of differentiable primitives
  > I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.
  And I hate inaccurate statements like this. It pretends to be rigorous mathematical, but really just propagates erroneous information, and makes the whole article so much more amateur in only a single sentence.
  The simple relu is continuous but not differentiable at 0, and its derivative is discontinuous at 0.
  [-]
  - xanderlewis 20 days ago
    It’s not ‘inaccurate’. The mark of true mastery is an ability to make terse statements that convey a huge amount without involving excessive formality or discussion of by-the-by technical details. If ever you’ve spoken to world-renowned experts in pure mathematics or other highly technical and pendantic fields, you’ll find they’ll say all sorts of ‘inaccurate’ things in conversation (or even in written documents). It doesn’t make them worthless; far from it.
    If you want to have a war of petty pedantry, let’s go: the derivative of ReLU can’t be discontinuous at zero, as you say, because continuity (or indeed discontinuity) of a function at x requires the function to have a value at x (which is the negation of what your first statement correctly claims).
    [-]
    - newrotik 20 days ago
      Lack of differentiability is actually a very important feature of the underlying optimization problem.
      You might think that it doesn't matter because ReLU is, e.g., non-differentiable "only at one point".
      Gradient based methods (what you find in pytorch) generally rely on the idea that gradients should taper to 0 in the proximity of a local optimum. This is not the case for non-differentiable functions, and in fact gradients can be made to be arbitrarily large even very close to the optimum.
      As you may imagine, it is not hard to construct examples where simple gradient methods that do not properly take these facts into account fail to converge. These examples are not exotic.
    - makerdiety 20 days ago
      Invoking excessive formality and discussions of minute technical details leads to a cathedral of knowledge built on autistic pedantry. The chosen rabbit hole to get lost in needs to be the correct one. And human science is riddled with the paths that have naive or childish fundamentals.
      [-]
      - kragen 20 days ago
        > a cathedral of knowledge built on autistic pedantry
        this is certainly true, but more often we use its short name, 'math'. it turns out to be far more effective than so-called common sense
      - mistermann 20 days ago
        This comment makes me want to both upvote and downvote with extreme enthusiasm/fury!
        The sign of a truly good conversation?
    - kragen 20 days ago
      my experience with world-renowned experts in pure mathematics is that they are much more careful than the average bear to explicitly qualify inaccurate things as inaccurate, because their discipline requires them to be very clear about precisely what they are saying
      discontinuity of a function at x does not, according to the usual definition of 'continuity', require the function to have a value at x; indeed, functions that fail to have a value at x are necessarily discontinuous there, precisely because (as you say) they are not continuous there. https://en.wikipedia.org/wiki/Continuous_function#Definition...
      there are other definitions of 'discontinuous' in use, but i can't think of one that would give the result you claim
      [-]
      - xanderlewis 20 days ago
        > they are much more careful than the average bear to explicitly qualify inaccurate things as inaccurate
        Sure. But what part of this entirely worded in natural language, and very short statement made you think it was a technical, formal statement? I think you’re just taking an opportunity to flex your knowledge of basic calculus, and deliberately attributing intent to the author that isn’t there in order to look clever.
        Regarding a function being discontinuous at a point outside its domain: if you take a completely naive view of what ‘discontinuous’ means, then I suppose you can say so. But discontinuity is just the logical negation of continuity. Observe:
        To say that f: X —> Y (in this context, a real-valued function of real numbers) is continuous means precisely
        ∀x∈X ∀ε>0 ∃δ>0 |x - p| < δ ⇒ |f(x) - f(p)| < ε
        and so its negation looks like
        ∃x∈X ⌐ …
        that is, there is a point in X, the domain of f where continuity fails.
        For example, you wouldn’t talk about a function defined on the integers being discontinuous at pi, would you? That would just be weird.
        To prove the point further, observe that the set of discontinuities (according to your definition) of any given function would actually include every number… in fact every mathematical object in the universe — which would make it not even a set in ZFC. So it’s absurd.
        Even more reasons to believe functions can only be discontinuous at points of their domain: a function is said to be discontinuous if it has at least one discontinuity. By your definition, every function is discontinuous.
        …anyway, I said we were going to be petty. I’m trying to demonstrate this is a waste of time by wasting my own time.
        [-]
        kragen 20 days ago
        you have an interesting point of view, and some of the things you have said are correct, but if you try to use gradient descent on a function from, say, ℤ → ℝ, you are going to be a very sad xanda. i would indeed describe such a function as being discontinuous not just at π but everywhere, at least with the usual definition of continuity (though there is a sense in which such a function could be, for example, scott-continuous)
        even in the case of a single discontinuity in the derivative, like in relu', you lose the intermediate value theorem and everything that follows from it; it's not an inconsequential or marginally relevant fact
        [-]
        jj3 20 days ago
        Note that any function ℤ → ℝ is continuous on its domain but nowhere differentiable.
        A Scott-continuous function ℤ → ℝ must be monontonous. So not every such function is Scott-continuous.
        [-]
        kragen 19 days ago
        aha, thanks!
        laingc 20 days ago
        Because memes aren't allowed on HN, you're not allowed to reply with the "akssshuallllyyy" meme, so you had to go to these lengths.
        ¯\_(ツ)_/¯
        [-]
        xanderlewis 20 days ago
        You’re actually not far off. I’m somewhat embarrassed by the above, but I think it makes the point.
  - whimsicalism 20 days ago
    it's pretty close to accurate, the lack of differentiability at 0 for relu doesn't really come into play in practice
  - kmmlng 19 days ago
    Eh, it really doesn't matter much in practice. Additionally, there are many other activation functions without this issue.
- phkahler 20 days ago
  >> one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name
  And yet, artificial neural networks ARE an approximation of how biological neurons work. It is worth noting that they came out of neurobiology and not some math department - well at least in the forward direction, I'm not sure who came up with the training algorithms (probably the math folks). Should they be considered mystical? No. I would also posit that biological neurons are more efficient and probably have better learning algorithms than artificial ones today.
  I'm confused as to why some people seem to shun the biological equivalence of these things. In a recent thread here I learned that physical synaptic weights (in our brains) are at least partly stored in DNA or its methylation. If that isn't fascinating I'm not sure what is. Or is it more along the lines of intelligence can be reduced to a large number of simple things, and biology has given us an interesting physical implementation?
  [-]
  - xanderlewis 20 days ago
    As the commenter below mentions, the biological version of a neuron (i.e. a neuron) is much more complicated than the neural network version. The neural network version is essentially just a weighted sum, with an extra layer of shaping applied afterwards to make it nonlinear. As far as I know, we still don’t understand all of the complexity about how biological neurons work. Even skimming the Wikipedia page for ‘neuron’ will give you some idea.
    The original idea of approximating something like a neuron using a weighted sum (which is a fairly obvious idea, given the initial discovery that neurons become ‘activated’ and they do so in proportion to how much the neurons they are connected to are) did come from thinking about biological brains, but the mathematical building blocks are incredibly simple and are hundreds of years old, if not thousands.
    [-]
    - naasking 20 days ago
      > the biological version of a neuron (i.e. a neuron) is much more complicated than the neural network version
      This is a difference of degree not of kind, because neural networks are Turning complete. Whatever additional complexity the neuron has can itself be modelled as a neural network.
      Edit: meaning, that if the greater complexity of a biological neuron is relevant to its information processing component, then that just increases the number of artificial neural network neurons needed to describe it, it does not need any computation of a different kind.
      [-]
      - andoando 20 days ago
        And assembly is also turing complete, so if two models being both Turing completeness means they are equivalent, there would be no need for coding neural networks at all. Would you consider LLMs a different kind of computation than writing assembly code?
        Perhaps fundamentally they are not, but its also true that just writing more and more random assembly code isn't going to lead to an LLM.
        [-]
        naasking 20 days ago
        LLMs aren't randomly generated though, they are shaped by training data. This means there would, in principle, be a comparable way to synthesize an equivalent assembly program from that same training data.
        The difference here is that it's just more obvious how to do this in one case than the other.
        My point was only that 1) neural networks are sufficient, even if real neurons have additional complexity, and 2) whatever that additional complexity, artificial neural networks can learn to reproduce it.
        [-]
        andoando 20 days ago
        I understand that, what I am saying though is the fact that they can doesn't mean that they will by simply scaling their number. It still entirely depends on how they are trained/arranged, meaning it may take a completely different way of composing/glueing neurons together to stimulate any additional complexity. Its like saying a nand gate is turing complete, I put 1000000000 of them in a series, but its not doing anything, what gives, do I need to add a billion more?
        Just as a modeling and running a single neuron takes x amount of transistors configured in a very specific way for example, it may take y amount of neurons arranged in some very specific, unknown to model something that has extra properties.
        And its not clear either whether neurons are fundamentally the correct approach to reach this higher level construction than some other kind of node.
      - xanderlewis 20 days ago
        PowerPoint is Turing complete. Does that mean PowerPoint should be regarded as being biological or at least neuroscience-inspired?
        [-]
        naasking 20 days ago
        No, but neural networks literally were inspired by biology so I'm not sure what your point is.
        [-]
        xanderlewis 20 days ago
        My point is that you seem to think neurons in the sense of artificial neural networks and neurons in the human brain are equivalent because:
        (1) Neural networks are Turing complete, and hence can do anything brains can. [debatable anyway; We don’t know this to be the case since brains might be doing more than computation. Ask a philosopher or a cognitive scientist. Or Roger Penrose.]
        (2) Neural networks were very loosely inspired by the idea that the human brain is made up of interconnected nodes that ‘activate’ in proportion to how other related nodes do.
        I don’t think that’s nearly enough to say that they’re equivalent. For (1), we don’t yet know (and we’re not even close), and anyway: if you consider all Turing complete systems to be equivalent to the point of it being a waste of time to talk about their differences then you can say goodbye to quite a lot of work in theoretical computer science. For (2): so what? Lots of things are inspired by other things. It doesn’t make them in any sense equivalent, especially if the analogy is as weak as it is in this case. No neuroscientist thinks that a weighted sum is an adequate (or even remotely accurate) model of a real biological neuron. They operate on completely different principles, as we now know much better than when such things were first dreamed up.
        [-]
        naasking 20 days ago
        The brain certainly could be doing super-Turing computation, but that would overturn quite a bit of physics seeing as how not even quantum computers are more powerful than Turing machines (they're just faster on some problems). Extraordinary claims and all that.
        As for equivalency, that depends on how that's defined. Real neurons would not feature any more computational power than Turing machines or artificial neural networks, but I never said it would be a waste of time to talk about their differences. I merely pointed out that the artificial neural network model is still sufficient, even if real neurons have more complexity.
        > No neuroscientist thinks that a weighted sum is an adequate (or even remotely accurate) model of a real biological neuron
        Fortunately that's not what I said. If the neuron indeed has more relevant complexity, then it wouldn't be one weighted sum = one biological neuron, but one biological neuron = a network of weighted sums, since such a network can model any function.
        [-]
        xanderlewis 20 days ago
        The original comment you were in defence of was suggesting that artificial neurons were somehow very close to biological ones, since supposedly that’s where their inspiration came from.
        If you’re interested in pure computational ‘power’, then if the brain is nothing more than a Turing machine (which, as you agree, it might not be), fine. You can call them ‘equivalent’. It’s just not very meaningful.
        What’s interesting about neural nets has nothing to do with what they can compute; indeed they can compute anything any other Turing machine can, and nothing more. What’s interesting is how they do it, since they can ‘learn’ and hence allow us to produce solutions to hard problems without any explicit programming or traditional analysis of the problem.
        > that would overturn quite a bit of physics
        Our physics is currently woefully incomplete, so… yes. That would be welcome.
      - srean 20 days ago
        > This is a difference of degree not of kind
        Nope.
        Neurons in our brain operate fundamentally differently. They work by transient spikes and information is carried not by the intensity of the spike voltage, but by the frequency of spiking. This is a fundamentally different phenomenon than ANNs where the output (voltage) is a squash transformed aggregated input values (voltages).
        [-]
        phkahler 19 days ago
        >> Neurons in our brain operate fundamentally differently. They work by transient spikes and information is carried not by the intensity of the spike voltage, but by the frequency of spiking.
        I thought they worked like accumulators where the spike "energy" accumulates until the output "fires". If that's the case then the artificial NNs are still an approximation of that process. I agree that this is a significant difference, but the mathematical version is still a rough approximation inspired by the biological one.
        [-]
        srean 18 days ago
        Sandpile mathematics, something that is studied by computer scientists and mathematicians, would be an approximation. Its not so much the level of the spike that matters but how often they spike is what conveys the signal. The temporal behavior is supremely important. It used to be believed that the rate is all that matters, but now, no longer.
        There are ANN models that model these spike trains (that's what these 'avalanches' are called), these do work similar to real neurons, but they are not part of the deep neural network popularity [0,1]. Besides, backpropagation is not what goes on in the brain, its known to be biologically infeasible.
        So all in all the traditional ANNs are nothing like real neural networks. That's ok, aeroplanes do not fly like birds, but they do still 'fly'.
        [0] https://en.wikipedia.org/wiki/Spiking_neural_network
        [1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9313413/
  - kragen 20 days ago
    anns originated in hypotheses about how neurobiology might work in the 01940s but diverged completely from neurobiology in the 01960s; they contain nothing we've learned about neurons in the last 50 years, and not much from before that either (they don't, for example, do hebbian learning). current anns use training methods like gradient descent with momentum and activation functions like relu which have no plausible biological realization
    artificial neural networks are an approximation of biological neural networks in the same way that a submarine is an approximation of a fish
  - srean 20 days ago
    > And yet, artificial neural networks ARE an approximation of how biological neurons work
    For a non-vapid/non-vacuous definition of 'approximation' this is not true at all. It is well understood that (i) back-propagation is biologically infeasible in the brain (ii) output 'voltage' is a transformed weighted average of the input 'voltage' -- is not how neurons operate. (ii) is in the 'not even wrong' category.
    Neurons operate in terms of spikes and frequency and quiescence of spiking. If you are interested any undergrad text in neurobiology will help correct the wrong notions.
  - chriswarbo 20 days ago
    > And yet, artificial neural networks ARE an approximation of how biological neurons work.
    Only if you limit yourself to "sums of weighted inputs, sent through a 1D activation function".
    However, the parent said "differentiable primitives": these days people have built networks that contain differentiable ray-tracers, differentiable physics simulations, etc. Those seem like crazy ideas if we limit ourselves to the "neural" analogy; but are quite natural for a "composition of differentiable primitives" approach.
- m463 20 days ago
  chess is pattern matching.
p1esk 20 days ago
And then you learn about binary or ternary networks where gradients don’t really exist anywhere, and you start to wonder about the importance of this differentiability.
[-]
- ubj 20 days ago
  ...And then you start learning about generalizations of the notion of "gradient" to scenarios where the classical gradient doesn't exist :)
- whimsicalism 20 days ago
  binary networks don't really work well unless you do a relaxation first
seanhunter 20 days ago
Wow, just skimmed a bit, but this book looks amazing so far. Really understandable but with an intuitive presentation of the underlying maths that invites the reader to go deeper if they want to by giving them what they need to get started.
gfaure 20 days ago
In the literature, they're usually called convolutional layers (I think you can pretty much search and replace all uses of "convolutive" in the text).
glonq 20 days ago
I wonder if the usage of Alice & Wonderland takes inspiration from Douglas Hofstadter's "Gödel, Escher, Bach: an Eternal Golden Braid" ?
[-]
- abdullahkhalids 20 days ago
  Lewis Carrol's Alice in Wonderland features a number of logical and mathematical puzzles [1]
  He also wrote What the Tortoise Said to Achilles (1895) in which the paradoxes of Zeno are discussed.
  So it's more correct to say that GEB and this article are originally inspired by Lewis Carrol's work.
  [1] I wrote a short article for my university magazine a long time ago. Some interesting references at the end https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
- devnonymous 20 days ago
  No, I think the inspiration is more direct https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...
- iainmerrick 20 days ago
  It's a pretty common trope, especially for math-related books, e.g. Alex Bellos' "Alex's Adventures in Numberland".
Eduard 20 days ago
better remove all the Disney-based Alice in Wonderland character intellectual property from the book.
[-]
- astrodust 20 days ago
  I was just thinking that's "cease and desist" bait right there.
  [-]
  - iainmerrick 20 days ago
    Alice in Wonderland (the book) is in the public domain. The old Disney movie is still in copyright, and the cover image does look very much like it's from the movie, but that character design is from John Tenniel's illustrations which are also in the public domain.
    [-]
    - astrodust 20 days ago
      The character design is. The image, however, is clearly Disney flavoured if not traced directly.
      His version, for example, does not have the distinctive bow. The art style is also completely different.
      [-]
      - iainmerrick 20 days ago
        True - it would be a good idea to use a Tenniel piece instead.
        Edit to add: I was mostly trying to push back on the implication that Disney owns Alice in Wonderland (and Peter Pan, Winnie the Pooh, etc). Now I re-read the original comment, they did specify "Disney-based", so maybe I'm over-reacting!
TheRealNGenius 20 days ago
[dead]