One concern with the study is that the authors generated the objects specifically from skeletons rather than deriving them from shapes, either natural or human-made, covered by skin, metal, or other materials that people encounter in their day-to-day life. “The shapes that they generated are directly related to the hypothesis they’re testing and the conclusions they’re drawing,” says James Elder, a professor of human and computer vision at York University in Toronto. “If we’re interested in how important skeletons are to shape and object perception, we can’t really answer that question by only looking at the perception of skeleton-generated shapes. Because obviously in a world of skeleton-generated shapes, skeletons are probably fairly important because that’s the way those shapes were made.”
I looked into the paper first and thought: yea well it's really not surprising, that the skeleton models are most predictive for the kind of objects they tested. Their skeleton really is all that defines them.
The only thing they tested and proved is: Skeleton models are predictive for human decision when recognizing objects made just from skeletons with little flesh and hardly any texture whatsoever.
Nevertheless I think skeleton models are a good thing for object recognition
Humans are much better at noise removal than computers. Many people can look at an object and see what's extraneous to the basic form--what's left is the skeleton. Computers, so far, don't have the context to do this, and instead try to recognize objects based on visual patterns, etc.
Perhaps "weighting" models, allowing algorithms to look for centers of gravity and mechanical behavior would help. Humans exist in a 3d world, but we also interact with a simplified 3d world.
We don't worry about the plastic bag in the street because we can feel how our car will respond. It's trivial. There's no "weight" attached to the object.
Weight and balance are incredibly important psychologically (see the burgeoning popularity of weighted blankets), and that's a thing that's missing for computers. Having a tangible sense of the world in our minds gives us a huge leg up when relating to it.
I'd like to argue that it's rather the extreme connection density and feedback loops that connect all these different concepts. Taken on their own, each of these models that the brain (and perhaps artificial neural networks) construct are weak predictors. This is compensated for by their sheer number and the plasticity of the feedback loops between them.
As you say, when a human observes a plastic bag, a vast number of different models and transformations aggregate their predictions in a highly nonlinear fashion:
The bag has a plasticky look, it seems to be flopping around, it is slightly see-through, it produces a certain sound that implies a hollow cavity etc... these primary observations are processed by the sensory neurons, which do a first pass filter to remove noise completely subconsciously. If they don't get enough feedback, perhaps it was just a mirage of a bag - not real - and you do a double take and realize it was just a play of shadows.
But let's assume first pass feedback confirms that it is likely a real sensation. The primary inputs are then confirmed by secondary predictions:
The bag is carried by the wind, implying low density, the sound it makes is common for empty thin plastic materials, it has a matte surface that lets through some amount of light, etc... the subconscious thus makes the conclusion that it is indeed probably made of thin plastic and therefore of low density and low hardness and therefore not a threat in terms of high velocity impact. Your swerve reflex thus doesn't kick in and you drive straight.
But this reasoning requires an in-depth model of the world. It isn't enough to just recognize the shape of a bag, because that could be a myriad of other things. Only by having a model and thus understanding of all these different aspects of reality can one make a prediction as robustly as a human. And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
>> But this reasoning requires an in-depth model of the world. It isn't enough to just recognize the shape of a bag, because that could be a myriad of other things. Only by having a model and thus understanding of all these different aspects of reality can one make a prediction as robustly as a human.
This is a great summary of why I think current deep-learning based methods will never lead to 'intelligence' that is good enough to e.g. navigate the real world like humans do. They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification.
>> And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
This observation I don't really follow, I would say the bar to match human reasoning abilities is extremely high for exactly the reasons you described yourself.
>> This observation I don't really follow, I would say the bar to match human reasoning abilities is extremely high for exactly the reasons you described yourself.
Sorry I should've phrased it better. I was trying ti imply that just matching human reasoning abilities is indeed an undertaking of incomprehensible complexity, _and_yet_ it is still highly error prone. I believe a system that replaces humans will be under close scrutiny and just being at par won't be enough.
>> This is a great summary of why I think current deep-learning based methods will never lead to 'intelligence' that is good enough to e.g. navigate the real world like humans do. They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification.
I'm not a neurologist or cutting edge ML researcher by any measure, but this is my viewpoint as well. The astounding amount of information and internal models, and the astounding complexity of these models in terms of connections and feedback loops (and their plasticity) implies to me that our current pedestrian attempts at AI are nowhere near what is required for GAI, let alone human level GAI.
It seems to me like a lot of hubris to suggest (as I've seen people do) that in just a couple of years we could get there. Currently we have not even a clue how consciousness arises. We have evidence that it is physically possible, but that's it.
The leading enterprise in the area, Google/Youtube routinely fail to identify objects and sounds in videos.
My prediction is that what we have currently is a local optimum that expands our capabilities a lot, compared to what we had before, but in terms of genuine insight into human level AI, it will prove to be a dead end.
Regarding models of the world: isn't it conceivable that a computer could have a smaller, specialized, model of the world specific to its task?
A car could have a model of reality whose scope is only encompassed by the context of roads and driving. It is conceivable to me that a car could have an in-depth model of the "driving-world" that would allow it to make multi-sensory, tiered observations and predictions akin to human cognition.
> They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification
Deep learning is more than just imagenet classification or object detection.
There are many approaches that require more understanding, such as future video prediction, captioning, question answering, reinforcement learning requiring an implicitly learned model of how the environment works beyond mere appearances, image generation, structure extraction, anomaly detection, 3d reasoning, external memory, few/one/zero shot learning, meta-learning, etc etc.
The field is huge and whatever "obvious shortcomings of deep learning" non-specialists come up with after reading popular articles are probably being tackled already in many groups and have several lines of approaches and papers already.
> Computers, so far, don't have the context to do this
As someone who did their Ph.D thesis on the statistics of shape using models based on the medial axis (i.e., a skeleton), I would beg to differ.
Whether these models are as easy to apply (computationally and conceptually) as the currently in-vogue techniques is another question, but there is nothing magical here that computers are incapable of.
I think one of the most important aspects of human vision that everyone seems to overlook is that it's active. We aren't just sitting in a dark room looking through a video feed our whole lives, we actually live in and interact with the world.
Our eyes are active in that they move freely and can focus at different distances. We also happen to have two of them and our brains have a model for how far apart they are. These two features (active focusing and binocular vision) give us incredible depth perception.
Our brains use this depth information to separate objects from the background, something a machine learning algorithm cannot do if you're just feeding it a billion photo labeled training set.
The brain also makes decisions very early and updates it as it has time to reconsider the data. We've all probably had cases where we saw a person sitting down then realised it was just a jacket draped over a chair.
At least from my own personal experience, it's very biased too. It seems the more tired we are, the more likely we are to incorrectly recognise immobile objects as people or animals at a glance.
I know you meant it as a general statement, but I think it depends on the kind of noise, and obviously the type of signal. Its trivial for a computer to look past 'fixed pattern noise' to find the data in an audio-visual signal. For certain noisy signals, a compute device could amplify/scale the data (and then perform noise removal) to retrieve the signal, etc, etc..
>Here we tested whether skeletal structures provide an important source of information for object recognition when compared with other models of vision. Our results showed that a model of skeletal similarity was most predictive of human object judgments when contrasted with models based on image-statistics or neural networks, as well as another model of structure based on coarse spatial relations. Moreover, we found that skeletal structures were a privileged source of information when compared to other properties thought to be important for shape perception, such as object contours and component parts. Thus, our results suggest that not only does the visual system show sensitivity to the skeletal structure of objects32,36,37, but also that perception and comparison of object skeletons may be crucial for successful object recognition.
I think it's telling that even young children are exceptionally good at object recognition, and if you ask them to draw an object, they'll typically give you a "skeleton" with basically no ability to reconstruct the textural components.
I think the real interesting question is: what is the internal representation of this skeleton? A graph? A forest of graphs? Some kind of field that's graph-like?
Well, it's not at all obvious that a line is the naive representation of a limb rather than a particularly intelligent encoding of it.
For example: CNNs, though pretty good at detecting limbs (and miscellaneous other things) have only a very limited ability to encode structural information in this way. An interesting open question in the field is what is the "right way" to encode this sparse, graph-like structural data (hence capsule networks).
Absolutely, but that requires advanced fine motor control, understanding of how the instrument lays down color and what multiple layers of color look when on top of each other, and so on.
The naive way to use the instrument, is to run the instrument over the area one or a few times. The simplest way to do that in terms of motor control (e.g. fewest turns) is to run it up and down the longest axis one or more times. That's exactly what a child does.
Machines are taught from flat images. How can they be expected to create 3D from this?
Humans learn from binocular vision, and from multiple angles as we move around an object, making it a lot easier to get an idea of its shape.
My daughter aged 18 months could already recognise abstract signs like the mother and baby or disabled sign just from knowing the real object. Which must say something about the way she stored the representations of them.
They don't see flat static images, but a continuous stream of input that changes view angle constantly as they and the object make subtle movements. Moreover, they can interact with things and gather more visual information where needed. (Anything too big to interact directly with is probably too far away for binocular depth perception to be of much use.) See a big list of monocular depth cues here: https://en.wikipedia.org/wiki/Depth_perception
Because they’re using existing data. You need thousands, maybe millions of images to train an AI to recognise something well, and only recognise the right characteristics. No-one has the resources to go take all those photos themselves.
Anyone know of a visual recognition AI being trained also with depth data? Would be interested to see what difference it makes.
This relates to something else I noticed differently about my daughter learning. You can show her one photo of a lion, from one angle and she will recognise other lions later on, at different angles. I think she must have seen enough animals already from many angles to have generalised their shape and then be able to presume the new animal is similar and just see the new characteristics like a mane. Something very different is happening in Human brains!
I'd say (out of experience) that people do not recognize objects by visualizing their skeletons, but they recognize objects by a generalization of their shape.
In case of recognizing other animals, the generalization takes the form of a 'tree' of objects connected via nodes, which is actually what a skeleton does to a body.
But that does not happen with other objects, i.e. cars. For cars, the generalization is that of a box with circles at the bottom (for the wheels).
It shall also have to be noted that the details of objects are not really lost, but they are remembered, up to a certain degree, which allows us to recognize a person with fat body parts from a person with thin body parts of the same height and otherwise same general outlook.
The degree of generalization is also responsible for not being able to remember a new face that strongly resembles a face we already know, until we recognize for the new face some special attributes the old face does not have. In this case, the degree if generalizaton is such that does not allow us to immediately tell apart the old from the new face.
I'd say that recognition works in a step like fashion:
-we first recognize a generic abstraction of the object at hand: if the object is inanimate or not.
-then we recognize in which category of the inanimate or living objects the object under recognition is (for example, is it a human? an animal? etc).
-then we recognize more details; is the person tall, fat or blond? for example.
-then we recall our connections to that person, resulting in chosing a response.
I don't have data to back the above up, it's all from intuition and personal experience, but that's how I think objects are recognized by brains.
I dont think the study can be used to draw the conclusions that the article is trying to draw. The study presents new objects which are derived from skeletons for people to learn and identify. IMO people learn differently in short term vs long term. Short term, we try to reduce the dimensionality of the input to things we can hold in working memory. In this study that would be the skeleton of the object. That doesnt mean that that pattern holds up for long term learning (which is mostly how we visually identify things, because we've seen them many times already). The main reason I bring this up is because it seems to be in direct contrast with studies which show the opposite (i.e. that humans do operate like machines in identifying objects). That study was done by comparing the brain regions which activated when the person was exposed to visual input and found a consistent location which was activated due to seeing a horizontal / vertical line.
This is what fascinates me about machine learning. You can train an algorithm self adjust in ways that a human doesn't have the capacity to understand, which can then perform human like tasks.
If we get good at allowing programs to generate programs that find new ways of learning. Is it still behaving in a way that humans program them, or has it shifted to a law of nature that is fundamentally out of our control.
When it's all said and done, we decide if it was us, or a force beyond humans.
I think that object recognition is hard because humans have much more data than computers. People see with two eyes which can focus on different distances, so our brain has 3D data to learn from. We later learn to recognise the same objects on pictures.
Computers usually start from flat pictures, and that trips the learning process.
No but they cannot see depth very well because they can't use stereoscopic vision (triangulate). However, there are other cues that are used to infer for depth such as covered edges(if one object partially covers another then it is closer to you), perspective (if two objects that you know are similar in size but one appears smaller then it is farther) etc.
A friend of mine who cannot see with one eye and yet he is a painter. One thing I know he cannot do is drive a car.
That's specific to your friend, not true in general. Lots of people drive with only one functional eye. At the visual distances involved, the depth perception provided by stereoscopic vision doesn't matter much. Especially with all the relative motion. My dad has been driving successfully for 65 years with only one working eye.
All those other cues are present in images from cameras as well. The only one I can think of that typically aren't used much for computer vision is focus distance, but for objects far away I don't think that helps us much in object recognition since all of the object are in focus anyway.
TL;DR the objects are grouped into categories which determine the "Key points" on the objects (similar to this 'skeleton') which the robot knows how to interact with in order to bring about the intended manipulation.