Remember when Apple created a “Machine Learning journal”? Well, it seems like they’ve stopped publishing to it and now have gone back to introducing stuff at presentations, if at all: https://machinelearning.apple.com/
I'm not in ML, but in PL and HCI almost all conferences have proceeded on-schedule, just in a virtual format.
The only exception I'm aware of is HOPL (History of Programming Languages). They still published the papers/proceedings as usual, but have postponed a physical gathering instead of meeting virtually because the conference convenes only once every 10-15 years.
I think it's still wild that neither Tensorflow nor PyTorch work on Apple's MBP GPUs - AMD can't run ROCm on anything but Linux, and NVIDIA drivers aren't supported if you wanted to get an external GPU.
This combined with Microsoft's roadmap with a WSL that works on CUDA GPUs is going to cost Apple a lot of ML/AI/HPC developer mindshare. Yes, we do a lot of our work on remote machines, but it's not always the most convenient way to experiment. I doubt my next machine will be a MacBook.
There seems to be an ongoing work for Vulcan Compute support for Tensorflow. But the mlir repo moved at the end of 2019 and I don't see where (or if) the discussion and PR continue, because the new repo doesn't even use Github Issues.
There might be some other reasons for shipping different models for the iPad vs the iPhone. F.e., if the iPad is more often used inside rather than outside, you could use a fine-tuned version of your big CNN to this smaller set of classes.
Image augmentations are hard to add to training. It may seem easy, but it requires a lot of thought.
(To back up a bit: Image augmentations are how you solve that problem. "How do I make my model robust across different cameras?" It might be tempting to gather labeled data from a variety of cameras, but that doesn't necessarily result in a model that can handle newer, larger-res cameras. So one solution is to distort the training data with augmentations so that the model can't tell which resolution the input images are coming from.)
The other way to deal with it is to just downscale the camera's image to, say, 416x416. But that introduces a question: can different cameras give images that look different when downscaled to 416x416? Sure they can! Cameras have a dizzying array of features, and they perform differently in different lighting conditions.
To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.
I've been trying to build an equivalent of Kornia for tensorflow https://github.com/kornia/kornia which is a wonderful library that implements image augmentations using nothing but differentiable primitives. Work is a bit slow, but I hope to release it in Mel https://github.com/shawwn/mel (which will hopefully look less like a TODO soon).
Training a model per camera isn't necessarily a terrible idea, either. In the future I predict that we'll see more and more "on-demand" models: models that are JIT optimized for a target configuration (in this case, a specific camera).
Robustness often comes at the cost of quality / accuracy (https://arxiv.org/abs/2006.14536 recently highlighted this). In situations where that last 2% of accuracy is crucial, there are all kinds of tricks; training separate models is but one of many.
> To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h.... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.
Why not do the data augmentation during preprocessing (so that the transformations don't have to be done by differentiable transforms)? I.e., map over a tf.Dataset with the transformation (and append to the original dataset).
Why are you trying to backpropagate over data augmentations? I've never done that (or heard about it being done). Usually I just do the augmentations on the input samples and then feed the augmented samples to the network.
Differentiable augmentations aren't necessary unless the augmentations are midstream (so you have to propagate parameters above the augmentations, which is weird) or have parameters (at which point you aren't learning how to work on different views of the same sample, you are learning how to modify a sample to be more learnable, which is a different problem that you are trying to solve).
Don't get me wrong, augmenting samples to reduce device bias is a hard problem, but you might be making it harder than it needs to be.
The data augmentations we are interested in are in fact 'midstream', as they augment the examples before passing into the D or the classification loss but you must backprop from that back through the augmentation into the original model, because you don't want the augmentations to 'leak': the G is not supposed to generate augmented samples, the augmentation is there to regularize the D and reduce its ability to memorize real datapoints. It would probably be better to consider them as a kind of consistency or metric loss along the lines of SimCLR (which has helped inspire these very new GAN data augmentation techniques). It's a bit weird, which is perhaps why despite its simplicity (indicated by no less than 4 simultaneous inventions of it in the past few months), it hasn't been done before. You really should read the linked Github thread if you are interested.
> Training a model per camera isn't necessarily a terrible idea, either. In the future I predict that we'll see more and more "on-demand" models: models that are JIT optimized for a target configuration (in this case, a specific camera).
Meta-learning, or perhaps learning camera embeddings to condition on, would be one way. Although that might all be implicit if you use a deep enough NN and train on a sufficiently diverse corpus of phones+photos.