Ask HN: Does machine learning always need huge data sets?

5 points | by gtirloni 13 days ago


  • gorg_dc 12 days ago
    ML expert here. Good points above especially training vs inference.

    To be honest, yes it is possible. Most models I made could run on a mobile device, mostly they would not because they were written in python and since it is possible/cheap, I would not care too much about RAM and efficiency for a training job.

    I think the dataset size is overrated by things like Kaggle or news about Deep learning models for image recognition. Bigger datasets are better, but if your data quality is good few hundred of rows (like a csv file) can be enough for many applications!

    Most data challenges are not image recognition or NLP either, so you could do them on smaller devices. I think the main issues would be 'support' tho. Small devices do not run python (or R/Julia) so you need to port your inference code to some binary (like webassembly) or rewrite in C/C++. Inference code is much smaller than training/ experimenting code fortunately.

    • Jugurtha 11 days ago
      There's a difference between training and inference (using the model to achieve a goal).

      Training is when you show examples (instances of something) to an entity and it learns to recognize them. Example: doing homework and math exercises.

      Inference is when you show the entity an example it has not seen before and ask it to draw form its training experience to make something of that example.

      Example : a seasoned cop has had many more interactions than a green cop. However, sometimes People are more intuitive than others and dont need that many years to read situations. Their learning algorithm is different, or they're looking at things others are not looking at (features).

      You probably are using machine learning inference on your mobile device when you text and it recommends the next word. This application does not request a server because it needs to be low latency. You type fast, and you need the model to be right there. The same case is to be made for self driving. This poses several challenges and relies on several techniques to get the models on constrained environnements, either to run or get them there in the first place.

      Second, for training ? It depends on the problem you are solving. Are you trying to predict something that is so rare that even if you have a year worth of data, it only has happened twice? Is there a lag between influencing factors and the phenomenon ? Say, changing nutrients for a plant and its state not instantaneously changing. This depends on the problem.

      • RicoElectrico 13 days ago
        Depends on which part you want to do on the smartphone. Training, inference, or both.

        Training is quite expensive computationally, but inference needn't be. We have many models that can run on a smartphone, after all.

        However, you can do some limited training on the smartphone by leveraging pre-trained models. Usually the internal representation at the very end of the network can be used as an input to train a simpler algorithm on top of it.

        But all of the above depends on what actually needs to be done, which you have not specified. Classical, non-deep, ML models could easily be trained on a smartphone provided the datasets can fit.

        The keyword for you to seek would be "edge AI", if that helps.

        • i_like_apis 13 days ago
          Non ML expert here.

          Training an ML model typically takes large datasets and compute power.

          Using a model that has already been trained requires less compute power and some ML apps (trained models) certainly exist on smartphones.

          One example is in some motion and gesture apps that detect if you are walking / running / riding a bike. Some use ML. The classifiers were trained on a large set of data external to the app, but they run on the device afterward.

          • Jensson 13 days ago
            You can do a lot of useful things like predict the price of a phone on e-bay etc using around a thousand data points and basically no compute. You can't do things like train an image recognition model, but most machine learning isn't that expensive to do, just that problems requiring expensive models to solve is what makes the news.
            • high_byte 13 days ago
              it needs enough data. often "enough" is measurable amount, so technically speaking if you had a specific use case you could answer it.

              but in practical terms, there is absolutely no reason training would be done on a smartphone. or any pc for that matter. you only need to train once, then you can use anywhere.