Dear AI startups: Your ML models are dying quietly

(sanau.co)

132 points | by jimmyechan 1824 days ago

15 comments

ohazi 1824 days ago
I don't see how this has anything to do with AI or ML. It's a great description of why you might want to prefer strongly typed languages, avoid "the data is the schema" key-value monstrosities like mongodb, and maybe think about writing some sanity-check-tests that need to pass before deployment, though.
No system should ever fail silently if a required field suddenly goes missing or has the wrong type/unit.
[-]
- nabla9 1824 days ago
  The problem with numerical programming is that all data is the same type. If you access wrong row, multiply across wrong axis, function uses wrong parameters, it's semantically wrong, but it can still work and system can learn something.
  [-]
  - amelius 1824 days ago
    Isn't there a typesystem that can help with this?
    E.g., give rows a type, give columns a type, and then if you multiply two matrices, see if the type of the columns of the first matrix and the type of the rows of the second matrix match.
    Also, the typesystem could use units, just as they are used in physics.
    [-]
    - chpatrick 1824 days ago
      We use this at our startup. We have a Haskell math library where every vector/matrix/rotation knows the spaces they deal with (World, Camera, etc), and using them with incorrect spaces is a type error. This is enormously useful because in 3D computer vision/graphics almost everything is 3 or 4-dimensional, so it's very easy to use them the wrong way. When you read C++ computer vision code a datatype often has a Matrix3 or worse yet a double[] in it and you have no idea what coordinate system (or even matrix order) it represents. With this, it's all self-documenting and difficult to get wrong.
    - em500 1824 days ago
      In pandas you could/should use dataframe index and column labels that way, it automatically aligns by column and index names before arithmetic, unlike raw numpy arrays which aligns by position. It takes more discipline and some performance hit though.
      You could even do it with SQL field (column) names and natural joins. The performance hit is even bigger and discipline/retraining is even bigger, as few SQL users are used to thinking about column names (rather than types) as type definitions.
      Further, this would only catch a class of bugs that are relatively easy to spot manually. The difficult case is when the processes that generated the numbers have slowly changed in time (e.g., you have a very different mix of visitors/customers now than during model training).
    - nabla9 1824 days ago
      You would like to have 'unit system' that would keep track of the units of measure in general and the ability to use concepts like dimensional homogeneity to avoid semantic errors while programming. Alas, they don't exist as far as I know.
      [-]
      - alkonaut 1824 days ago
        F# has units of measure. A frontend is (usually) not F#, but even for a more loosely typed system such as your typical web frontend, it should be possible to represent scalars as tuples of a number and an identifier. Eg. { qty: 12, unit: "kgs" }.
      - anonymou2 1824 days ago
        https://pint.readthedocs.io/en/0.9/
      - AKrumbach 1824 days ago
        A search on Google turns up this Common Lisp code, which appears to define a number of SI units (both base and derived) in the sort of manner described.
        http://www-ksl.stanford.edu/knowledge-sharing/lib/unit-conve...
        Given the way this appears to be defining some sort of class-based hierarchy, with the correct type requirements this code should perform exactly the sort of unit/type checking the grandparent post wanted, but I didn't go digging into the rest of the project files to confirm this code defined things in the correct manner for that to work.
      - carlmr 1824 days ago
        F# has this: https://fsharpforfunandprofit.com/posts/units-of-measure/
      - alehander42 1824 days ago
        Nim also has it
    - dual_basis 1824 days ago
      There is a namedtensor library that tries to fix this.
- ayazhan 1823 days ago
  Thanks for your feedback ohazi! You are right that “No system should ever fail silently” and there are many ways to address this. However, we’ve seen cases where not enough attention is put into input data pipelines and similar mistakes are being made by new and even well-funded AI startups. One of the reasons this is happening is because some of these startups are launched by people with little experience in machine learning / data science. For example, we’ve seen doctors, web developers, lawyers and accountants moving into AI space. They can learn the basics of machine learning and maybe hire someone to help them, but often don’t know about the best practices or things to watch out for when building a data driven company. Unfortunately, there are few guidelines or best practices available online. With this article, we were hoping to help these new AI startups by bringing awareness to the importance of data pipelines.
- bartimus 1824 days ago
  Good point. Or perhaps it was that nobody had ownership over the recommendation system. There should've been a performance indicator on a dashboard somewhere.
  [-]
  - jimmyechan 1824 days ago
    Yes! It could actually be both. Ownership over the recommendation system needs to include not just designing it and training it but also making sure that the upstream data that feeds the model remains good. And I agree that a performance indicator is a potential solution, it just needs to be able to catch these kinds of problems.
luckyt 1824 days ago
There's definitely a lot of opportunities for technical debt in machine learning projects that don't exist in usual software development, which makes careful design decisions even more important. Reminds me of this paper, which talks about these issues and ways to avoid them: https://research.google.com/pubs/archive/43146.pdf
[-]
- jimmyechan 1823 days ago
  That's true! Thanks for sharing the paper. We'll take a look
jimmyechan 1824 days ago
Hey HN, this is an article we wrote about things to watch out for as you develop your machine learning models and deploy them on production.
We realize that it's really important for data science, product management and engineering teams to discuss and ideally monitor any new or any changes in data capture and processing that feeds into a machine learning model.
[-]
- ayazhan 1824 days ago
  In this article, we showed a scenario where a small change to the front-facing interface could dramatically reduce the accuracy/performance of a machine learning model powering the application
distant_hat 1824 days ago
This has nothing to do with startups, or established companies. Any time the distribution of your data changes, your models need to be retrained. The model can degrade even if the input data improves, e.g., if your geolocation feed had a high error rate before but has suddenly gotten much better you need to retrain the models.
j0057 1824 days ago
The domain is flagged as malicious by ESET:
https://www.virustotal.com/#/url/e4accf1e046c8266168b9038763...
[-]
nitrogen 1824 days ago
Is it common for an ML model to be designed to make product recommendations based on name and email, as in the example? That seems... problematic.
[-]
- nostrademons 1824 days ago
  I figured that it was just for illustration, because the author couldn't think of a better example. Some real-life examples that turn up stupidly often:
  1. The model uses click-through data as an input. Your frontend engineer moves the UI element being clicked upon to a different portion of the page for a certain category of results. This changes the baseline click-through rate. The model assumed this feature had a constant baseline across all results, so the new feature value now needs to be scaled to account for the different user behavior. Nobody thinks to do this.
  2. The frontend engineer removes a seemingly-wasted HTTP fetch to reduce latency. This fetch was actually being used to calibrate latency across different datacenters, and was a crucial input to a data pipeline to a system of servers (feeding the ML model) that the frontend team didn't control and wasn't aware of.
  3. The frontend engineer accidentally triggers a browser bug in IE7 (gimme a break, it was 9 years ago) that prevents clicks from registering when RTL text is mixed with LTR. Click-through rates decline precipitously in Arabic-speaking nations. This is interpreted by an ML model as all results being poorly performing in Arabic countries, so it promptly starts cycling through results, killing ones that had shown up before with no clicks.
  4. A fiber cable is cut across the Pacific. This results in high latency for all Chinese users, which makes them abandon their sessions. A ML model interprets this as Chinese people being less interested in the news headlines of that day.
  5. A ML model for detecting abusive traffic uses spikes in the volume of searches for any one single query over short periods of time as a signal. Michael Jackson dies. The model flags everyone searching for him as a bot.
  6. A ML model for search suggestions uses follow-up queries as a signal. The NYTimes crossword puzzle comes out. Everybody goes down the list of clues and Googles them. Suddenly, [houston baseball player] suggests [bird sound] as related.
  [-]
  - ayazhan 1824 days ago
    Thanks nostrademons, these are great examples. You're right, name and email are just for illustration. Would you mind if we use your feedback and some of your examples to improve the article? If yes, should we credit your HN account?
    [-]
    - nostrademons 1823 days ago
      I'd actually rather that you keep them general (eg. just talk about clickthrough data or changing latency conditions) and don't credit my account. The past employer in question is relatively easy to lookup from my past comment history, and while there's nothing really confidential in the examples, stories about how they do things or how things go wrong tend to blow up in the news, and they like the publicity only when it's positive.
      [-]
      - ayazhan 1823 days ago
        Ok, sounds good. we'll keep it generic and won't mention the source. Thank you for sharing! We think this is something AI companies can benefit from in the future.
  - massaman_yams 1823 days ago
    Good examples, but re: #1, do a lot of places really deploy models with a static distribution? It should be relatively trivial to calculate this directly from the data within most ML libraries/systems - using a static distribution seems like such an obvious novice mistake.
    [-]
    - nostrademons 1823 days ago
      It's not that the distribution is static, it's that the distribution is computed once when the model is built and then becomes outdated if the UI is changed as the model is being used. Many places have different timelines for updating machine-learned models vs. deploying new frontend code; anywhere from "a month" to "we'll rebuild it manually when we need to" is typical for the former, while good engineering shops do weekly, biweekly, daily, or sometimes even hourly for frontend changes.
      [-]
      - massaman_yams 1823 days ago
        Ah, I suppose I come from a world where most models are automatically rebuilt at least daily, and sometimes as often as a few times an hour.
PeterisP 1823 days ago
It's not really specific to machine learning - everything in big corporate pre-ML reporting, data analytics, business intelligence and management information systems domain (which is a mainstream field of IT systems with decades of history and lots and lots of systems experience) have the same issues. Any pipeline of business data collection and analysis tends to be reliant of lots and lots of factors external to that system, and dependent on particular details of every business process involved.
It's just well-known things being rediscovered because people are treating this as a new field - but everything mentioned in this article would be the same for company tracking some other sales efficiency metric twenty years ago, except that the "change in front end framework" would be "change in preorder sales reporting templates", causing very similar problems in your data analysis.
eoinmurray92 1824 days ago
Sanau seems like a new version of Sagemaker https://aws.amazon.com/sagemaker/ where you write code in Jupyter and it auto converts to endpoints.
I've used these solutions, but while Jupyter is amazing (my startup https://kyso.io started as a way to share notebooks) I'm not sure if deducing which cells to convert into an endpoint is the way to go - especially since you will need to also host model files and extra data?
[-]
- tixocloud 1824 days ago
  Your startup looks quite interesting and I can actually see potential for commercial usage. How's the growth rate been?
  [-]
  - eoinmurray92 1824 days ago
    Its awesome - we've recently launched a team version https://kyso.io/for-teams and our beta users seem to love it
massaman_yams 1823 days ago
It's not just data format changes; model accuracy can be impacted by changes in the distributions of values, even if their types remain the same.
That's why production ML systems should monitor for data drift, model accuracy, and a host of other factors that may not be obvious at first. That's part of what TensorFlow Extended does.
See also: "What’s your ML Test Score? A rubric for ML production systems" https://ai.google/research/pubs/pub45742
oli5679 1824 days ago
When you deploy a machine learning model, you need to monitor it's performance in production.
If the model's ROC-AUC falls by 0.1, the mean/s.d. of one of it's inputs changes by more than 50% or the number of N.a.s for an input increases suddenly or if the monitoring report dies, then the model owner should get an alert quickly.
tumanian 1824 days ago
“Data pipelines die quietly” is a more appropriate name for the article based on the example. And its true, all data pipelines need counters monitored continuously. A simple <number of records that didnt parse> metric on grafana would prevent this error.
anotheryou 1824 days ago
Sounds like you should manage your metrics and refine your model using false detection/classifications (which you should try to catch and measure).
debaserab2 1824 days ago
Garbage in, garbage out. This has nothing to do with ML models.
leowoo91 1824 days ago
That article could be a good example of modern spam.
BloodyLobster 1824 days ago
boring story. cannot believe it goes on one of today's hottest.