• Eridrus 252 days ago

    I am skeptical of anomaly detection since in my experience anomalies are common and diverse and don't actually matter, so I expect these systems to basically inundate people with false positives.

    Their offline training accuracy is garbage: 16% precision, so all of the real work is basically being done in the online training portion, which gets it to a respectable 82%+ precision.

    But they don't tell you how many alerts they had to label to get those numbers. Maybe over the long run you get those numbers, but you really want to know if it takes 10 or 10,000 examples to get there.

    Also, their dataset distribution is very different to reality: they have 7% of their dataset annotated as real anomalies; I don't think anyone in the real world wants 5% of their log entries to get flagged as anomalies. So I expect their precision numbers to be far worse on more realistically distributed logs.

    • jstarfish 252 days ago

      It's a good time to make some money baffling people with bullshit in the cybersecurity space.

      Of course if you let an ML-powered "anomaly detection" engine run rampant on your logs, it's going to find anomalies...just like if you hire a ghost hunter, you'll be informed that your house is haunted. In the end, ghost chasing is all this anomaly nonsense turns out to be-- the justifications for conclusions by ML practitioners and ghost hunters alike tend to be equally mumbly and hand-wavy.

      Me working from home is technically an anomaly, and one these systems are all too eager to flag. We get random logins from overseas VPSes-- it's an anomaly! Oh, wait, no, we onboarded a client application. Oh, look, a random login from China for a US-based employee with no history of foreign logins! Yeah, that guy just started in a new position with travel requirements. Hey, this IP just tried to log into 5000 user accounts! Congratulations, you just alerted me to the existence of carrier NAT.

      None of this saves any time and usually wastes it, since it stirs up paranoia where none was otherwise warranted. It's a fun toy that gives the appearance of being productive when all it's actually doing is generating literally endless busywork. Good for justifying your SOC budget I suppose.

      But in the end nobody wants to pay a quarter-million dollars for a black box that just sits there quietly-- if it's not constantly drawing attention to itself and all the badness it's pretending to find, you're not going to have any reason to renew the license.

      "Renew it? Why? This thing didn't find anything at all last year."

      • Eridrus 252 days ago

        Oh I know, I spent a decade in Security and work in ML now, and I can see how badly people want to put the two together, but it's basically 90% bullshit and 10% same old shit of varying effectiveness.

        • noir-york 252 days ago

          So what does your organisation use for intrusion detection? Humans eyeballing logs doesn't scale. Rule-based approaches?

          • 252 days ago
            • russh 252 days ago

              Mostly user complaints...

          • stephengillie 252 days ago

            At a previous tech support position, I collaborated with a data scientist to create a predictive alert system based on system notification data. It would monitor the quantity of noise from each interface on the network and alert on anomalies in Slack. The only issue was that it didn't work - we saw only false positive noise, and it sat quiet during actual incidents. It would be interesting to see another team's attempts, and what different design choices they make.

            • asavinov 252 days ago

              Another problem of anomaly detection is that they do not provide any (domain specific) explanation for why the system thinks it is an anomaly. The system also does not say what to do in this situation, which means that such anomalies are not actionable findings. Therefore I think anomaly detection should be used as a pre-processing step which generates input for some other other components of the system.

              • dimitry12 252 days ago

                Are they doing supervised training for anomalies?

              • pilooch 252 days ago

                I do, with others, a lot of ML anomaly detection in the cyber security context. Deeply has interesting ideas, especially the encoded logs via lstm. The work was presented at a workshop at NIPS 2017.

                One of the interesting facts we ve been able to measure empirically over the past few years is that the statistical anomalies' scores magnitude as reconstruction error are uncorrelated with the criticality of the anomaly in terms of security / threat.

                This means that in practice SOC operators need to label on top of the anomaly detection and a supervised model can do the reranking after a while.

                • thaumaturgy 252 days ago

                  This is an interesting paper, but it sort of sidesteps one of the harder problems in generalized machine learning for log analysis:

                  > As shown by several prior work [9, 22, 39, 42, 45], an effective methodology is to extract a “log key” (also known as “message type”) from each log entry. The log key of a log entry e refers to the string constant k from the print statement in the source code which printed e during the execution of that code.

                  So if you're looking for a way to apply this to log data that varies wildly, like site access logs, you still have the difficult problem of converting the URIs to the numeric vectors needed by ML algorithms without losing the significant parts of the input.

                  • asavinov 252 days ago

                    Here is another generic approach to anomaly detection from event data which has been used for analyzing logs received from automatic lawn mowers:


                    It allows for using different algorithms like one class SVM or MDS (including custom algorithms). It also allows for defining custom domain specific features as integral part of its analysis engine. In particular, for log analysis, frequencies of various event types have been generated.

                    • kthielen 252 days ago

                      Once you discard type structure, it's a fruitless task to try to reconstruct it.

                      It's much easier to make sense of logs when we don't discard that type structure.


                      • StreamBright 250 days ago

                        Would you mind explaining what is the primary use of Hobbes?

                      • corneliu_p 252 days ago
                        • lindig 252 days ago

                          The authors are using their own Spell[1] tool to parse syslog files into patterns that represent the fixed part of printf-like log statement. Is the source of that available? At the heart of this is a tree-based construction that is not well explained.

                          [1] https://www.cs.utah.edu/~lifeifei/papers/spell.pdf

                          • boltzmannbrain 252 days ago

                            Would be interested to see the results on a benchmark dataset for online anomaly detection, comparing to those approaches used in practice: https://github.com/numenta/NAB#the-numenta-anomaly-benchmark...

                            • ram_rar 252 days ago

                              Has anyone in real production systems benefit from anomaly detection of logs ? I have usually converted some of the important events in logs to metrics and alerted users based on simple moving averages / spikes etc. I have usually started with alerts from system level metrics and then checked the logs. Applying Anomaly detection to logs directly hasn't worked for me yet.

                              • gesman 252 days ago

                                O yes.

                                Applying K-Means clustering across different features of online traffic always shows some weird and often malicious stuff:


                                • slv77 249 days ago

                                  Care to share more about what kinds of features you cluster on?

                              • bhnmmhmd 252 days ago

                                I was wondering, has anyone here applied cluster analysis techniques for anomaly detection?

                                I read a paper that used it for insurance fraud detection, but I don't know what other fields are using clustering to detect frauds and abnormalities?

                                I'd be grateful if someone can help.

                                • gesman 252 days ago

                                  Yes, tons of that.

                                  See this - using K-Means clustering for anomaly detection in web traffic:


                                  Using DBscan clustering for anomaly detection in healthcare claims data (detecting doctors who anomalously prescribing opioids). Using public CMS data set from 2015.

                                  4 out of 8 top anomalies (doctors) were later actually convicted of crimes or gone into all sort of troubles with DOJ:



                                  (Splunk Enterprise + free apps was used to ingest data and build all this logic and dashboards)

                                  • bhnmmhmd 252 days ago

                                    Thank you so much, it really was helpful.

                                • cphoover 252 days ago

                                  Is there a github for DeepLog?

                                  • mino 252 days ago

                                    I had contacted the first author in March and the answer was that "our source code is currently not available because of a pending patent application".

                                    • 252 days ago
                                      • cphoover 252 days ago

                                        that's lame...

                                    • sscarduzio 252 days ago

                                      Elastic.co X-Pack has machine learning for log anomalies and people buy and use that stuff. Has anybody direct experience with that?

                                      • dimitry12 252 days ago

                                        I don't but I was researching the space and https://www.anodot.com/ has the most feature-rich product - though they only discover anomalies in numeric time-series.

                                        • ygur 251 days ago

                                          Check out www.loomsystems.com for a spot on AI log analysis

                                      • matachuan 252 days ago

                                        Pure trash

                                        • dang 252 days ago

                                          This breaks the HN guidelines, which ask you not to post shallow dismissals. Better options would be either to factually explain what the problems are, so that people can learn something, or not to post.