I am skeptical of anomaly detection since in my experience anomalies are common and diverse and don't actually matter, so I expect these systems to basically inundate people with false positives.
Their offline training accuracy is garbage: 16% precision, so all of the real work is basically being done in the online training portion, which gets it to a respectable 82%+ precision.
But they don't tell you how many alerts they had to label to get those numbers. Maybe over the long run you get those numbers, but you really want to know if it takes 10 or 10,000 examples to get there.
Also, their dataset distribution is very different to reality: they have 7% of their dataset annotated as real anomalies; I don't think anyone in the real world wants 5% of their log entries to get flagged as anomalies. So I expect their precision numbers to be far worse on more realistically distributed logs.
It's a good time to make some money baffling people with bullshit in the cybersecurity space.
Of course if you let an ML-powered "anomaly detection" engine run rampant on your logs, it's going to find anomalies...just like if you hire a ghost hunter, you'll be informed that your house is haunted. In the end, ghost chasing is all this anomaly nonsense turns out to be-- the justifications for conclusions by ML practitioners and ghost hunters alike tend to be equally mumbly and hand-wavy.
Me working from home is technically an anomaly, and one these systems are all too eager to flag. We get random logins from overseas VPSes-- it's an anomaly! Oh, wait, no, we onboarded a client application. Oh, look, a random login from China for a US-based employee with no history of foreign logins! Yeah, that guy just started in a new position with travel requirements. Hey, this IP just tried to log into 5000 user accounts! Congratulations, you just alerted me to the existence of carrier NAT.
None of this saves any time and usually wastes it, since it stirs up paranoia where none was otherwise warranted. It's a fun toy that gives the appearance of being productive when all it's actually doing is generating literally endless busywork. Good for justifying your SOC budget I suppose.
But in the end nobody wants to pay a quarter-million dollars for a black box that just sits there quietly-- if it's not constantly drawing attention to itself and all the badness it's pretending to find, you're not going to have any reason to renew the license.
"Renew it? Why? This thing didn't find anything at all last year."
At a previous tech support position, I collaborated with a data scientist to create a predictive alert system based on system notification data. It would monitor the quantity of noise from each interface on the network and alert on anomalies in Slack. The only issue was that it didn't work - we saw only false positive noise, and it sat quiet during actual incidents. It would be interesting to see another team's attempts, and what different design choices they make.
Another problem of anomaly detection is that they do not provide any (domain specific) explanation for why the system thinks it is an anomaly. The system also does not say what to do in this situation, which means that such anomalies are not actionable findings. Therefore I think anomaly detection should be used as a pre-processing step which generates input for some other other components of the system.
I do, with others, a lot of ML anomaly detection in the cyber security context. Deeply has interesting ideas, especially the encoded logs via lstm. The work was presented at a workshop at NIPS 2017.
One of the interesting facts we ve been able to measure empirically over the past few years is that the statistical anomalies' scores magnitude as reconstruction error are uncorrelated with the criticality of the anomaly in terms of security / threat.
This means that in practice SOC operators need to label on top of the anomaly detection and a supervised model can do the reranking after a while.
This is an interesting paper, but it sort of sidesteps one of the harder problems in generalized machine learning for log analysis:
> As shown by several prior work [9, 22, 39, 42, 45], an effective methodology is to extract a “log key” (also known as “message type”) from each log entry. The log key of a log entry e refers to the string constant k from the print statement in the source code which printed e during the execution of that code.
So if you're looking for a way to apply this to log data that varies wildly, like site access logs, you still have the difficult problem of converting the URIs to the numeric vectors needed by ML algorithms without losing the significant parts of the input.
It allows for using different algorithms like one class SVM or MDS (including custom algorithms). It also allows for defining custom domain specific features as integral part of its analysis engine. In particular, for log analysis, frequencies of various event types have been generated.
The authors are using their own Spell tool to parse syslog files into patterns that represent the fixed part of printf-like log statement. Is the source of that available? At the heart of this is a tree-based construction that is not well explained.
Has anyone in real production systems benefit from anomaly detection of logs ? I have usually converted some of the important events in logs to metrics and alerted users based on simple moving averages / spikes etc. I have usually started with alerts from system level metrics and then checked the logs. Applying Anomaly detection to logs directly hasn't worked for me yet.
This breaks the HN guidelines, which ask you not to post shallow dismissals. Better options would be either to factually explain what the problems are, so that people can learn something, or not to post.