I've long been intrigued by pure FP, advanced type systems and proof assistants from an academic point of view but haven't yet been convinced that the effort involved in learning such systems/methods pays off in most cases (without taking into consideration the costs you still have when you already know the methods well). I will still try to learn more about this, but without any strong conviction that this needs to be immediately useful.
By contrast, I've only more recently started hearing more and more about model checkers such as TLA+. The case studies in this blog post serve to give a good motivation for why such systems might be useful (as well as honestly presenting drawbacks; something that advocates of e.g. PFP rarely do). I would love to use something like that one day.
As an aside, I still think it's a disgrace how irrelevant empirical research is to most aspects of today's software development. Another blog post by the author cites a paper claiming that very simple testing, especially of error conditions, would be able to prevent a huge chunk of production bugs, but things like that are rarely talked about. The whole profession is unfortunately amateurish and always obsessed with fads and trends instead of taking a more analytic and empirical approach which makes me sad.
The Redex modeling tool  is an interesting point in this space. Its specialty is in making executable specifications of domain-specific languages. Redex models can be tested by the unit-testing framework of the Racket dialect of Scheme. There's plenty of excellent documentation on Redex, and it has some neat visualization tools. I've gotten plenty of value from using it for my research over the years, but suspect it'd be useful outside of academic research.
This article talks about design specification and its model-checking; seL4 used a completely different formal method.
Formal Methods (for software assurance) is the name of a research discipline studying the following methods, as well as programming language semantics:
* Formal specification of either code or design
* Model checking
* Sound static analysis
* Machine-checked/assisted deductive proof systems and dependent type systems
* Concolic testing and automatic test generation
* Program synthesis
Some of the verification methods above (like some model checkers and sound static analysis) can be used without any explicit specification whatsoever (e.g. they're used to verify lack of undefined behavior in C programs). So the name "Formal Methods" covers a wide range of tools and techniques, with widely varying goals, costs and appropriate domains, that it doesn't really mean much to talk of "formal methods" as if it were a specific thing. Rather, it is a general approach used in constructing widely differing tools and techniques. That's why it's called formal methods, not a formal method.
I assume you mean "full" code verification in some sense. Partial verification, especially of properties that are "local" and where verification can be done in a way that's "based" on the language syntax itself, is relatively common. This kind of verification is often called "type checking" and is often not even thought of as a means of code verification.
One can of course object that requiring the use of a language where verification is "baked in" like this might be undesirable; but either way, it seems clear even to formal-methods proponents that well-chosen language semantics are going to be very important if code verification is to be feasible. So it's six of one, half a dozen of the other.
I would suspect that (subsets of) Ada, or, Rust, or maybe OCaml would be more fitting while staying on the industrial side. (Likely Haskell or Idris would be even easier targets, but their industrial adoption is quite limited.)
The seL4 code didn't start out as C. First they wrote the whole thing in HOL (an ML dialect), then translated that to Haskell, then translated _that_ to C. Verifying code is correct is Just Plain Hard.
It probably depends on your trust assumptions and requirements. Ideally we write formally provable code for a formally proven compiler to run on a formally proven OS/bios/hardware.
But some use cases require safe client code that will nevertheless run on unsafe OS and hardware. Client-side and some server-side cryptocurrency apps are one example that comes to mind. Probably banking apps too. It's still useful to write them in strongly typed languages amenable to formal verification, even though they'll be running on unverified Windows, Linux, MacOS and smartphones.
Also we really need OS's, firmware, etc. written in languages amenable to formal verification, but given the massive lockin of the mainstream options, that's a pipedream for now. But a worthy long-term goal.
I'm not sure that there are languages more or less amenable to full formal verification. Formal verification of a system is such a complex task that the language-level details are unlikely to be a significant proportion of the effort.
You can also look at existing formally verified software and probably confirm my assumption - there is likely as much formally verified C code (a weakly-typed, memory unsafe language) as formally-verified Haskell code.
And I do not agree at all that a formally verified banking app is a good idea. It would likely be so difficult to do that by the time it is finished it no longer runs on any current version of any OS.
Formal verification is just too hard to be worth it. There are better methods of ensuring software correctness that are far less costly. Maybe something will be discovered in the future that will make it easier, but for now formal verification of any piece of software more complex than a few hundred lines of code is a multi-year research-level project.
The (famously robust) SQLite has 92 million lines of tests, for 139 thousand lines of code. How many person-years did that take to create?
And then there are the hundreds of software libraries I've seen throughout my career that I never got to use because they were just too buggy.
From where I stand, C doesn't scale to significantly-sized codebases. Nobody I know uses it except when they need to glue to some legacy C code that's been tested in production for the past few decades.
It's easy to say something is a lot of work, but in this case every one of the alternatives looks worse. I don't have the time or resources to write 1000x my program in test code, to debug it in production for 10 years, or to throw it all away because I never achieve a minimum usable level of quality. Formal methods aren't looking so bad to me right now!
Yup. This shows why we need languages that are more amendable to formal methods. Proving stuff requires you to write short code that is obviously correct.
For most programs, an 'informal' proof would be a huge improvement already. The effort you have to put in non-formal proofs is much lower, but the main drawback is that they are not checked, so you might make mistakes in the proof (on the other hand, you can make mistakes in the specification as well).
The holy grail would be formal methods that work mostly like informal proofs. With todays methods you would only prove the smallest, most critical parts of the code correct.
If you're going to write a formal spec in the first place, you can then "fuzz" the code against the spec, or parts of the spec against itself. I.e. try to find an automated refutation as opposed to a proof. Model checkers work according to this principle, and I assume that the combination of "informal" proof and automated model checking can be especially helpful for suitably-sized systems. Either one of these without the other seems a bit harder to take seriously.
I don't know why this strain of thought continues to have currency that formal methods are some kind of unattainable utopian ideal, this line of thinking is exactly what Dijkstra prophesied that the flawed CS education would lead us to think, that testing software is a viable option to ensure their correctness.
It's possible to use TLA+ for complex serial systems, and it can def help there. The thing is that TLA+ is designed to be good at modeling concurrent problems, problems that have a lot of specific challenges you don't see in serial systems. So if you know your system is serial and nondeterministic, then you're probably better off with something that focuses on that part of the design space.
To paraphrase Edmund Clarke's Turing lecture, the problems of verifying software include "floats, strings, user-defined types, procedures, concurrency, generics, storage, libraries…" If you're not worried about verifying concurrency in your tool, you've got a more resources to devote towards verifying the other things.
That's also a space I don't have as much familiarity with, so I don't know what the best tools are there. My vague understand is that at that stage the problem "simplifies" enough that it's a lot easier to verify the code itself. Like how CompCert isn't just a model of a compiler, it's an actual correct compiler. So you can apparently work much closer to the code when verifying serial things.
At one point I got a bunch of people in FM to write provably correct versions of Leftpad. Maybe one of those would work best for you? Note also that one of the examples uses proves Java code correct via inline JML annotations.
Do you really mean "nondeterministic"? AIUI, what people call 'TLA' was designed by adding a bunch of modalities (you can think of these as "monads" if you like) for nondeterminism and "time" (state transitions) to standard propositional logic. The rest of it is really a matter of what's idiomatic, such as using the nondeterminism modality to account for spec refinement as well. So it would seem that TLA+ should be enough to deal with these cases.
I think he meant to write deterministic. The exact nature of the logic TLA is not the relevant point here. TLA (and so TLA+) extensively relies on nondeterminism for things like specifying code at arbitrary levels of detail (e.g. you can say, after this step, the list is sorted somehow, without specifying how), interaction with the environment (the user performs one operation of the possible ones, or some arbitrary machine fails) or concurrency in its programming sense (the OS will schedule one of these several operations next).
If you're interested, I've written a rather detailed explanation of the TLA+ logic. Part 1 is an introduction that explains the design from a UX perspective, part 2 is about the + side of the logic, and parts 3 and 4 are about TLA: https://pron.github.io/tlaplus
Our mobile apps derive actually revenue and customer satisfaction, with budgets already hard to justify (high 7 and low to mid 8 figures) and alway live in a highly complex service and operational environment. If you added the cost of formal methods nothing would ever be approved at all. The alternative of not doing formal methods and having stuff that mostly works and makes us money and customers reasonably happy is good enough (not crud apps, not even sure wtf that is in a real business, is Facebook a crud app? (maybe crap app but still)).
I could see formal methods for things that might kill people as possible but even there spending all of your money to deliver perfect code vs still being in business with something good enough is an easy math for most CEOs.
I actually think the opposite is true. TLA+'s strength in a business context is precisely because it is not production code.
This considerably derisks TLA+ because you cannot end up depending on TLA+, or put another way any deficiencies in TLA+ or its toolchain will not become production blockers for your team. You cannot lose more time than the time you spent writing your spec.
You can use TLA+ to specify as much or as little of your system as you want. You can upgrade your system independently of TLA+. You don't have to wait for TLA+ to support or integrate with your language.
I've said this elsewhere, but the closest competitor to TLA+ on most teams is decent high-level documentation, which is precisely what most teams lack.
Software tools, like many other things, suffer from an uncanny valley effect. If you're not tempted at all depend on a tool to generate code for you, you know the limits of the tool and can be quite happy. If the tool integrates smoothly with the rest of your code it's even better. However, if your tool integrates in a somewhat bumpy way with the rest of your code than it becomes worse than the two previous alternatives.
The TLA+ community already has limited resources that I think can be better spent on making TLA+ an even better design language before trying to cross the much bigger and more arduous gap of trying to make TLA+ executable (a goal which I suspect most of the community would actually oppose anyways).
Our (sound) formal verification tools and knowledge are nowhere near anything that could do this affordably for programs of common sizes, and it's unclear they ever will be (the gap between programs we can verify soundly and the size of the programs we write has state roughly constant for decades). The hottest approach these days in formal methods that operate directly on code of real-world-sized programs is reducing the soundness of verification with approaches like concolic testing.
Humans don't do this affordably. Programs proven end-to-end correct using human provers and semi-automated proof assistants have required extreme simplification of the program so that it could be verified at all (so no clever efficient algorithms), teams of graduate-student proof-monkeys working for years, and even then the biggest programs we could prove that way were ~10KLOC. The greatest achievement of this semi-automated approach required years by highly-trained specialists to prove simplified programs that amount to 1/5th of jQuery.
We have some very interesting advances in formal verification, but that is probably not the way.
If you see my comment, it proposes using large infosys style teams. You don't need graduates per se, just people who have an intuition for logic and math, of which there are plenty. Plus you can prove isolated pieces e.g. TCP/IP stacks instead of end-to-end. It would not be as beneficial as full formal verification but it is still a large step in securing the most vulnerable targets such as IoT, industrial control systems, financial services etc.
The problem is that the training required for "proof monkeys" is higher than for programmers, so they'll be even more expensive (plus you need to know the specified C semantics to prove C code, something even many/most C programmers don't know), and second, the proving effort is not something that can usually be done separately and after-the-fact. The program needs to be written in a way informed by the proving effort. Moreover, every time the program is changed, the amount of proof that needs to be rewritten could be large, especially if the programmers are unfamiliar with the proofs.
Hopefully, there are more efficient formal methods than the one that is currently the least efficient of them all.
Your program may be able to formally prove that the code corresponds to the TLA+. It may be able to generate tests to show that the code does what the TLA+ says it should do. But a program can't prove that the TLA+ is correct. The most it could prove is that it is not self-inconsistent.
So now you move from debugging the production code to debugging the TLA+. That's still an improvement.
 Unless you take the TLA+ code to be the definition of "correct". That's a possible position, but I don't adhere to it. It seems to me more reasonable to be able to say that the spec is wrong if it doesn't correspond to what is actually needed, and the TLA+ is wrong if it doesn't accurately encode the spec. (For example, the MCAS software did exactly what the spec said. But the spec was wrong.)
But how do you unambiguously specify "what is needed"? I would think you need some high-level formal language for doing that.
So I wonder if instead of proving that some hand-written code corresponds to a specification, wouldn't it be better to think of the specification as a high-level program and then somehow translate into some lower level conventional programming language?
> But how do you unambiguously specify "what is needed"?
You don't. You follow the scientific method of building a model specification for "what is needed", throwing it into the world, and tweaking and improving it where you find that some of its properties are not a good fit for the problem it's intended to solve - primarily by gathering feedback from the people that are using it and finding its flaws.
That's not so much a business case as a contrived example. I absolutely would expect system testing and QA to find such a bug or the testing isn't up to scratch.
What you're proposing is that all engineers now effectively learn a new language and translate requirements into it. This expects requirements to be fully fleshed out themselves (a luxury few have these days) and relatively static (or everything needs to be reworked).
That's not what the article is claiming though, and in fact when the code is written is irrelevant to the business case. We're talking about spotting production issues before they arise.
The article makes the very strong claim that QA aren't going to catch the bug that arises from this design omission and it's going to cost your company money as everything goes wrong in the field.
I don't believe it would get that far IRL as any dev with a brain is going to spot it, and even so QA should certainly have test cases for this sort of situation if they're doing their job right.
You can argue this may be a better approach, and catch potential problems earlier, have at it. You can argue that there are classes of error that this will catch that normally you wouldn't (some of the preamble and larger scale stuff does seem to show this) But the illustrated example doesn't hold up to scrutiny and is overstated. As such I think it undermines the point that's the author is trying to make, because it looks to me like something we'd catch anyway.
> But the illustrated example doesn't hold up to scrutiny and is overstated.
The illustrated example is intended to be comparatively trivial. It's just clarifying what this kind of verification actually involves in practice. It makes no sense to describe it as "overstated", since the article actually mentions several case studies as demonstrating the usefulness of this approach; it's just that these cannot be examined in detail in an introductory article.
It's attempting, AFAICT, to demonstrate TLA+ catching a class of very dangerous problem that traditional approaches won't/can't catch. It fails and in doing so it undermines the business case the wider article attempts to make.
I think perhaps the example has just been simplified too far.
I think the argument is if you've been around distributed systems long enough you will encounter race conditions. Sure, it's ok to say "Well, the testing infrastructure isn't up to snuff, so we just need to fix it" but at scale this is impractical.
Check out the fallacies of distributed computing. If your testing system can simulate all of those edge cases, it probably looks a lot like TLA+.
I agree, once you've been around parallel and distributed computing for a while you do notice this stuff, which is why I think the example given in the article isn't a particularlyt good one - that's a really noticeable design omission and I would expect it to be caught pretty early in any dev/QA process as the developer implements the functionality and thinks "Hey, what if .. don't we need to invalidate the other offers?"
I'm sure there are good cases for using TLA+, I'm sure there are situations where it's not only useful for catching errors before they even happen, but in which this more than offsets the upfront costs of the exercise.
I guess I just came away from the article not feeling that such had been demonstrated, in fact I came away with the feeling that the example was contrived to fit the agenda and didn't actually show much.
I think it's a great example. It boils down to simple code that may not be obvious how it behaves within a degraded network. Your dev/QA could even have the thought and actually test it but it work under ideal circumstances giving them false confidence.
I'll go out on a limb and say it is nearly impossible to design and build a test environment that can simulate all network conditions, so that even in trivial cases where a dev might know for a fact that there is an issue, it'll be incredibly hard to reproduce it.
Maybe put another way, formal methods give cover to dev/QA to avoid shipping known
but hard to prove buggy code. Bugs they will ultimately be held responsible for.
Keep in mind, there are different approaches to design distributed systems. You can do pretty much everything without sharing or concurrently accessing any resources, without relying on paxos family of algorithms, etc., it would require you to make sure that algorithms converge, can be retries and so on, but that is local reasoning, easily testable, and it doesn't look like TLA+ even remotely nor can it benefit from it. If you need to model 35 high level steps, your system is definitely not designed that way, is too complex and even philosophically is not in the camp of making distributed algorithms so simple, that they can be easily reasoned about. I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.
> I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.
Agree to disagree. If your system is distributed it requires you to share data and act concurrently. Otherwise it isn't a distributed system. No amount of retries will cover all of the screwed up things that can happen on an unreliable network.
Sharing data the way you imply is an anti-pattern in distributed systems, you are probably just stuck in a shared memory multithreading mindset. Instead we use specific consistency models. What I described is about models that don't assume that data is up to date on every node but rather operate with the assumption of stale data. And asynchronous networks fail relatively simply too, you are making it the bigger deal than it is. Well, they don't actually fail, but rather this is how they normally operate, they are just unreliable, so you have to assume that data won't be delivered to destinations in time or at all sometimes. If this screws things up for you, then you are definitely doing it wrong. Even if you need low latency, multiple destinations and multiple independent paths is all there is to it, which, again, works well if you don't use consistency models that require consensus. It's like a different world from imperative algorithms with many steps.
The article is implicitly asking you to make a small imaginative leap – that the bug expressed is in fact subtle enough it would have never been caught otherwise – in the service of providing a concise, digestible example. It might have been better for the author to spell this out but I don't think it's really necessary.
I like how it emphasized the "debuggable" aspects of FMs.
Most people believe FMs are not resilient due to their coupling with the spec. What if the spec changes?
Saying that FMs are a debuggable form of specification, i.e. very close to debugging the design itself, turns that criticism on its head and shows that it is trivially the case that you can iterate on or prototype with FMs.
You're right in a limited sense, except that it's not any language, it's particular languages that can count as a Formal Method in the sense of the OP (i.e. can do model checking in some kind of termporal logic).
The problem is that explaining the upfront cost to 'the business people' is an issue. In many cases, budgets for dev are yearly and that is kind of an horizon for most people (even some who should know better), so even if i'm going to say that your TCO is probably (which is a problem in itself, but probably statistically) going to be lower with formal methods than without, that assumption is spread over many years. There are enough systems that require vast refactoring every year because they were badly done in the first case, but that's the concern of every year's budget/manager, not of the one sitting the current year.
That goes for verifying code as well, however that is even harder to sell as it's so much more expensive and even though I 'feel' it is a good idea, I cannot really estimate if the TCO would be lower than just patching with bandaids over the years.