We could get a >50% performance boost by ignoring security.
Think about that. 1.5x - 2x boost, on a single core. That's like doubling a CPU's clock speed, except more powerful since it involves caching memory (which is usually the bottleneck, not raw horsepower).
What would the TempleOS of CPUs look like, I wonder?
We ought to have the option of running workloads on such CPUs. Yes, they're completely compromised, but if you're running a game server you really don't have to care very much. Most data is ephemeral, and the data that isn't, isn't usually an advantage to know.
Being able to run twice the workload as your competitors on a single core is a big advantage. You saw what happened when performance suddenly dropped thanks to the patch.
This brings to mind articles and discussions I read in the early 90s about differences in performance between running in real (or "unreal") mode and protected mode --- yes, the extra permission checks, paging, segmentation, etc. definitely add some overhead:
Thus they decided to do permissions checking in parallel with other operations, leading to Spectre. It's interesting to note that the original intent of protected mode was not as a real secure "security feature" immune from all attacks, but more to provide some isolation to guard against accidents and expand the available address space. The entire Win9x series of OSs embody this principle.
What would the TempleOS of CPUs look like, I wonder?
x86 running in unreal mode might be close. I wonder if anyone has done any benchmarking of recent x86 CPUs in that mode...
Spectre variant 2 is not fundamental to speculative execution. It can be fixed by tagging the indirect branch predictions with the full virtual address and the ASID, and flushing it together with the TLB.
you could certainly extend the unused transactional support to do a chandy-lamport thing, with the difference that when you run out of space isolated cache versions, you just can't speculate any more.
it would be a lot of machinery
you could also do latency hiding with smt instead of trying to fight it head-on with speculation. ultimately probably more productive, but either the compiler or the programming model or the user has to expose that concurrency.
> This brings to mind articles and discussions I read in the early 90s about differences in performance between running in real (or "unreal") mode and protected mode
I remember a 10 years old Microsoft Research project that implemented an OS that would use the .NET managed runtime to implement security. IIRC, they had interesting differences with CPU memory isolation off.
I like the idea that you don't need hardware barriers to isolate programs when they are lobotomized.
You're speaking of Singularity and it's "software isolated processes", which amounted to static verification of IL before AOT compiling it to x86. From the perspective of Sing#, the only way to express IPC was in the form of protocols, which were essentially formalized function call dances between two processes.
Singularity would be just as vulnerable to the recent bugs as contemporary OSes are, possibly more so because there is even less timing uncertainty when crossing privilege domains, making the attacks even easier
Not when the bootloader is an UEFI application. UEFI puts the processor into 32 bit protected mode or typically long mode (64 bit). So when your bootloader that is implemented as an UEFI application is started, you are already out of real mode.
Naturally, protecting an indirect branch means that no
prediction can occur. This is intentional, as we are
“isolating” the prediction above to prevent its abuse.
Microbenchmarking on Intel x86 architectures shows that our
converted sequences are within cycles of an native indirect
branch (with branch prediction hardware explicitly
For optimizing the performance of high-performance binaries,
a common existing technique is providing manual direct
branch hints. I.e., Comparing an indirect target with its
known likely target and instead using a direct branch when a
match is found.
Example of an indirect jump with manual prediction
cmp %r11, known_indirect_target
One example of an existing implementation for this type of
transform is profile guided optimization, which uses run
time information to emit equivalent direct branch hints.
Note the parenthetical in the first paragraph. The way I read the above, they're merely saying that the performance is the same as disabling branch prediction for the section of code. Which would be really slow. But, presumably, faster overall than _actually_ disabling branch prediction as there's probably no good way to explicitly do that in as localized a manner.
And as I interpret this, the reference to FDO is unfortunately confusing and perhaps better placed under the heading of correctness. They're merely saying that the transformation of indirect branches to direct branches is something compilers already do to steer speculative execution down the correct path. Except in the case of a retpoline it's steering speculative execution into a trap.
 I'm no expert in this area (nor do I do much assembly programming) but AFAIU branch predictors these days have substantial caches of their own. Disabling branch prediction or flushing the predictor's caches is probably much more costly than simply steering the predictor into a trap. The single pipeline bubble you create is better than the many bubbles that would be created if the branch predictor had to warm up again for later code.
So they lost 5% of performance, then applied FDO, a generic optimization technique, and gained back about that 5%. In other words, they could be ~5% ahead without the bug (at least 2%, if you consider there are some helpful interactions between FDO and retpoline mitigations).
Not to my understanding. They were already applying FDO to all of their performance sensitive programs. They found that after FDO, the retpolines and the other work done to mitigate meltdown and spectre had negligible impact. This is because FDO replaces indirect branches with direct branch hints, so only in uncommon cases do you even have to execute the retpoline. That's my understanding, at least.
Why limit yourself? Just run everything in ring 0. No kernel calls overhead at all. Lightning-fast network stack. Use a unikernel, run your trusted code only, don't store secrets.
Beyond game servers, this could be the mode of operation of e.g. compute cluster nodes, relying on external firewall for security, and running zero untrusted code. Of course they won't care about meltdown or spectre either.
I’ve heard some rumblings that this is what some HFT shops were doing years ago to minimize latency and to reduce context switches. Because they already run bare metal and don’t share anything besides a physical facility with their competitors (if even that) the attack vectors are not the same as those running on shared machines either.
I don't think this requires a 1000x slowdown but I'm not super familiar with this area of the spec. It should be possible to cache the tainted state of the canvas in a single bit, only updating it when adding new images to the canvas, and checking the bit when converting the canvas to an array.
For some reason every single pixel access must be authorized through CORS. I am not sure if this is in the standard or just shoddy implementation in all browsers. Try it yourself, read pixels in one non-CORSed image, and then in CORSed and wonder what went wrong on the drawing board. Some people bypass it via WebGL.
There's a trade-off, but I seriously doubt it's anywhere near that. For one, if you could get super fast by ignoring safety, that would surely be found in consoles. Then, a good old 486 or early Pentiums have none of these issues, having a multiple of z80 speed without some decades of progress we had in the interim.
I meant you could execute every instruction with cryptographic check, i.e. AES check for every single step if this instruction could be executed. That would bring speeds under Z80 and will be overwhelmingly secure if protocol is correct. I hope nobody would try to add it to normal CPUs in the future, but security guys are sometimes too idealistic.
Doesn't it also mean Intel kind of cheated their way to the performance crown by ignoring security while AMD does not seem to have these issues ? It certainly brings Ryzen closer again, however future improvements to the patch might make it negligible in the end.
Meltdown looks pretty much like cheating. Intel simply took security verification out of the critical path, and their public image ought to suffer for that.
Spectre is a much more complex issue. It is more of a "I would never thing of that" thing. Looks like there were people at the security community talking about it even 10 years ago, but it is certainly hard for knowledge to spread to chip designers - most of what security people study would be noise for them, and even the security people mostly ignore those.
The toy example referenced in section 6.4 is actually closer to Spectre. It's missing the crucial distinction that the speculatively loaded address should belong to a page inaccessible from the current ring level.
To see why that toy example is insufficient, consider that you could simply execute the load directly without putting an exception in front of it and you would be able to read the value.
No, it just means that it's better to design chips and software architectures with security in mind from day one. If you don't, then every single patch you add afterwards will bloat-up the whole system and the (software) performance will drop overtime.
It's just another way to say you're building technical debt into your chip/software if you don't consider security from day one. Because sooner or later you will need that security, if you have tens of millions or more people using that product. So it's not like you can just ignore it. You're just postponing it for later.
Some Amiga people still believe that "virtual memory" is a passing fad, and when the Amiga comes again in glory we'll all realize how futile and self-defeating it was to prevent people from accessing the raw hardware and OS.
It's more than that even - given that a high portion of the code being run now is using virtual machines, a lot of that protection is redundant. If all code is run inside VMs and zero 'native' code is allowed, then you could run without needing protection rings, system call overhead, memory mapping, etc - which in theory could more than make up for the virtual machine overhead.
This was being explored with Microsoft's Singularity and Midori projects but seems to be a dormant idea.
I don’t think it works like that. You still need protection between the rings inside the VM and the CPU is providing that protection. The VM is not only a user process for the host, it is executing its kernel code in the virtualized ring 0 on the CPU (where its still the CPU that provides protection).
As i understand it, with this approach you wouldn't have a separation between kernel code and the virtual machine - everything runs in a single address space and you rely on the virtual machine verifying the code as it JITs it.
"What would the TempleOS of CPUs look like, I wonder?"
In the old days, probably Burroughs B5000 for it being high-level with security and programming in large support built in before those were a thing. If about hackability, the LISP machines like Genera OS that you still can't quite imitate with modern stuff. If a correct CPU and you want it today, then I've been recommending people build on VAMP that was formally-verified for correctness in Verisoft project. Just make it multicore, 64-bit, and with microcode. Once that's done, use the AAMP7G techniques for verifying firmware and priveleged assembly on top of it. Additionally, microcode will make it easier to turn it into a higher-level CPU for cathedral-style CPU+OS combos such as Oberon, a LISP/Scheme machine, or Smalltalk machines. Heck, maybe even native support for Haskell runtime for House OS.
I assume those save language would still fail to defend against spectre. The problem is that the „code path with data“ executed in a speculative branch is not considered by the compiler but „fantasized“ by the cpu.
Nice thoughts but the vulnerabilities of Spectre and Meltdown do not directly threaten code which runs on the vulnerable CPU. Instead, the vulnerabilities allow malicious code running on the vulnerable CPUs to attack the system and other apps. A subtle difference, but makes the job to identify the applications which are not trustworthy and avoid running them on the performant but potentially vulnerable cores.
If every application were signed (like on the apple store), does that help reduce the risk here? If it were a game server then all apps run would presumably be provided by some official third party who could be relied upon to sign the binary.
If it were all signed, then is the risk of the box becoming compromised fairly small and worst case, replacing the entire box (rebuild) if there were an issue could address problems quickly?
It is my understanding that any sort of cloud setup would be completely out. Also, anything with a web-browser or other platform that runs third party code. But that still leaves an awful lot of servers that are under the control of a single company or individual and don't run any untrusted code (sandboxed or otherwise), doesn't it?
Yeah, you split your business into those who's customers need >1 physical machine, and those that need < 1 physical machine. Drop security for the former, increase for the latter. Both can phrase their application as a unikernel.
The performance boost is not about ignoring security, but about ignoring edge case of one particular security check on i386 which ended up being the only critical one for "modern" unix-derived (which for purposes of this discussion includes Windows NT) OSes.
In original x86 protected mode model (ie. real security flags on descriptor level and page protection bits only used for scheduling of paging) this would be non-issue.
If you weren't considering Intel alternatives before this, I'd argue that's a real failure of imagination and risk management. I'm sure some of the really small cloud providers weren't, but all the big players keep tabs on the path to and pain of migration, at a minimum. Just because they weren't actually using PowerPC/ARM/AMD does not mean they did not know how.
A lot of server software is actually open source and runs on most commonly available hardware architectures, including ARM and PPC. You are right that everybody develops assuming x86, though, and there is some friction in that transition regardless of software portability.
In the sense that those processors can compute things which are computable, yes, they are a real alternative. The correct question is, "How much worse would the situation have to be with Intel before the cost of these alternatives was less than the benefit of switching to one of them." The answer, I guess, is: "much worse than things are now." However, if you're running at billions in revenue per quarter, you can afford to spend a few million to limit your downside risk by keeping yourself apprised of what the costs would be of moving to another platform.
More important is, are PPC or ARM competitive at the amount of computation per watt, and amount of computation per dollar. Power 8 and 9 are pretty good at number-crunching but also pretty expensive, and I don't know about power efficiency. ARMs used to be pretty power-efficient, but likely cannot offer very high single-threaded performance (must still be great for IO-heavy applications that are hit hardest on Intel).
The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.
Though not an unquestionable and not ultimately reliable one, using comparably rare hardware and software usually is a valid security measure. Now as Linux has gained solid market share the "no viruses for Linux" era comes to its end and Linux-capable ransomware is emerging I am considering moving to OpenBSD not only because security is among the top priorities of its design but because it's a way more exotic too.
I don't think throwing intel out entirely is a solution, but rather, diversification. What if there later comes out another vulnerability granting arbitrary code exec given a specific bytestring in memory, which only intel processors are immune to? Better to have as many as possible of as many different CPUs as possible. Just like, e.g. backblaze has done with HDD manufacturers.
I mean, we know they are all saying "arm" and "amd" both as negotiating tactic as well as strategically diversifying microarchitectures. That said, I'm not sure it's like amd can deliver more instructions per second per dollar.
I wonder if it would make sense for some loads to prepare for Itanium usage, or even older Atom architectures? Does any one deploy Itanium in the cloud?
It's worth noting that only pre-2013 Itanium CPUs are not vulnerable to meltdown (The same as with Atom CPUs). Intel has also said that the Itanium chips released in 2017 would be the last Itanium chips they will develop, so I don't think there's any reason to bother switching to Itanium. I would wager it's guaranteed you'll be better-off with the latest AMD instead of a 2013 Itanium, and AMD supports x86-64 so it doesn't require making your full software stack support Itanium (Which it likely doesn't).
I am sure you didn’t mean it this way, but it struck me as funny saying AMD supports x86-64. Of course AMD supports x86-64, they invented it. That is why Microsoft and Linux refer to that architecture as AMD64.
That's not entirely accurate. There are a lot of ARM CPUs that are not vulnerable to Spectre, and that includes a lot of them that are in use in actual devices. For example, my phone happens to use a Cortex-A53, which is not vulnerable. It is however easy to miss this detail from their 'security update'  because the table doesn't list CPUs that aren't affected.
Also, I don't believe the single ARM CPU that's vulnerable to meltdown (Cortex-A75) has actually been included in any devices at this point, so for now it is safe to assume any ARM-based device you have is not vulnerable to meltdown.
There won't be a hardware fix, you can't patch CPUs (aside from microcode which is just more software). Since this is a pretty fundamental issue wit how certain things work, yes, you'll have to wait for a new generation, and I couldn't expect anything in less than a year at the absolute earliest. It's a multi-year process from design to tape-out and fabrication. Honestly, I'm not sure how far back in the design process fixing this will take, it is something that can be added in as permanent microcode after fabrication/during packaging? Can current designs be "patched" before moving on to tape out? Do we need to go back to square one and rethink a couple key premises? I don't know. But it's not quick.
I was coming at it more from a "having alternatives is good" comment instead of saying it's not vulnerable to the new problems. The specs look interesting enough to benchmark on, and perhaps if it actually does perform better, then even with patches it will stand out.
In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754. Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754
My funniest thought of this whole affair that whatever performance upsizing happens in the front is probably going to be with manufacturer discounts from Intel from Amazon Google complaining and the cloud providers may end up with a windfall in the gap.