Benchmarks like these still have their place, in part because many contemporary benchmarks are too specific. A non-trivial number of Phoronix benchmarks, for instance, can't be run on ten or twenty year old OSes, so they're not the best for comparing advances in OS efficiency, as an example.
What'd be nice is a collection of ByteBench results for machines and OSes going back to the '80s. It's interesting to see what has progressed by leaps and bounds versus what has progressed more gradually. For instance, a few comparisons from the early '80s to now:
Memory size: 64K 64GB factor of 1,000,000
Memory bandwidth: 1MB/sec 50GB/sec factor of 50,000
Instructions/sec: 1 million 10 billion/core 10,000 per core
Disk size: 5M 10TB factor of 2,000,000
Magnetic disk speed: 150K/sec 250MB/sec factor of just 1,666
Very general, but we can see that the amounts of increase varies wildly.
A nit:
If we're saying memory size is 64GB, that's in HEDT and workstation territory, so let's look at memory bandwidth of, e.g., M1 Max or Ultra at 400-800GB/s (so factor of 400,000-800,000).
"Whereas M1 Max had 16 LPDDR5-6400 channels for a total of 408GB/second of memory bandwidth, M1 Ultra doubles that to 32 LPDDR5 channels and 800GB/second of memory bandwidth." [1]
Realistically, there were plenty of higher end systems in the early 80s with more than 64K of RAM as well, so maybe more fair to compare to the specs of, an Apple IIe to a M1 Ultra Mac Studio as a sort of common denominator (similarly priced, given inflation). Problem then is that systems in the early 80s didn't include a hard drive, so you're looking at a $3,500 add-on (in 2022 dollars) just to get to parity in terms of computer features we now consider basic requirements. With that extra $3,500 we're in territory looking something more like this:
Memory size: 64K 128GB factor of 2,000,000
Memory bandwidth: 1MB/sec 800GB/sec factor of 800,000
Instructions/sec: 500,000 10 billion/core 20,000 per core
Disk size: 5M 8TB factor of 1,600,000
Magnetic disk speed: 150K/sec 500MB/sec factor of just 3,300 [2]
Actual drives compared: 150K/sec 7.5GB/sec factor of 50,000
1: https://www.anandtech.com/show/17306/apple-announces-m1-ultra-combining-two-m1-maxes-for-even-more-performance
2: https://www.storagereview.com/news/seagate-exos-2x14-hdd-delivers-524mb-s
I like your example - Apple II to Apple ARM makes for a very interesting path :)
Personally, I'm running NetBSD on m68030 Mac systems (and Amiga), so comparing them with Apple ARM gives some interesting direct comparisons. I'll have to write them up some day.
Didn't Byte also use the Sieve of Eratosthenese or am I mis-remembering that.
I ask because I've been fooling with that a bit recently and extending it to very large primes (for some definition of very large) is a bit of a challenge. Marching through 4GB RAM 1) requires 4GB RAM if byte size bools are employed and 2) absolutely thrashes the processor caches.
I've been thinking about how to do an efficient Sieve of Eratosthenes for a while. One approach that had occurred to me was to sieve one block at a time (say, 16 KiB) and "park" one "rider" for each prime in a priority queue, maybe a binary heap, ordered by the next composite they were going to eliminate.
Also you can special-case even numbers, so 16 KiB is 262'144 numbers using bit-size bools.
So, after going through the first block, your rider for 3 is parked in the heap at 262'149, your rider for 5 is parked at 262'145, your rider for 7 is parked at 262'157 (because 262'150 is even and thus not considered), your rider for 11 is parked at 262'163, your rider for 13 is also parked at 262'145, and so on. So you start by picking 5 off the heap, marking off all the multiples of 5 in the block (262'145, 262'155, etc.) and when it runs off the end of the block you park it at 524'295, at which point it tumbles to near the bottom of the heap. Then you pick 13 off the heap, mark off all the multiples of 13, etc. Once you're done striking off all the composites in the block, you iterate over its primes, inserting a rider for each of them into the heap (at their square, so for example the rider for 262'147 starts running at 68'721'049'609), maybe outputting them to a file, and then you can discard the block.
That means you only need to keep the heap of known primes in RAM, not the bitmap of all the numbers you contemplate testing for primality. This not only allows you to change your mind about how many you want, it reduces your space by a factor of the natural logarithm. But this doesn't beat the constant factor for a long time; https://en.wikipedia.org/wiki/Prime-counting_function says π(10¹¹) = 4'118'054'813, so 64 GiB of RAM should be enough to calculate all the primes up to 10¹¹ if each heap element is two 64-bit integers; while, without a heap of riders, with one bit each, you would only need 12 GiB. Well, 6 GiB skipping the evens.
If, instead of discarding the already finished blocks, you lazily add riders for their primes when you reach those primes' squares, you can do a lot better up to a much higher sieve size: at 10¹¹ you've just passed 316'271², so your heap only needs to contain the 27'296 primes p such that 2 < p ≤ 316'271, thus occupying only 436'736 bytes. Moreover, if you do know ahead of time where you're going to stop, you can still discard the finished blocks whose primes will never spawn riders in the heap. This approach should allow you to sieve out the 201'467'286'689'315'906'290 prime numbers up to 10²² in those same 64 GiB of RAM, but an earlier obstacle is that striking off 10²² composites at one per nanosecond would take you several hundred millennia.
You don't want to use a standard bin-heap at gigabyte sizes on a cached machine because you'll have one cache miss per access once you get outside the top few levels of the heap. Instead, you probably want to block it into a heap of heaps, staggered so their roots aren't a large power of 2 apart. Or possibly use a Fibonacci heap, if you can understand it.
It might make sense to bundle the first few primes into a "wheel", with, say, period 15'015 (3 · 5 · 7 · 11 · 13 · 17), where you just OR an appropriately shifted bitmap into the block to strike off all their multiples at once (9255 out of every 15'015 numbers in this case), instead of striking off their multiples one by one with the rider mechanism. Better still would be if you can work out an efficient addressing scheme which doesn't even assign a bit index to numbers that are divisible by one of these small primes, a generalization of the just-skip-the-evens trick.
These are just some things that have occurred to me (though I'm pretty sure I read about the wheel trick somewhere), so probably there are already more efficient implementations known of the Sieve of Eratosthenes, and I just haven't heard of them.
Edit: Darius Bacon wrote a version of this approach in https://github.com/darius/cant/blob/master/library/factoring... where the "frontier" of active riders is stored in a hash table rather than a bin heap, which is almost certainly a more efficient approach. But he's not doing the bitmaps, and the overhead of Cant makes it hard to measure the absolute efficiency of the approach.
What'd be nice is a collection of ByteBench results for machines and OSes going back to the '80s. It's interesting to see what has progressed by leaps and bounds versus what has progressed more gradually. For instance, a few comparisons from the early '80s to now:
Very general, but we can see that the amounts of increase varies wildly."Whereas M1 Max had 16 LPDDR5-6400 channels for a total of 408GB/second of memory bandwidth, M1 Ultra doubles that to 32 LPDDR5 channels and 800GB/second of memory bandwidth." [1]
Realistically, there were plenty of higher end systems in the early 80s with more than 64K of RAM as well, so maybe more fair to compare to the specs of, an Apple IIe to a M1 Ultra Mac Studio as a sort of common denominator (similarly priced, given inflation). Problem then is that systems in the early 80s didn't include a hard drive, so you're looking at a $3,500 add-on (in 2022 dollars) just to get to parity in terms of computer features we now consider basic requirements. With that extra $3,500 we're in territory looking something more like this:
Personally, I'm running NetBSD on m68030 Mac systems (and Amiga), so comparing them with Apple ARM gives some interesting direct comparisons. I'll have to write them up some day.
https://twitter.com/AnachronistJohn/status/15012572783218278...
I ask because I've been fooling with that a bit recently and extending it to very large primes (for some definition of very large) is a bit of a challenge. Marching through 4GB RAM 1) requires 4GB RAM if byte size bools are employed and 2) absolutely thrashes the processor caches.
Also you can special-case even numbers, so 16 KiB is 262'144 numbers using bit-size bools.
So, after going through the first block, your rider for 3 is parked in the heap at 262'149, your rider for 5 is parked at 262'145, your rider for 7 is parked at 262'157 (because 262'150 is even and thus not considered), your rider for 11 is parked at 262'163, your rider for 13 is also parked at 262'145, and so on. So you start by picking 5 off the heap, marking off all the multiples of 5 in the block (262'145, 262'155, etc.) and when it runs off the end of the block you park it at 524'295, at which point it tumbles to near the bottom of the heap. Then you pick 13 off the heap, mark off all the multiples of 13, etc. Once you're done striking off all the composites in the block, you iterate over its primes, inserting a rider for each of them into the heap (at their square, so for example the rider for 262'147 starts running at 68'721'049'609), maybe outputting them to a file, and then you can discard the block.
That means you only need to keep the heap of known primes in RAM, not the bitmap of all the numbers you contemplate testing for primality. This not only allows you to change your mind about how many you want, it reduces your space by a factor of the natural logarithm. But this doesn't beat the constant factor for a long time; https://en.wikipedia.org/wiki/Prime-counting_function says π(10¹¹) = 4'118'054'813, so 64 GiB of RAM should be enough to calculate all the primes up to 10¹¹ if each heap element is two 64-bit integers; while, without a heap of riders, with one bit each, you would only need 12 GiB. Well, 6 GiB skipping the evens.
If, instead of discarding the already finished blocks, you lazily add riders for their primes when you reach those primes' squares, you can do a lot better up to a much higher sieve size: at 10¹¹ you've just passed 316'271², so your heap only needs to contain the 27'296 primes p such that 2 < p ≤ 316'271, thus occupying only 436'736 bytes. Moreover, if you do know ahead of time where you're going to stop, you can still discard the finished blocks whose primes will never spawn riders in the heap. This approach should allow you to sieve out the 201'467'286'689'315'906'290 prime numbers up to 10²² in those same 64 GiB of RAM, but an earlier obstacle is that striking off 10²² composites at one per nanosecond would take you several hundred millennia.
You don't want to use a standard bin-heap at gigabyte sizes on a cached machine because you'll have one cache miss per access once you get outside the top few levels of the heap. Instead, you probably want to block it into a heap of heaps, staggered so their roots aren't a large power of 2 apart. Or possibly use a Fibonacci heap, if you can understand it.
It might make sense to bundle the first few primes into a "wheel", with, say, period 15'015 (3 · 5 · 7 · 11 · 13 · 17), where you just OR an appropriately shifted bitmap into the block to strike off all their multiples at once (9255 out of every 15'015 numbers in this case), instead of striking off their multiples one by one with the rider mechanism. Better still would be if you can work out an efficient addressing scheme which doesn't even assign a bit index to numbers that are divisible by one of these small primes, a generalization of the just-skip-the-evens trick.
These are just some things that have occurred to me (though I'm pretty sure I read about the wheel trick somewhere), so probably there are already more efficient implementations known of the Sieve of Eratosthenes, and I just haven't heard of them.
Edit: Darius Bacon wrote a version of this approach in https://github.com/darius/cant/blob/master/library/factoring... where the "frontier" of active riders is stored in a hash table rather than a bin heap, which is almost certainly a more efficient approach. But he's not doing the bitmaps, and the overhead of Cant makes it hard to measure the absolute efficiency of the approach.
My thoughts too, though you've clearly thought about it a lot more than I have.