stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA.
PDS: A long time ago, I thought that Forth was "just another computer programming language", but back then, I had no appreciation for it, because I didn't understand where it fits in on the abstraction hierarchy that we call modern day computing...
I didn't have this appreciation, at the time, because I had not worked on writing compilers at the time...
Now, let me explain to you why Forth is so gloriously wonderful!
You see, in terms of abstraction, where transistors are the lowest level, and Lisp, and Lisp-like languages are the highest, and most compilers are somewhat mid-level,
Forth is the next step up above machine language -- but a step below most compilers!
You'd know this, you'd derive this, if you ever tried to write a compiler from the bottom up (going from assembly language to a compiler) rather than from the top down (what is taught about compiler construction in most University CS classes, i.e., the "Dragon Book", etc.)
See, Forth is sort of what happens when you add an at-runtime symbol (AKA, "keyword") lookup table to your assembler code, such that addresses (AKA "labels", AKA, "functions", AKA "procedures", AKA "methods") can be dynamically executed at runtime, and you create an interpreter / stack-machine -- around this concept.
There's a little bit more to that of course, but that's the basic idea.
Then you derive Forth!
That's why Forth is so gorgeous and brilliant!
And so aesthetically pleasing, because of it's small size, yet amazing functionality!
And why you don't need big, complex CPU's to run it!
It exactly follows the "Einsteinian mantra" of
"As simple as possible, but not simpler!"
Anyway, thanks again for your great work with the J1 Forth CPU!
I think GP may have worded that poorly, and "most compilers" in your quoted portion would be better read as "most other languages".
Taken this way, the sentence becomes something like: For computing, transistors are the lowest level of abstraction (they are literally the things doing the computing), Lisp and kin languages are at the extreme end of abstraction, and most other languages fit somewhere in between.
Though not entirely true (Lisps let you get pretty low level if you want), it's a reasonable perspective if you look not at what the languages permit, but what they encourage. Forth encourages building your own abstractions from the bottom up. Lisp encourages starting with an abstract machine model and building up even higher levels of abstractions (same with most functional and declarative languages). C++ (for something in between) encourages understanding the underlying machine, but not necessarily to the degree that Forth does. Forths and Forth programs are often closely tied to a specific CPU or instruction set. C++ can be seen as abstracted across a particular machine model that's typical of many modern CPUs but not with a direct tie to a specific instruction set (doing so constrains many programs more than necessary). And Lisp's abstract model is (or can be) even more disconnected from the particular physical machine.
I'd put Haskell, SQL, or Prolog and languages at their level at the extreme end of abstraction before Lisp. Their abstract machines are even further from the hardware (for most users of the languages) than Lisp's.
In college, one of our assignments was to design a CPU. I started with a plain register-based one, but ended up moving to a stack-based design. It was one of the best decisions I made in the project, simplified the microcode enormously and made programming it easy and fun.
Machine code felt very close to Forth (I worked with GraFORTH on the Apple II prior to that), or an HP calculator (ironically, now I finally own one). The CPU was never implemented and large parts of the microcode were never written, but I have written some simple programs for it.
The printouts and the folder are in my mom's house in Brazil, so I can't really look them up. Not sure the (5.25") floppy is still readable or what could interpret the netlist.
For some ops I cheated a bit. The top of the stack could be referenced as registers, R0 to R7, but I don't think I used those shortcuts in sample code.
Tangential question about FPGAs: Is there any work on compiling code to a combination of hardware and software? I'm imagining that the "outer loop" of a program is still fairly standard ARM instructions, or similar, but the compiler turns some subroutines into specialised circuits. Even more ambitiously you could JIT-compile hot loops from machine instructions to hardware.
We already kind of do this manually over the long term (eg things like bfloat16, TF32 and hardware support for them in ML, or specialised video decoders). With mixed compilation you could do things like specify a floating point format on-the-fly, or mix and match formats, in software and still get high performance.
The thing is, this is not just another step up in complexity as another poster wrote here, but several.
Because it requires partial dynamic reconfiguration, which works with ram based FPGAs only (the ones which load their bitstream(containing their initial configuration) on startup from somewhere), not flash based ones which are "instant on" in their fixed configuration.
Regardless of that, partial dynamic reconfiguration takes time. The larger the reconfigured parts, the more time.
This is all very annoying because of vendor lock in because of proprietary tools, IP-protection, and so much more.
The few fpgas which have open source tool chains are unsuitable because they are all flash based AFAIK, and it doesn't seem to be on the radar of the people involved in developing these, because why, if flash anyways?
> The few fpgas which have open source tool chains are unsuitable because they are all flash based AFAIK...
Not true at all. The flagship open-source FPGAs are the Lattice iCE40 series, which are SRAM-based. There's also been significant work towards open-source toolchains for Xilinx FPGAs, which are also SRAM-based.
The real limitation is in capabilities. The iCE40 series is composed of relatively small FPGAs which wouldn't be particularly useful for this type of application.
OK? I didn't follow the efforts for Lattice because insufficient resources for my needs. I'm aware of efforts for Xilinx, but they aren't covering the SKUs/models I'm working with. Is there anything for Altera/Intel now?
I'm not aware of any significant reverse-engineering efforts for Intel FPGAs. QUIP  might be an interesting starting point, but there may be significant IP licensing restrictions surrounding that data.
Out of curiosity, which Xilinx models are you hoping to see support for?
The challenge is that reformulating problems to parallel computation steps is something we're in general still really bad at.
We're struggling with taking full advantage of GPUs and many-core CPUs as it is.
FPGAs is one step up in complexity.
I'd expect JIT'ing to FPGA acceleration to show up other than as very limited research prototypes after people have first done a lot more research on JIT'ed auto-parallelisation to multiple CPU cores or GPUs.
The "execution model" is so vastly different it's hard to even know what "JIT-compile hot loops from machine instructions to hardware" even means. I wouldn't even call HDLs "execution" - they describe how to interconnect electronic circuits, and if they can be said to "execute" at all, it's that everything runs in parallel, processing signals across all circuits to the beat of a clock (usually, not always).
FPGAs can generally achieve much lower latencies than systems like the Pi (but so can microcontrollers). For some use cases they can be more power efficient. If you need some oddball I/O that isn't widely available in existing systems you can implement it in gateware.
In general I would say there aren't a whole lot of cases where it makes sense to use an FPGA over a microcontoller or SBC given how cheap and fast they are these days. Of course like a lot of hobbyist tech stuff people will choose to use a certain technology for fun/learning.
Note that the Pi didn't exist yet when this CPU was designed, the embedded world was very different 10 years ago.
Seconding folks re: some of the advantages on latency, efficiency, etc.
I also wouldn't describe it as "just" a emulator, or to think of it as something interpreting / emulating some digital logic. It's much lower level than this, in that it is actually implementing your design in real hardware (consisting of lookup tables / logic gates / memory cells, etc. all "wired" together.
As such, they can be useful not only for implementing some custom bits of glue logic (e.g. interfaces which need high speed serialization / deserialization of data), or accelerating a particular calculation, but also for anywhere you need to have really deterministic performance in a way that isn't easy or even possible to do in software / an emulator alone.
While FPGA:s are often used for implementing CPU:s, that's really not the best use of an FPGA in my oppinion. Of course, very usefull when prototyping a CPU, but if you only want a CPU... you can just buy one.
I think a more interesting use is for hardware that you can't buy. Like the balls in this project:
It can be more efficient. It depends on how sophisticated an emulator you are writing and how well the emulated CPU semantics match the host CPU semantics. For a simple emulator of a CPU which is vastly different from the host CPU an FPGA will likely be much faster and lower power consumption, since although FPGAS are generally slower than dedicated silicon but they are really good at simulating arbitrary digital logic very fast, while CPUs are quite poor at it.
Contrary to much common misunderstanding, sure, no real problem.
The semantics of each instruction is consume the two top stack elements and replace that with the results. You handle this by having an stack rename stack of physical registers (with additional complication for handling under- and overflows), that is assuming the current stack is
pr3 pr4 pr5
and the first free register is pr56
Then an "+" instruction is interpreted as "add pr56, pr4, pr5" and pr56 is consumed and pr4 and pr5 marked to be freed when this commits.
Because stack machines inherently introduce a lot of tight dependencies you will need to use dynamic scheduling (OoOE) to go super-scalar, but it's not a problem.
Upsides are incredible instruction density. Downside: slightly harder to do good code generation, but not really.
I'm assuming partly inertia, partly the code density not being important enough to do this.
To be clear, while you _can_ go OoO superscalar with a stack machine, it's more work than with an ISA that exposes the dependencies, like VLIW or EDGE.
Don't take my word for it, design, model, simulate, and implement it yourself. It's a small matter of coding.
EDIT: Reduceron is an example of a super-scalar stack machine, though not dynamically scheduled. It's very difficult to write code by hand though.
It's possible to pipeline a stack CPU but the there is an upper limit on the number of pipeline stage that make sense. A forth friendly (data and return) dual stacked CPU can be very simple and small yet achieve a throughput close to one instruction per cycle.
Going superscalar on a single datastack is hard. You basically have to rename every stack access and most compute instructions consume two arguments producing a single result. You just trapped yourself in a fate about as bad as having to implement a superscalar x87 FPU.
One interesting design to break the 1 instruction per clock barrier is a multistack VLIW design. In this design the dense encoding of a stack based instruction compensates for the low usage code density common in VLIW instruction sets. See https://bernd-paysan.de/4stack.html an example of this out of the box approach.
It should also be noted that double in this context has nothing to do with
floating point numbers, Forth implementations often do not
have the functions (called "words") to manipulate FP numbers. Instead I
believe it refers to the double-number word set, words like "d+", "d-". See
Forth often uses different terminology, or slightly odd terminology, for
more common computer science terms because of how the language evolved
independently from universities and research labs. For example "meta-compiler"
is used when "cross-compiler" would be more appropriate, instead of functions
Forth has "words" which are combined into a "dictionary", because the
term "word" is already used "cell" is used instead of "word" when referring
to a machines word-width.
If you look at e.g. x86 ASM manuals, you have WORD (16 bits), double word (DWORD, 32 bits), and quadword (QWORD, 64 bits). So even if it's nowadays a 64-bit CPU, the nomenclature from the 16-bit days sticks.
Double precision usually refers to 64-bit floating point, like you say.
Not usually; `float` is 32 bits and `double` is 64 bits on virtually every common platform (maybe not on some DSP chips or certain embedded chips?). But the C++ standard (and probably the C one) only requires that `double` have at least as much precision as a `float`, so it's conceivable you could have a C++ implementation with 32-bit `float` and `double` or 16-bit `float` and 32-bit `double.