The J1 Forth CPU

(excamera.com)

214 points | by Cieplak 113 days ago

10 comments

  • peter_d_sherman 112 days ago
    >"J1 is a

    small (200 lines of Verilog)

    stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA.

    PDS: A long time ago, I thought that Forth was "just another computer programming language", but back then, I had no appreciation for it, because I didn't understand where it fits in on the abstraction hierarchy that we call modern day computing...

    I didn't have this appreciation, at the time, because I had not worked on writing compilers at the time...

    Now, let me explain to you why Forth is so gloriously wonderful!

    You see, in terms of abstraction, where transistors are the lowest level, and Lisp, and Lisp-like languages are the highest, and most compilers are somewhat mid-level,

    Forth is the next step up above machine language -- but a step below most compilers!

    You'd know this, you'd derive this, if you ever tried to write a compiler from the bottom up (going from assembly language to a compiler) rather than from the top down (what is taught about compiler construction in most University CS classes, i.e., the "Dragon Book", etc.)

    See, Forth is sort of what happens when you add an at-runtime symbol (AKA, "keyword") lookup table to your assembler code, such that addresses (AKA "labels", AKA, "functions", AKA "procedures", AKA "methods") can be dynamically executed at runtime, and you create an interpreter / stack-machine -- around this concept.

    There's a little bit more to that of course, but that's the basic idea.

    Then you derive Forth!

    That's why Forth is so gorgeous and brilliant!

    And so aesthetically pleasing, because of it's small size, yet amazing functionality!

    And why you don't need big, complex CPU's to run it!

    It exactly follows the "Einsteinian mantra" of

    "As simple as possible, but not simpler!"

    Anyway, thanks again for your great work with the J1 Forth CPU!

    • fmakunbound 112 days ago
      Agree with all that!

      Except:

      > Lisp, and Lisp-like languages are the highest, and most compilers are somewhat mid-level

      Most Lisps have compilers (except the toy ones), and have had compilers for several decades.

      • Jtsummers 112 days ago
        I think GP may have worded that poorly, and "most compilers" in your quoted portion would be better read as "most other languages".

        Taken this way, the sentence becomes something like: For computing, transistors are the lowest level of abstraction (they are literally the things doing the computing), Lisp and kin languages are at the extreme end of abstraction, and most other languages fit somewhere in between.

        Though not entirely true (Lisps let you get pretty low level if you want), it's a reasonable perspective if you look not at what the languages permit, but what they encourage. Forth encourages building your own abstractions from the bottom up. Lisp encourages starting with an abstract machine model and building up even higher levels of abstractions (same with most functional and declarative languages). C++ (for something in between) encourages understanding the underlying machine, but not necessarily to the degree that Forth does. Forths and Forth programs are often closely tied to a specific CPU or instruction set. C++ can be seen as abstracted across a particular machine model that's typical of many modern CPUs but not with a direct tie to a specific instruction set (doing so constrains many programs more than necessary). And Lisp's abstract model is (or can be) even more disconnected from the particular physical machine.

        I'd put Haskell, SQL, or Prolog and languages at their level at the extreme end of abstraction before Lisp. Their abstract machines are even further from the hardware (for most users of the languages) than Lisp's.

      • tkinom 112 days ago
        Each "J1" seems to be able to map well(easily) to a tensor.

        Thousands (or a lot more) of them can be mapped to a TensorFlow network....

      • Cieplak 113 days ago
        Also worth checking out James’s simplified version designed for the Lattice iCE40 [1][2].

        [1] https://www.excamera.com/sphinx/article-j1a-swapforth.html

        [2] https://youtube.com/watch?v=rdLgLCIDSk0

        • UncleOxidant 112 days ago
          Will the non-simplified J1 fit/run in a larger iCE40 FPGA?
          • avhon1 112 days ago
            Probably not.

            iCE40 FPGAs come with up to 7,680 logic cells.

            The J1 was designed for a board with a Xilinx XC3S1000, which has 17,280 logic cells.

            • sitkack 112 days ago
              There are RISC-V cores that fit in about 1k luts. One could build a NoC of RISC-V using an XCS1000.
              • UncleOxidant 112 days ago
                This is probably because the RAM is internal to the FPGA:

                > A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA.

                I'd guess the CPU itself would easily fit into an iCE40 (given that RISC-Vs are fitting and the J1 should be simpler) with the RAM external. Several of the iCE40 boards have external RAMs

                • kelu124 111 days ago
                  Even the iCE UP5K has a whopping internal 1Mb RAM.
            • tieze 113 days ago
              Oh, thanks so much for this pointer :)
            • rbanffy 112 days ago
              In college, one of our assignments was to design a CPU. I started with a plain register-based one, but ended up moving to a stack-based design. It was one of the best decisions I made in the project, simplified the microcode enormously and made programming it easy and fun.
              • loa_in_ 112 days ago
                Did you roll your own language for this, used existing or did you program it in machine code?
                • rbanffy 112 days ago
                  Machine code felt very close to Forth (I worked with GraFORTH on the Apple II prior to that), or an HP calculator (ironically, now I finally own one). The CPU was never implemented and large parts of the microcode were never written, but I have written some simple programs for it.

                  The printouts and the folder are in my mom's house in Brazil, so I can't really look them up. Not sure the (5.25") floppy is still readable or what could interpret the netlist.

                  For some ops I cheated a bit. The top of the stack could be referenced as registers, R0 to R7, but I don't think I used those shortcuts in sample code.

              • one-more-minute 113 days ago
                Tangential question about FPGAs: Is there any work on compiling code to a combination of hardware and software? I'm imagining that the "outer loop" of a program is still fairly standard ARM instructions, or similar, but the compiler turns some subroutines into specialised circuits. Even more ambitiously you could JIT-compile hot loops from machine instructions to hardware.

                We already kind of do this manually over the long term (eg things like bfloat16, TF32 and hardware support for them in ML, or specialised video decoders). With mixed compilation you could do things like specify a floating point format on-the-fly, or mix and match formats, in software and still get high performance.

                • LargoLasskhyfv 112 days ago
                  There was. But for Mips. By Microsoft which used NetBSD.

                  https://www.microsoft.com/en-us/research/project/emips/

                  https://www.microsoft.com/en-us/research/publication/multico...

                  http://blog.netbsd.org/tnf/entry/support_for_microsoft_emips...

                  The thing is, this is not just another step up in complexity as another poster wrote here, but several.

                  Because it requires partial dynamic reconfiguration, which works with ram based FPGAs only (the ones which load their bitstream(containing their initial configuration) on startup from somewhere), not flash based ones which are "instant on" in their fixed configuration.

                  Regardless of that, partial dynamic reconfiguration takes time. The larger the reconfigured parts, the more time.

                  This is all very annoying because of vendor lock in because of proprietary tools, IP-protection, and so much more.

                  The few fpgas which have open source tool chains are unsuitable because they are all flash based AFAIK, and it doesn't seem to be on the radar of the people involved in developing these, because why, if flash anyways?

                  • duskwuff 112 days ago
                    > The few fpgas which have open source tool chains are unsuitable because they are all flash based AFAIK...

                    Not true at all. The flagship open-source FPGAs are the Lattice iCE40 series, which are SRAM-based. There's also been significant work towards open-source toolchains for Xilinx FPGAs, which are also SRAM-based.

                    The real limitation is in capabilities. The iCE40 series is composed of relatively small FPGAs which wouldn't be particularly useful for this type of application.

                    • nereye 112 days ago
                      Lattice ECP5 is an SRAM-based FPGA which has up to 84K LUTs (vs ~5K for iCE40) and is supported by an open source tool chain. E.g. see https://https://www.crowdsupply.com/radiona/ulx3s.
                      • LargoLasskhyfv 112 days ago
                        OK? I didn't follow the efforts for Lattice because insufficient resources for my needs. I'm aware of efforts for Xilinx, but they aren't covering the SKUs/models I'm working with. Is there anything for Altera/Intel now?
                  • vidarh 113 days ago
                    The challenge is that reformulating problems to parallel computation steps is something we're in general still really bad at.

                    We're struggling with taking full advantage of GPUs and many-core CPUs as it is.

                    FPGAs is one step up in complexity.

                    I'd expect JIT'ing to FPGA acceleration to show up other than as very limited research prototypes after people have first done a lot more research on JIT'ed auto-parallelisation to multiple CPU cores or GPUs.

                    • rwmj 113 days ago
                      The "execution model" is so vastly different it's hard to even know what "JIT-compile hot loops from machine instructions to hardware" even means. I wouldn't even call HDLs "execution" - they describe how to interconnect electronic circuits, and if they can be said to "execute" at all, it's that everything runs in parallel, processing signals across all circuits to the beat of a clock (usually, not always).
                      • ekiwi 112 days ago
                        You might be interested in this work which integrates a programmable fabric directly with a MIPS core in order to speed up inner loops: http://brass.cs.berkeley.edu/garp.html
                      • jandrese 112 days ago
                        > one’s complement addition

                        That's going to catch some people off guard, especially on a 16 bit system where it's so easy to overflow.

                        • fctorial 113 days ago
                          What are the advantages of an fpga over an emulator running on a board like rpi? Isn't an fpga just a hardware level emulator?
                          • opencl 113 days ago
                            FPGAs can generally achieve much lower latencies than systems like the Pi (but so can microcontrollers). For some use cases they can be more power efficient. If you need some oddball I/O that isn't widely available in existing systems you can implement it in gateware.

                            In general I would say there aren't a whole lot of cases where it makes sense to use an FPGA over a microcontoller or SBC given how cheap and fast they are these days. Of course like a lot of hobbyist tech stuff people will choose to use a certain technology for fun/learning.

                            Note that the Pi didn't exist yet when this CPU was designed, the embedded world was very different 10 years ago.

                            • TFortunato 112 days ago
                              Seconding folks re: some of the advantages on latency, efficiency, etc.

                              I also wouldn't describe it as "just" a emulator, or to think of it as something interpreting / emulating some digital logic. It's much lower level than this, in that it is actually implementing your design in real hardware (consisting of lookup tables / logic gates / memory cells, etc. all "wired" together.

                              As such, they can be useful not only for implementing some custom bits of glue logic (e.g. interfaces which need high speed serialization / deserialization of data), or accelerating a particular calculation, but also for anywhere you need to have really deterministic performance in a way that isn't easy or even possible to do in software / an emulator alone.

                              • progre 112 days ago
                                While FPGA:s are often used for implementing CPU:s, that's really not the best use of an FPGA in my oppinion. Of course, very usefull when prototyping a CPU, but if you only want a CPU... you can just buy one.

                                I think a more interesting use is for hardware that you can't buy. Like the balls in this project:

                                https://en.wikipedia.org/wiki/IceCube_Neutrino_Observatory

                                Those balls contains frequency analyzers implemented on FPGA:s, essentially doing Fourier transforms using lookup table maths. This means they can do transforms much, much faster than in software.

                                • rcxdude 112 days ago
                                  It can be more efficient. It depends on how sophisticated an emulator you are writing and how well the emulated CPU semantics match the host CPU semantics. For a simple emulator of a CPU which is vastly different from the host CPU an FPGA will likely be much faster and lower power consumption, since although FPGAS are generally slower than dedicated silicon but they are really good at simulating arbitrary digital logic very fast, while CPUs are quite poor at it.
                                • pkaye 112 days ago
                                  I always wondered if it is feasible to do a pipelined/super scalar stack based CPU.
                                  • FullyFunctional 112 days ago
                                    Contrary to much common misunderstanding, sure, no real problem.

                                    The semantics of each instruction is consume the two top stack elements and replace that with the results. You handle this by having an stack rename stack of physical registers (with additional complication for handling under- and overflows), that is assuming the current stack is

                                      pr3 pr4 pr5
                                    
                                    
                                    and the first free register is pr56

                                    Then an "+" instruction is interpreted as "add pr56, pr4, pr5" and pr56 is consumed and pr4 and pr5 marked to be freed when this commits.

                                    Because stack machines inherently introduce a lot of tight dependencies you will need to use dynamic scheduling (OoOE) to go super-scalar, but it's not a problem.

                                    Upsides are incredible instruction density. Downside: slightly harder to do good code generation, but not really.

                                    • klelatti 112 days ago
                                      Thank you for an interesting answer. This begs the question of course: why have they never caught on?
                                      • FullyFunctional 112 days ago
                                        I'm assuming partly inertia, partly the code density not being important enough to do this. To be clear, while you _can_ go OoO superscalar with a stack machine, it's more work than with an ISA that exposes the dependencies, like VLIW or EDGE.

                                        Don't take my word for it, design, model, simulate, and implement it yourself. It's a small matter of coding.

                                        EDIT: Reduceron is an example of a super-scalar stack machine, though not dynamically scheduled. It's very difficult to write code by hand though.

                                        • klelatti 112 days ago
                                          Thanks - the Reduceron looks really interesting.
                                    • crest 112 days ago
                                      It's possible to pipeline a stack CPU but the there is an upper limit on the number of pipeline stage that make sense. A forth friendly (data and return) dual stacked CPU can be very simple and small yet achieve a throughput close to one instruction per cycle. Going superscalar on a single datastack is hard. You basically have to rename every stack access and most compute instructions consume two arguments producing a single result. You just trapped yourself in a fate about as bad as having to implement a superscalar x87 FPU.

                                      One interesting design to break the 1 instruction per clock barrier is a multistack VLIW design. In this design the dense encoding of a stack based instruction compensates for the low usage code density common in VLIW instruction sets. See https://bernd-paysan.de/4stack.html an example of this out of the box approach.

                                    • moonchild 113 days ago
                                      > double precision (i.e. 32 bit) math

                                      Is this standard nomenclature anywhere? IME 'double precision' generally refers to 64-bit floating values; and 32-bit is called 'single precision'.

                                      • howerj 113 days ago
                                        It should also be noted that double in this context has nothing to do with floating point numbers, Forth implementations often do not have the functions (called "words") to manipulate FP numbers. Instead I believe it refers to the double-number word set, words like "d+", "d-". See <http://lars.nocrew.org/forth2012/double.html>.

                                        Forth often uses different terminology, or slightly odd terminology, for more common computer science terms because of how the language evolved independently from universities and research labs. For example "meta-compiler" is used when "cross-compiler" would be more appropriate, instead of functions Forth has "words" which are combined into a "dictionary", because the term "word" is already used "cell" is used instead of "word" when referring to a machines word-width.

                                        Edit: Grammar.

                                        • jabl 113 days ago
                                          If you look at e.g. x86 ASM manuals, you have WORD (16 bits), double word (DWORD, 32 bits), and quadword (QWORD, 64 bits). So even if it's nowadays a 64-bit CPU, the nomenclature from the 16-bit days sticks.

                                          Double precision usually refers to 64-bit floating point, like you say.

                                          I would agree that this usage is not standard.

                                          • jimktrains2 112 days ago
                                            > Double precision usually refers to 64-bit floating point, like you say.

                                            Is it? Doesn't `double` in c refers to a 32bit value?

                                            EDIT:

                                            So, it seems I've not dealt with this in much too long and am misremembering and therefor wrong.

                                                #include <stdio.h>
                                                int main() {
                                                  printf("sizeof(float) = %d\n", sizeof(float));
                                                  printf("sizeof(double) = %d\n", sizeof(double));
                                                  return 0;
                                                }
                                            
                                            yields

                                                sizeof(float) = 4
                                                sizeof(double) = 8
                                            
                                            on an Intel(R) Core(TM) i5-6200U (more-or-less a run-of-the-mill-not-that-mill 64-bit x86-family core. I don't have a 32-bit processor handy to test, but I don't believe it'd change the results.
                                            • froydnj 112 days ago
                                              Not usually; `float` is 32 bits and `double` is 64 bits on virtually every common platform (maybe not on some DSP chips or certain embedded chips?). But the C++ standard (and probably the C one) only requires that `double` have at least as much precision as a `float`, so it's conceivable you could have a C++ implementation with 32-bit `float` and `double` or 16-bit `float` and 32-bit `double.
                                              • coliveira 112 days ago
                                                C never required that the size of int or double be the same across compilers. Even in the same machine they can have different sizes.
                                                • zokier 112 days ago
                                                  "F.2 Types

                                                  The C floating types match the IEC 60559 formats as follows:

                                                  — The float type matches the IEC 60559 single format.

                                                  — The double type matches the IEC 60559 double format.

                                                  — The long double type matches an IEC 60559 extended format, else a non-IEC 60559 extended format, else the IEC 60559 double format.

                                                  Any non-IEC 60559 extended format used for the long double type shall have more precision than IEC 60559 double and at least the range of IEC 60559 double.

                                                  Recommended practice

                                                  The long double type should match an IEC 60559 extended format."

                                                  ISO/IEC 9899:1999, Annex F

                                                • Something1234 112 days ago
                                                  float == 32 bits, double == 64 bits.
                                              • chalst 113 days ago
                                                It used to be the case that double precision meant two words, which for this 16-bit CPU fits. It's fairly rare these days now we care more about portability.
                                              • sanjarsk 113 days ago
                                                since it's 16 bit CPU, double precision implies 32 bits
                                              • bmitc 112 days ago
                                                Forth is fun. I haven’t learned much of it or used it much, but it has been enjoyable.
                                                • cbmuser 113 days ago
                                                  The name is a bit confusing as there is already a J2 CPU which is the open source variant of the SuperH SH-2.
                                                  • jecel 112 days ago
                                                    The J1 was published in 2010 and the J2 in 2015, so it was up to them to avoid the confusion.