> Google will have the cloud TPU ... to handle training models for various machine learning-driven tasks, and then run the inference from that model on a specialized chip that runs a lighter version of TensorFlow that doesn’t consume as much power ... dramatically reduce the footprint required in a device that’s actually capturing the data ... Google will be releasing the chip on a kind of modular board not so dissimilar to the Raspberry Pi ... it’ll help entice developers who are already working with TensorFlow as their primary machine learning framework with the idea of a chip that’ll run those models even faster and more efficiently.
If you're interested in playing with Chisel, the "Chisel Bootcamp" is now hosted on Binder meaning you can run through a fair amount of learning content in a browser [1,2].
As a longer, elaborating point: Chisel is much closer to the LLVM compiler infrastructure project than a new hardware description language. Chisel is a front end targeting the FIRRTL circuit IR. There's a FIRRTL compiler that optimizes the IR with built-in and user-added transforms. A Verilog emitter then takes "lowered" FIRRTL and emits Verilog.
Consequently, Chisel is the tip of the iceberg on top of which the Edge TPU was built. The speakers in the video mention this explicitly when explaining the "Chisel Learning Curve" slide and doing automated CSR insertion.
As a further elaboration, Chisel is pedantically not High Level Synthesis (HLS). You write parameterized circuit generators not an algorithm that is optimized to Verilog.
So my guess is that Chisel is one of the many responses to the two horrors that are VHDL and Verliog.
Unfortunately, Chisel is built on Scala, and I have no interest in learning Scala. Though I'm intrigued by the claim of using generators and not instances, and would be interested in a white paper that explains it in PL-agnostic terms (PL: programming language).
Also have on my to-do list MyHDL , a Python solution to the same problem. (has anyone tried it and found to be better than VHDL/Verilog?)
I get the impression that people who talk about the "horrors" that are VHDL and Verilog for hardware design are software developers who have little to no knowledge about hardware design processes.
There are reasons why VHDL/Verilog are still in use in the industry and why high-level synthesis hasn't taken off.
VHDL/Verilog for hardware design is not broken. I won't claim that there isn't space for improvement (because there is) but there isn't anything fundamentally broken in them. They are fit for the purpose and they fulfill all of the needs we have.
What could be massively improved is actually the functional verification languages we use, SystemVerilog for verification is in serious need of an overhaul.
OK. I'll bite. I only have experience with verilog, but it's basically uncomfortable to work with in the sense that there are absolutely no developer ergonomics. We're well into the 21st century and you'd think that our HDLs would learn from everything that the software world has learned.
1) the syntax is very finicky (slighttly more so than C, I'd say). Most software languages (thanks to more experience with parsers and compilers) have moved on from things like requiring semicolons, verilog has not.
2) writing tests is awful. Testbenches are crazy confusing. Much better would be some sort of unit testing system that does a better job of segregating of what constitutes "testing code" versus the "language of the gates". You would have a hard time doing something like, say, property testing using verilog.
3) there isn't a consistent build/import story with verilog. Once worked with an engineer that literally used perl as a verilog metaprogramming language. His codebase had a hard-to-find perl frankenbug which sometimes inserted about 10k lines of nonsense (which somehow still assembled a correct netlist!) but caused gate timings to severely miss and the footprint to be bloated. It took the other hardware developers one week to track down the error.
None of these things have anything to do with the fundamental difference between software and hardware development.
> 2) writing tests is awful. Testbenches are crazy confusing. Much better would be some sort of unit testing system that does a better job of segregating of what constitutes "testing code" versus the "language of the gates". You would have a hard time doing something like, say, property testing using verilog.
SystemVerilog makes this distinction between RTL (language of the gates) and verification environment code (wrt. testing, they are different things in my experience) very clearly. SystemVerilog inherits much of what people dislike about Verilog, but it makes writing large verification environments much easier. Again, not without lots of potential pain points, but you can do an awful lot that way.
On a slight aside - it worries me (though perhaps unreasonably) about the different approaches to functional verification which come from the software world v.s. the hardware world.
Software verification seems to (generally) be a much more continious affair, while for hardware, there is an extremely intense period of verification before the product is delivered to a customer (as IP) or physically manufactured. This arises because fixing software bugs is cheap by comparison to fixing hardware (again, please accept my generalising!).
It makes me shiver a little to hear people applying software "testing" strategies and terms to verifying actual hardware. I don't know if this is reflected by their actual practice ofcourse. There is a lot of potential for the hardware community to make use of so many software development practices in their verification environments (Big systemverilog testbenches are giant class hierarchies which are far more akin to straight up software), but I'm yet to be convinced about hardware itself. The development constraints are so different, and the possiblity for continuous development is hindered by the hard cut off point (manufacture).
I am a software engineer who's been involved in the tapeout of a few ASICs (although none of the TPUs). Particularly when you plan to build a series of chips, the continuous approach taken by software is massively preferable. X v2 does what X v1 did, plus some additional things, and with all of the errata fixed. Also, you find the errata in X v1 after tapeout but before your driver team does, saving them an enormous amount of work trying to track down a driver bug that's actually a HW bug (maybe even one with a simple workaround).
> Particularly when you plan to build a series of chips, the continuous approach taken by software is massively preferable.
For sure. I think continuous integration and cataloging of things like coverage collection is something hardware development really benefits from.
The things that hardware development can learn best from the software world are (in my opinion) mainly down to developing and maintaining verification environments, because they are (mostly) just big software projects. The constrained random variety are anyway.
Lisp does use too many parens. There's a gunning fog associated with debugging lisp, and that's one reason why I don't code in it even though professionally I have my choice in languages and scheme was one of the first I learned.
Maybe nitpicking, but languages like Chisel and MyHDL aren't really HLS. Here there is a straight-forward mapping between the written language and the rendered result, and there should be little surprise in what logic is actually generated.
I am convinced that some specimen of this class of languages will eventually overtake verilog.
One feature I'm eagerly waiting for is an equivalent of Option/Maybe types, which makes it impossible to access some signals unless they are signaled as valid by a qualifier signal.
I'm curious about what improvements you would like to see in SystemVerilog?
It doesn't look like you're open to any serious criticisms of the two Vs, but the readers of your comment and mine deserve to look at the arguments and make up their own mind. Therefore, I'm linking the pages regarding rationale for some of the recent HDLs:
Chisel isn't high level synthesis. It's not a good name to describe it when the overwhelming majority of "HLS" projects are C/C++ compilers and are completely different beasts in design and theory. Honestly, almost every experienced hardware engineer I meet who's only heard of these languages thinks this, so I partially think it's a marketing failure, but I also get the impression HW engineers think literally anything that is not Verilog is "high level" which is just simply untrue. (If I had any say in the matter, probably the only real "high level synthesis" language that isn't just a tagline for "Compile C++ to Hardware" I've experienced is BlueSpec Verilog.)
I haven't used Chisel personally, but from my experience with Clash -- it is better to think of them as structural RTLs that that have vastly better abstraction capabilities than VHDL/Verilog have. And I don't mean whatever weird things hardware designers think up when they say "abstraction" and they chuckle about software programmers (before writing a shitload of tedious verification tests or using Perl to generate finite state machines or some weird shit but That's Cool And Good because most don't know the difference between a "macro" and a "preprocessor" and no I am not venting), I mean real abstraction capabilities -- for example, parametric types alone can drastically reduce the amount of boilerplate you need for many tedious tasks, and those parametric types inline and are statically elaborated much in the same way you expect "static elaboration" of RTL modules, etc to work. Types are far more powerful than module parameters and inherently higher order, so you get lots of code reuse. In Clash, it's pretty easy to get stateful 'behavioral' looking code that is statically elaborated to structural code, using things like State monads, etc, so there's a decent range of abstraction capabilities, but the language is generally very close to structural design. The languages are overall simply more concise and let you express things more clearly for a number of reasons, and often can compare favorably (IMO) even to more behavioral models (among others, functions are closer to the unit of modularity and are vastly briefer than Verilog modules, which are just crap, etc). Alternative RTLs like MyHDL are more behavioral, in contrast.
The biggest problem with these languages are that the netlists are harder to work with, in my experience. But the actual languages and tools are mostly pretty good. And yes, they do make verification quite nice -- Clash for example can be tested easily with Haskell and all Clash programs are valid Haskell programs that you can "simulate", so you have thousands of libraries, generators, frameworks etc to use to make all of those things really nice.
(This is all completely separate from what a lot of hardware designers do, which is stitch together working IP and verify it, as you note with the verification comment. That's another big problem, arguably the much more important one, and it is larger than the particular choice of RTL in question but isn't the focus here.)
if you'd like something on a to-do, I built this once, but it was a long time ago, julia has gotten a lot better, and it only does combinatorial logic (not sequential logic). I had an idea of how to do sequential logic using lambda closures in julia, but I never got around to it.
Am I missing something, or are they not only not documenting the chip, they are also not even releasing the compiler, but requiring you to use a cloud based compiler:
> You need to create a quantized TensorFlow Lite model and then compile the model for compatibility with the Edge TPU. We will provide a cloud-based compiler tool that accepts your .tflite file and returns a version that's compatible with the Edge TPU.
This seems like a new low in software freedom, and pretty risky to depend on as Google is known to shutter services pretty often and could just decide to turn off their cloud-based compiler at any time they feel like.
Chips without a public toolchain are not worth investing your time in. It is bad enough if your work is tied to specific hardware for which there may at some point not be a replacement but to not even have the toolchain under your control makes it a negative.
Seriously Intel is already struggling after buying Nervana.
I went to their shing dig and they were working their butt off to wow the developers who were invited. When I asked for hard number they were very mum about that and very evasive.
The timeline for Nervana chip have been always seemingly in this mystical horizon that is never solidified to a real date but over yonder.
Google is going to pull this crap? They got better software expertise than Intel though they may be able to do it. But after that fiasco with Angular 1 to 2 I wouldn't trust Google with any early version number.
Nervana had a lot of other issues. It was trying to produce an ASIC with 50 employees. When they got aquired by Intel the first step they had to tackle was hiring the engineers necessary to actually produce an ASIC which innevitably slows down production, and then on top of that they got caught in the Intel 10nm bear trap.
AI is too powerful a technology to let it out there to the masses. People might use it for killer drones after all. All users of AI must be tightly controlled and registered with the authorities!
This is the problem with certain kinds of technology that are bumping up against the edge of innovation. They're too powerful and if these technologies get in the hands of the DIY set, governments will lose control so they have to DRM and regulate everything. Heck, it's a problem with old technology. Many weapons aren't that complicated technologically, but their production and use are tightly regulated.
Edit: I'm not saying this is a good thing, I'm just deconstructing their though process for tight control over AI tech going forward.
For some reason drones are perceived to be completely different from all weapons that have existed before them. Those killer drones have existed for half a century. They are called missiles. Also the reason why UAV based fighter jets are not viable is because a cruise missile can be launched from 1000 miles away and for the cost of a global hawk you can send out more than a hundred of them.
If terrorists have access to explosives then it doesn't matter how they deliver them because most lucrative targets (= lots of people in a small area) are stationary or predictable. A simple bagpack filled with explosives was more than enough to injure hundreds of people during the Boston Marathon.
You can make an unguided, explosive-filled rocket that can harm people for cheap from scrap. Insurgents throughout the world have done so for the past 40 years. That may not be as simple as Add To Cart, but it is well within the economic means of almost everyone.
it doesn't require a cloud based compiler; the quote above shows that you use TF-Lite, which is an open source project, or a cloud-based tool for people who don't want/need/have the ability to work with TF-Lite.
[UPDATE] I misread and assumed the previous case (where no cloud tool was required) was still true (I worked with previous versions of this device).
Do you have a source for this, or are you just reading the statement I quoted differently than I am?
The way I read the quote, you use TF-Lite to produce a quantized TF-Lite model, and then use a cloud based compiler to compile it for the actual chip.
This is why I asked "am I missing something." Do you have a reference for where the compiler exists in the open source TensorFlow project?
Mostly, what I'm interested in is learning what capabilities their TPU provides, to see if it would be useful for other similar kinds of kernels like DSP (which, like machine learning kernels, also involves a lot of convolution).
So I'm interested in looking at what the capabilities of the chip are, seeing what could be compiled to it. But I haven't found those docs, or found a compiler that could be studied. But maybe I'm not looking in the right place.
Here's an overview of the architecture of their Cloud TPUs, which has some good architectural details but doesn't documet the instruction set:
Well they decided to not only make it closed source but also lock it up behind a http frontend so you can't even reverse engineer it. Criticizing Google for shuttering things suddenly that companies are depending on is rightly justified.
Google has mastered the art of using open source to crush competition like they did with Chrome and Android. They ever revealed their main moneymakers ever. For example they opened up their mapreduce technique after they have moved on from it.
This is a long time coming. I'm normally not a big fan of large companies building products in the embedded space that could potentially destroy competition and future innovation but this is needed.
Nvidia's embedded boards are EXPENSIVE. So expensive it limits the applications dramatically. They also require a different skillset in people to set up which drives up the cost.
We did an analysis for a security project that required visual inference. It turned out all the extra costs to setup with TX boards meant it actually made more sense to have mini desktops with consumer gtx cards.
I am excited to see the performance of the inference module. If it's decent at a good price, that opens up so many pi/beagle/arduino applications that were limited by both cost and form factor of existing options.
Nvidia provides a line of embedded systems for accelerated compute called tegra. It’s pretty awesome kit but costs from 150-500, depending of the compute necessary. Probably a new one will be announced in a months time, hence google is trying to get ahead
Currently the only real options for amateur off-the-shelf (accelerated) edge ML are the Nvidia boards (but small carrier boards for the TX2 cost more than the module itself) or the Intel NCS which inexplicably blocks every other USB port on the host device due to its poorly designed case. There is the Movidius chip itself, but Intel won't sell you one unless you're a volume customer. The NCS also does bizarre things: the setup script will clobber an existing installation of opencv with no warning, for example.
There are various optimised machine learning frameworks for ARM, but I'm only counting hardware accelerated boards here. I'm also not including the various kickstarter or indiegogo boards which might as well be vapour ware.
There are no good, cheap, embedded boards with USB3 that I can find. There are a few Chinese boards with USB3, but none of them have anywhere near the quality of support that the Pi has.
Then camera support. The Pi has a CSI port, but it's undocumented and only works with the Pi camera. The TX2 is pretty good, but you need to dig through the documentation to figure things out. USB is fine, but CSI is typically faster and frees up a valuable port.
Finally another issue is fast storage. It's difficult to capture raw video on the Pi because you can't store anything faster than about 20MB/s. There are almost no boards that support SATA or similar (the TX2 does), so the ability to use USB3 storage would be welcome too.
If this is offered at a reasonable price point, it could be a really nice tool for hobbyists. It looks like they're trying to keep GPIO pin compatibility with the Pi too.
The Raspberry Pi provides documentation for their GPU architecture, so it would be possible to provide support for that within open source machine learning frameworks. It would involve quite a bit of work, though, and the RPi is not really competitive with modern hardware in performance-per-watt terms, even when using GPU compute.
I believe, Idein did that. At least they regularly post impressively (for the Pi) fast examples to /r/raspberry_pi like https://redd.it/a5o6ou. It seems the result isn't available individually or as open source but only in the form of a service (https://actcast.io/)
There are some well optimised libraries, for example a port of darknet that uses nnpack and some other Neon goodies. You can do about 1fps with tiny yolo. Not sure if it used anything on the gpu though.
Yes, I know. My point was that CPU-only deep learning is possible on the Pi if you don't need real-time inference. What I wasn't sure of is whether that specific port does anything on the GPU at all, or if it's only using NEON intrinsics.
> You need to create a quantized TensorFlow Lite model and then compile the model for compatibility with the Edge TPU. We will provide a cloud-based compiler tool that accepts your .tflite file and returns a version that's compatible with the Edge TPU.
I seriously hope that's not the only way they're expecting people to compile models for this particular TPU.
I'd rather they enable independent development of models on the hardware they're selling. This is about as useful as a high performance electric car you can only charge at authorised dealerships. Dealerships which have the unfortunate habit of closing down after a few years or so.
This has been around for a while but has been stuck at 'Coming Soon' forever. Does anyone know what the status of this project actually is? I suspect that it has been stalled for some reason or the other.
Google has been marketing the TPUs for a long time but they were not even using them much as they were still on Nvidia stack. Not sure if it changed on the past 9 months but my guess is that they are able to run tensorflow on them.
What is the use of 100fps vision models other than being the input to a controller (e.g. driving, flying etc) ? A raspberry can hold up to 3fps with standard open source frameworks, and this is enough for many applications, e.g. construction site surveillance etc... Not criticizing, rather a genuine interest to understand the edge ML vision market.
I did work on optical sorting machine: you have stream of fruits on very fast conveyor belt and then machine vision system scans the passing fruits, detects each object and reject (by firing stream of air) those that don't pass: those can be molded fruits, weird color or foreign material like rocks or leaves. 100 fps might be enough, but faster you go, faster your conveyor belt could be.
The 100fps model is also much more efficient in W/Flop, or J/Flop, or W/fps which is very important for embeded and mobile applications. You can design your construction site surveillance system to record 10 frames a minute while sleeping the ML accelerator and then process 100 frames all at once in a few seconds, which reduces the duty cycle tremendously, improving battery / device life.
Movidius is not a TPU. It's more like a GPU, but with SIMD, DSP and even VLIW capabilities and with a _very_ wide memory bus (and massive throughput). It's rather impressive actually, but probably serious overkill for what really needs to be done during inference: https://en.wikichip.org/wiki/movidius/microarchitectures/sha.... Whereas a TPU is highly specialized for just, you guessed it, processing tensors, which basically means matrix and vector multiply. It's a systolic architecture, so it also (purportedly, since I don't have insider knowledge) stores the weights for the computation for the duration of the computation.
As far as I understand "TPU" is Google's brand name, so of course competing products are not TPUs. There is an overlap in what you can do with the devices, so a comparison of their strengths would be useful.
I don't know enough about this but how do these devices compare to the Sipeed MAIX devices  I saw mentioned on HN the other day? They seem to both support TensorFlow Lite but that's where my ability to understand their capabilities end.
The risk with crowd funded chips is support and longevity. The reason that the Raspberry Pi wins every time against technically superior competition is that it's well supported and there are reasonable supply guarantees.
The same goes for big companies of course. Intel has a habit of releasing Iot platforms and then killing them. Let's hope the TPU lasts a bit longer.
As for a comparison, it's impossible to say until Google releases benchmark information on the edge TPU, or some kind of datasheet for the SOM.
Given Google's tendency to kill products and shift priorities rapidly, I think building a product or service dependent on a supply of their hardware is probably a pretty risky choice.
I definitely have been shocked how fast Intel maker boards have come and gone though. It feels like Intel has written them off before anyone's tried to build a project using one. I have one sitting around here somewhere that's never so much as been powered on.
It's very hard to beat the traction that the Pi has. I think because it's explicitly targeted towards people without any embedded experience, there's been a lot of pressure to make things work and to make the documentation somewhat organised.
Intel made some nice little boards, but there wasn't much publicity and actually getting started with them wasn't easy at all because the docs were buried. They were usually modules designed for integration, not standalone devices.
With the Pi you can buy a kit, plug in the SD card and boot ot desktop in minutes.
The NXP® i.MX 8MQuad board is available for 150$ and has USB3 and PCIe. The TPU would probably be attached through one of those buses. I would bet around 250$ with the TPU which is pretty good and puts it around half the Jetson Tx2, 1/5 the Xavier. I wonder if the TPU could be used for SLAM not just object identification, now that would be useful
The competitors don't really keep chips around for longer. Intel isn't manufacturing Skylake anymore. Nvidia isn't manufacturing Maxwell GPUs anymore. (Incidentally, Apple did appear to be using their 4-year old A8 SoC in the first HomePods, released in 2018, though.)
Hardware and software are different things. We are all sad that Google Reader doesn't exist anymore, but every silicon product has basically been a flash in the pan. They make it, you buy it, and by the time it's shipped to you, it's announced as obsolete. That's the pace of that industry. Maybe with Google's attention span, they should have been a hardware company all along. They will fit right in.
My example was inaccurate, because at least dead Replicants can be replaced with newer models, whereas dead Google products have no follow-up model and require completely replacing what you had created around that product. That's something seemingly unique to either vaporware start-ups, or Google.
At least you can still write software for those. If Google decides your particular flavor of chip is no longer supported then good luck. Besides that, good luck to acquire those chips in the first place, 'coming soon' without a stated delivery date may well translate into 'never'.
I'll stick to the usual suspects before I get roped into some cloud based development system. Why does Google need access to my IP to begin with?