The Looming Battle Over AI Chips

(barrons.com)

124 points | by poster123 2197 days ago

16 comments

oneshot908 2197 days ago
If you're FB, GOOG, AAPL, AMZN, BIDU, etc, this strategy makes sense because much like they have siloed data, they also have siloed computation graphs for which they can lovingly design artisan transistors to make the perfect craft ASIC. There's big money in this.
Or you can be like BIDU, buy 100K consumer GPUs, and put them in your datacenter. In response, Jensen altered the CUDA 9.1 licensing agreement and the EULA for Titan V such that going forward, you cannot deploy Titan V in a datacenter for anything but mining cryptocurrency, and his company reserves the right going forward to audit your use of their SW and HW at any time to force compliance with whatever rules Jensen pulled out of his butt that day after his morning weed. And that's a shame. Because there's no way any of these companies can beat the !/$ of consumer GPUs and NVDA is lying out of its a$$ to say you can't do HPC on them.
But beyond NVDA shenanigans, I think it's incredibly risky to second guess those siloed computation graphs from the outside in the hopes of anything but an acqui-hire for an internal effort. Things ended well for Nervana even if their HW didn't ship in time, but when I see a 2018 company (http://nearist.ai/k-nn-benchmarks-part-wikipedia) comparing their unavailable powerpoint processor to GPUs from 2013, and then doubling down on doing so when someone rightly points out how stupid that is, I see a beached fail whale in the making, not a threat to NVDA's Deepopoly.
[-]
- zantana 2197 days ago
  I just had an image of pranksters "swatting" datacenters by feeding false information to Nvidia that they are using consumer cards.
- joshuamorton 2197 days ago
  I'm a little confused here, are you saying that ML ASICs can't beat compute per $ of GPUs? That seems, on its face, to be a ridiculous assertion, so I'm confused where I'm misunderstanding you.
  [-]
  - oneshot908 2197 days ago
    No, they can't when you can get tens of TFLOPS per GPU for <$1000 that comes with a solid software ecosystem for all the major AI frameworks out of the box. That's the power of the gaming/industrial complex: NVDA can fund the development of AI chips and software from the nearly bottomless pockets of millions of gamers and cryptocurrency miners. ASIC startups figuratively have one bullet in their gun and they have to hit the bullseye to live.
    Now when a Tesla GPU costs $9-10K instead of <$1000, that's a somewhat different story, but even then, NVDA tapes out a brand new architecture almost annually. Good luck keeping up with that pace ASIC startups. And that's exactly what happened to Nervana. Their ASIC was better than a Pascal GP100, but it's clobbered by Volta V100. So at best you get a 6-12 month lead on them.
    In contrast, if you can right-size the transistor count for expensive and very specific workloads across 100K+ machines like companies with big datacenters and big data sets can do, I see an opportunity for all of the BigCos to build custom ASICs for their siloed data sets and graphs. That's what GOOG is doing and it seems to be working for them so far. FB is now hiring to do the same thing I suspect.
    [-]
    - joshuamorton 2197 days ago
      Yes but over the lifetime of a GPU, you'll spend more on power draw than the physical hardware. That's where the savings come from, or at least that's what I've been told.
      A V100 costs ~300/yr in electricity. If you are buying at the scale of 100k units, but can price per operation, even by just 10% (for example, by dropping features you don't care about), that's a million dollars of electricity over the lifetime of your hardware.
  - emcq 2197 days ago
    At a high level there is a design tradeoff where you put your transistors for a given chip. For a dense linear algebra/tensor processor, it basically comes down to using your transistors for memory or compute.
    GPUs (and DSPs) historically are way on the compute side. You get kilobytes of on chip memory, and really fat parallel buses to off chip RAM.
    On the other end, you have some chips that put more memory near the compute. It means you have less compute, but way better power efficiency. Each hop from a register to cache to off chip to between boards to between nodes is roughly 2-10x power hit, so you get orders of magnitudes here.
    In terms of training, it's really hard to fit your training data on chip. In which case you end up with an architecture very similar to a GPU or DSP/TPU. NVIDIA is no slouch here. A couple years ago the big trick was reducing precision - you dont need single or double floats during training so an architecture more specialized at fp16 or int8 would get some big savings for power and throughput. NVIDIA is playing this game too. At Google/FB scale tweaks to improve cost may make sense but it doesnt seem like architecturally speaking there is some major design decision being left on the table anymore (I'd love to hear a specific counterpoint though!).
    In terms of inference, there can be some big power savings putting the weights next to compute. This isn't rocket science and folks have been doing this for a while. In a strange way it can sometimes be more efficient to use a CPU, with fat caches.
    [-]
    - joshuamorton 2197 days ago
      >At Google/FB scale tweaks to improve cost may make sense but it doesnt seem like architecturally speaking there is some major design decision being left on the table anymore (I'd love to hear a specific counterpoint though!).
      I'd expect if I knew any, they'd be under NDA. I'll just point out that a GPU, even one with specific "ML cores" as NVIDIA calls them, is going to have a bunch of silicon that is being used inefficiently (for more "conventional" GPU uses). There's room for cost saving there. Perhaps NVIDIA eventually moves into that space and produces ML-only chips, but they don't appear to be heading in that direction yet.
      [-]
      - emcq 2196 days ago
        There are many researchers not bound by NDAs making novel chip architectures, some of which are on HN.
        Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P At a company like MS/FB/Google where you don't need to leverage selling the same chip for gaming/vr/ML/mining for economy of scale, like you said you can reduce your transistor count and have the same compute fabric. This would reduce your idle power consumption through leakage but you wouldn't expect a huge drop in power during active compute. Because the smaller precision compute ends up increasing the compute per byte, you either need to find more parallelism to RAM, get faster RAM, reduce the clockrate of compute, or reduce the number of compute elements to find a balanced architecture with lower precision. If you just shrink the number of compute elements - voila! - you're close to what NVIDIA is doing with ML cores.
        [-]
        joshuamorton 2196 days ago
        > Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P
        I think this is semantics. Modern GPUs do have a lot of hardware that isn't specialized for machine learning. My limited knowledge says that very recent NVIDIA GPUs have some things that vaguely resemble TPU cores ("Tensor cores"), but they also have a lot of silicon for classic CUDA cores. Which I was calling "graphics" hardware, but might better be described as "silicon that isn't optimized for the necessary memory bandwidth for DL". So its still used non-optimally.
        To be clear, you can still use the CUDA cores for DL. We did that just fine for a long time, they're just decidedly less efficiently than Tensor cores.
      - sanxiyn 2196 days ago
        NVIDIA is heading in that direction: see NVDLA. http://nvdla.org/
- mattnewton 2197 days ago
  s /Jensen / nvidia ?
  [-]
  - guipsp 2197 days ago
    Jensen refers to NVDA's President & CEO, Jensen Huang
- lostgame 2197 days ago
  FYI I think your comment is informative and I understood a lot of it but that's a shitton of acronymns for the uninitiated.
  [-]
  - hueving 2197 days ago
    FB: Facebook
    GOOG: Google
    AAPL: Apple
    AMZN: Amazon
    BIDU: Baidu
    ASIC: Application specific integrated circuit
    GPU: Graphics processing unit
    CUDA: Compute-unified device architecture
    EULA: End-user license agreement
    weed: marijuana
    HPC: High-performance computing
    NVDA: Nvidia
    HW: hardware
  - bytematic 2197 days ago
    Most are stock ticker names from the NYSE
Nokinside 2197 days ago
Nvidia will almost certainly respond to this challenge with it's own specialized machine learning and inference chips. It's probably what Google, Facebook and others hope. Forcing Nvidia to work harder is enough for them.
Developing a new high performance microarchitecture for GPU or CPU is complex task. A new clean sheet design architecture takes 5-7 years even for teams that have been doing it constantly for decades in Intel, AMD, ARM or Nvidia. This includes optimizing the design into process technology, yield, etc. and integrating memory architectures. Then there is economies of scale and price points.
Nvidia's Volta microarchitecture design started 2013, launch was December 2017
AMD's zen CPU architecture design started 2012 and CPU was out 2017.
[-]
- osteele 2197 days ago
  Google’s gen2 TPU was announced May 2017, and available in beta February 2018. That 2018.02 date is probably the appropriate comparison to Volta’s 2017.12 and Zen’s 2017 dates.
  EDIT: I’m trying to draw a comparison between the availability dates (and where the companies are now), not the start of production (and their development velocity). Including the announcement date was probably a red herring.
  [-]
  - Nokinside 2197 days ago
    I'm aware.
    Making a chip and making competitive chip are two different things.
    When Nvidia enters the market with specialized chip it's likely on completely another level in bandwidth, energy consumption and price per flop performance. They have so much more experience with this.
    * https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...
    * https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-acce...
  - brandmeyer 2197 days ago
    Announcing it to the world != start of design effort.
- jacksmith21006 2197 days ago
  Do not see why Google would care what Nvidia is doing? Why would they?
  [-]
  - jacksmith21006 2197 days ago
    > Do not see why Google would care what Nvidia is doing? Why would they?
    But without the data and not doing higher layers of the stack how can Nvidia create a product that is competitive with the TPUs for inference?
  - m3kw9 2197 days ago
    If they do it better and cheaper they care
    [-]
    - jacksmith21006 2196 days ago
      Google is doing the entire stack so kind of hard for Nvidia to have a competitive product to the TPUs.
      But also Google is already doing the 2nd generation of the TPUs and would expect a third.
      The data of chips coming from 3rd parties are probably numbered.
      The dynamics of the industry changed where now the companies that are buying the chips have skin in the game with the cost of running the chips. Versus the old days Intel sold a chip to Dell who then sold it to someone else.
      It just makes sense for Google to do their own silicon at this point as saves them a ton of money.
      But also Google has the data to improve the chip that Nvidia just does not have.
joe_the_user 2197 days ago
Nvidia, moreover, increasingly views its software for programming its chips, called CUDA, as a kind of vast operating system that would span all of the machine learning in the world, an operating system akin to what Microsoft (MSFT) was in the old days of PCs.
Yeah, nVidia throwing it's weight around in terms of requiring that data centers pay more to use cheap consumer gaming chips may turn out to backfire and certainly has an abusive-monopoly flavor to it.
As I've researched the field, Cuda really seems provides considerable value to the individual programmer. But making maneuvers of this sort may show the limits of that sort of advantage.
https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/
[-]
- oneshot908 2197 days ago
  It probably won't. For every oppressive move NVDA has made so far, there has been a swarm of low-information technophobe MBA sorts who eat their computational agitprop right up, some of them even fashion themselves as data scientists. More likely, NVDA continues becoming the Oracle of AI that everyone needs and everyone hates.
  [-]
  - arca_vorago 2197 days ago
    So is OpenCL dead? Because that's how everyone is talking. The tools you choose, and their licensing, matters!
    [-]
    - keldaris 2196 days ago
      OpenCL isn't dead, if you write your code from scratch you can use it just fine and match CUDA performance. In my experience, OpenCL has two basic issues.
      The first is the ecosystem. Nvidia went to great lengths to provide well optimized libraries built on top of CUDA that supply things people care about - deep learning stuff, dense as well as sparse linear algebra, etc. There's nothing meaningfully competitive on the OpenCL side of things.
      The second is user friendliness of the API and the implementations. OpenCL is basically analogous to OpenGL in terms of design, it's a verbose annoying C API with huge amounts of trivial boilerplate. By contrast, CUDA supports most of the C++ convenience features relevant in this problem space, has decent tools, IDE and debugger integration, etc.
      Neither of these issues is necessary a dealbreaker if you're willing to invest the effort, but choosing OpenCL over CUDA requires prioritizing portability over user friendliness, available libraries and tooling. As a consequence, not many people choose OpenCL and the dominance of CUDA continues to grow. Unfortunately, I don't see that changing in the near future.
    - joe_the_user 2197 days ago
      When I looked at OpenCl, my impression was it was simply a kind of bundle of functionality that happened to be shared by different GPUs (thin-layer of syntax on top of each manufacturers chip) whereas Cuda is a library that actually shields the programmer from the complexity of programming a GPU. I think AMD is working on an open-source system somewhat equivalent to Cuda which might be nice to see develop. But it seems like the OpenCl consortium is the kind of organization that could never care about the individual developer or meet their needs. I'd love if someone could prove me wrong.
    - oneshot908 2197 days ago
      Almost as dead as its original creator, Steve Jobs, who only summoned it into existence because Jensen Huang leaked a deal between AAPL and NVDA ahead of him.
      OpenCL could have been killer on mobile, and it could have delivered low-power machine learning and computation there all the way back in 2012, but both AAPL and GOOG went out of their way to cripple its use despite many of the mobile GPUs having hardware support for it. We all lost there IMO.
    - droidist2 2196 days ago
      It seems like people are placing their hopes into HIP more than OpenCL now when it comes to using chips from AMD and others.
  - nl 2196 days ago
    This is complete nonsense, and since you’ve said it twice in this thread I’ll respond.
    The reason NVidia is used in deep learning is because the frameworks give the best performance on it.
    The reason they give good performance on it is because CUDA an CuDNN (which is even more important but I don’t see mention of) work really well and give better performance than anything else.
    There are no MBA, just lots of grad students trying anything to get better speed.
- sofaofthedamned 2197 days ago
  NVidia have had similar dodgy tactics in the past, including blocking their GPUs from being used in a VM using IOMMU. For my use case (where I had Windows in a VM on a Linux host) this was a massive PITA.
deepnotderp 2197 days ago
Do people think that nobody at nVidia has ever heard of specialized deep learning processors?
1. Volta GPUs already have little matmul cores, basically a bunch of little TPUs.
2. The graphics dedicated silicon is an extremely tiny portion of the die, a trivial component (source: Bill Dally, nVidia chief scientist).
3. Memory access power and performance is the bottleneck (even in the TPU paper), and will only continue to get worse.
[-]
- oneshot908 2197 days ago
  Never overestimate the intelligence of the decision makers at big bureaucratic tech companies. Also, it is not in the best interest of any of them to be reliant on NVDA or any other single vendor for any critical workload whatsoever. Doubly not so for NVDA's mostly closed source and haphazardly optimized libraries.
  All that said, Bill Daly rocks, and NVDA is a hardened target. But the DL frameworks have enormous performance holes once one stops running Resnet-152 and other popular benchmark graphs in the same way that 3DMark performance is not necessarily representative of actual gaming performance unless NVDA took it upon themselves to make it so.
  And since DL is such a dynamic field (just like game engines), I expect this situation to persist for a very, very long time.
  [-]
  - Dibes 2197 days ago
    > Never overestimate the intelligence of the decision makers at big bureaucratic tech companies.
    See Google and anything chat related after 2010
- mtgx 2197 days ago
  And Intel knew that mobile chips will one day become very popular, too - two decades ago. Much good did that knowledge do to the company.
  It's not about them being ignorant about it. It's about them making decisions in spite of that knowledge - decisions that Make Sense™ for the advancement and increased profitability of the incumbent cash cow, but they are often contradictory or have a negative impact on new tech investment.
  Here's one main reason why Nvidia will not go "full TPU" with its chips - it wants "scalability". That means it wants an architecture that can be "flexible and server different markets".
  The companies that specialize in AI chips will likely beat them in performance because they only care about winning one market at a time (and the AI market is a pretty big one).
  Intel's AI strategy is even more of a mess, because it has no clue what it can use to beat Nvdia, so its investments and developer ecosystems are all over the place.
  [-]
  - pm90 2197 days ago
    Great point. Intel tried to break into the mobile and graphics market for sooo long with no success.
    I do hope the AI specific TPUs that come out on the future will follow the ARM model instead of being silod into proprietary architectures. Fucking hate the vendor lock in of NVIDIA with CUDA.
- ergothus 2197 days ago
  There is the question of incentives though. The non-gpu companies want perfectly aligned performance that they can both built for and designate in advance. The gpu companies want to make said companies buy and rebuy as much as they can.
  As the areas of expertise and manufacturing grow closer, the advantages of paying someone to do it for you decrease.
  I know much too little to have an opinion on who is likely correct, but I can understand the two sides each having positions that don't assume the other side is an idiot.
- zantana 2197 days ago
  Is there an easy way that Nvidia can cripple their graphics targeted cards so they can't be used for GPCPU?
  I'm thinking back to strategies like the 486SX https://en.wikipedia.org/wiki/Intel_80486SX
  [-]
  - oneshot908 2197 days ago
    Yes, they've been trying to do that in various forms since the 2009 Fermi GPU GTX 480 which crippled FP64. It never works. But it does create technical debt to work around their nonsense.
    So unless they cripple CUDA altogether, there will always be efficient workarounds (arguably DirectX or OpenGL programmable shaders in the very worst-case scenario). They even gave up on doing so for GTX Titan Black and then resumed with Maxwell. Currently, I would not be surprised that the lack of a true consumer Volta GPU is their only play at crippling consumer Volta by making it effectively nonexistent or $3000 for the GTX Titan V.
    What they could do across the board is hamfistedly disable the deep learning frameworks on GeForce. That would probably stop 90% of amateur hour data science on GeForce. But the remaining 10% would just recompile them without the cripple code in violation of some sort of scary EULA clause against doing so and requiring such cripple code in all HPC/AI applications. I would love to see them try this - they'll pry my FP32 MADs (which is the core operation of AI/ML as well as vertex and pixel shaders) from my cold dead consumer GPU desktop.
    I don't think they'll do that though. They know the low-end is the entry point to their ecosystem. They just want to force people to graduate into the high-end after hooking them. Not that you have to: multiplication and addition want to be free.
alienreborn 2197 days ago
Non paywall link: https://outline.com/FucjTm
etaioinshrdlu 2197 days ago
It would be interesting to try emulate a many-core CPU as a GPU program and then run an OS on it.
This sounds like a dumb idea, and it probably is. But consider a few things:
* NVIDIA GPUs have exceptional memory bandwidth, and memory can be a slow resource on CPU based systems (perhaps limited by latency more than bandwidth)
* The clock speed isn't that slow, it's in the GHz. Still one's clocks per emulated instruction may not be great.
* You can still do pipelining, maybe enough to get the clocks-per-instruction down.
* Branch prediction can be done with ample resources. RNN based predictors are a shoe-in.
* communication between "cores" should be fast
* a many-core emulated CPU might not do too bad for some workloads.
* It would have good SIMD support.
Food for thought.
[-]
- Symmetry 2197 days ago
  Generally speaking emulating special purpose hardware in software slows things down a lot so I don't think that relying on a software branch predictor is going to result in performance anywhere close to what you'd see in, say, an ARM A53. And since you have to trade off clock cycles used in your branch predictor with clock cycles in your main thread I think it would be a net loss. Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time. The advantage over regular SIMD is that each shader processor tracks its own PC and only executes the broadcast instruction if it's appropriate - allowing diverging control flows across functions in ways that normal SIMD+mask would have a very hard time with except in the lowest level of a compute kernel.
  That also means that you can really only emulate as many cores as the NVidia card has streaming multiprocessors, not as many as it has shared processors or "cores".
  Also, it's true that GPUs have huge memory bandwidth they achieve that by trading off against memory latency. You can actually think of GPUs as throughput optimized compute devices and CPUs as latency optimized compute devices and not be very mislead.
  So I expect that the single threaded performance of a NVidia general purpose computer to be very low in cases where the memory and branch patterns aren't obvious enough to be predictable to the compiler. Not unusably slow but something like the original Raspberry Pi.
  Each emulated core would certainly have very good SIMD support but at the same time pretending that they're just SIMD would sacrifice the extra flexibility that NVidia's SIMT model gives you.
  [-]
  - joe_the_user 2197 days ago
    Remember that even though NVidia calls each execution port a "Core" it can only execute one instruction across all of them at a time.
    There are clever ways around this limitation, see links in my post this thread.
    https://news.ycombinator.com/item?id=16892107
    [-]
    - Symmetry 2197 days ago
      Those are some really clever ways to make sure that all the threads in your program are executing the same instruction, but it doesn't get around the problem. Thanks for linking that video, though.
      [-]
      - joe_the_user 2197 days ago
        The key of the Dietz system (MOG) is that the native code that the GPU runs is a bytecode interpreter. Bytecode "instruction pointer" together with other data is just data in registers and memory that's interpreted by the native code interpreter. So for each thread, the instruction pointer can point at a different command - the interpreter runs the same instructions but the results are different. So effectively you are simulating a general purpose CPU running a different instruction on each thread. There are further tricks required to make this efficient, of course. But you are effectively running a different general purpose instruction per thread (actually runs MIPS assembler I recall).
        [-]
        etaioinshrdlu 2196 days ago
        This is more or less what I'm talking about. I wonder what possibilities lie with using the huge numerical computation available on a GPU applied to predictive parts of a CPU, such as memory prefetch prediction, branch prediction, etc.
        Not totally dissimilar to the thinking behind NetBurst which seemed to be all about having a deep pipeline and keeping it fed with quality predictions.
        [-]
        joe_the_user 2196 days ago
        I'm not sure if your idea in particular is possible but who knows. There may be fundamental limits to speeding up computation based speculative look-ahead not matter how many parallel tracks you have and it may run into memory through-put issues.
        But take a look at the MOG code and see what you can do.
        Check out H. Dietz' stuff. Links above.
- joe_the_user 2197 days ago
  Support for specialized CPU functions won't happen and doesn't make sense.
  However, it is quite feasible to emulate, on a GPU, a networked group of a general purpose CPUs( ie, run MIMD[1] programs on SIMD[2] architecture). This, MOG[3], has been a project of Henry G. Dietz of the University of Kentucky. Unfortunately, the project seems to have stalled at a "rough" level. He claims that he can run MIMD programs at 1/4 efficiency while also running SIMD programs at near full efficiency. His video is instructive [4].
  Edit: Note that this isn't intended for deep learning applications as such but rather for traditional supercomputing applications (weather prediction, other physics simulations ,etc).
  [1] https://en.wikipedia.org/wiki/MIMD [2] https://en.wikipedia.org/wiki/SIMD [3] http://aggregate.org/MOG/ [4] https://www.youtube.com/watch?v=FZ6efZFlzRQ
- ianai 2197 days ago
  How would that work when the individual cores have to be running the same instructions at a time? Where’s the ability to emulate a CPU come from?
BooneJS 2197 days ago
Pretty soft article. General purpose processors no longer have the performance or energy efficiency that’s possible at scale. Further, if you have a choice to control your own destiny, why wouldn’t you choose to?
[-]
- jacksmith21006 2195 days ago
  Great post. It is like mining going to ASICs. We have hit limits and you now have to do your own silicon.
  A perfect example is the Google new speech synthesis. Doing 16k samples a second through a NN is not going to be possible without your own silicon.
  https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
  Listen to the samples. Then think the joules required to do it this way versus the old way and trying to create a price competitive product with the improved results.
bogomipz 2197 days ago
The article states:
>"LeCun and other scholars of machine learning know that if you were starting with a blank sheet of paper, an Nvidia GPU would not be the ideal chip to build. Because of the way machine-learning algorithms work, they are bumping up against limitations in the way a GPU is designed. GPUs can actually degrade the machine learning’s neural network, LeCun observed.
“The solution is a different architecture, one more specialized for neural networks,” said LeCun."
Could someone explain to me what exactly are the limitations of current GPGUs such as those sold by Nvidia when used in machine learning/AI contexts? Are these limitation only experienced at scale? Ff someone has resources or links they could share regarding these limitations and better designs I would greatly appreciate it.
[-]
- maffydub 2197 days ago
  I went to a talk from the CTO of Graphcore (https://www.graphcore.ai/) on Monday. They are designing chips targeted at machine learning. As I understood it, their architecture comprises - lots of "tiles" - small processing cores with collocated memory (essentially DSPs) - very high bandwidth (90TB/s!) switching fabric to move data between tiles - "Bulk Synchronous Parallel" operation, meaning that the tiles do their work and then the switching fabric moves the data, and then we repeat.
  The key challenge he pointed to was power - both in terms of getting energy in (modern CPUs/GPUs take similar current to your car starter motor!) and also getting the heat out. Logic gates take a lot more power than RAM, so he argued that collocating small chunks of RAM right next to your processing core was much better from a power perspective (meaning you could then pack yet more into your chip) as well as obviously being better from a performance perspective.
  https://www.youtube.com/watch?v=Gh-Tff7DdzU isn't quite the presentation I saw, but it has quite a lot of overlap.
  Hope that helps!
  [-]
  - bogomipz 2196 days ago
    Thanks for the detailed response and link! Cheers.
emcq 2197 days ago
There is certainly a lot of hype around AI chips, but I'm very skeptical of the reward. There are several technical concerns I have with any "AI" chip that ultimately leave you with something more general purpose (and not really an "AI" chip, but good at low precision matmul):
* For inference, how do you efficiently move your data to the chip? In general most of the time is spent in matmul, and there are lots of exciting DSPs, mobile GPUs, etc. that require a fair amount of jumping through hoops to get your data to the ML coprocessor. If you're doing anything low latency, good luck because you need tight control (or bypassing entirely) of the OS. Will this lead to a battle between chip makers? Seems more likely to be a battle between end to end platforms.
* For training, do you have an efficient data flow with distributed compute? For the foreseeable future any large model (or small model with lots of data) needs to be distributed. The bottlenecks that come from this limit the improvements from your new specialized architecture without good distributed computing. Again better chips don't really solve this, and comes from a platform. I've noticed many training loops have terrible GPU utilization, particularly with Tensorflow and V100s. Why does this happen? The GPU is so fast, but things like summary ops add to CPU time limiting perf. Bad data pipelines not actually pipelining transformations. Slow disks bottlenecking transfers. Not staging/pipelining transfers to the GPU. And then there is a bit of an open question of how to best pipeline transfers from the GPU. Is there a simulator feeding data? Then you have a whole new can of worms to train fast.
* For your chip architecture, do you have the right abstractions to train the next architecture efficiently? Backprop trains some wonderful nets but for the cost of a new chip (50-100M), and the time it takes to build (18 months min), how confident are you that the chip will still be relevant to the needs of your teams? This generally points you towards something more general purpose, which may leave some efficiency on the table. Eventually you end up at a low precision matmul core, which is the same thing everyone is moving towards or already doing whether you call yourself a GPU, DSP, or TPU (which is quite similar to DSPs).
Coming from an HPC/Graphics turned deep learning engineer, I've worked with gpus since 2006 and neural net chips since 2010 (before even AlexNet!!), so I'm a bit of an outlier here having seen so many perspectives. From my point of view the computational fabric exists we're just not using it well :)
[-]
- jacksmith21006 2196 days ago
  Here is a great paper that might help.
  https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf
  But also this is a great example of why these new "AI" chip matters. This would NOT be possible with the Google chips.
  https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
  I do wish Google shares more detail. Specifically how they are doing 16k samples a second through a NN. That has a lot of application beyond speech.
  Some things Google will not share and this might be one of them but we can hope.
- justicezyx 2197 days ago
  Most top tier tech all.have their working solutions for these. It's a matter of turning into product and moving the industry mindset.
  [-]
  - jacksmith21006 2195 days ago
    There is nothing for the buyer to see. They are buying a service or capability and what silicon that runs on is here or there to them.
    A simple example is the Google New Speech synthesis service. It is done using NN on their TPUs but nobody needs to know any of that.
    https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
    What the buyer knows is the cost and the quality of the service.
    Now Google had to do their own silicon to offer this as otherwise the cost would have been astronomical. The compute to do 16k samples a second with a NN are astronomical.
    If I could not see it myself I would say what Google did was not possible.
    Just hope they share the details in a paper. If we can get to 16k cycles through a NN at a reasonable cost that opens up a lot of interesting applications.
MR4D 2196 days ago
Back in the day, there was a 386. And also a 387 coprocessor to have the tougher math bits.
Then came a 486 and it got integrated again.
But during that time, the GPU split off. Companies like ATI and S3 began to dominate, and anyone wanting a computer with decent graphics had one of these chips in their computer.
Fast forward several years, and Intel again would bring specialized circuitry back into their main chips, although this time for video.
Now we are just seeing the same thing again, but this time it’s an offshoot of the GPU instead of the CPU. Seems like the early 1990’s again, but the acronyms are different.
Should be fun to watch.
davidhakendel 2197 days ago
Does anyone have a non-paywall but legitimate link to the story?
[-]
- trimbo 2197 days ago
  Incognito -> search for headline -> click
  [-]
  - bogomipz 2197 days ago
    Thank you for this tip. Out of curiosity why does this trick work?
    [-]
    - _delirium 2197 days ago
      Websites like this want traffic from Google. To get indexed by Googlebot they have to show the bot the article text, and Google's anti-blackhat-SEO rules mean that you have to show a human clicking through from Google the same text that you show Googlebot. So they have to show people visiting through that route the article text too.
Barjak 2197 days ago
If I were better credentialed, I would definitely be looking to get into semiconductors right now. It's an exciting time in terms of manufacturing processes, and I think some of the most interesting and meaningful optimization problems ever formulated come from semiconductor design and manufacturing, not to mention the growing popularity of specialized hardware.
I would tell a younger version of myself to focus your education on some aspect of the semiconductors industry.
jacksmith21006 2197 days ago
The new Google Speech solution is the perfect example on why Google had to do their own silicon.
Doing speech with 16k samples a second through a NN and keep at a reasonable cost is really, really difficult.
The old way was far more power efficient and if you are going to use this new technique which gets you a far better result and do it at a reasonable cost you have to go all the way down into the silicon.
Here listen to the results.
https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
Now I am curious on the cost difference Google as able to achieve. It is still going to be more then the old way but how close did Google come?
But my favorite new thing with these chips is the Jeff Dean paper.
https://www.arxiv-vanity.com/papers/1712.01208v1/
Can't wait to see the cost difference using Google TPUs and this technique versus traditional approaches.
Plus this approach support multi-core inherently. How would you ever do a tree search with multiple cores?
Ultimately to get the new applications we need Google and others doing the silicon. We are getting to extremes where the entire stack has to be tuned together.
I think Google vision for Lens is going to be a similar situation.
[-]
- taeric 2197 days ago
  This somewhat blows my mind. Yes, it is impressive. However, the work that Nuance and similar companies used to do are still competitive, just not getting near the money and exposure.
  I remember over a decade ago, they even had mood analysis they could apply to listening to people. Far from new. Is it truly more effective or efficient nowadays? Or just getting marketed by companies you've heard of?
  [-]
  - jacksmith21006 2196 days ago
    "Nuance and similar companies used to do are still competitive"
    Surprised. Curious if you can compare the inference per joules of Google 1st gen TPUs compared? Google shared a paper and the numbers are pretty impressive and was not aware of anyone else close to the gen 1 TPUs?
    Here is the paper that you can use for the TPU side. Love so see someone else in the ball park? We really do not want just one company but competition.
    https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf
    [-]
    - taeric 2196 days ago
      This seems to be comparing them on their own terms. In more curious on features. Dragon naturally speaking and some other products have been really impressive for years now. Far beyond what my phone is capable of.
      Not to say that the likes of the echo and others aren't impressive. Just that the speech recognition is the least of those products. Fully transcribed voice mail was available for years with Google voice (even before it was Google voice), yet that seems to happen less now than it did when I first for the product.
      So what changed? And why?
      [-]
      - jacksmith21006 2195 days ago
        What? You are comparing processing a NN. How is that comparing on "own terms"?
        [-]
        taeric 2195 days ago
        Did the old methods use neutral networks? I wouldn't be surprised if they did, but I would be surprised if they were as deep of networks as what people use today.
        That is, I am interested in comparing them on speed of transcription, speech synthesises, error rates, etc. Not on speed of network execution.
        [-]
        jacksmith21006 2194 days ago
        No the old method did NOT use NN. I hope Google writes a paper and shares more details.
        It is hard to believe they are able to do 16k samples a second through a NN even with the TPUs.
        So be curious to see if reduced and how much?
        If they really do have the ability to do 16k a second at scale that opens the door for all kinds of other applications.
  - sanxiyn 2196 days ago
    It is truly better. Objective metrics (such as word error rate) don't lie. You can argue whether it makes sense to use, say, 100x compute to get 2x less error, but that's a different argument; I don't think anyone is really disputing improved quality.
    [-]
    - taeric 2196 days ago
      Do you have a good comparison point? And not, hopefully, comparing to what they could do a decade ago. I'm assuming they didn't sit still. Did they?
      I question whether it is just 100x compute. Feels like more, since naturally speaking and friends didn't hog the machine. Again, over a full decade ago.
      More, the resources that Google has to throw at training are ridiculous. Well over 100x what was used to build the old models.
      None of this is to say we should pack up and go back to a decade ago. I just worry that we do the opposite; where we ignore progress that was made a decade ago in favor of the new tricks alone.
      [-]
      - jacksmith21006 2196 days ago
        The thing is it is not simply the training but the inference aspect would have require an incredible amount of compute compared to the old way of doing it.
        Hope Google will do a paper like they did with the Gen 1 TPUs. Would love to see the difference in terms of joules per word spoke.
    - jacksmith21006 2195 days ago
      Speech synthesis not recognition.
      [-]
      - taeric 2195 days ago
        Yeah, I noticed we were skirting between those topics. I think mostly the points still stand. On both sides. :)
jacksmith21006 2196 days ago
The dynamics of the chip industry have completely changed. Use to be a chip company like Intel sold their chips to a company like Dell that then sold the server with the chip to a business which ran the chip and paid the electric bill.
So the company that made the chip had no skin in the game with running the chip or the cost of the electricity to run it.
Today we have massive cloud with Google and Amazon and lowering the cost of running their operations goes a long way unlike the days of the past.
This is why we will see more and more companies like Google create their own silicon which has already started and well on it's way.
Not only the TPUs but Google has created their own network processors as they quietly hired away the Lanai team years ago.
https://www.informationweek.com/data-centers/google-runs-cus...?
Also this article helps explain why Google built the TPUs.
https://www.wired.com/2017/04/building-ai-chip-saved-google-...
willvarfar 2197 days ago
I just seem to bump into a paywall.
The premise from the title seems plausible, although NVIDIA seems to be catching up again fast.
[-]
- madengr 2197 days ago
  I was impressed enough with their CES demo to buy some stock. Isn’t the Volta at 15E9 transistors? It’s at the point only the big boys can play in that field due to fab costs, unless it’s disrupted due to some totally new architecture.
  First time on HN I can read a paywalled article, as I have a Barron’s print subscription.
  [-]
  - twtw 2197 days ago
    21e9 transistors.
- rosege 2197 days ago
  use the web button at the top - worked for me
  [-]
  - randcraw 2197 days ago
    Doesn't work on an ipad. Barrons fades out the text, trivializing themselves to non subscribers.
mtgx 2197 days ago
Alphabet has already made its AI chip...its second generation already.
[-]
- jacksmith21006 2197 days ago
  Plus Google has the data and the upper layers of the AI stack to keep well ahead.