Intel Distribution for Python


139 points | by EntICOnc 13 days ago


  • Rd6n6 12 days ago
    > the Intel CPU dispatcher does not only check which instruction set is supported by the CPU, it also checks the vendor ID string. If the vendor string says "GenuineIntel" then it uses the optimal code path. If the CPU is not from Intel then, in most cases, it will run the slowest possible version of the code, even if the CPU is fully compatible with a better version.[1]

    I’ve been a little shy about using intel software since reading about this years ago


    • danieldk 12 days ago
      This has gotten a bit better. Last time I checked, MKL now uses Zen-specific kernels for sgemm/dgemm. Unfortunately, these kernels are slower than the AVX2 kernels. But at least, it does not use the pre-modern SIMD kernels for AMD Zen anymore.

      Edit, comparison:

          $ perf record target/release/gemm-benchmark  -d 1024
          Threads: 1
          Iterations per thread: 1000
          Matrix shape: 1024 x 1024
          GFLOPS/s: 96.36
          $ perf report --stdio -q | head -n3
              97.18%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_kernel_0_zen
               1.94%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_down16_bdz
               0.78%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_right4_bdz
      After disabling Intel CPU detection:

          $ perf record target/release/gemm-benchmark  -d 1024
          Threads: 1
          Iterations per thread: 1000
          Matrix shape: 1024 x 1024
          GFLOPS/s: 129.12
          $ perf report --stdio -q | head -n3
              97.02%  gemm-benchmark        [.] mkl_blas_avx2_sgemm_kernel_0
               1.77%  gemm-benchmark        [.] mkl_blas_avx2_sgemm_scopy_down24_ea
               1.02%  gemm-benchmark        [.] mkl_blas_avx2_sgemm_scopy_right4_ea
      Benchmarked using and oneMKL 2021.3.0.
      • Royi 10 days ago
        How could one do your trick on Windows?
      • SavantIdiot 12 days ago
        That's just plain sinister.

        I'm really surprised popular numerical computing Python packages don't already have optimized hardware back-ends for things like NumPy... similar to ORC (OIL) which has been around for quite some time:

        But I don't know that much about Python under the hood, and I'm willing to be since so many academics work on this there's already optimized FFIs. I've used TensorFlow and it can offload tensor math to GPUs, but only NVIDIA's AFAIK.

        • danieldk 12 days ago
          I'm really surprised popular numerical computing Python packages don't already have optimized hardware back-ends for things like NumPy

          I think it is hard to beat modern BLAS implementations for common operations. E.g. Apple Accelerate (which also implement the BLAS/LAPACK APIs) uses undocumented AMX instructions for large speedups compared to an ARM NEON implementation.

      • mushufasa 12 days ago
        There is a longstanding issue around MKL and OpenBLAS optimization flags making intel systems artificially faster than amd ones for numpy computations.

        If there are true optimizations to be had, wonderful. But those should be added to core binaries pypi / conda. I am worried that Intel here may be trying to again artificially segment their optimization work on their math libraries for business rather than technical reasons.

        • vitorsr 12 days ago
          • gnufx 12 days ago
            At least single-threaded "large" OpenBLAS GEMM has always been similar to MKL once it has the micro-architecture covered. If there's some problem with the threaded version (which one?), has it been reported like it would be for use in Julia? Anyway, on AMD, why wouldn't you use AMD's BLAS (just a version of BLIS). That tends to do well multi-threaded, though I'm normally only interested in single-threaded performance. I don't understand why people are so obsessed with MKL, especially when they don't measure and understand the measurements.
            • thunkshift1 12 days ago
              What do you mean by ‘artificially faster’?
              • jchw 12 days ago
                Intel libraries whitelist their own CPUs for using certain extension instruction sets, instead of using the relevant CPU ID feature flag for that feature as their own documentation tells you to.
                • jeffbee 12 days ago
                  CPUID is insufficient. CPUID can tell you that a CPU has a working PDEP/PEXT, but it can't tell you that a CPU's PDEP sucks like the one on all AMD processors prior to Zen3.
                  • jchw 12 days ago
                    This argument crops up every time but it’s irrelevant; MKL does and always has worked absolutely fine on AMD processors with the checks disabled, and no, reproducibility is not a feature of MKL that is enabled by default and it never was. Intel even had to add a disclaimer that MKL doesn’t work properly on non-Intel processors after legal threats, and they still ran with that for literally years despite knowing it could just be fixed.

                    When this first cropped up, I was using Digg.

                    Edit: removed note that they fixed the cripple AMD function; they didn’t, they actually just removed the workaround that made it easier to disable the checks; I was misinformed. Apparently now some software does runtime patching to fix it, including Matlab...

                    • gnufx 12 days ago
                      Recent MKL will generate reasonable code for Zen if you set a magic environment variable, but it was very limited (possibly only sgemm and/or dgemm when I looked). Once you've generated AVX2 with a reasonable block size, you're most of the way there. But why not just use a free BLAS which has been specifically tuned for your AMD CPU (and probably your Intel one)?
                      • user5994461 12 days ago
                        Nope, they removed support for the magic environment variable in the latest MKL release.
                        • gnufx 11 days ago
                          I stand corrected. No loss, anyway. (I probably saw it in oneapi from sometime last year, and was surprised.)
                      • jeffbee 12 days ago
                        Yeah I don't think all the hacks are out, yet. But my point is only that the availability of some feature is not the only input to the decision to use that feature at runtime. Some of these conditions may look suspiciously like shorthand for IsGenuineIntel(), even if they are legit, like blacklisting BMI2 on AMD, because BMI2 on AMD was useless over most of its history.
                      • sitkack 12 days ago
                        The real answer is to do feature probing and benchmarking the underlying implementation. In the cloud you never really know the hardware backing your instance.
                  • pletnes 12 days ago
                    From a practical perspective you have to use some BLAS library. If there is a working alternative from AMD, it would be great if you share it. They did have one in the past although I don’t recall its name.
                    • dsign 12 days ago
                      Thanks for bringing out that link, I had had that nagging question about how specific Intel performance libraries were to Intel hardware. At least in this case, it seems not much.
                      • jxy 12 days ago
                        That SO performance benchmark would be so much more useful if the OP had also run OpenBlas on the xeon.
                        • mistrial9 12 days ago
                          what, no Debian/Ubuntu ? sigh
                        • mhh__ 12 days ago
                          Do AMD even have optimized packages available? Don't get me wrong, I'm not a huge fan of what Intel get up to but AMD's profiling software is dreadful so I'm not exactly surprised that Intel don't even entertain the option.
                        • bananaquant 12 days ago
                          Quite unsurprisingly, this distribution has no support for ARM:

                          I once was excited about Intel releasing their own Linux distro (Clear Linux), but it has the same problem. It looks like Intel is trying to make custom optimized versions of popular open-source projects just to get people to use their CPUs, as they lose their leadership in hardware.

                          • mumblemumble 12 days ago
                            I'm not sure I see why you would expect anything different? The entire point of this framework is to provide a bunch of tools for squeezing the most you can out of SSE, which is specific to x86.

                            I don't know if there's an ARM-specific equivalent, but, if you want to use TensorFlow or PyTorch or whatever on ARM, they'll work quite happily with the Free Software implementations of BLAS & friends. If you code at an appropriately high level, the nice thing about these libraries is that you get to have vendor-specific optimizations without having to code against vendor-specific APIs. Which is great. I sincerely wish I had that for the vector-optimized code I was writing 20 years ago. In any case, if ARM Holdings or a licensee wants to code up their own optimized libraries that speak the same standard APIs (and assuming they haven't already), that would be awesome, too. The more the merrier. How about we all get in on the vendor-optimized libraries for standard APIs bandwagon. Who doesn't want all the vendor-specific optimizations without all the vendor lock-in?

                            Alternatively, if you would rather get really good and locked in to a specific vendor, you could opt instead to spam the CUDA button. That's a popular (and, as far as I'm concerned, valid, if not necessarily suited to my personal taste) option, too.

                          • smoldesu 12 days ago
                            "Their" CPUs meaning x86 platforms, in this case.

                            Plus, who's surprised? This is how Intel makes money. The consumer segment is a plaything for them, the real high-rollers are in the server segment, where they butter them up with fancy technology and the finest digital linens. Is it dumb? A little, but it's hardly a "problem" unless you intended to ship this software on first-party hardware which, hint-hint, the license forbids in the first place.

                            At the end of the day, this doesn't really irk me. I can buy a compatible processor for less than $50, that's accessible enough.

                            • stonemetal12 12 days ago
                              No, Their CPUs as in ones from Intel. Intel has long done a thing in their compilers where they detect the CPU model, and run less optimized code if it isn't Intel. They claim it is because they can't be sure "Other" processors have correctly implemented SSE and other extensions. So Intel Linux is going to run faster on an Intel CPU because it was compiled with ICC.
                            • klelatti 12 days ago
                              Link says Core Gen 10 or Xeon so you may be out of luck on AMD or at less than $50.

                              I think this is more likely aimed at AMD than Arm - don't think Arm is yet a threat in this space - and whilst they're entitled to do what they want it does make me less enthused about Intel and frankly more likely to support their competitors.

                              • mumblemumble 12 days ago
                                AMD has their own equivalent:

                                I'm not sure it's a sin for hardware manufacturers to support their products? In the days of yore, we even expected it of them.

                                • klelatti 12 days ago
                                  Not a sin but it's not really just about supporting (or optimising) their products, its about doing so whilst trying to increase the lock-in beyond what is achieved on performance grounds alone.

                                  I may be wrong but my experience is that AMD has been a bit better on this is the past e.g their OpenCL libraries supported both Intel and AMD whereas Intel's were Intel only.

                                  • mumblemumble 12 days ago
                                    I would assume that's not entirely a fair comparison, though. Intel's 3D acceleration hardware only ever appears in Intel-manufactured chipsets, which only ever contain Intel-manufactured CPUs.

                                    AMD, on the other hand, also supplies Radeon GPUs for use with Intel CPUs. For example, that's the setup in the computer on which I'm typing this.

                                    So I have a hard time seeing anything nefarious there. The one is obviously a business necessity, while the other would obviously be silly. Perhaps that changes with the new Xe GPUs?

                                    • klelatti 12 days ago
                                      Sorry, should have been clearer - Intel's CPU OpenCL drivers only supported Intel and not AMD whereas the AMD's CPU OpenCL drivers supported both - so GPUs not relevant in this case.

                                      I can see how if you've invested a lot in software you'd like to get a competitive advantage over your nearest rival so maybe a price we have to pay.

                                  • gnufx 12 days ago
                                    Yes. The difference is that may be "theirs", but I think it's all free software. At least the linear algebra stuff is. They supply changes for BLIS (which seem not to get included for ages). Their changes may well be relevant to Haswell, for instance. I don't remember what the difference in implementation was for Zen and Haswell, but they were roughly the same code at one time.
                                  • vel0city 12 days ago
                                    I wonder what features are missing from a Comet Lake generation Pentium, those can be had for ~$70 these days. Other than the feature of the box says "Core" on it instead of "Pentium".

                                    EDIT: Ah, I found it, AVX2.

                                  • mistrial9 12 days ago
                                    the capital model for cost recovery and earnings is one thing, but in the modern times, the amount of money that flows through Intel Inc. is not the same thing. Intel played dirty for long years to crush competitors, not "make money" like they need it.. "Greed is good" - remember that ? so, no.. apologists count your quarterly dividends but you have no platform for social advocacy here IMO
                                  • gnufx 12 days ago
                                    Clear Linux looked unconvincing to me. When I looked at their write-up, the example of what they say they do with vectorization was FFTW. That depends on hand-coded machine-specific stuff for speed, and the example was actually for the testing harness, i.e. quite irrelevant. I did actually run the patching script for amusement.
                                    • mhh__ 12 days ago
                                      Alder Lake looks seriously impressive if the rumoured performance is even close to accurate, so I wouldn't count them out just yet - that being said, they will never get a run like they did over the last 10 years again.
                                    • vitorsr 12 days ago
                                      You can easily try it yourself [1]:

                                          conda create -n intel -c intel intel::intelpython3_core
                                      Or [2]:

                                          docker pull intelpython/intelpython3_core
                                      Note that it is quite bloated but includes many high-quality libraries.

                                      You can think of it as a recompilation in addition to a collection of patches to make use of their proprietary libraries.

                                      Other useful links to reduce the noise in this thread: [3], [4], [5], [6].







                                      • tkinom 12 days ago
                                        Any benchmarks comparison data?

                                           For example:   .... benchmarks with this python is XXX % higher than ... (std python, AMD, ARM)
                                        • mumblemumble 12 days ago
                                          I haven't done a comparison in a long time, and, even then, it wasn't very thorough, so take this with a grain of salt.

                                          But, 6 years ago, when I was in grad school, just swapping to the Intel build of numpy was an instant ~10x speedup in the machine learning pipeline I was working on at the time.

                                          No idea if that's typical or specific to what I was doing at the time. I don't use MKL anymore because ops doesn't want to deal with it and the standard packages are already plenty good enough for what I'm doing nowadays. If you forced me to guess, I guess I'd have to guess that my experience was atypical.

                                      • _joel 12 days ago
                                        Why are they making their own distro and not putting code back into mainline if it's useful? Do they have some particular IP that makes this impossibe?
                                        • LeifCarrotson 12 days ago
                                          Here's the list of CPUs which incorporate the AVX2 instructions that enable some of these optimizations:


                                          You could write your distro to check for flags that will tell it whether or not you have these using flags from /proc/cpuinfo. Or you could check whether it's in the Intel half of the list or the AMD half of the list. Or you could write your own distro that only runs on the first half of the list.

                                          I get that Intel's contributions aren't purely altruistic. There are likely to be subtle tuning problems that require slight changes to optimize on different platforms, and they can't really be expected to do free work for AMD. But it looks to me like they're being unecessarily anticompetitive.

                                          • falcor84 12 days ago
                                            >being unecessarily anticompetitive

                                            Isn't setting up barriers to entry generally considered to be a part of healthy competition? I'd hazard to say that as long as a company is playing within the boundaries of what's allowed, there's nothing they could do that's anticompetitive; at the most, you could accuse them of being somewhat unsportsmanlike.

                                            • dec0dedab0de 12 days ago
                                              Isn't setting up barriers to entry generally considered to be a part of healthy competition?

                                              No, it is not. This is better described as vendor lock-in, than a barrier to entry. But vendor lock-in is also against healthy competition.

                                              Healthy competition means that users choose your product because it suits their needs the best, not because they are somehow forced to choose your product.

                                              • DasIch 12 days ago
                                                Competition is desirable because it aligns with society’s goals of innovation and progress which also imply increased productivity and lower prices.

                                                Artificial barriers to entry are contrary to that and if they’re not illegal they should be.

                                                • falcor84 12 days ago
                                                  Where do you define this line of barriers becoming 'artificial'?
                                                  • LeifCarrotson 12 days ago
                                                    It's artificial when the vendor expends additional time, effort, or funds to construct a barrier, or chooses an equally-priced non-interoperable design that a rational, informed consumer with a choice would reject. If you're expending great effort to write custom DRM or to reinvent open industry standards that you could have installed cheaply, that's artificial.

                                                    I fully admit that there are natural barriers that occur at times. I don't think that you should be expected to reverse-engineer your competitor's products and bend over backwards to make them work better.

                                                    Here, for a concrete example, Intel had a clear choice to test whether a processor supported a feature by checking a feature flag - It's in the name, they're literally implemented for that exact purpose - or they could expend extra effort in building their own feature flag database by checking manufacturer and part number. They could have either expended extra effort to launch and distribute their own entire custom Python distribution, or submitted pull requests to the existing main distribution. For another example, Apple could have used industry-standard Phillips or Torx screws in their hardware: Manufacturers had lines to produce them, distributors had inventory of the fasteners, users had tools to turn them. Instead, they went to great expense to build their own incompatible tri-lobe screws, requiring probably millions of dollars in investment in custom tooling and production lines, all for the sake of creating an artificial barrier.

                                                    • Sanguinaire 12 days ago
                                                      We could start with something similar to the concept of Pareto optimality; Intel could have delivered their maximum performance without preventing optimizations from being applied equally on AMD hardware, but instead they choose to disadvantage AMD without providing anything extra on top of what they could do while remaining "neutral".
                                              • SkipperCat 12 days ago
                                                I think there is a pretty big base of people who do big data work using Numpy and Pandas (Fintech, etc). They want to squeeze every bit of computing power out of the specific Intel chipset, GPUs, etc and Intel's distro really helps them out.

                                                A 10% speed improvement on 1000's of jobs could in theory save you a nice chunk of time. This becomes very important in the financial market where you need batch jobs to be finished before markets open, or you just want to save 10% on your EC2 bill.

                                                • gnufx 12 days ago
                                                  10% is around the noise level for HPC, especially for throughput depending on scheduling. I rather doubt you couldn't do the same as free software.
                                                  • pjmlp 12 days ago
                                                    Yet plenty of HPC installations rather use IBM's xl or NVidia's PGI compiler suits.

                                                    So they definitly don't agree you can do the same with free software.

                                                    • gnufx 11 days ago
                                                      They may or may not disagree on the basis of measured performance in different cases. (The US labs are investing heavily in free software LLVM.) Often not with ifort, at least. However, I was talking about the Intel stuff, and partly from experience with R versus the Microsoft version.

                                                      However, I recently ran the Polyhedron Fortran benchmarks with the compilers to hand (~2020 vintage). XL (on POWER) was the only one that gave a significantly better bottom line; obviously IBM know how to compile Fortran well by now (or by Fortran-H). As far as I remember, that was essentially due to the treatment (vectorization) of maths intrinsics, which probably aren't so dominant in typical HPC code. One bad case -- fmod inlining -- has since been fixed in GCC. Without GCC's unfortunate longstanding failure to vectorize sincos (or equivalent), gfortran should have beaten ifort significantly in the bottom line, and at least got close to XLF. PGI was distinctly worse than GCC, but may do better at OpenACC/GPU offload, for instance. XL may win on OpenMP, since some of the current standard was for Sierra's needs. I should find time for the NAS benchmarks.

                                                    • Sanguinaire 12 days ago
                                                      You are correct, nothing Intel provides in their Python distro cannot be obtained elsewhere - this is just a nice wrapper.
                                                • TOMDM 12 days ago
                                                  To me this just looks like Intel saw what Nvidia has accomplished with CUDA, locking in large portions of the scientific computing community with a hardware specific API and going "yeah me too thanks"

                                                  Thankfully, accelerated math libraries already exist for Python without the vendor lockin.

                                                  • bostonsre 12 days ago
                                                    Intel has been releasing mkl/math kernel libraries for Java for a really long time. Hopefully core python devs can learn a few tricks and similar changes can make it upstream.
                                                  • rshm 12 days ago
                                                    Looks like recompilation. I am guessing gains are on numpy and scipy. For python heavy code base, i doubt it can be performant than pypy.
                                                    • ciupicri 12 days ago
                                                      Python 3.7.4 when 3.10 is just around the block.
                                                      • amelius 12 days ago
                                                        Maybe I'm missing something but it seems to me that this can only cause fragmentation in the Python space.

                                                        Why not use the original distributions?

                                                        • lbhdc 12 days ago
                                                          There are a number of alternate interpreters available. The selling point typically is that they are faster, and seems to be the value proposition of intels.

                                                          One use might be improving throughput of a compute bound system, like an etl written in python, with little effort. Ideally just downloading the new interpreter.

                                                          • amelius 12 days ago
                                                            Ok. If they offer Python without the GIL then I'm all ears :)
                                                            • gautamdivgi 12 days ago
                                                              I don't think python is ever going to get rid of the GIL. I haven't looked but there's two things that may speed things up quite a bit: - Use native types - Provide the ability to turn "off" the GIL if you know you will not be using multi-threading within a process.

                                                              I guess that is my naive wish list for a short term speed up :)

                                                              • dec0dedab0de 12 days ago
                                                                Jython doesn't have a gil, but It doesn't support python3, and I've never used it.
                                                                • gautamdivgi 12 days ago
                                                                  Jython would also have issues with the many c libraries that python code relies on today.
                                                              • shepardrtc 12 days ago
                                                                Numba might be what you're looking for:
                                                                • gautamdivgi 12 days ago
                                                                  That looks really interesting. I'll definitely be trying it out. Thanks!!!
                                                                • TOMDM 12 days ago
                                                                  A pythonic language that included something analogous to Golangs channels/goroutines would be my ideal.
                                                                  • borodi 12 days ago
                                                                    Julia does have channels similar to those of Go, although if you want to call it pythonic or not is up to you.
                                                                    • TOMDM 12 days ago
                                                                      I've seen hype for Julia over and over, but this is the first piece of information that's made me genuinely interested.

                                                                      Thanks for the heads up!

                                                                      EDIT: Oh god it's 1 indexed

                                                                      • borodi 12 days ago
                                                                        While people discuss a lot about it, in the end 1 indexing doesn't really matter. I think it comes from fortran/matlab.
                                                                        • TOMDM 12 days ago
                                                                          I agree, it doesn't really matter, but I've been programming long enough that I can see it being that top step that's always half an inch too tall that I'm going to stub my toe on.
                                                                          • borodi 12 days ago
                                                                            For sure, I switch between python, C/C++ an julia a lot and well, lets say bounds errors are pretty common for me.
                                                                            • snicker7 11 days ago
                                                                              The "idiomatic" way to access the first element in an array/sequence in julia is to use the `first` function, e.g. `first(arr)` vs. `arr[1]`. This works across a larger number of array types, including OffsetArrays with 0-based index offsets.
                                                                              • oscardssmith 12 days ago
                                                                                My advice would be to use begin and end. Then you don't have to think about the indexing.
                                                                • gnufx 12 days ago
                                                                  Mystique (PR)?
                                                                • gnufx 12 days ago
                                                                  I don't know what Intel did for the proprietary version, but the first thing you should do for Python is to compile with GCC's -fno-semantic-interposition. I don't know if there's a benefit from vectorization, for instance, in parts of the interpreter, or whether -Ofast helps generally if so, but I doubt there's anything Intel CPU-specific involved if there is. I've never looked at it, has the interpreter not been well-profiled and such optimizations provided? Anyway, if you want speed, don't use Python.

                                                                  It's obviously not relevant to Python per se, but you get basically equivalent performance to MKL with OpenBLAS or, perhaps, BLIS, possibly with libxsmm on x86. BLIS may do better on operations other than {s,d}gemm, and/or threaded, than OpenBLAS, but they're both generally competitive.

                                                                  • black_puppydog 12 days ago
                                                                    So I see Intel and Microsoft both like naming things the Wrong(TM) way around? This name makes about as much sense as WSL... :D
                                                                    • hallgrim 12 days ago
                                                                      We tried using intel python in one of my previous data science jobs, and ultimately gave up because compatibility with some packages from pip was a nightmare. Alas I can’t quite remember exactly what went wrong.
                                                                      • RocketSyntax 12 days ago
                                                                        is there a pip package?
                                                                        • agloeregrets 12 days ago
                                                                          I wonder who the person is who saw python and was like "You know what this needs? INTEL."