Datashader: turns even the largest data into images, accurately


237 points | by luu 189 days ago


  • mhalle 188 days ago

    Looks like a great project. Contrary to other comments, rendering != visualization. This project seems to have paid attention to lots of the seemingly little but critical details of this type of visualization that are a pain to handle yourself (anti-aliasing of multi-scale data, terrain shading, large- and out-of-core visualization).

    Any one of these topics can bring a visualization project to a screeching halt, or make the results look misleading or bad.

    Even better that they built a tool that works with existing libraries, rather than replacing them. Good work!

    • BubRoss 188 days ago

      > anti-aliasing of multi-scale data, terrain shading, large- and out-of-core visualization

      Webgl will basically do all of that for you, including the out of core if you can stream the data in.

      • mhalle 188 days ago

        No, it can't. WebGL is at a completely different level of the stack than Datashader. WebGL is even lower level than what most people use for 3D graphics (hence threejs and babylonjs).

        Data visualization is at a higher semantic level than rendering; ideally you don't want to deal with pixels and polygons. D3, for instance, binds graphical primitives (usually in SVG) with data representations but requires more programming to do actual data visualization (and that's why a bunch of software layers on top of D3). Bokeh deals with still higher level primitives closer to the data set level (plotting and charts).

        And Datashader carves out a niche where there's too much data to have a 1:1 ratio of data element to graphical primitive on the screen. It does that by rasterizing, but then also handling the hard part of mapping backwards from image to data for selection and interactivity (I hope that's right; I got it from watching the 2016 video).

        Anyone who has had to do this stuff for a living knows it is hard to do right, and that good modular tools are always welcome.

        I don't see how this repeated "it's just like X" line of responses is benefiting the discussion. Datashader is not just like WebGL or basic low-level rendering, any more than D3 is just SVG or web apps are just TCP connections. Completely different levels of abstraction with lots of value add in between (and hard work, I'm sure).

        • BubRoss 188 days ago

          I didn't say that webgl would do everything this can do, I said it would do the things I copied from the post I replied to.

          Again, my issue is that they seem to put a lot of focus on this having some sort of sophisticated new rendering when it seems to be marketing of trivial techniques. People seem to like what this library does, but they didn't invent new rendering algorithms and their buzzwords and clever names just show a lack of awareness of what they are doing in the rendering department.

          • jbednar 188 days ago

            I'm not sure if you're objecting to the name "Datashader", but surely every library needs a name, and this one is accurate in that it allows the sort of shading that one does for 3D rendering to be applied to 2D data plotting. Or are there other buzzwords used in the docs you find objectionable?

            • BubRoss 188 days ago

              If I said I was an expert in 'big data visualization with billions of points' and had written my own 'out of core' rendering library that I dubbed 'data shader', complete with a paper where I coined the term 'Abstract Rendering' or 'AR' for short, then you found out that I was just reading points from disk and drawing them with opengl's draw points function, what would you think?

              The term 'out of core' rendering comes from raytracing, where you really do need all the geometry available. They are applying it to trivial accumulation where it was never a problem in the first place. That's like me writing a paper on how to make a balloon air tight. That's how it has always worked, why would I take credit for something that was never a problem?

              • jbednar 187 days ago

                Sigh. Datashader is not a paper, it's an actual usable piece of software, so it should be compared to other tools and libraries for rendering data. Unlike nearly ever other 2D plotting library available for Python, it can operate in core or out of core, so it's entirely appropriate to advertise that fact (why hide it?). Unlike OpenGL's point drawing functions and nearly every other 2D plotting library available for Python, it avoids overplotting and z-ordering issues that make visualizations misleading (so why hide that?). Unlike NumPy's histogram2D, it allows you to define what it means to aggregate the contents of each bin (mean, min, std, etc.), to focus on different aspects of your data. It's a mystery to me why you think Datashader should somehow fail to advertise what it's useful for!

                • BubRoss 187 days ago

                  > Datashader is not a paper


                  You keep defending the project as a whole while not confronting the fact that they are touting rendering breakthroughs, while I have given a lot of explanation of why there are no rendering breakthroughs and the actual rendering, no matter where it is done and no matter how much data is used, is trivial. I'm not sure what can help you focus in on the point I'm making here, I haven't strayed from it. This isn't about the workflow or the language used or anything else. It is about false claims and buzzwords to make people think that it is solving rendering problems that have never existed like 'accuracy' and 'big data' ( in the context of these visualizations ).

                  • eggie 187 days ago

                    They are touting it specifically in the context of the visualization of very large datasets.

                    The fact that their software exists is itself a breakthrough. It enabled me to do things that other equivalent tools (such as in statistical packages) could not allow. I would have been reduced to directly implementing my rendering pipelines, and I would also have had to make many of the same design decisions they made, such as doing things out of core.

                • saltcured 187 days ago

                  I do not know the data shader work well enough to defend it, nor to even know if it deserves defense, but I can at least respond to your argument.

                  You imply that accumulative rendering into a framebuffer solves large statistical integration problems. But, the framebuffer is not implemented using abstract math over the real nor integer domain. You need to consider numerical effects of adding the smallest value (one sample) into a running sum.

                  If you use an integer/fixed precision buffer for the running sums, you need enough bits to avoid overflow even if billions of points land in one bin. You might think to use floating point, but that has worse problems for running sums. You are effectively limited to the number of bits in the mantissa when continuously adding small increments.

                  So, you cannot scale up the naive approach of zeroing the framebuffer and blending/accumulating points from a stream. You need to do some hierarchical aggregation to accurately represent sub-populations and combine them in a numerically robust manner. Most likely, you would also like to precompute some of these results to support better interactive performance, much like mip-mapping is used to provide more accurate texture sampling at multiple rendering scales.

                  • pwang 187 days ago

                    I think you are deliberately trying to misunderstand what is being done in this project.

                    It's not about what APIs are being used to render whatever. At that level of analysis, all that anybody is every doing, is just doing memcpy and bitblt. Rather, datashader provides a framework for applying semantically meaningful, mathematical transformations on datasets as they're being accumulated, as those accumulations are converted into aesthetic/geom primitives, and as those primitives are rendered into colors. It really is "renderman for data", along with arbitrary vertex/texture shaders, driven by a dynamic rasterizer that can use whatever bins in data-space (not merely physical pixels).

                    BTW "Out of core" does NOT come from raytracing; in fact its history in computing is a term for anything that exceeds physical memory. We use it all the time in scientific/HPC and data science because datasets are frequently much larger than available memory.


                    • BubRoss 187 days ago

                      I think you are misrepresenting what is being done in this project. People seem to like it. They say it has workflow refinements. That's great, but there isn't anything new being done in rendering here unless doing something trivial in a pointlessly complex way and renaming fundamental techniques counts as a breakthrough.

                      Focus on the workflow refinements, saying g there are rendering breakthroughs here is snake oil.

            • jbednar 188 days ago

              Datashader is server-side rendering, and thus not in any way comparable to WebGL in its usage. With Datashader, only the final rendered/rasterized image-like object is sent to the client, which lets it handle arbitrarily large datasets (anything your remote servers can process). With WebGL the dataset is sent to the browser for rendering, which has some advantages but is a very different process than what Datashader does.

              • detaro 188 days ago

                Somehow I feel like WebGL in the browser isn't going to just handle it for things like the main image on the project page, where the compressed dataset is > 1 GB already.

                • BubRoss 188 days ago

                  Why wouldn't it? There isn't anything special that needs to be done, you can stream in whatever you want and render it to the existing buffer. You can even have it anti aliased practically for free.

                  • el_dev_hell 188 days ago

                    > Why wouldn't it? There isn't anything special that needs to be done, you can stream in whatever you want and render it to the existing buffer.

                    1GB to the browser for rendering... Sure, it's doable. So is eating 1KG of wings for lunch. Doable, but very far from being a good idea.

                    • BubRoss 188 days ago

                      Once again, there is no need for very much to be in memory at one time. This is accumulation and buffers can be reused. I'm not sure how much more I can simplify this other than to say that addition doesn't care about ordering, you can just add continually without needing access to any of the other data that has been rasterized.

                      • jbednar 187 days ago

                        I think this was already said above, but it still seems to be getting confused, so to repeat: Datashader renders everything out of core, in the server. So it doesn't matter whether a client could successfully accumulate results for a large data incrementally; to use WebGL directly one still has to send all of the data to the client eventually. With Datashader the dataset is never sent to the client in the first place; it stays on the server, which could be a remote HPC system with thousands of cores processing petabytes. Datashader renders the data into an image-shaped array on the server, then sends that (much smaller) array to the client, so that the client never sees any data larger than the available screen resolution. This is no claim that doing so is unprecedented or some crazy new idea, just that Datashader lets you render datasets regardless of their size, completely independently of any client (browser) limitations, and without having to serialize the data over an internet connection.

            • IanCal 188 days ago

              Datashader is a great project. Very fast, very easy to use. You can throw a lot of data at it in a notebook and get back a zoomable interactive pane.

              Here's a 2016 talk on it:

              There's likely a lot of improvements since then, but that should help show some of the core parts and explain why it's a useful tool.

              • 24gttghh 188 days ago


                I wish more people were outraged at this kind of election tampering. Great visualizations though! Zoom in on some of those tight masses of black outlines. The shapes are ridiculous. Maryland 3rd? Come on.

              • itodd 188 days ago

                I've used datashader for plotting NGS (Next Generation Sequencing) enrichments. At the time I had to hack together the ability to use the polygon select tool on the data, but it worked and blew my mind.

                Very elegant solution to a difficult problem (overplotting).

                • abcc8 188 days ago

                  Do you have any examples of this you could point to online? I am looking at different visualization tools for various NGS-based analyses currently.

              • tokyodude 188 days ago

                > Turns even the largest data into images, accurately

                The first image, the image of the USA, seems really mis-representative to me. LA and NYC should be way way way more bright in relation to everything else than the entire area east of the Mississippi.

                At least to my eyes that map makes it look like parts of Denver, Kansas City, Salt Lake City, Atlanta, and the San Joaquin Valley are just as dense as Manhattan.

                Atlanta's population density 630 per square mile

                Manhattan's population density 70826 per square mile

                It seems like an accurate data image would have Atlanta's brightness 1/100th of Manhattan's. Basically it looks like they saturated out at around 250 people do anything over 250 people is the same brightness.

                • jbednar 188 days ago

                  By default, Datashader accurately conveys the shape of the distribution in a way that the human visual system can process. If you want a linear representation, you can do that easily; see the first plot in , but you'll quickly see that the resulting plot completely fails to show that there are any patterns anywhere besides the top few population hotspots, which is highly unrepresentative of the actual patterns in this data. There is no saturation here; what it's doing in the homepage image is basically a rank-order encoding, where the top brightness value is indeed shared by several high-population pixes, the next brightness value is shared by the next batch of populations, etc. Given only 256 possible values, there has to be some grouping, but it's not saturating.

                  • pwang 187 days ago

                    Yes, datashader actually gives you the ability to dial-in as much gamma compensation as you want, to account for the human visual system's nonlinear response to luminance.

                  • johnmarinelli 188 days ago

                    Looks like a really cool project. One thing that I would be interested in see would be using Datashader as a dynamic visualisation library - for example, generative art projects. Probably not the main interest of data visualisation practitioners but hey, if you've got a sweet pipeline to render all those points, why not?

                  • whoisjuan 188 days ago

                    What license is this project using? The repo has a license but besides the provisions listed there I don't see any standard license.

                    • tnvaught 188 days ago

                      The repo has a standard 3-clause BSD license.

                    • burtonator 188 days ago

                      This actually gave me an interesting idea regarding bitcoin passphrase mnemonics.

                      Instead of text we could use the same algorithm to generate images.

                      So you could have an index of images and generate them. I'm actually wondering if you could use nouns and verbs to maybe make stories if you could mutate the nouns reliably.

                      Like 'bird flying' vs 'bird sleeping' ...

                      This could help to remember long passphrases visually which people seem to be better at.

                      • simplyinfinity 188 days ago

                        Is there anything similar for network graphs?

                        • lmeyerov 188 days ago

                          Yep -- at, we started by making millions of nodes/edges interactive. If you use notebooks, can signup on our site and get going. The trick is we connect GPUs in the browser to GPUs in the cloud, and encapsulate it enough that you can stick to writing standard SQL/pandas/etc.

                          We've been curious about server-side static tile rendering for larger graphs, but has been on the back-burner. (We already connect to GPUs on the server, so not rocket science.) Currently, we're actively increasing how much can be ingested + computed on, such as for finding influencers, communities & rings, etc. However, visualizing that hasn't been an operational priority for our users. More useful to generate the communities, and then either inspect individual ones, or see how communities stitch together: quickly run out of pixels otherwise due to too many edges. Likewise, we're building connectors to gigascale-petascale graph DBs: titan, janus, aws neptune, tigergraph, spark graphx, etc.

                          We still are interested, but more for when we start supporting geographic maps: you can see that is the primary use for datashader. Also, because data art is fun :)

                          • jbednar 188 days ago

                            Geographic maps aren't the primary use for datashader; those are just easy examples that people can appreciate without a lot of explanation. In practice we use it for any large datasets that we don't want to subsample before visualizing them.

                            • lmeyerov 188 days ago

                              Yes, definitely more capable. I've seen them primarily used in scatterplots (x, y, maybe z) + maps. Curious where else you're seeing the 80/20 breakdown if not there...

                          • jbednar 188 days ago

                            Datashader itself renders networks:

                            • jointpdf 188 days ago

                              Perhaps Gephi? It was used for those Graphcore neural network visualizations (millions of edges and nodes) and they look stunning.



                              • simplyinfinity 188 days ago

                                I've used both gephi and cytoscape, i have about 2 million nodes with about 3 million edges, it's hard to visualise at once. however i've reduced the graphs to only sub networks with more than 5 connections, the remainder is about 600k, which is still painfully slow to visualise

                              • maliker 188 days ago

                                My team has had luck rendering an SVG from the graph and sending it to a browser. It works well for about 10k vertices and edges. Above that scale we use datashader, and we're investigating a potential to move to QGIS. We tried Gephi a few years ago and it had trouble at these scales.

                                • IanCal 188 days ago

                                  You can render networks in datashader, there's a line primitive.

                                  I added edge bundling (probably the slowest thing in datashader!) but I know there's examples of flight path rendering in the video I linked in another comment.

                                • BubRoss 189 days ago

                                  This is basically visualization 101. 'Turning data into images' isn't exactly a new concept. Sometimes I think the next generation of programmers has gotten too good at thinking up fancy names for reinventing the wheel and hasn't spent enough time looking at what has been done in the past 50 years so they can build off of it.

                                  Also there are links to 'visualizing big data' and 'visualizing billions of points'. These too have been problems solved a long time ago with a lot less resources. The book advanced renderman are is a good place to start, and of course, opengl uses basic z-buffer rendering.

                                  Unless you need sorted transparency (and accumulating points for brightness definitely does not), then you can rasterize as much as you want in any order, without holding anything in memory except for the image buffer. It doesn't matter how many point/particles you have, you don't have to do anything to make it scale.

                                  • jameskilton 189 days ago

                                    And this is how you push people out of our field.

                                    How about instead of starting with an insult (I can't believe you didn't already know this) you instead congratulate them on putting together a full working library, with pretty, easy-to-grasp examples then offer up some research links that they could use to further refine and improve their system. It's our job to teach people, you can't expect everyone to suddenly know everything.

                                    To the Datashader team: I apologize for the above comment. Good job in building and launching a tool for others to use, and great choices for examples!

                                    • corysama 189 days ago

                                      “Dropbox is just rsync on a cron job” and “OP’s project is not completely novel” are unfortunately common tropes on HN...

                                      Like you said. Starting with “Great job launching. What are the benefits of using Dropbox over, say just rsync + cron?” would go a long way towards improving the environment around here.

                                      • BubRoss 189 days ago

                                        My problem isn't that isn't novel or new, it is that it is presented as novel. Visualization is great (and the visualization here looks good), but saying things like 'datashader: turn data into images' when that is literally what rendering is, is a nonsense way to approach a valid topic.

                                        Saying 'visualize big data and billions of points' when the buzzwords are just there to sugar coat an accumulation buffer, gets into a territory of reinventing the wheel but naming it 'the flattened infinite curvature hypersphere'.

                                        So, before you get too self righteous at least realize that it is the delivery and lack of context and precedent, not the actual work that is the problem.

                                        • johnmarinelli 188 days ago

                                          I suspect marketing Datashader as an "accumulation buffer" wouldn't have the same effect on its target audience (data visualisation developers) as presenting it simply as a way to "Turn data into images".

                                          I'm also curious (as a fledgling graphics programmer) - what leads you to believe that Datashader uses an accumulation buffer internally? I would think that they use some magic to draw all the points in a single draw call using instanced rendering, but I am very naive :)

                                          • mhalle 188 days ago

                                            If you watch the video, one of the creators explains this point, which makes datashader different from, say, Bokeh. (Pipeline explained around 7:20 in the video)


                                            D3 and Bokeh and other web-based visualization tools, in general, plot HTML or CSS primitives to the browser. This approach works great for smaller datasets, but doesn't scale to millions/billions.

                                            Datashader aggregates (accumulates) graphical representations of data into images, then provides a way to get those to the browser and work well with the other libraries. That high level description leaves out 95% of the critical practical details of visualization, which the creators of datashader handle.

                                            • jbednar 188 days ago

                                              Datashader's approach is a bit different from an accumulation buffer, though similar in principle. It's not 3D rendering, and has no need for a z ordering; instead it's essentially 2D histogramming. For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel. The key benefit over something like SciPy's histogram2d functions are in how it is implemented and used -- highly optimized, and highly integrated with viz tools so that it can let you just interact with your data naturally as if you had infinite resolution. Try it and see!

                                              • BubRoss 188 days ago

                                                > For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel.

                                                That is literally what opengl does. If you mean a histogram per pixel in depth, that's literally voxels in perspective space.

                                                If there are usability benefits here, that's great, but everything seems to be centered around there being new rendering techniques here, when not only are they not new, they're completely trivial, with solidified names and formalized math.

                                              • pwang 188 days ago

                                                It's actually not merely an accumulation buffer. It's a shader pipeline that allows for arbitrary Python code to be executed at each stage of data processing. It's actually very much like "renderman for data", but with Python (via Numba, Dask for performance).

                                                The pipeline is also built in such a way that it permits front-end JS viewers like Bokeh to drive a very dynamic experience.

                                              • mikepurvis 188 days ago

                                                It is novel in the sense that it combines the interactivity of a D3/Bokeh/whatever JS based visualization (typically limited to a few thousand points) with the massive data display capability of offline rendering.

                                                • p49k 188 days ago

                                                  “Turn data into images” is a much better phrase than “rendering.” Anyone who is not intentionally being pedantic for dramatic effect would understand the purpose of the product based on reading that phrase, whereas “Datashader: a renderer” is less clear and could refer to products with an entirely different scope/purpose.

                                                  Their product page is well-written and accurate. It sounds like you want them to purposely describe their product as something that is inferior to what it actually is.

                                                  • jbednar 188 days ago

                                                    If you want one word, Datashader is a rasterizer. It takes data of many types (points, lines, grids, meshes) and creates a regular grid where each grid cell's value is a well defined function of the incoming data. Not sure anyone would be any happier with "rasterizer" than "renderer" or "shader" or any other single word...

                                                  • detaro 188 days ago

                                                    I can't see a single reference to this being somehow something you couldn't do before in the linked page. It describes what it does, it doesn't make claims of superiority over other approaches, ...

                                                  • RodericDay 189 days ago

                                                    As a newcomer to the field, parent post is far more welcoming than the bevy of trash Ninja flavor-of-the-week bullshit that makes it seem impossible to catch up.

                                                    I only actually got my bearings and self-confidence as a programmer when I realized that most of the people pushing blogs with subscriptions about "cutting-edge" tech were literally snake-oil salesmen and shovel merchants.

                                                    That coding wasn't actually different from anything else I had learned in my life, and that there were some fundamentals I could latch onto, and grow from there upwards. All this nonsense about the field experiencing a revolution that upends all existing knowledge year-after-year is far more mentally taxing.

                                                  • pwang 188 days ago

                                                    You're missing the point of this project. It's not about the feasibility of throwing a billion points at a pile of software, to get an image. I can do that with a simple Python script. It's about doing so to create a meaningful and accurate data visualization, and not just a picture of, say, shiny spheres or a scene from Avatar.

                                                    I actually have a background in 3D computer graphics, and it's precisely because of my detailed knowledge of raytracing, rasterization, OpenGL, BMRT, photon maps, computational radiometry, BDRFs, computational geometry, and statistical sampling, etc... that when I came to the field of data science & specifically the problem of visualizing large datasets, I realized the total lack of tooling in this space.

                                                    The field of information visualization lags behind general "computer-generated imagery" by decades. When I first presented my ideas around Abstract Rendering (which became Datashader) to my DARPA collaborators, even to famous visualization people like Bill Cleveland or Jeff Heer, it was clear that I was thinking about the problem in an entirely different way. I recall our DARPA PM asking Hanspeter Pfister how he would visualize a million points, and he said, "I wouldn't. I'd subsample, or aggregate the data."

                                                    Datashader eats a million points for breakfast.

                                                    Since you're clearly a computer graphics guy, the way to think about this problem is not one of naive rendering, but rather one of dynamically generating correct primitives & aesthetics at every image scale, so that the viewer has the most accurate understanding of what's actually in the dataset. So it's not just a particle cloud, nor is it nurbs with a normal & texture map; rather, it's a bunch of abstract values from which a data scientist may want to synthesize any combination of geometry and textures.

                                                    I chose the name "datashader" for a very specific and intentional reason: we are dynamically invoking a shader - usually a bunch of Python code for mathematical transformation - at every point, within a sampling volume (typically a square, but it doesn't have to be). One can imagine drawing a map of the rivers of the US, with the shading based on some function of all industrial plants in its watershed. Both the domain of integration and the function to evaluate are dynamic for each point in the view frustum.

                                                    • BubRoss 188 days ago

                                                      > Datashader eats a million points for breakfast.

                                                      So does opengl on a decade old computer.

                                                    • IanCal 188 days ago

                                                      > thinking up fancy names for reinventing the wheel

                                                      They're not claiming to have reinvented the wheel, they're just explaining what it is.

                                                      > 'Turning data into images' isn't exactly a new concept.

                                                      No, but doing so on large data accurately (the last word is important that you cut off) is not something I know I can easily achieve in a different python library faster. I'd like to know if I could.

                                                      • CyberDildonics 188 days ago

                                                        > but doing so on large data accurately

                                                        What do you mean by 'large' or 'accurate' ? Where would accuracy be lost in any approach?

                                                        • pwang 188 days ago

                                                          This paper on our original ideas of "Abstract Rendering" talks about the kinds of accuracy problems that plague the visualization of large datasets:

                                                          We renamed from Abstract Rendering to Datashader for affordances of human cognition.

                                                          This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization, as part of an effort to come up with an algebraic process for visual design:

                                                          Using their metrics around "confusers" and "hallucinators", Datashader came out as one of the few things that doesn't suffer from such intrinsic limitations.

                                                          • CyberDildonics 188 days ago

                                                            There are a lot of red flags in the abstract of that paper alone.

                                                            > Rendering techniques are currently a major limiter since they tend to be builtaround central processing with all of the geometric data present.

                                                            This is completely untrue - OpenGL and virtually all real time rendering is done using z-buffer techniques that were originally used because they don't need all the geometry present. These techniques date back to the 70s and were some of the first hidden surface rendering algorithms.

                                                            > This paper presents Abstract Rendering (AR), a technique for eliminating the cen-tralization requirement while preserving some forms of interactivity.

                                                            Interactivity might be novel here so that is what should really be focused on, if anything. I don't think coining a new term and acronym that don't seem to relate to what is happening is a going to be a good choice to communicate the techniques.

                                                            > AR is based on the observation that pixelsare fundamentally bins, and that rendering is essentially a binning process on a lattice of bins.

                                                            This observation was made in the early 80s and has been the backbone of renderman renderers for almost 40 years. Renderman calls them 'buckets'.

                                                            > This approach enables: (1) rendering onlarge datasets without requiring large amounts of working memory,

                                                            Renderman originally rendered film resolution images with high resolution textures with only 10MB of memory.

                                                            > (3) a direct means of distributing the rendering task across processes,

                                                            Giving different threads their own buckets is standard for any non-toy renderer. Distributing buckets across multiple computers is part of many toolsets.

                                                            > high-performanceinteraction techniques on large datasets

                                                            This is the only part that has a chance of being novel, but paper only shows basic accumulation of density for adjacency matrices. The visualization are timed in the multiple seconds but look extremely simple, and for some reason are rendered 'out-of-core' on a computer with 144GB of memory even though it seems very unclear that these images couldn't be made with z-buffer rendering in opengl.

                                                            > This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization

                                                            It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.

                                                            • pwang 187 days ago

                                                              > It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.

                                                              Actually, no. The paper may not have been explicitly clear about this, but the ENTIRE point of a "data visualization" system is to transform potentially high-dimensional datasets, with a large number of columns, into meaningful images by a series of steps. You seem to be interpreting this narrowly, and imagining that geometry is already pre-defined in the dataset, so then of course this looks like a fairly trivial 2D accumulator.

                                                              That is not the intent, nor is the common use case.

                                                              For data visualization, the question of "how do I accurately aggregate or accumulate the 25 - 1million points in this bucket" is a deep one. There is NO data visualization system that programmatically gives access to this step of the viz pipeline to a data scientist or statistician. Most "infoviz" tools gloss over this problem - they do simple Z buffering, or cheesy automatic histograms of color/intensity, etc. These are almost always "wrong" and produce unintended hallucinators.

                                                              Your first comment - about "not needing all the geometry present" - indicates that you are not understanding the nature of the problem datashader was designed to solve. There is no simple "cull" function for data science; there is no simple "Z" axis on which to sort, smush, blend, etc. At best, your data points can be projected into some kind of Euclidean space on which you can implement a fast spatial subdivision or parallel aggregation algorithm. But once that's done, you're still left holding millions of partitions of billions of points or primitives, each with dozens of attributes.... what then?

                                                              • CyberDildonics 187 days ago

                                                                > You seem to be interpreting this narrowly

                                                                I'm not sure why you would coin a term 'Abstract Rendering' and talk about 'out of core rendering' then turn around and say that transforming high dimensional data sets is part of rendering. Rendering is well defined and very established, coming up with transformations and calling that part of rendering is nonsense. You made this mess yourself by trying to stretch the truth.

                                                          • IanCal 188 days ago

                                                            If you want to discuss this properly, can you please finish the sentence?

                                                            What library should I use to do this faster in python without subsampling the data for a start?

                                                        • candu 188 days ago

                                                          Ah, the good old "MapReduce is basically functional programming 101" trope, usually resulting from a fundamental misunderstanding of the problem the framework / tool in question solves.

                                                          • TeMPOraL 188 days ago

                                                            Well, in that particular case, Google has brought this on themselves by naming their sort-of-a-product after a common FP idiom, and then hyping the hell out of it.

                                                          • douglaswlance 189 days ago

                                                            It doesn't matter if something has been done before if the new way hooks into new systems. Tools do not exist in a vacuum. All tools are apart of a system of tools that should always be considered when evaluating any component piece.

                                                            • sevensor 188 days ago

                                                              I also was expecting something new, but in their defense they've made a very appealing version of something old. I'm sure there are a lot of people out there who haven't thought about saturation with "large" data sets before.

                                                              • Eli_P 189 days ago

                                                                What are those fractals on the 3rd picture? They remind me of Lissajous curves used in oscilloscopes.