Looks like a great project. Contrary to other comments, rendering != visualization. This project seems to have paid attention to lots of the seemingly little but critical details of this type of visualization that are a pain to handle yourself (anti-aliasing of multi-scale data, terrain shading, large- and out-of-core visualization).
Any one of these topics can bring a visualization project to a screeching halt, or make the results look misleading or bad.
Even better that they built a tool that works with existing libraries, rather than replacing them. Good work!
No, it can't. WebGL is at a completely different level of the stack than Datashader. WebGL is even lower level than what most people use for 3D graphics (hence threejs and babylonjs).
Data visualization is at a higher semantic level than rendering; ideally you don't want to deal with pixels and polygons. D3, for instance, binds graphical primitives (usually in SVG) with data representations but requires more programming to do actual data visualization (and that's why a bunch of software layers on top of D3). Bokeh deals with still higher level primitives closer to the data set level (plotting and charts).
And Datashader carves out a niche where there's too much data to have a 1:1 ratio of data element to graphical primitive on the screen. It does that by rasterizing, but then also handling the hard part of mapping backwards from image to data for selection and interactivity (I hope that's right; I got it from watching the 2016 video).
Anyone who has had to do this stuff for a living knows it is hard to do right, and that good modular tools are always welcome.
I don't see how this repeated "it's just like X" line of responses is benefiting the discussion. Datashader is not just like WebGL or basic low-level rendering, any more than D3 is just SVG or web apps are just TCP connections. Completely different levels of abstraction with lots of value add in between (and hard work, I'm sure).
I didn't say that webgl would do everything this can do, I said it would do the things I copied from the post I replied to.
Again, my issue is that they seem to put a lot of focus on this having some sort of sophisticated new rendering when it seems to be marketing of trivial techniques. People seem to like what this library does, but they didn't invent new rendering algorithms and their buzzwords and clever names just show a lack of awareness of what they are doing in the rendering department.
I'm not sure if you're objecting to the name "Datashader", but surely every library needs a name, and this one is accurate in that it allows the sort of shading that one does for 3D rendering to be applied to 2D data plotting. Or are there other buzzwords used in the docs you find objectionable?
If I said I was an expert in 'big data visualization with billions of points' and had written my own 'out of core' rendering library that I dubbed 'data shader', complete with a paper where I coined the term 'Abstract Rendering' or 'AR' for short, then you found out that I was just reading points from disk and drawing them with opengl's draw points function, what would you think?
The term 'out of core' rendering comes from raytracing, where you really do need all the geometry available. They are applying it to trivial accumulation where it was never a problem in the first place. That's like me writing a paper on how to make a balloon air tight. That's how it has always worked, why would I take credit for something that was never a problem?
Sigh. Datashader is not a paper, it's an actual usable piece of software, so it should be compared to other tools and libraries for rendering data. Unlike nearly ever other 2D plotting library available for Python, it can operate in core or out of core, so it's entirely appropriate to advertise that fact (why hide it?). Unlike OpenGL's point drawing functions and nearly every other 2D plotting library available for Python, it avoids overplotting and z-ordering issues that make visualizations misleading (so why hide that?). Unlike NumPy's histogram2D, it allows you to define what it means to aggregate the contents of each bin (mean, min, std, etc.), to focus on different aspects of your data. It's a mystery to me why you think Datashader should somehow fail to advertise what it's useful for!
You keep defending the project as a whole while not confronting the fact that they are touting rendering breakthroughs, while I have given a lot of explanation of why there are no rendering breakthroughs and the actual rendering, no matter where it is done and no matter how much data is used, is trivial. I'm not sure what can help you focus in on the point I'm making here, I haven't strayed from it. This isn't about the workflow or the language used or anything else. It is about false claims and buzzwords to make people think that it is solving rendering problems that have never existed like 'accuracy' and 'big data' ( in the context of these visualizations ).
They are touting it specifically in the context of the visualization of very large datasets.
The fact that their software exists is itself a breakthrough. It enabled me to do things that other equivalent tools (such as in statistical packages) could not allow. I would have been reduced to directly implementing my rendering pipelines, and I would also have had to make many of the same design decisions they made, such as doing things out of core.
I do not know the data shader work well enough to defend it, nor to even know if it deserves defense, but I can at least respond to your argument.
You imply that accumulative rendering into a framebuffer solves large statistical integration problems. But, the framebuffer is not implemented using abstract math over the real nor integer domain. You need to consider numerical effects of adding the smallest value (one sample) into a running sum.
If you use an integer/fixed precision buffer for the running sums, you need enough bits to avoid overflow even if billions of points land in one bin. You might think to use floating point, but that has worse problems for running sums. You are effectively limited to the number of bits in the mantissa when continuously adding small increments.
So, you cannot scale up the naive approach of zeroing the framebuffer and blending/accumulating points from a stream. You need to do some hierarchical aggregation to accurately represent sub-populations and combine them in a numerically robust manner. Most likely, you would also like to precompute some of these results to support better interactive performance, much like mip-mapping is used to provide more accurate texture sampling at multiple rendering scales.
I think you are deliberately trying to misunderstand what is being done in this project.
It's not about what APIs are being used to render whatever. At that level of analysis, all that anybody is every doing, is just doing memcpy and bitblt. Rather, datashader provides a framework for applying semantically meaningful, mathematical transformations on datasets as they're being accumulated, as those accumulations are converted into aesthetic/geom primitives, and as those primitives are rendered into colors. It really is "renderman for data", along with arbitrary vertex/texture shaders, driven by a dynamic rasterizer that can use whatever bins in data-space (not merely physical pixels).
BTW "Out of core" does NOT come from raytracing; in fact its history in computing is a term for anything that exceeds physical memory. We use it all the time in scientific/HPC and data science because datasets are frequently much larger than available memory.
I think you are misrepresenting what is being done in this project. People seem to like it. They say it has workflow refinements. That's great, but there isn't anything new being done in rendering here unless doing something trivial in a pointlessly complex way and renaming fundamental techniques counts as a breakthrough.
Focus on the workflow refinements, saying g there are rendering breakthroughs here is snake oil.
Datashader is server-side rendering, and thus not in any way comparable to WebGL in its usage. With Datashader, only the final rendered/rasterized image-like object is sent to the client, which lets it handle arbitrarily large datasets (anything your remote servers can process). With WebGL the dataset is sent to the browser for rendering, which has some advantages but is a very different process than what Datashader does.
Once again, there is no need for very much to be in memory at one time. This is accumulation and buffers can be reused. I'm not sure how much more I can simplify this other than to say that addition doesn't care about ordering, you can just add continually without needing access to any of the other data that has been rasterized.
I think this was already said above, but it still seems to be getting confused, so to repeat: Datashader renders everything out of core, in the server. So it doesn't matter whether a client could successfully accumulate results for a large data incrementally; to use WebGL directly one still has to send all of the data to the client eventually. With Datashader the dataset is never sent to the client in the first place; it stays on the server, which could be a remote HPC system with thousands of cores processing petabytes. Datashader renders the data into an image-shaped array on the server, then sends that (much smaller) array to the client, so that the client never sees any data larger than the available screen resolution. This is no claim that doing so is unprecedented or some crazy new idea, just that Datashader lets you render datasets regardless of their size, completely independently of any client (browser) limitations, and without having to serialize the data over an internet connection.
I wish more people were outraged at this kind of election tampering. Great visualizations though! Zoom in on some of those tight masses of black outlines. The shapes are ridiculous. Maryland 3rd? Come on.
I've used datashader for plotting NGS (Next Generation Sequencing) enrichments. At the time I had to hack together the ability to use the polygon select tool on the data, but it worked and blew my mind.
Very elegant solution to a difficult problem (overplotting).
I do. I remember posting this to the mailing list. I don't have an example calculating enrichments though. We simply group by read and divide using the frequencies and then plot one enrichment vs another. This way we can see how one sequence enriches between conditions. There's more to it than that but this will produce a plot similar to the one in the attachment in the thread linked below.
> Turns even the largest data into images, accurately
The first image, the image of the USA, seems really mis-representative to me. LA and NYC should be way way way more bright in relation to everything else than the entire area east of the Mississippi.
At least to my eyes that map makes it look like parts of Denver, Kansas City, Salt Lake City, Atlanta, and the San Joaquin Valley are just as dense as Manhattan.
Atlanta's population density 630 per square mile
Manhattan's population density 70826 per square mile
It seems like an accurate data image would have Atlanta's brightness 1/100th of Manhattan's. Basically it looks like they saturated out at around 250 people do anything over 250 people is the same brightness.
By default, Datashader accurately conveys the shape of the distribution in a way that the human visual system can process. If you want a linear representation, you can do that easily; see the first plot in http://datashader.org/topics/census.html , but you'll quickly see that the resulting plot completely fails to show that there are any patterns anywhere besides the top few population hotspots, which is highly unrepresentative of the actual patterns in this data. There is no saturation here; what it's doing in the homepage image is basically a rank-order encoding, where the top brightness value is indeed shared by several high-population pixes, the next brightness value is shared by the next batch of populations, etc. Given only 256 possible values, there has to be some grouping, but it's not saturating.
Looks like a really cool project. One thing that I would be interested in see would be using Datashader as a dynamic visualisation library - for example, generative art projects. Probably not the main interest of data visualisation practitioners but hey, if you've got a sweet pipeline to render all those points, why not?
Yep -- at https://github.com/graphistry/pygraphistry, we started by making millions of nodes/edges interactive. If you use notebooks, can signup on our site and get going. The trick is we connect GPUs in the browser to GPUs in the cloud, and encapsulate it enough that you can stick to writing standard SQL/pandas/etc.
We've been curious about server-side static tile rendering for larger graphs, but has been on the back-burner. (We already connect to GPUs on the server, so not rocket science.) Currently, we're actively increasing how much can be ingested + computed on, such as for finding influencers, communities & rings, etc. However, visualizing that hasn't been an operational priority for our users. More useful to generate the communities, and then either inspect individual ones, or see how communities stitch together: quickly run out of pixels otherwise due to too many edges. Likewise, we're building connectors to gigascale-petascale graph DBs: titan, janus, aws neptune, tigergraph, spark graphx, etc.
We still are interested, but more for when we start supporting geographic maps: you can see that is the primary use for datashader. Also, because data art is fun :)
Geographic maps aren't the primary use for datashader; those are just easy examples that people can appreciate without a lot of explanation. In practice we use it for any large datasets that we don't want to subsample before visualizing them.
I've used both gephi and cytoscape, i have about 2 million nodes with about 3 million edges, it's hard to visualise at once. however i've reduced the graphs to only sub networks with more than 5 connections, the remainder is about 600k, which is still painfully slow to visualise
My team has had luck rendering an SVG from the graph and sending it to a browser. It works well for about 10k vertices and edges. Above that scale we use datashader, and we're investigating a potential to move to QGIS. We tried Gephi a few years ago and it had trouble at these scales.
This is basically visualization 101. 'Turning data into images' isn't exactly a new concept. Sometimes I think the next generation of programmers has gotten too good at thinking up fancy names for reinventing the wheel and hasn't spent enough time looking at what has been done in the past 50 years so they can build off of it.
Also there are links to 'visualizing big data' and 'visualizing billions of points'. These too have been problems solved a long time ago with a lot less resources. The book advanced renderman are is a good place to start, and of course, opengl uses basic z-buffer rendering.
Unless you need sorted transparency (and accumulating points for brightness definitely does not), then you can rasterize as much as you want in any order, without holding anything in memory except for the image buffer. It doesn't matter how many point/particles you have, you don't have to do anything to make it scale.
How about instead of starting with an insult (I can't believe you didn't already know this) you instead congratulate them on putting together a full working library, with pretty, easy-to-grasp examples then offer up some research links that they could use to further refine and improve their system. It's our job to teach people, you can't expect everyone to suddenly know everything.
To the Datashader team: I apologize for the above comment. Good job in building and launching a tool for others to use, and great choices for examples!
My problem isn't that isn't novel or new, it is that it is presented as novel. Visualization is great (and the visualization here looks good), but saying things like 'datashader: turn data into images' when that is literally what rendering is, is a nonsense way to approach a valid topic.
Saying 'visualize big data and billions of points' when the buzzwords are just there to sugar coat an accumulation buffer, gets into a territory of reinventing the wheel but naming it 'the flattened infinite curvature hypersphere'.
So, before you get too self righteous at least realize that it is the delivery and lack of context and precedent, not the actual work that is the problem.
I suspect marketing Datashader as an "accumulation buffer" wouldn't have the same effect on its target audience (data visualisation developers) as presenting it simply as a way to "Turn data into images".
I'm also curious (as a fledgling graphics programmer) - what leads you to believe that Datashader uses an accumulation buffer internally? I would think that they use some magic to draw all the points in a single draw call using instanced rendering, but I am very naive :)
D3 and Bokeh and other web-based visualization tools, in general, plot HTML or CSS primitives to the browser. This approach works great for smaller datasets, but doesn't scale to millions/billions.
Datashader aggregates (accumulates) graphical representations of data into images, then provides a way to get those to the browser and work well with the other libraries. That high level description leaves out 95% of the critical practical details of visualization, which the creators of datashader handle.
Datashader's approach is a bit different from an accumulation buffer, though similar in principle. It's not 3D rendering, and has no need for a z ordering; instead it's essentially 2D histogramming. For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel. The key benefit over something like SciPy's histogram2d functions are in how it is implemented and used -- highly optimized, and highly integrated with viz tools so that it can let you just interact with your data naturally as if you had infinite resolution. Try it and see!
> For points, it simply takes each point, calculates which pixel it would land in, and aggregates per pixel, without ever storing all the data points per pixel.
That is literally what opengl does. If you mean a histogram per pixel in depth, that's literally voxels in perspective space.
If there are usability benefits here, that's great, but everything seems to be centered around there being new rendering techniques here, when not only are they not new, they're completely trivial, with solidified names and formalized math.
It's actually not merely an accumulation buffer. It's a shader pipeline that allows for arbitrary Python code to be executed at each stage of data processing. It's actually very much like "renderman for data", but with Python (via Numba, Dask for performance).
The pipeline is also built in such a way that it permits front-end JS viewers like Bokeh to drive a very dynamic experience.
It is novel in the sense that it combines the interactivity of a D3/Bokeh/whatever JS based visualization (typically limited to a few thousand points) with the massive data display capability of offline rendering.
“Turn data into images” is a much better phrase than “rendering.” Anyone who is not intentionally being pedantic for dramatic effect would understand the purpose of the product based on reading that phrase, whereas “Datashader: a renderer” is less clear and could refer to products with an entirely different scope/purpose.
Their product page is well-written and accurate. It sounds like you want them to purposely describe their product as something that is inferior to what it actually is.
If you want one word, Datashader is a rasterizer. It takes data of many types (points, lines, grids, meshes) and creates a regular grid where each grid cell's value is a well defined function of the incoming data. Not sure anyone would be any happier with "rasterizer" than "renderer" or "shader" or any other single word...
As a newcomer to the field, parent post is far more welcoming than the bevy of trash Ninja flavor-of-the-week bullshit that makes it seem impossible to catch up.
I only actually got my bearings and self-confidence as a programmer when I realized that most of the people pushing blogs with subscriptions about "cutting-edge" tech were literally snake-oil salesmen and shovel merchants.
That coding wasn't actually different from anything else I had learned in my life, and that there were some fundamentals I could latch onto, and grow from there upwards. All this nonsense about the field experiencing a revolution that upends all existing knowledge year-after-year is far more mentally taxing.
You're missing the point of this project. It's not about the feasibility of throwing a billion points at a pile of software, to get an image. I can do that with a simple Python script. It's about doing so to create a meaningful and accurate data visualization, and not just a picture of, say, shiny spheres or a scene from Avatar.
I actually have a background in 3D computer graphics, and it's precisely because of my detailed knowledge of raytracing, rasterization, OpenGL, BMRT, photon maps, computational radiometry, BDRFs, computational geometry, and statistical sampling, etc... that when I came to the field of data science & specifically the problem of visualizing large datasets, I realized the total lack of tooling in this space.
The field of information visualization lags behind general "computer-generated imagery" by decades. When I first presented my ideas around Abstract Rendering (which became Datashader) to my DARPA collaborators, even to famous visualization people like Bill Cleveland or Jeff Heer, it was clear that I was thinking about the problem in an entirely different way. I recall our DARPA PM asking Hanspeter Pfister how he would visualize a million points, and he said, "I wouldn't. I'd subsample, or aggregate the data."
Datashader eats a million points for breakfast.
Since you're clearly a computer graphics guy, the way to think about this problem is not one of naive rendering, but rather one of dynamically generating correct primitives & aesthetics at every image scale, so that the viewer has the most accurate understanding of what's actually in the dataset. So it's not just a particle cloud, nor is it nurbs with a normal & texture map; rather, it's a bunch of abstract values from which a data scientist may want to synthesize any combination of geometry and textures.
I chose the name "datashader" for a very specific and intentional reason: we are dynamically invoking a shader - usually a bunch of Python code for mathematical transformation - at every point, within a sampling volume (typically a square, but it doesn't have to be). One can imagine drawing a map of the rivers of the US, with the shading based on some function of all industrial plants in its watershed. Both the domain of integration and the function to evaluate are dynamic for each point in the view frustum.
There are a lot of red flags in the abstract of that paper alone.
> Rendering techniques are currently a major limiter since they tend to be builtaround central processing with all of the geometric data present.
This is completely untrue - OpenGL and virtually all real time rendering is done using z-buffer techniques that were originally used because they don't need all the geometry present. These techniques date back to the 70s and were some of the first hidden surface rendering algorithms.
> This paper presents Abstract Rendering (AR), a technique for eliminating the cen-tralization requirement while preserving some forms of interactivity.
Interactivity might be novel here so that is what should really be focused on, if anything. I don't think coining a new term and acronym that don't seem to relate to what is happening is a going to be a good choice to communicate the techniques.
> AR is based on the observation that pixelsare fundamentally bins, and that rendering is essentially a binning process on a lattice of bins.
This observation was made in the early 80s and has been the backbone of renderman renderers for almost 40 years. Renderman calls them 'buckets'.
> This approach enables: (1) rendering onlarge datasets without requiring large amounts of working memory,
Renderman originally rendered film resolution images with high resolution textures with only 10MB of memory.
> (3) a direct means of distributing the rendering task across processes,
Giving different threads their own buckets is standard for any non-toy renderer. Distributing buckets across multiple computers is part of many toolsets.
> high-performanceinteraction techniques on large datasets
This is the only part that has a chance of being novel, but paper only shows basic accumulation of density for adjacency matrices. The visualization are timed in the multiple seconds but look extremely simple, and for some reason are rendered 'out-of-core' on a computer with 144GB of memory even though it seems very unclear that these images couldn't be made with z-buffer rendering in opengl.
> This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization
It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.
> It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.
Actually, no. The paper may not have been explicitly clear about this, but the ENTIRE point of a "data visualization" system is to transform potentially high-dimensional datasets, with a large number of columns, into meaningful images by a series of steps. You seem to be interpreting this narrowly, and imagining that geometry is already pre-defined in the dataset, so then of course this looks like a fairly trivial 2D accumulator.
That is not the intent, nor is the common use case.
For data visualization, the question of "how do I accurately aggregate or accumulate the 25 - 1million points in this bucket" is a deep one. There is NO data visualization system that programmatically gives access to this step of the viz pipeline to a data scientist or statistician. Most "infoviz" tools gloss over this problem - they do simple Z buffering, or cheesy automatic histograms of color/intensity, etc. These are almost always "wrong" and produce unintended hallucinators.
Your first comment - about "not needing all the geometry present" - indicates that you are not understanding the nature of the problem datashader was designed to solve. There is no simple "cull" function for data science; there is no simple "Z" axis on which to sort, smush, blend, etc. At best, your data points can be projected into some kind of Euclidean space on which you can implement a fast spatial subdivision or parallel aggregation algorithm. But once that's done, you're still left holding millions of partitions of billions of points or primitives, each with dozens of attributes.... what then?
I'm not sure why you would coin a term 'Abstract Rendering' and talk about 'out of core rendering' then turn around and say that transforming high dimensional data sets is part of rendering. Rendering is well defined and very established, coming up with transformations and calling that part of rendering is nonsense. You made this mess yourself by trying to stretch the truth.
It doesn't matter if something has been done before if the new way hooks into new systems. Tools do not exist in a vacuum. All tools are apart of a system of tools that should always be considered when evaluating any component piece.
I also was expecting something new, but in their defense they've made a very appealing version of something old. I'm sure there are a lot of people out there who haven't thought about saturation with "large" data sets before.