Serious question: what on earth have supercomputers got to do with this?
My impression is supercomputers exist mainly for incredibly intensive large-scale simulation/calculation that can't be subdivided into parts -- e.g. weather or nuclear explosion simulation.
3D map processing feels like the literal opposite of that -- trivially parallelizable, with a much higher ratio of data/IO to calculation.
Are they running out of tasks for Blue Waters to do, or trying to find a high-profile project to justify it politically or something? I really can't imagine for the life of me why you wouldn't just run this in any enterprise-grade cloud.
I used what was then a top-10 system on the Top 500 when I worked at a national laboratory in the early 2000s. An embarrassing number of jobs in its job queue would have run well on much smaller clusters with less expensive hardware. Only once in a while would we run a single job that used more than half of the entire system and achieved decent scaling.
I suspect that the mismatch is worse nowadays. Although software and interconnects have improved, core counts and node counts have gone up even faster.
IMO simulation-guided research would probably have gone faster at the lab if the money for the top-10 system had been spent on a bunch of smaller clusters with less exotic hardware, divvied up according to the actual lines of research scientists were pursuing. But there's prestige and budgetary room for a new Grand Challenge system that may not be there for a bunch of more affordable, less exotic systems. And once in a while somebody does have a job that only runs well on the big machine.
This is also why I don't much worry about China building systems that rank higher on the Top 500 than American systems. Until Chinese research groups start churning out Gordon Bell Prize-winning software to go with the giant systems, they're probably just misallocating even more money than American labs.
EDIT: well that was arrogant and foolish of me to dismiss Chinese HPC. I looked up recent Gordon Bell Prize winners and Chinese researchers won in 2016 and 2017. It looks like they're making good progress in using those really big systems.
Creating a 3d model out of 2d images requires computer vision to extract objects in the images and estimate their dimensions (including elevation). This will most likely require implementing an end-to-end deep learning model that's gonna need training, validation and test. Given the amount of data it'll have to deal with (100ks to millions of images) it'll need to load (high dimensional?) images in batch for them to get processed. This can still be done arguably on aws or Azure (or or...) with TensorFlow and HPC, but two things here, HPC bring a bit more overhead to the table, and a supercomputer could do better since none of the current cloud service providers have supercomputers that can compete especially in terms of cpu performance.
Theres no reason it needs a DL model. There's a lot of software that calculates tie points and creates point clouds from pictures, which is what they are almost certainly going to do here. DL to go from orthoimages to point cloud, if it is a thing, is probably still in the feasibility steps.
The steps are all fairly easily parallelizable until you get to a final large scale nonlinear least squares refinement step, and even then there are tricks to make the decomposition tractable. It usually just involves single images or pairs of images with no need for communication between processes until the last bit.
If you look at georeference systems, they may fit a parametric equation with coefficients to a large set of earth data. The geoid in the link below is a refinement over an ellisoid (think lumpy potato vs smooth pebble), that gives you a 3D fit that has some level of accuracy with a very compact equation compared to carrying around the raw data. I'm thinking a supercomputer might be pretty useful to do the fit of data as it's updated, as well as giving options of using equations with different costs-to-fit on large data sets, or providing better fits on finer granularity.
Good question! This does seem like an embarrassingly parallel problem. Whenever I've used the big HPC centers, the secret sauce has been a fast low latency network interconnect. The fast interconnect is useful for PDE solvers which need a lot of processor-to-processor communication (i.e. for sending data back/forth periodically along the boundaries of the grid cells you're solving for).
Supercomputers are computers with 10.000s of cores of consumer-like CPUs. So it is the opposite of your impression. They can only work with tremendously parallelizable tasks. I don't know the exact details of weather simulation or nuclear explosion but they would have to be parallelized to work on HPCs. Even if the computation is not parallelizable, the scientists can leverage supercomputers by running a simulation with randomized parameters at every node and reach a consensus result.
There is a difference between parallelizable and embarrassingly parallelizable. The former means that you can get better performance by dividing the work among different processors, while the latter usually implies that the work can be dividing into independent units of work that don’t need to communicate with each other.
A supercomputer typically means that those thousands of cores are connected with fast and expensive interconnects so that the cores can communicate with low latency. A large portion of the budget is usually spent on this interconnect. If you have an embarrassingly parallel problem and you run it on a supercomputer then that expensive interconnect is sitting idle - you would get the same performance on AWS or a more standard compute cluster.
Well, your impression is just wrong. Simple as that. No shame in it.
Today what is called a super computer is usually just a cluster (i.e. multiple connected normal spec computers). It is normally connected with a high speed interconnect though (100 Gbit/sec and more) what is its most defining capability.
Why they are using this cluster? My speculation is that probably because it is available and does not have much use for the real scientific computing (because it is old https://bluewaters.ncsa.illinois.edu/hardware-summary) and intelligence agency prefers to rather support the academia then feed some commercial entity.
I don't think it's anywhere near "meaningless". The fact that a team of software engineers who are led by someone extremely well experienced in HPC (since the early 90s, multiple awards won, etc.) can do this, does not mean it is easy for a typical university or engineering company to do the same.
Will HPC migrate towards the cloud? Maybe yes, but we need several major overhauls to tooling before that is anywhere close to happening.
Just think about how much work it would be today to configure a Packer image that has several MPI libraries, a scheduler like SLURM, various Python versions + required packages, C and Fortran compilers, BLAS/LAPACK/etc, VCS systems, integration with some sort of user authentication system including support for SSH login to each node as well as linked for each user to the accounting in the scheduling system, and to have confidence that it will be highly performant for the application you work with on the AWS allocation that you have requested. Not many people could pull that off in a reasonable amount of time, if at all.
> calculation that can't be subdivided into parts -- e.g. weather
I'm not sure how weather simulation work but I always wondered why they aren't performed cellular automaton style. Bottom up rather than top-down. At each time step a given cell state is computed based on the state of its neighbors at t-1. This should be parallelizable. I feel it's also how it works in the real world anyway.
That is exactly what weather simulations do. Same most other kinds of hydrodynamic simulations (simulation of gas or fluid flows).
The thing is: you need the state at t-1 of all your neighbors. Then you can do a small timestep to get from t-1 to time t. And then you need the NEW state of your neighbors. That requires a fast interconnect. Which HPC machines have, unlike most clouds or commodity clusters.
In other words, yes it is parallelizable, but not trivially so, because the different grid points are coupled.
I’ve seen that azure has some nodes with infiniband connection (same that is often used in super computers).
I did my PhD in physics simulations (molecular dynamics), and have the same problem there. I tried running these simulations in Google Cloud without any good performance results due to high latency (compared to HPC). I’m no GC expert though, so should be possible to improve what I did.
There's already multiple global DEM data sources. It would be good to understand how this is better and what the planned accuracy is. Its a very light article
Their referenced Arctic DEM gives a 2m resolution however on accuracy notes 'Without ground control points absolute accuracy is approximately 4 meters in horizontal and vertical planes'.
This is much better than most global data (SRTM and ASTER both at 30m resolution)
However it is not as high resolution as many existing free models for individual parts of the globe. As an example here in Australia I can get free 1m resolution DEMs of cities with accuracy noted at "0.3m (95% Confidence Interval) vertical and 0.8m (95% Confidence Interval) horizontal".
Seems like EarthDEM is the same. So all the advantages and disadvantages that come with stereo photogrammetry. Very high resolution, but potential for inaccuracy and quite difficult to check. Probably they used SRTM or something similar to validate their models.
>IceSAT altimetry data points are used to improve the vertical accuracy of both the DEM strips and mosaic files. IceSAT data points are filtered to exclude points in areas of high relief and over hydrographic features. Additional filtering is applied to remove altimetry points collected outside the temporal window of the source imagery acquisition date.
An xyz translation is calculated for each strip and the offset is added to the metadata file. The individual DEM strips are not translated before distribution. Users can apply their own corrections to the strip if they do not agree with the one originally provided.
Where available, additional control information such as LiDAR or surveyed GPS points have been applied.
They've changed quite a bit it since I last used it for my CS senior project a few years ago, but figured out you can see DEM segments for download by clicking the icon with the popup name "Layer List" then click "3DEP Elevation - Index", then "DEM Product Index". You will need a GIS program like QGIS or ArcGIS to view and raw DEM data.
The article misses the point a bit. There are 3D maps of the world, or rather, you can make your own on a weekend. Topographic data is out there for most things, the problem is just it's very inconsistent and of course, resolution. And I guess for the whole world you do need a decent computer.
To prove my point, here is some art I made  from open data (I think NASA) with Blender and QGIS, meaning free software stack on very much a normal computer. My model was the state Schleswig Holstein, a german county, but as said, you can find data for pretty much everything. The resolution is not astonishing, but enough to spot Germany's only high sea island, the "Wattenmeer" where tide causes some spots to lie below water level on average, the mouth of the Elbe and more cool stuff when you know where to look.
My point isn't to downplay the effort of EarthDEM, I just want to make more people aware of what is already very much out there :)
> And I guess for the whole world you do need a decent computer.
Nope, resolution is the problem, not scale. You need low orbit stereo photos, like from an airplane. Without clouds. With satellite you only get 30m, with airplanes you get under 1m. Cities usually rent a plane a year to adjust their maps. Countries don't, they usually just rely on the cheap satellite photos.
I wish every game depicting the real world contribute to a common 3D world repository. And maybe allow crowd sourcing that repository too. Imagine what we would get to play in flight simulators or driving simulators with such a repository.
If you want to drive real circuits or fly across actual countries then real world data would be awesome, but for anything else it probably wouldn't be. Compressed game world environments (eg mountains next to open grass lands next to coastline next to forests within a few miles of each other) are designed that way because that variety contributes to the game being more fun. You don't get that in the real world.
Depends on the country. There are places with "mountains next to open grass lands next to coastline next to forests within a few miles of each other" (where few, eg. 100 miles). The world is not all like Utah US50 or whatever huge expanse.
But even so,
(1) there's something exciting on its own for driving in real-mapped as opposed to some fake designed terrain
(2) for certain games, it can main point (e.g. I can imagine a "route 66 hot rod race" or a "drive Monaco Rally" (and of course things like MS Flight Simulator, combat games, etc)
(3) There's absolutely no reason why a game couldn't use real world data and cherry pick different terrains for variability from those...
But you could take mountainous area, stretch the height difference and plonk your favourite building/village/city block in the middle of it, then crop the whole thing and stick it in a lake you know, or whatever - seems like fun.
And imagine when some internet forum decides to put a phallic object on top of some significant building, and all go vote it as legitimate. I'm not claiming it's impossible, but there will certainly be obstacles (see wikipedia). This has to be one data-set we get right; with self-driving cars coming, it could cause injury or death if we don't.
>This has to be one data-set we get right; with self-driving cars coming, it could cause injury or death if we don't.
Self driving cars can't (and don't) rely on maps to prevent injury or death. They only use to know how to go from A to B in the most general way -- and they'd still need to check for bypasses, closed roads, etc... (on top of the real-time, processing of obstacles, vehicles, lanes, traffic lights, weather, and so on).
Right, but Wikipedia still has lots of issues with vandalism. They do a good job, but it's just not accurate to keep something perfect at that scale. Here's the difference: wikipedia gets vandalized and people laugh. Mapping data gets vandalized and people could die.
I hope so too. There should be a service that allows game developers to just import the real world. On top of that they could then customize specific locations according to the story or gameplay. Especially valuable would be an algorithm that automatically creates interiors (ideally infered from the shape of the building).
Supercomputers are exactly the wrong sort of tool to use for this- nearly every supercomputer has crappy disk IO and tons of fast CPU and RAM and network. Using a supercomputer for this would leave the expensive elements like the GPUs nearly idle, and the IO subsystem would be the bottleneck.
I'm confused by this comment, especially since you seem to work with super computers. One of the biggest challenges in ECP (exascale computing project) is disk IO.
So much so that they were inventing complicated heterogeneous architectures. In fact many teams are just skipping disk all together and using in situ methods, only saving results. I would think that would be needed here.
But the question is what heavy processing computer doesn't have IO issues? Also, Blue Waters isn't really a GPU super computer like Summit and Sierra are (or the up coming Aurora and Frontier). It has 4228 nodes (out of ~27000 nodes) that have GPUs on them, and they only have one, and they are Keplers. Those aren't great GPUs and aren't going to do very well in parallel either. There's a big bottleneck in GPU IO. I think this program will not be utilizing GPUs very heavily. Worse, they don't have many CPUs per node. It's 8-16 cores and 32-64 GB memory per node. There's going to be a lot of time spent in communication.
I'll admit that BW doesn't seem like the best computer to the job, but you use what you got. I'll buy the argument that this is the wrong computer for the specific job, but what would you use besides a super computer? (I think Summit would be a good computer for this job)
I'm a supercomputer expert - that's a part of my job - who works on data processing fulltime. I know these codes, I know the hardware architecture, and my statement above is technically correct. I'm not missing anything.
I haven't read the details on the implementation but I used to be a developer on Google Earth and you are exactly right on the Disk IO being the main bottleneck for building large terrain datasets with 3D globes
You may find especially interesting an article I published that used an embarassingly parallel computing system that I built which ran on Google's internal infrastructure (not a supercomputer) in response to my codes not running well on supercomputers.
If you wish to have a substantive discussion about my statements (rather than flinging insults), I'd be happy to.
There's not really a huge difference between supercomputers and a large number of ordinary computers these days. The computational hardware is basically the same. Only the communication interconnects are different.
I strongly suspect that they don't need a supercomputer, but they had one available and it sounds cool. I sit one building over from the ~13th largest supercomputer in the world and I've definitely used it when I needed a bit more computing power than my laptop could provide. I used the supercomputer because I could, not because I needed to.
Did you have an embarassingly parallel code (like, "map"), or a code that was ported to supercomputers?
I've worked on supercomputers for years and generally, one could not run code at scale unless it scaled- in terms of parallel performance- and also used the network in a non-trivial way. Supercomputers aren't just speedups for normal codes (I'm pretty sure you know this).
The real issue with modern supercomputers is that basically none of them have decent disk IO. That's what differs between modern Internet/cloud clusters and supercomputers. Cloud clusters emphasize very high connectivity between durable storage and worker nodes. None of the supercomputers have a decent disk IO stack (mostly GPFS and Lustre) and this ends up being no end of pain for applicaiton developers. The only recent improvement in this area was "burst buffers", but that's really just accelerated data staging.
Seems like they are using photogrammetry, most likely with a large amount of high resolution images. Photogrammetry under these conditions can be fairly computationally expensive, specially if they use aim for high quality reconstructions (which they probably should, given this is a one-time or few-time cost). Of course, it depends on what algorithms they're using, of which I have no idea, but there's plenty of fairly fancy methods, many involving deep learning in the pipeline.
I am wondering what resolution they work with. Because earth crust is moving continuously, and sometimes suddenly like 6 feet at once: https://www.latimes.com/california/story/2019-07-22/ridgecre...
I like the idea to survey the whole planet, I would be curious to see what resolution they end up with, and how often it is updated.
Google earth uses older publicly available NASA data as a base and then improves it with better, often commercial, data in some interesting points. One problem with this approach is that the quality and accuracy differs wildly from place to place.
For any place in at least the western world there almost certainly exists better (and sometimes open) data than this project will provide. The big win for this project isn't that the data will always be the best available, but that it will be a single source of open data of consistent quality and format for the entire earth.
Google Earth was not strictly a private effort. Back in the early 2000s, Keyhole (then the company that owned the software that would become Google Earth after the acquisition) was bankrolled by a venture capital firm run by the CIA called In-Q-Tel. I'm sure there are other relations Google Earth (in particular) has had with government as well.