I recall many years back a website, I think a Microsoft project, that linked together photos in a 3D space of tourist destinations. It created something of a point cloud, but nothing this advanced. You could click through the points/photos to jump into each photos perspective of the space.
Anyone remember what that project was called, or if it is even still around?
After watching that video:
Why did no one build the application that tied your personal photos with a global database of all the other public photos taken in the same place?
Seems like it would have been an amazing application.
I had this vision that one day we'll be able to reconstruct memories from our past by taking old photos and having a ML model collate everything together to form a 3D rendering of that point in time. It seems like you have gotten most of the way there.
The next step would be to have the user grab a VR headset and immerse themselves in their favorite childhood moment. One could even add avatars for loved ones using again ML-generated audio based on recordings of their voices.
Your project made me think that it wouldn't be that interesting for me to view your memories, so perhaps the best initial step for a proof-of-concept that would allow the technology to mature would be to recreate historical moments so everyone people could relive them – and they could do so entirely virtually, from the comfort of their own couch. Side note: it feels like this technology can disrupt traditional museums with the added bonus of being pandemic-proof.
Anyway, I don't really have a question... Just wanted to compliment you on this amazing work and throw this idea out there in case others want to think about it, as I'm in an entirely different field and don't have the skills and resources to make it real, and I do strongly feel this will inevitably come to life.
That's a really cool idea! This technology does a fantastic job at reconstructing static scenes. The moving objects -- people, cars, even flora -- are out of scope here. Why? It's really hard to build a 3D model of something you only see from one direction.
I was amazed at the scene at the time, and thought it was unbelievable. But then i read they had NSA advisors and maybe the US govt might have had access to some sort of primitive photogrammetry at the time?
But if we know what a car ought to look like in 3D, can't we take the one photo we have from one direction and just fill in the blanks with that a priori 3D knowledge?
I pitched a similar idea in an interview years ago in an interview (http://www.wearegamedevs.com/2016/01/20/scott-anderson-rende...) with the added complexity of forward simulating past events with different choices. I was asked what I would do with infinite time and money though. To this day my elderly parents tell me to quit my job and work on this idea :-D.
I do believe that past time travel to reconstructed and recorded events will be one of the stickiest use cases for VR.
Can this technique reconstruct good geometry for the visible parts if only part of the structure is ever imaged?
That's a precursor to: could this technique be used to enhance Street View? There are times when I would really like to be able to "walk around" outdoor scenes in finer steps. Current Street View smears between photos taken some distance apart. (I don't know if that is a limitation of the public interface or if the original data capture is really that coarse.) It would be nice to have a real 3D space to explore, but I certainly don't expect the un-imaged parts to be defined correctly.
Finally, does this also work for reconstructing interior spaces seen from the inside? Like the geometry of a cave, from pictures of the cave interior?
At this point, this method is only good at reconstructing parts of the scene that are well-photographed. You'll notice that our video for Sacre Coeur has some blurry bits, particularly the staircase in front of the Basilica. That's because we learn to reconstruct what was seen, but aren't yet able to imagine what wasn't!
Does this work in reconstructing indoor spaces? Give it a shot and find out!
Does this work in reconstructing indoor spaces? Give it a shot and find out!
Have you released the code? (Or did you mean I could try re-implementing the work you published from the paper? That's a reasonable response too.) I didn't see a link to source in the github.io page or in the arXiv paper. The only source code link I saw was to https://github.com/bmild/nerf which I thought was earlier work than this paper.
Our code is not released yet. If your data is captured without occluders and in RAW format, you don't need the enhancements we propose :). NeRF can do amazing things with clean data!
I just saw the NeRF demo video, it's amazing. If its results are this good, why is photogrammetry software not basically perfect yet? It looks like they can generate models with vast amounts of detail.
Not the author (but I read the previous papers and the code), the simple answer is that it's still very costly in processing power. Think a few hours on good hardware for a set of pictures.
Note that the results are still very impressive imo, this is still early research phase.
Great work! I’m an ex googler currently working on a farming robot we hope to make open source. I’m particularly interested in neural reconstruction of plants in a field. I want to capture the 3D structure of the plants as well as semantics like plant species. I’ve found that normal photogrammetry produces poor reconstructions due to movement of the plants in the wind.
We’re potentially about to start a non profit and formally kick off our whole robot as open source. I’m interested in finding research partners who would like to help produce a research paper on 3D reconstruction of plants. I can produce a high quality geo located dataset with 2cm accurate GPS tags, but I have no experience with neural rendering. This is work I want to do over the course of the next year.
Do you know anyone interested in helping with thank kind of work? Thanks!
Indeed. I am seeing some generative approaches that know what the object should look like in 3D and use that knowledge to imagine a model that matches a photo. I think such a technique would be useful for good approximation of plant models. Such a project would require some new datasets I would think, but seems like a good approach.
Looks awsome! I'm not a ML guy and haven't read the paper, just watched the video - one thing isn't clear to me from it: is this fully automatic/unattended, you just throw images into it and out come magic rainbows of 3d structures? or do you need to somehow help it, e.g. to disentangle the structure from the "transient" elements? In other words, I don't really understand what does the "Appearance Embedding" even mean... Or is the "input" that you mention in the video fed into a model that is already trained on a set of photos of a particular scene? I.e. the "input" + "appearance embedding" basically encodes just a choice of a framing & "atmosphere/lighting"?
It's a little hard to describe from scratch, but let me do my best.
The method is unattended, in the sense that it's photos + camera parameters in and scene representation out. The photos should all be of the same scene (e.g. the Trevi Fountain). Once you have a scene representation, you can ask what the scene would look like from new camera angles with your choice of lighting.
Choosing camera angles is straightforward. You tell me where and what direction the camera is facing. The question then becomes, how do you specify your choice of lighting? The answer is, you can't do so directly. Instead, you provide a picture with the lighting you want, and with a little magic, we can find a way to imitate that lighting. The way we do is by finding a corresponding "appearance embedding" via numerical optimization.
What is the precision required (or used in your datasets) for camera position and angles? Is the geotagging in the images from common cellphones and smart cameras enough? Were they back-calculated using some other method from non- or poorly-georeferenced images?
Why did you use neural networks? There are faster techniques in analytical geometry that can extract surface contours from color gradients from images, and they do this faster and directly.
1) My bread and butter for the last 10 years has been machine learning. When all you have is a hammer...
2) We don't extract surface contours, we learn a volumetric radiance field! To oversimplify, we learn a (smooth) function that, given a position in space, produces the differential opacity and color at that space. To render an image from a camera viewpoint, we approximately integrate along rays emitted from each pixel of the camera.
Check out NeRF and our paper to learn more about this representation!
Neural networks are better compared to classical methods.
One of the best non-classical methods is this one (https://grail.cs.washington.edu/projects/sq_rome_g1/), and our method is significantly improves upon it. We do not compare directly with it, but Neural Rerendering in the Wild does, and we improve upon it.
this looks really cool.
I'm not am ML chap, but always wondered: Can these kinds of algorithms also give you dimensional data? For example, can I 3D-print one of these models with any accuracy?
For that, you'll need to convert the representation we have (volumetric radiance field) to on your 3D printer can understand (a mesh?). The NeRF authors use the marching cubes algorithm to do just that. Check out their website: https://www.matthewtancik.com/nerf.
how many pictures or angles are needed to produce good results? I get that landmarks have an abundance of source material, but whats a reasonable amount of data to reconstruct scenes?
On the order of hundreds to low-digit thousands worked well for us. These photos contain a lot of occluders like tourists, and we needed to have enough views of the subject in question to build a good 3D scene representation.
can you elaborate on the key variables for the data? for instance, is it safe to assume 360 photos from the same angle would yield a worse model than 1 photo from 360 different angles?
what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?
NeRF's (and all of photogrammetry's) bread and butter is 3D consistency -- that is, seeing the same object from multiple angles. A 360 degree photo from a fixed position just won't do. As to how to select the best camera angles...I'm not sure. I believe there is research in this area for classical photogrammetry techniques, but I'm not familiar enough to point you to a body of work.
The model does not explicitly learn to segment images. The answer is unfortunately more difficult to explain than a HN comment bears. I encourage you to read the paper for more details.
Is the model able to capture the underlining geometry? E.g. If I have a pillar part of which was not visible at any training point is it able to reconstruct that part?
The model is trained to reconstruct what is observed, but not what is obscured. If you look closely at our videos, you'll notice some parts of the scene are blurry -- those parts weren't seen often enough to learn well. If you look at parts of the scene not observed at all, I'm not sure what you'd find.
awww! Figured dolly shots and steady cam shots would fit perfect into something like that. Esp 24 frames per second and usually known locations. Course it would probably drag a lot of the net into being biased towards that time spot I guess?
It definitely can work, and even has some additional benefits (1), but requires special considerations. You can deblur using global motion vectors (2), or additional hardware like accelerometer reading embedded in the video feed (3).
1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution
This looks amazing! Congratulations. What was the most challenging aspect of this? I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly. For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
Wow, that's hard to say! Our work truly stands on the shoulders of giants (Mildenhall et al, 2020). I can list off a few challenges:
figuring out if an idea "kinda works" or "definitely works" or has a bug,
figuring out how to measure progress,
coordinating a group of 6 researchers living 9 hours apart,
and assembling everything together for a simultaneous paper-website-video release!
> I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly.
We don't! All of this comes by the magic of machine learning :). We train our model to attentuate the importance of "difficult" pixels that aren't 3D-consistent in the training set. We also partition the scene into "static" and "transient" volumetric radiance fields without explicit supervision. We do so by regularizing the latter to be empty unless necessary, and providing it with access to a learned, image-specific latent embedding. We discard the transient radiance field when rendering these videos, thus removing tourists and other moving objects.
> For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
Well that won't do. Download it here: https://arxiv.org/abs/2008.02268. It's 40 MB, so your download bar may indicate it's almost done, but it actually has a good bit left to go.
> Do you plan on releasing code?
I hope so! As with most code, what ran on our machines may not run on yours. Migrating the code to open source will be a big effort. I hope what we describe in the paper is sufficient to build something like you see here.
You are assuming there that all their code is non-proprietary, doesn't belong to someone else, already has a open-licence, and is of good enough quality to be shared with the public.
GPT-3 has some pretty interesting demo which unfortunately are unfortunately rather disappointing once outside a carefully crafted environment. Said otherwise, a paper is nothing it it cannot be reproduced.
Not too familiar with this area. IIUC, training a model would require photos with corresponding information about where and at what angle the photo is taken.
Is this information just available from the dataset?
It is provided by the dataset we use, but given a new dataset, you can use off-the-shelf tools to obtain it yourself! Check out COLMAP, it's super duper cool: https://colmap.github.io/.
so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density (opacity) and a color given a position?
Answered most your other questions below in another comment.
> and what are z(t) r(t) in equation 5,6?
r(t) is a position in 3D space along a camera ray of the form, r(t) = origin + t * direction.
z(t) is the output of our first MLP. Think of it as a 256-dimensional vector of uninterpretable numbers that represent the input position r(t) in a useful way.
The former. Once the scene is synthesized, I figure that is where any dynamic output would occur. Although that raises an interesting thought of using the NeRF modeling to paint out certain things in potentially live video.
The work we build off of, NeRF, does. While there's nothing preventing NeRF-W from also representing reflections, we find it captures a more matte picture of the object.
Those are some very cool 3D visualizations generated, but it's a bit difficult to understand what the form of the dataset they generated it from is. They say "in-the-wild" photography, but of course don't really give you a great sense.
The light->dark transitions having consistent geometry is clean though.
We use images from the Image Matching Challenge 2020 dataset. If you look at the Appendix, we list how many images we use and the process by which they were chosen.
> They say "in-the-wild" photography, but of course don't really give you a great sense.
Flickr user photos. Citation shows up in the lower right hand corner during the video.
This appears to be a substantial improvement on current open photogrammetry/structure from motion work [1]. I hope Google supports this making its way into cultural preservation efforts [2].
I'm still looking for a program that takes a video and turns it into an animated 3D scene. All the stuff I've seen is on static scenery, besides some neural nets that can tweak camera angles.
MIT actually does not give an explicit patent grant. So if "any way you like" is your goal, you should choose something different like Apache License 2.0
A while back I stitched together a "hyperlapse" of Stanford's Hoover Tower using lots of Flickr-scraped images. Everything was aligned using "classical" CV tricks and I was really happy with the results. I wonder how NeRF-w would fare on this data?
After going to one of the early Maker Faires, and seeing so many interesting exhibits and projects, I had this same idea, of course with absolutely no clue about how to implement it. If enough people take pictures of the exhibits from a variety of angles, and make them available online, a virtual Maker Faire could be created. Thanks for sharing this!
Great work!
Having tried the code from the original NeRF paper I found the inference time (generation of new views) to be rather slow because the network had to be queried multiple times per ray (pixel). The paper said that there is still potential to speed this up. Did you improve inference speed and do you think that it will be possible to get it to real-time (>30 fps) in the foreseeable future?
We did not aim to speed this part of NeRF up. Check out Neural Sparse Voxel Fields (https://arxiv.org/abs/2007.11571) for some effort in that direction. It's 10x faster, but there's still another 10x to go till you get video frame rates :)
This sort of work will both allow for digital forensics (imagine reconstructing a scene from multiple socially shared images or video), as well as to create even better "deep fakes" (putting people in scenes they never actually went to; or at different times of day/night, or with different weather effects).
Is there a reason why the skies do not appear to be picked up by their "transient" filter of the scene? You end up with the skies constantly changing when moving in 3D point of view, which looks strange.
This is really cool and IMHO an area where ML truly shines: being able to disentangle the base geometric signal from lighting / crowds / occlusion via learning is truly amazing.
Wow that is fantastic work! And so quick since NeRF debuted. This is exactly the kind of work I have been waiting for to reconstruct some old photos I have.
The magic of this method is that we don't construct a "geometry" the same way one might think. There are no triangles or textures here. Instead, we train a machine learning model to predict the derivative of the color and opacity at every point in 3D space. We then integrate along rays emitted from the camera to render an image. It's similar to what's used in CT scans!
That's very cool, but also makes it sound more challenging to integrate into the existing 3D modeling ecosystem vs, say, photogrammetry approaches. Is it possible to generate approximate textured meshes from the color and opacity information?
There is depth information, just not in the form of a mesh.
The model learns to compute a function that takes an XYZ position within a volume as input, and returns color and opacity. You can then render images by tracing rays through this volume. You can pretty easily compute the distance to the first sufficiently-opaque region, or the "average" depth (weighted by each sample's contribution to the final pixel color), at the same time.
so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density and a color given a position?
> so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
According to Wikipedia, "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)." We're being more specific about what we use :)
> There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input. Why does the network need a direction? Why can't we get a density and a color given a position?
Volume data of this form is unable to express the idea of view-dependent reflections. I admit, we don't make much use of that here, but it does help! See NeRF for where it makes a big, big difference: https://www.matthewtancik.com/nerf
According to Wikipedia, "Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant imagery and other phenomena."
I'd say that this research is in the field of photogrammetry.
Anyone remember what that project was called, or if it is even still around?
Edit: Found it. http://phototour.cs.washington.edu/ Later the discontinued. https://en.wikipedia.org/wiki/Photosynth
https://www.ted.com/talks/blaise_aguera_y_arcas_how_photosyn...
I highly doubt that demo was JS / could run in multiple browsers without some proprietary runtime.
[0] https://web.archive.org/web/20191231213153/http://phototour....
The next step would be to have the user grab a VR headset and immerse themselves in their favorite childhood moment. One could even add avatars for loved ones using again ML-generated audio based on recordings of their voices.
Your project made me think that it wouldn't be that interesting for me to view your memories, so perhaps the best initial step for a proof-of-concept that would allow the technology to mature would be to recreate historical moments so everyone people could relive them – and they could do so entirely virtually, from the comfort of their own couch. Side note: it feels like this technology can disrupt traditional museums with the added bonus of being pandemic-proof.
Anyway, I don't really have a question... Just wanted to compliment you on this amazing work and throw this idea out there in case others want to think about it, as I'm in an entirely different field and don't have the skills and resources to make it real, and I do strongly feel this will inevitably come to life.
Jon Voight: "Can the computer take us around to the other side?"
Jack Black: "It can HYPOTHESIZE"
:)
Would the logical next step, use GPT-3 to create a 3D world? :)
I do believe that past time travel to reconstructed and recorded events will be one of the stickiest use cases for VR.
We're hiring too. Looking for an engineer with some game dev experience.
That's a precursor to: could this technique be used to enhance Street View? There are times when I would really like to be able to "walk around" outdoor scenes in finer steps. Current Street View smears between photos taken some distance apart. (I don't know if that is a limitation of the public interface or if the original data capture is really that coarse.) It would be nice to have a real 3D space to explore, but I certainly don't expect the un-imaged parts to be defined correctly.
Finally, does this also work for reconstructing interior spaces seen from the inside? Like the geometry of a cave, from pictures of the cave interior?
Does this work in reconstructing indoor spaces? Give it a shot and find out!
Have you released the code? (Or did you mean I could try re-implementing the work you published from the paper? That's a reasonable response too.) I didn't see a link to source in the github.io page or in the arXiv paper. The only source code link I saw was to https://github.com/bmild/nerf which I thought was earlier work than this paper.
Note that the results are still very impressive imo, this is still early research phase.
We’re potentially about to start a non profit and formally kick off our whole robot as open source. I’m interested in finding research partners who would like to help produce a research paper on 3D reconstruction of plants. I can produce a high quality geo located dataset with 2cm accurate GPS tags, but I have no experience with neural rendering. This is work I want to do over the course of the next year.
Do you know anyone interested in helping with thank kind of work? Thanks!
The method presented here wouldn't do well with your problem either. 3D reconstruction of moving objects is an unsolved problem!
Just came across this which is neat: https://github.com/AljazBozic/DeepDeform
Also this might be useful: https://github.com/paschalidoud/hierarchical_primitives
For now I’m just beginning to collect data but I hope to contribute more to the field in time!
The method is unattended, in the sense that it's photos + camera parameters in and scene representation out. The photos should all be of the same scene (e.g. the Trevi Fountain). Once you have a scene representation, you can ask what the scene would look like from new camera angles with your choice of lighting.
Choosing camera angles is straightforward. You tell me where and what direction the camera is facing. The question then becomes, how do you specify your choice of lighting? The answer is, you can't do so directly. Instead, you provide a picture with the lighting you want, and with a little magic, we can find a way to imitate that lighting. The way we do is by finding a corresponding "appearance embedding" via numerical optimization.
2) We don't extract surface contours, we learn a volumetric radiance field! To oversimplify, we learn a (smooth) function that, given a position in space, produces the differential opacity and color at that space. To render an image from a camera viewpoint, we approximately integrate along rays emitted from each pixel of the camera.
Check out NeRF and our paper to learn more about this representation!
One of the best non-classical methods is this one (https://grail.cs.washington.edu/projects/sq_rome_g1/), and our method is significantly improves upon it. We do not compare directly with it, but Neural Rerendering in the Wild does, and we improve upon it.
also they're way higher quality than traditional techniques
Someone made a “camera” which tracks location & direction, and “takes” a picture by selecting the closest picture found taken from that spot & angle.
Use this new tech for a next generation of that “camera”, generating the Platonic frame which should occur from that location.
You can use the PhotoSynth tech described above.
what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?
thanks for being so active on this thread.
https://arxiv.org/abs/2008.02268
My follow up question would be: are you able to compare your results to actual photogrammetry data to see how good your reconstruction performs?
1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution
2) http://users.ece.northwestern.edu/~sda690/MfB/Motion_CVPR08....
3) http://neelj.com/projects/imudeblurring/imu_deblurring.pdf
Do you plan on releasing code?
Congrats again. This is very cool research.
Thank you!
> What was the most challenging aspect of this?
Wow, that's hard to say! Our work truly stands on the shoulders of giants (Mildenhall et al, 2020). I can list off a few challenges: figuring out if an idea "kinda works" or "definitely works" or has a bug, figuring out how to measure progress, coordinating a group of 6 researchers living 9 hours apart, and assembling everything together for a simultaneous paper-website-video release!
> I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly.
We don't! All of this comes by the magic of machine learning :). We train our model to attentuate the importance of "difficult" pixels that aren't 3D-consistent in the training set. We also partition the scene into "static" and "transient" volumetric radiance fields without explicit supervision. We do so by regularizing the latter to be empty unless necessary, and providing it with access to a learned, image-specific latent embedding. We discard the transient radiance field when rendering these videos, thus removing tourists and other moving objects.
> For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
Well that won't do. Download it here: https://arxiv.org/abs/2008.02268. It's 40 MB, so your download bar may indicate it's almost done, but it actually has a good bit left to go.
> Do you plan on releasing code?
I hope so! As with most code, what ran on our machines may not run on yours. Migrating the code to open source will be a big effort. I hope what we describe in the paper is sufficient to build something like you see here.
No, it is not. It's 5 shell command at the most.
$ git init
$ git add .
$ git commit -m 'Initial import'
$ git remote add origin git://...
$ git push origin master
Then say "we can't share for legal reason", not "we're planning to". This is just a bs corporate answer.
> is of good enough quality to be shared with the public.
This is a petty excuse. There is plenty of open-source code utterly crappy/barely functional out there.
Mine included ;)
GPT-3 has some pretty interesting demo which unfortunately are unfortunately rather disappointing once outside a carefully crafted environment. Said otherwise, a paper is nothing it it cannot be reproduced.
Is this information just available from the dataset?
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density (opacity) and a color given a position?
and what are z(t) r(t) in equation 5,6?
> and what are z(t) r(t) in equation 5,6?
r(t) is a position in 3D space along a camera ray of the form, r(t) = origin + t * direction.
z(t) is the output of our first MLP. Think of it as a 256-dimensional vector of uninterpretable numbers that represent the input position r(t) in a useful way.
If the question is, "can you reconstruct a (static) scene from the frames in a video?", the answer is yes!
If the question is, "can you reconstruct a scene with people and other moving objects, and model them moving around too?", the answer is not yet.
The light->dark transitions having consistent geometry is clean though.
Download and have a look! https://vision.uvic.ca/image-matching-challenge/data/
Flickr user photos. Citation shows up in the lower right hand corner during the video.
This appears to be a substantial improvement on current open photogrammetry/structure from motion work [1]. I hope Google supports this making its way into cultural preservation efforts [2].
[1] https://github.com/mapillary/OpenSfM (developed by Mapillary, now part of Facebook)
[2] https://www.nytimes.com/2015/12/28/arts/design/using-laser-s... (Using Lasers to Preserve Antiquities Threatened by ISIS)
I saw in the paper their citation [13] pointed to https://arxiv.org/pdf/2003.01587.pdf, which in section 3 says the following:
We thus build on 25 collections of popular landmarks originally selected in [48,101], each with hundreds to thousands of images.
So hundreds to thousands of photos are used, which is a decent quantity, but definitely makes the quality of the result very impressive.
https://github.com/kach/hootow-hyperlapse
When will you be sharing some code?!
Note that it generates a light field, which is note exactly like a polygonal mesh ... YMMV
[Edit] After a little Googling I do see this has been done, using marching cubes (https://www.matthewtancik.com/nerf).
The model learns to compute a function that takes an XYZ position within a volume as input, and returns color and opacity. You can then render images by tracing rays through this volume. You can pretty easily compute the distance to the first sufficiently-opaque region, or the "average" depth (weighted by each sample's contribution to the final pixel color), at the same time.
Another recent Google project figured out a way to approximate these radiance fields with layered, partially transparent images for efficient rendering: https://augmentedperception.github.io/deepviewvideo/
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density and a color given a position?
According to Wikipedia, "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)." We're being more specific about what we use :)
> There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input. Why does the network need a direction? Why can't we get a density and a color given a position?
Volume data of this form is unable to express the idea of view-dependent reflections. I admit, we don't make much use of that here, but it does help! See NeRF for where it makes a big, big difference: https://www.matthewtancik.com/nerf
I'd say that this research is in the field of photogrammetry.
It looks like this generates a light field, which is not something that traditional 3D software handles directly.