NeRF in the Wild: reconstructing 3D scenes from internet photography

(nerf-w.github.io)

218 points | by tambourine_man 1371 days ago

18 comments

mey 1371 days ago
I recall many years back a website, I think a Microsoft project, that linked together photos in a 3D space of tourist destinations. It created something of a point cloud, but nothing this advanced. You could click through the points/photos to jump into each photos perspective of the space.
Anyone remember what that project was called, or if it is even still around?
Edit: Found it. http://phototour.cs.washington.edu/ Later the discontinued. https://en.wikipedia.org/wiki/Photosynth
[-]
- ur-whale 1371 days ago
  Was it PhotoSynth ?
  https://www.ted.com/talks/blaise_aguera_y_arcas_how_photosyn...
  [-]
  - mey 1371 days ago
    Watching this demo makes it very stark how much our single page webapps are regressions in fluid performance.
    [-]
    - fastball 1371 days ago
      Is this not a Silverlight app or something?
      I highly doubt that demo was JS / could run in multiple browsers without some proprietary runtime.
      [-]
      - shakna 1370 days ago
        It was, in point of fact, a Java applet. [0] Older, less secure, but more powerful.
        [0] https://web.archive.org/web/20191231213153/http://phototour....
  - oillio 1370 days ago
    After watching that video: Why did no one build the application that tied your personal photos with a global database of all the other public photos taken in the same place? Seems like it would have been an amazing application.
  - AtomicOrbital 1370 days ago
    yes ... impressive TED talk back then and even now
duckworthd 1371 days ago
Original author here. AMA!
[-]
- airstrike 1371 days ago
  I had this vision that one day we'll be able to reconstruct memories from our past by taking old photos and having a ML model collate everything together to form a 3D rendering of that point in time. It seems like you have gotten most of the way there.
  The next step would be to have the user grab a VR headset and immerse themselves in their favorite childhood moment. One could even add avatars for loved ones using again ML-generated audio based on recordings of their voices.
  Your project made me think that it wouldn't be that interesting for me to view your memories, so perhaps the best initial step for a proof-of-concept that would allow the technology to mature would be to recreate historical moments so everyone people could relive them – and they could do so entirely virtually, from the comfort of their own couch. Side note: it feels like this technology can disrupt traditional museums with the added bonus of being pandemic-proof.
  Anyway, I don't really have a question... Just wanted to compliment you on this amazing work and throw this idea out there in case others want to think about it, as I'm in an entirely different field and don't have the skills and resources to make it real, and I do strongly feel this will inevitably come to life.
  [-]
  - duckworthd 1371 days ago
    That's a really cool idea! This technology does a fantastic job at reconstructing static scenes. The moving objects -- people, cars, even flora -- are out of scope here. Why? It's really hard to build a 3D model of something you only see from one direction.
    [-]
    - trilinearnz 1370 days ago
      Anyone remember this scene from Enemy of The State (1998)? https://youtu.be/3EwZQddc3kY?t=45
      Jon Voight: "Can the computer take us around to the other side?"
      Jack Black: "It can HYPOTHESIZE"
      :)
      [-]
      - sprafa 1369 days ago
        I was amazed at the scene at the time, and thought it was unbelievable. But then i read they had NSA advisors and maybe the US govt might have had access to some sort of primitive photogrammetry at the time?
    - airstrike 1371 days ago
      But if we know what a car ought to look like in 3D, can't we take the one photo we have from one direction and just fill in the blanks with that a priori 3D knowledge?
      [-]
      - withjive 1371 days ago
        Similar to how GPT-3 can be applied not only to create Text, but also fill in missing pieces of Images (ie. complete the missing half of a face).
        Would the logical next step, use GPT-3 to create a 3D world? :)
        [-]
        airstrike 1370 days ago
        GPT-3D rolls off the tongue nicely
      - jhurliman 1370 days ago
        Getting there. This is one part of the puzzle: https://arxiv.org/pdf/2007.11965
  - Impossible 1371 days ago
    I pitched a similar idea in an interview years ago in an interview (http://www.wearegamedevs.com/2016/01/20/scott-anderson-rende...) with the added complexity of forward simulating past events with different choices. I was asked what I would do with infinite time and money though. To this day my elderly parents tell me to quit my job and work on this idea :-D.
    I do believe that past time travel to reconstructed and recorded events will be one of the stickiest use cases for VR.
    [-]
    - david_at 1370 days ago
      It's funny you say that... https://news.ycombinator.com/item?id=19529921
      We're hiring too. Looking for an engineer with some game dev experience.
      [-]
      - stallmanite 1370 days ago
        I wish I had a more substantive comment beyond “wow” but this is really impressive. I’ve wanted something like this for a long time.
    - airstrike 1370 days ago
      Hah! Loved the interview, thanks for sharing.
  - rasz 1369 days ago
    "I Built a REAL-LIFE Time Machine! " by Lucas Builds The Future https://www.youtube.com/watch?v=aHyNYfFfXlg
- philipkglass 1371 days ago
  Can this technique reconstruct good geometry for the visible parts if only part of the structure is ever imaged?
  That's a precursor to: could this technique be used to enhance Street View? There are times when I would really like to be able to "walk around" outdoor scenes in finer steps. Current Street View smears between photos taken some distance apart. (I don't know if that is a limitation of the public interface or if the original data capture is really that coarse.) It would be nice to have a real 3D space to explore, but I certainly don't expect the un-imaged parts to be defined correctly.
  Finally, does this also work for reconstructing interior spaces seen from the inside? Like the geometry of a cave, from pictures of the cave interior?
  [-]
  - duckworthd 1371 days ago
    At this point, this method is only good at reconstructing parts of the scene that are well-photographed. You'll notice that our video for Sacre Coeur has some blurry bits, particularly the staircase in front of the Basilica. That's because we learn to reconstruct what was seen, but aren't yet able to imagine what wasn't!
    Does this work in reconstructing indoor spaces? Give it a shot and find out!
    [-]
    - philipkglass 1371 days ago
      Does this work in reconstructing indoor spaces? Give it a shot and find out!
      Have you released the code? (Or did you mean I could try re-implementing the work you published from the paper? That's a reasonable response too.) I didn't see a link to source in the github.io page or in the arXiv paper. The only source code link I saw was to https://github.com/bmild/nerf which I thought was earlier work than this paper.
      [-]
      - duckworthd 1371 days ago
        Our code is not released yet. If your data is captured without occluders and in RAW format, you don't need the enhancements we propose :). NeRF can do amazing things with clean data!
        [-]
        StavrosK 1370 days ago
        I just saw the NeRF demo video, it's amazing. If its results are this good, why is photogrammetry software not basically perfect yet? It looks like they can generate models with vast amounts of detail.
        [-]
        m3at 1370 days ago
        Not the author (but I read the previous papers and the code), the simple answer is that it's still very costly in processing power. Think a few hours on good hardware for a set of pictures.
        Note that the results are still very impressive imo, this is still early research phase.
        duckworthd 1370 days ago
        NeRF was first published in March! Give us time :)
        [-]
        toomuchtodo 1370 days ago
        Thanks so much for taking the time to do this impromptu AMA. The excitement over the tech is clearly palpable :)
    - idontevengohere 1371 days ago
      Love how you explained this! And I thought I was the only one using exclamation marks for everything
- TaylorAlexander 1371 days ago
  Great work! I’m an ex googler currently working on a farming robot we hope to make open source. I’m particularly interested in neural reconstruction of plants in a field. I want to capture the 3D structure of the plants as well as semantics like plant species. I’ve found that normal photogrammetry produces poor reconstructions due to movement of the plants in the wind.
  We’re potentially about to start a non profit and formally kick off our whole robot as open source. I’m interested in finding research partners who would like to help produce a research paper on 3D reconstruction of plants. I can produce a high quality geo located dataset with 2cm accurate GPS tags, but I have no experience with neural rendering. This is work I want to do over the course of the next year.
  Do you know anyone interested in helping with thank kind of work? Thanks!
  [-]
  - duckworthd 1371 days ago
    I'm a bit new to the field myself, so I'm afraid I can't provide any contacts. Ask me again in a couple of years.
    The method presented here wouldn't do well with your problem either. 3D reconstruction of moving objects is an unsolved problem!
    [-]
    - TaylorAlexander 1371 days ago
      Indeed. I am seeing some generative approaches that know what the object should look like in 3D and use that knowledge to imagine a model that matches a photo. I think such a technique would be useful for good approximation of plant models. Such a project would require some new datasets I would think, but seems like a good approach.
      Just came across this which is neat: https://github.com/AljazBozic/DeepDeform
      Also this might be useful: https://github.com/paschalidoud/hierarchical_primitives
      For now I’m just beginning to collect data but I hope to contribute more to the field in time!
- akavel 1371 days ago
  Looks awsome! I'm not a ML guy and haven't read the paper, just watched the video - one thing isn't clear to me from it: is this fully automatic/unattended, you just throw images into it and out come magic rainbows of 3d structures? or do you need to somehow help it, e.g. to disentangle the structure from the "transient" elements? In other words, I don't really understand what does the "Appearance Embedding" even mean... Or is the "input" that you mention in the video fed into a model that is already trained on a set of photos of a particular scene? I.e. the "input" + "appearance embedding" basically encodes just a choice of a framing & "atmosphere/lighting"?
  [-]
  - duckworthd 1371 days ago
    It's a little hard to describe from scratch, but let me do my best.
    The method is unattended, in the sense that it's photos + camera parameters in and scene representation out. The photos should all be of the same scene (e.g. the Trevi Fountain). Once you have a scene representation, you can ask what the scene would look like from new camera angles with your choice of lighting.
    Choosing camera angles is straightforward. You tell me where and what direction the camera is facing. The question then becomes, how do you specify your choice of lighting? The answer is, you can't do so directly. Instead, you provide a picture with the lighting you want, and with a little magic, we can find a way to imitate that lighting. The way we do is by finding a corresponding "appearance embedding" via numerical optimization.
    [-]
    - Ut_Pwnsim 1371 days ago
      What is the precision required (or used in your datasets) for camera position and angles? Is the geotagging in the images from common cellphones and smart cameras enough? Were they back-calculated using some other method from non- or poorly-georeferenced images?
      [-]
      - duckworthd 1371 days ago
        It's hard for me to say how precise camera position and direction needs to be. We use COLMAP to estimate both via multi-view stereo.
- riotman 1371 days ago
  Why did you use neural networks? There are faster techniques in analytical geometry that can extract surface contours from color gradients from images, and they do this faster and directly.
  [-]
  - duckworthd 1371 days ago
    1) My bread and butter for the last 10 years has been machine learning. When all you have is a hammer...
    2) We don't extract surface contours, we learn a volumetric radiance field! To oversimplify, we learn a (smooth) function that, given a position in space, produces the differential opacity and color at that space. To render an image from a camera viewpoint, we approximately integrate along rays emitted from each pixel of the camera.
    Check out NeRF and our paper to learn more about this representation!
  - rmbrualla 1371 days ago
    Neural networks are better compared to classical methods.
    One of the best non-classical methods is this one (https://grail.cs.washington.edu/projects/sq_rome_g1/), and our method is significantly improves upon it. We do not compare directly with it, but Neural Rerendering in the Wild does, and we improve upon it.
  - randyrand 1371 days ago
    these nerf models are like 5MB large are have a ton of directional lighting support. speculars, caustics, refraction, mirrors, you name it!
    also they're way higher quality than traditional techniques
- ctdonath 1371 days ago
  Random thought:
  Someone made a “camera” which tracks location & direction, and “takes” a picture by selecting the closest picture found taken from that spot & angle.
  Use this new tech for a next generation of that “camera”, generating the Platonic frame which should occur from that location.
  [-]
  - oillio 1370 days ago
    Then instead of a camera, display it in VR goggles. Allows you to walk around and see a landmark without all those pesky people ruining your view.
    You can use the PhotoSynth tech described above.
- sramam 1371 days ago
  this looks really cool. I'm not am ML chap, but always wondered: Can these kinds of algorithms also give you dimensional data? For example, can I 3D-print one of these models with any accuracy?
  [-]
  - duckworthd 1371 days ago
    For that, you'll need to convert the representation we have (volumetric radiance field) to on your 3D printer can understand (a mesh?). The NeRF authors use the marching cubes algorithm to do just that. Check out their website: https://www.matthewtancik.com/nerf.
- pferdone 1371 days ago
  how many pictures or angles are needed to produce good results? I get that landmarks have an abundance of source material, but whats a reasonable amount of data to reconstruct scenes?
  [-]
  - duckworthd 1371 days ago
    On the order of hundreds to low-digit thousands worked well for us. These photos contain a lot of occluders like tourists, and we needed to have enough views of the subject in question to build a good 3D scene representation.
    [-]
    - panabee 1371 days ago
      can you elaborate on the key variables for the data? for instance, is it safe to assume 360 photos from the same angle would yield a worse model than 1 photo from 360 different angles?
      what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?
      thanks for being so active on this thread.
      [-]
      - duckworthd 1370 days ago
        NeRF's (and all of photogrammetry's) bread and butter is 3D consistency -- that is, seeing the same object from multiple angles. A 360 degree photo from a fixed position just won't do. As to how to select the best camera angles...I'm not sure. I believe there is research in this area for classical photogrammetry techniques, but I'm not familiar enough to point you to a body of work.
    - narrationbox 1371 days ago
      How do you remove tourists? Is the network trained to segment and ignore humans?
      [-]
      - duckworthd 1371 days ago
        The model does not explicitly learn to segment images. The answer is unfortunately more difficult to explain than a HN comment bears. I encourage you to read the paper for more details.
        https://arxiv.org/abs/2008.02268
    - pferdone 1371 days ago
      Just gotta say: amazing!
      My follow up question would be: are you able to compare your results to actual photogrammetry data to see how good your reconstruction performs?
      [-]
      - duckworthd 1371 days ago
        I'm actually quite new to the field, and I'm not even sure what to compare against nor how to compare it. What's typically measured and how?
    - _visgean 1371 days ago
      Is the model able to capture the underlining geometry? E.g. If I have a pillar part of which was not visible at any training point is it able to reconstruct that part?
      [-]
      - duckworthd 1371 days ago
        The model is trained to reconstruct what is observed, but not what is obscured. If you look closely at our videos, you'll notice some parts of the scene are blurry -- those parts weren't seen often enough to learn well. If you look at parts of the scene not observed at all, I'm not sure what you'd find.
    - baq 1371 days ago
      would a sufficiently long video in motion, say from a drone, car or even a walking person, work instead?
      [-]
      - duckworthd 1371 days ago
        Pictures are pictures, even as video frames :)
    - sumtechguy 1371 days ago
      Did you consider using movies as a source too?
      [-]
      - duckworthd 1371 days ago
        Consider? Yes. Try? Nope!
        [-]
        sumtechguy 1370 days ago
        awww! Figured dolly shots and steady cam shots would fit perfect into something like that. Esp 24 frames per second and usually known locations. Course it would probably drag a lot of the net into being biased towards that time spot I guess?
        [-]
        rasz 1370 days ago
        There are problems associated with using video: motion blur, rolling shutter.
        [-]
        sumtechguy 1370 days ago
        Oh I agree. In my head it seems like it should work. I could be wildly wrong though. I am every day :)
        [-]
        rasz 1369 days ago
        It definitely can work, and even has some additional benefits (1), but requires special considerations. You can deblur using global motion vectors (2), or additional hardware like accelerometer reading embedded in the video feed (3).
        1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution
        2) http://users.ece.northwestern.edu/~sda690/MfB/Motion_CVPR08....
        3) http://neelj.com/projects/imudeblurring/imu_deblurring.pdf
- panabee 1371 days ago
  This looks amazing! Congratulations. What was the most challenging aspect of this? I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly. For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
  Do you plan on releasing code?
  Congrats again. This is very cool research.
  [-]
  - duckworthd 1371 days ago
    > This looks amazing! Congratulations.
    Thank you!
    > What was the most challenging aspect of this?
    Wow, that's hard to say! Our work truly stands on the shoulders of giants (Mildenhall et al, 2020). I can list off a few challenges: figuring out if an idea "kinda works" or "definitely works" or has a bug, figuring out how to measure progress, coordinating a group of 6 researchers living 9 hours apart, and assembling everything together for a simultaneous paper-website-video release!
    > I'm curious to see how you performed edge detection on the transient objects and were able to isolate them so cleanly.
    We don't! All of this comes by the magic of machine learning :). We train our model to attentuate the importance of "difficult" pixels that aren't 3D-consistent in the training set. We also partition the scene into "static" and "transient" volumetric radiance fields without explicit supervision. We do so by regularizing the latter to be empty unless necessary, and providing it with access to a learned, image-specific latent embedding. We discard the transient radiance field when rendering these videos, thus removing tourists and other moving objects.
    > For some reason, the paper isn't loading, so feel free to say it's explained in detail there.
    Well that won't do. Download it here: https://arxiv.org/abs/2008.02268. It's 40 MB, so your download bar may indicate it's almost done, but it actually has a good bit left to go.
    > Do you plan on releasing code?
    I hope so! As with most code, what ran on our machines may not run on yours. Migrating the code to open source will be a big effort. I hope what we describe in the paper is sufficient to build something like you see here.
    [-]
    - alacombe 1370 days ago
      > Migrating the code to open source will be a big effort.
      No, it is not. It's 5 shell command at the most.
      $ git init
      $ git add .
      $ git commit -m 'Initial import'
      $ git remote add origin git://...
      $ git push origin master
      [-]
      - Jaruzel 1370 days ago
        You are assuming there that all their code is non-proprietary, doesn't belong to someone else, already has a open-licence, and is of good enough quality to be shared with the public.
        [-]
        alacombe 1370 days ago
        > their code is non-proprietary, doesn't belong to someone else, already has a open-licence
        Then say "we can't share for legal reason", not "we're planning to". This is just a bs corporate answer.
        > is of good enough quality to be shared with the public.
        This is a petty excuse. There is plenty of open-source code utterly crappy/barely functional out there.
        [-]
        Jaruzel 1369 days ago
        > There is plenty of open-source code utterly crappy/barely functional out there.
        Mine included ;)
- alacombe 1370 days ago
  Are you planning to share the code ?
  GPT-3 has some pretty interesting demo which unfortunately are unfortunately rather disappointing once outside a carefully crafted environment. Said otherwise, a paper is nothing it it cannot be reproduced.
- nialv7 1371 days ago
  Not too familiar with this area. IIUC, training a model would require photos with corresponding information about where and at what angle the photo is taken.
  Is this information just available from the dataset?
  [-]
  - duckworthd 1371 days ago
    It is provided by the dataset we use, but given a new dataset, you can use off-the-shelf tools to obtain it yourself! Check out COLMAP, it's super duper cool: https://colmap.github.io/.
- billconan 1371 days ago
  so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
  There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
  Why does the network need a direction? Why can't we get a density (opacity) and a color given a position?
  and what are z(t) r(t) in equation 5,6?
  [-]
  - duckworthd 1371 days ago
    Answered most your other questions below in another comment.
    > and what are z(t) r(t) in equation 5,6?
    r(t) is a position in 3D space along a camera ray of the form, r(t) = origin + t * direction.
    z(t) is the output of our first MLP. Think of it as a 256-dimensional vector of uninterpretable numbers that represent the input position r(t) in a useful way.
- mey 1371 days ago
  Would it possible to integrate video into the model?
  [-]
  - duckworthd 1371 days ago
    I'm not sure I understand your question.
    If the question is, "can you reconstruct a (static) scene from the frames in a video?", the answer is yes!
    If the question is, "can you reconstruct a scene with people and other moving objects, and model them moving around too?", the answer is not yet.
    [-]
    - mey 1371 days ago
      The former. Once the scene is synthesized, I figure that is where any dynamic output would occur. Although that raises an interesting thought of using the NeRF modeling to paint out certain things in potentially live video.
- hi41 1371 days ago
  Very cool! Congratulations!
  [-]
  - duckworthd 1371 days ago
    Thank you :)
- dougabug 1370 days ago
  Do you need posed images?
  [-]
  - duckworthd 1370 days ago
    Yes, the images need to be posed. We use COLMAP to obtain camera pose.
- AtomicOrbital 1370 days ago
  cool project ... now spin up a public server so we can feed up our own set of images and get back the 3D synth object scene
- ThisIsMyPasswrd 1371 days ago
  Does it take reflections into consideration?
  [-]
  - duckworthd 1370 days ago
    The work we build off of, NeRF, does. While there's nothing preventing NeRF-W from also representing reflections, we find it captures a more matte picture of the object.
- nullsmack 1370 days ago
  Is there a code release somewhere?
nawgz 1371 days ago
Those are some very cool 3D visualizations generated, but it's a bit difficult to understand what the form of the dataset they generated it from is. They say "in-the-wild" photography, but of course don't really give you a great sense.
The light->dark transitions having consistent geometry is clean though.
[-]
- duckworthd 1371 days ago
  We use images from the Image Matching Challenge 2020 dataset. If you look at the Appendix, we list how many images we use and the process by which they were chosen.
  Download and have a look! https://vision.uvic.ca/image-matching-challenge/data/
  [-]
  - nawgz 1371 days ago
    Thanks, that's a clean reference.
- toomuchtodo 1371 days ago
  > They say "in-the-wild" photography, but of course don't really give you a great sense.
  Flickr user photos. Citation shows up in the lower right hand corner during the video.
  This appears to be a substantial improvement on current open photogrammetry/structure from motion work [1]. I hope Google supports this making its way into cultural preservation efforts [2].
  [1] https://github.com/mapillary/OpenSfM (developed by Mapillary, now part of Facebook)
  [2] https://www.nytimes.com/2015/12/28/arts/design/using-laser-s... (Using Lasers to Preserve Antiquities Threatened by ISIS)
  [-]
  - nawgz 1371 days ago
    Yes, I mostly meant that I don't get a great sense of "how many photos there are" in these datasets.
    I saw in the paper their citation [13] pointed to https://arxiv.org/pdf/2003.01587.pdf, which in section 3 says the following:
    We thus build on 25 collections of popular landmarks originally selected in [48,101], each with hundreds to thousands of images.
    So hundreds to thousands of photos are used, which is a decent quantity, but definitely makes the quality of the result very impressive.
Mathnerd314 1371 days ago
I'm still looking for a program that takes a video and turns it into an animated 3D scene. All the stuff I've seen is on static scenery, besides some neural nets that can tweak camera angles.
[-]
- johanneskopf 1371 days ago
  Check this. Code coming (very) soon :) https://roxanneluo.github.io/Consistent-Video-Depth-Estimati...
  [-]
  - ThisIsMyPasswrd 1371 days ago
    Do you happen to know how intellectual property works when someone wants to use the algorithm/code?
    [-]
    - johanneskopf 1370 days ago
      I think we're going to use the MIT license. So, you'll be able to use it in almost any way you like...
      [-]
      - lostmsu 1367 days ago
        MIT actually does not give an explicit patent grant. So if "any way you like" is your goal, you should choose something different like Apache License 2.0
- duckworthd 1371 days ago
  There is currently no way i'm aware of to accurately reconstruct a moving 3D scene. Sorry! Ask us again in a few years :)
hardmath123 1371 days ago
A while back I stitched together a "hyperlapse" of Stanford's Hoover Tower using lots of Flickr-scraped images. Everything was aligned using "classical" CV tricks and I was really happy with the results. I wonder how NeRF-w would fare on this data?
https://github.com/kach/hootow-hyperlapse
flyingcircus3 1371 days ago
After going to one of the early Maker Faires, and seeing so many interesting exhibits and projects, I had this same idea, of course with absolutely no clue about how to implement it. If enough people take pictures of the exhibits from a variety of angles, and make them available online, a virtual Maker Faire could be created. Thanks for sharing this!
brookman64k 1371 days ago
Great work! Having tried the code from the original NeRF paper I found the inference time (generation of new views) to be rather slow because the network had to be queried multiple times per ray (pixel). The paper said that there is still potential to speed this up. Did you improve inference speed and do you think that it will be possible to get it to real-time (>30 fps) in the foreseeable future?
[-]
- duckworthd 1370 days ago
  We did not aim to speed this part of NeRF up. Check out Neural Sparse Voxel Fields (https://arxiv.org/abs/2007.11571) for some effort in that direction. It's 10x faster, but there's still another 10x to go till you get video frame rates :)
PeterCorless 1371 days ago
This sort of work will both allow for digital forensics (imagine reconstructing a scene from multiple socially shared images or video), as well as to create even better "deep fakes" (putting people in scenes they never actually went to; or at different times of day/night, or with different weather effects).
ekianjo 1371 days ago
Is there a reason why the skies do not appear to be picked up by their "transient" filter of the scene? You end up with the skies constantly changing when moving in 3D point of view, which looks strange.
[-]
- duckworthd 1370 days ago
  A good question! And a problem yet to be solved!
ur-whale 1371 days ago
This is really cool and IMHO an area where ML truly shines: being able to disentangle the base geometric signal from lighting / crowds / occlusion via learning is truly amazing.
nla 1370 days ago
Amazing work! Reminds me of something I saw at SIGGRAPH back in '95 called 'Tour into Picture' I think the work came out of Japan.
When will you be sharing some code?!
randyrand 1371 days ago
Wow that is fantastic work! And so quick since NeRF debuted. This is exactly the kind of work I have been waiting for to reconstruct some old photos I have.
[-]
- ur-whale 1371 days ago
  > reconstruct some old photos I have.
  Note that it generates a light field, which is note exactly like a polygonal mesh ... YMMV
schemescape 1371 days ago
Is the geometry from each of the examples available in some format? It would be fun to look more closely. Apologies if I missed a link somewhere!
[-]
- duckworthd 1371 days ago
  The magic of this method is that we don't construct a "geometry" the same way one might think. There are no triangles or textures here. Instead, we train a machine learning model to predict the derivative of the color and opacity at every point in 3D space. We then integrate along rays emitted from the camera to render an image. It's similar to what's used in CT scans!
  [-]
  - heyitsguay 1371 days ago
    That's very cool, but also makes it sound more challenging to integrate into the existing 3D modeling ecosystem vs, say, photogrammetry approaches. Is it possible to generate approximate textured meshes from the color and opacity information?
    [Edit] After a little Googling I do see this has been done, using marching cubes (https://www.matthewtancik.com/nerf).
  - schemescape 1370 days ago
    Thanks. So are the trained models for the examples available with code to generate 2D images from them?
  - jayd16 1371 days ago
    This means you don't have an occlusion mesh or any other depth information, correct?
    [-]
    - teraflop 1371 days ago
      There is depth information, just not in the form of a mesh.
      The model learns to compute a function that takes an XYZ position within a volume as input, and returns color and opacity. You can then render images by tracing rays through this volume. You can pretty easily compute the distance to the first sufficiently-opaque region, or the "average" depth (weighted by each sample's contribution to the final pixel color), at the same time.
      Another recent Google project figured out a way to approximate these radiance fields with layered, partially transparent images for efficient rendering: https://augmentedperception.github.io/deepviewvideo/
      [-]
      - duckworthd 1371 days ago
        Another related project by our friends in NYC: https://twitter.com/Jimantha/status/1289184432553734144
    - duckworthd 1371 days ago
      We have depth! Check out our depth video: https://youtu.be/yPKIxoN2Vf0?t=146
billconan 1371 days ago
so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input.
Why does the network need a direction? Why can't we get a density and a color given a position?
[-]
- duckworthd 1371 days ago
  > so why does it use multi-layer perceptron? is it the same as ANN? why not calling it ANN? Does it have activation?
  According to Wikipedia, "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)." We're being more specific about what we use :)
  > There is another thing I don't understand. Traditional volume data is a map of position to density (or color). There doesn't seem to be the need for a direction as an input. Why does the network need a direction? Why can't we get a density and a color given a position?
  Volume data of this form is unable to express the idea of view-dependent reflections. I admit, we don't make much use of that here, but it does help! See NeRF for where it makes a big, big difference: https://www.matthewtancik.com/nerf
woko 1371 days ago
The videos are very impressive. I wish you let us move the 3D scene with the mouse in the browser to get a better idea of the result.
nla 1370 days ago
Also, can you share any details of the compute requirement for this?
mrfusion 1371 days ago
How does this compare to photogrammetry?
[-]
- duckworthd 1371 days ago
  According to Wikipedia, "Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant imagery and other phenomena."
  I'd say that this research is in the field of photogrammetry.
  [-]
  - mrfusion 1371 days ago
    I guess I meant how it compares to the commercial photogrammetry software out there.
    [-]
    - ur-whale 1371 days ago
      As mentioned in another comment, traditional photogrammetry software typically generates a 3D polygonal mesh.
      It looks like this generates a light field, which is not something that traditional 3D software handles directly.
    - duckworthd 1371 days ago
      I don't know! I'm not familiar enough with the field to say.
Proven 1370 days ago
I clicked on the link because I misread that as “pornography”. Oh well