Global Latent Neural Rendering

CVPR 2024

Huawei Noah’s Ark Lab

Sparse DTU
3 views

Sparse RFF
3 views

Generalizable DTU
unknown scene

Generalizable RFF
known scene

ILSH
ICCV 23 challenge

We present a method that can render novel views (1) of unknown scenes (2) from sparse inputs (3) with high fidelity (4) in less than a second per frame (at 375x512 resolution).

Abstract

A recent trend among generalizable novel view synthesis methods is to learn a rendering operator acting over single camera rays.

abstract1.svg
This approach is promising because it removes the need for explicit volumetric rendering, but it effectively treats target images as collections of independent pixels. Here, we propose to learn a global rendering operator acting over all camera rays jointly. We show that the right representation to enable such rendering is a 5-dimensional plane sweep volume consisting of the projection of the input images on a set of planes facing the target camera. Based on this understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR), an efficient convolutional architecture that performs the rendering operation globally in a low-resolution latent space.
abstract2.svg
Experiments on various datasets under sparse and generalizable setups show that our approach consistently outperforms existing methods by significant margins.

Method

architecture.png

Overview of ConvGLR. The 4D grouped PSV $\boldsymbol{X}$ is turned into a latent volumetric representation $\boldsymbol{Y}$, then rendered into a latent novel view $\boldsymbol{Z}$ and finally upsampled into the novel view $\boldsymbol{\tilde{I}}_{\!\ast}$. The dark gray blocks are implemented with 2D convolutions and resblocks.

Results

ILSH

Generalizable RFF (known scenes)

Generalizable DTU (unknown scenes)