Implicit 3D scene representations can encode hundreds of high-resolution images in a compact format, enabling several applications such as 3D reconstruction and photorealistic synthesis of new views. For example, a Neural Radiance Fields (NeRF) model is a neural network that learns to represent a 3D object from a set of images that represent the object from different perspectives. This approach makes it possible to learn high-fidelity photometric characteristics (e.g., reflection and refraction) differently from more conventional explicit representations like meshes and point clouds.
Unfortunately, NeRFs have various drawbacks that prevent the use of implicit representations as a standard 3D data format for perception tasks. For example, the process of forming an implicit network can be very slow. At the same time, the inference process is too long to be integrated into a real-time application. Finally, the features learned implicitly by a NeRF model are scene-specific and cannot be transferred to other scenes. Recently, variants of NeRF, such as Plenoxels, have been proposed to overcome the problems mentioned above. Indeed, Plenoxels supports a fast learning process while maintaining a consistent representation of features across scenes. Specifically, Plenoxels represent scenes through a sparse 3D grid with spherical harmonics, as shown in the following figure.
Implicit representations like Plenoxels could be used for perceptual tasks such as classification and segmentation. However, there was still no large-scale implicit representation dataset for perceptual tasks. For this reason, using Plenoxels as a data format, a group of researchers from POSTECH, NVIDIA, and Caltech created the first two large-scale implicit datasets. Specifically, the authors converted Common Object 3D (CO3D) and ScanNet datasets to Plenoxels, resulting in two new datasets: PeRFception-CO3D and PeRFception-ScanNet. These two datasets cover object-centric and scene-centric scenarios respectively. In addition, they allow both 2D and 3D information to be encoded in a unified form while exhibiting a high compression ratio compared to the original datasets.
CO3D is an object-centric dataset with 1.5 million camera-annotated images and 50 different classes. It also includes reconstructed point cloud versions of the objects. The following figure compares some examples from the original dataset with the PeRFception-CO3D. The Plenoxels version of the dataset allows a compression rate of 6.94%.
The ScanNet dataset contains over 1.5K 3D scans of interior scenes. However, a number of images include motion blur. For this reason, blurry images are removed before conversion to Plenoxels. The resulting PeRFception-ScanNet dataset leads to a significant compression ratio of 96.4%, highlighting the accessibility of the Plenoxels representation.
Finally, the authors also conducted several experiments, including 2D image classification, 3D object classification, and semantic segmentation of 3D scenes. These experiments showed that the two new datasets could efficiently encode 2D and 3D information into a unified, compressed data format.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'PeRFception: Perception using Radiance Fields'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, project page and code. Please Don't Forget To Join Our ML Subreddit
Luca holds a doctorate. student at the Computer Science Department of the University of Milan. His interests are machine learning, data analytics, IoT, mobile programming and indoor positioning. His current research focuses on pervasive computing, context awareness, explainable AI, and human activity recognition in intelligent environments.