
Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool
We present GaussianVLM, the first 3D VLM operating on Gaussian splats. Each Gaussian in the scene is enriched with language features, forming a dense, scene-centric representation. A novel dual sparsifier reduces ~40k language-augmented Gaussians to just 132 tokens, retaining task-relevant and location-relevant information. This enables open-vocabulary, detector-free reasoning and yields state-of-the-art performance on both scene- and object-centric embodied benchmarks.
Find more information on our project page: https://insait-institute.github.io/gaussianvlm.github.io/