3D scene understanding is an area of vision research with applications ranging from augmented reality to autonomous navigation. This PhD position is focused on research into foundational models for 3D scene understanding. 3D scenes can be created from image collections (Structure from Motion, Simultaneous Localisation and Mapping), thus allowing for the extraction of foundational representations from images and their transfer to the 3D domain using pixel-to-point correspondences. These representations can then be interacted with via language model prompting. However, these representations have been optimised for 2D reasoning. The PhD candidate will be tasked with exploring novel approaches to disentangle object-level information in the 2D domain and to fuse it in 3D, thereby enabling 3D reasoning capabilities.