Nowadays, foundation models have shown remarkable capabilities in generating text, images and videos, rich world knowledge, and some complex “reasoning” skills. However, these models are still passive models (e.g., action-oriented aspects of intelligence are not leveraged), static, data- and computation-expensive, they often confabulate, and are difficult to align with human values.
This PhD project aims at making multimodal foundation models capable of interactions with other agents (e.g., enabling cooperative capabilities) and of interactions grounded in the world, enabling counterfactual reasoning and causality abilities in foundation models as well as enforcing their alignment with human intentions and values. The PhD candidate is required to have previous experience in working with deep learning algorithms, a strong interest on transformers’ architecture, graph neural networks, multimodality, and cooperative and embodied AI. The selected student will be able to collaborate with the ELLIS network and being part of the ELLIS PhD program (if selected) as well as with top universities and research centers such as MIT, Max Planck, etc.