In recent years, Multi-Modal Learning (MML) has garnered significant attention, driven by the increasing availability of vast multimodal datasets and the development of robust internet services accessible across various devices. Current research predominantly emphasizes the imperative to harness deep learning techniques, capitalizing on existing foundational models applicable across diverse domains, and customizing them for specific tasks within the MML framework.
The primary objective of this thesis is to advance the state of the art in MML by delving into cutting-edge neural architectures and learning approaches. This involves the exploration of innovative methods, potentially combining different modalities such as voice and gestures. The aim is to enhance the performance of existing single-modal systems, particularly in Automatic Speech Recognition (ASR) and Natural Language Processing (NLP).
A pivotal focus will be on investigating recent audio foundation models, specifically those designed for multi-lingual speech recognition, voice conversion, and speech generation to facilitate data augmentation. The overarching goal is to develop models that exhibit effectiveness across various speech tasks, even when confronted with limited data or constrained computation resources. Furthermore, through the integration of advanced audio generation techniques, the study seeks to bolster the multimodal capabilities of the overall system. This enhancement enables more effective creation of synthetic data for model training and development.
The outcomes of this research are anticipated to contribute significantly to the development of highly competitive services tailored to the demands of evolving multimodal application scenarios.