As we gather increasing volumes of data from next-generation sequencing and multi-omics technologies, the potential to develop transfer learning and neural networks applicable to diverse fields expands. A central challenge that remains is the creation of embeddings capable of representing information in a biologically meaningful space. The engineering and characterization of a model capable of representing this information in a comprehensive and robust manner could revolutionize the field with an impact comparable to what word2vec did for natural language processing.
The prospective PhD candidate is expected to work on the development of a model capable of recapitulating the entire biology of selected organisms, characterizing the manifold in terms of gene ontologies and pathway enrichments. The resulting encoding capabilities should be usable by the scientific community at large, and result in significant advancements in the field such as images to RNA inference models and vice versa.
Through this endeavor, we aim to make a significant stride in unifying diverse data types under a shared, biologically meaningful representational framework. This study will not only advance our understanding of biological systems but also serve as a herald for a new wave of AI applications in bioinformatics.