91ɬÂþ

Event

PhD defence of Joseph Szymborski – Addressing Pervasive Challenges in Generalizable Machine Learning Models of Protein-Protein Interaction

Monday, November 24, 2025 13:00to15:00
McConnell Engineering Building Room 603, 3480 rue University, Montreal, QC, H3A 0E9, CA

Abstract

Protein-protein interactions (PPIs) underpin critical cellular functions, from metabolism to DNA repair, and their dysregulation drives diseases like cancer and viral infections. Traditional experimental methods for characterizing PPIs (e.g, yeast two-hybrid, affinity chromatography) are resource-intensive, requiring specialized equipment, expertise, and weeks of labour per experiment. Computational approaches offer scalable alternatives, enabling rapid inference of putative PPIs across entire proteomes. However, despite leveraging large PPI databases and deep learning advances, current models fail to generalize to out-of-distribution (OOD) proteins, limiting real-world applicability in disease research and biologic therapy design.

This thesis addresses persistent OOD generalization challenges through three interconnected studies. First, Chapter 3 introduces RAPPPID, a regularization-optimized PPI inference model that achieves state-of-the-art accuracy by the judicious application of regularization and SentencePiece tokenization, while exposing critical failures of contemporaneous models on unseen proteins. Second, Chapter 4 describes INTREPPPID, which leverages evolutionary orthology to enhance cross-species generalization. By embedding an "orthologous locality" loss term, it reshapes protein latent spaces to cluster evolutionarily related proteins, outperforming prior methods when models trained on human data predict PPIs in distant organisms (e.g., S. cerevisiae, D. melanogaster). Third, Chapter 5 identifies a pervasive, previously uncharacterized data leakage source in PPI models incorporating pre-trained protein language models (pLLMs). It establishes rigorous dataset curation protocols to eliminate leakage while maintaining performance and reveals fundamental generalization barriers between pLLM-based and non-pLLM architectures.

Collectively, these contributions advance robust PPI inference by directly confronting OOD limitations, orthology-aware generalization, and pervasive sources of data leakage, all critical for translating computational predictions into biological insights and therapeutic innovations.

Back to top