On improving experimental binding affinity predictions with synthetic data
Published in BioRxiv, 2026
Recommended citation: Ryczko, Kevin; Zin, Phyo Phyo; Crivelli-Decker, Jordan; Le, Ly; Jha, Punit K.; Shields, Benjamin J.; Lemos, Pablo; Bandi, Sasaank; van Damme, Maarten; Sood, Amogh; Huntington, Lee; Pitman, Mary; Ganahl, Martin; Bortolato, Andrea. "On improving experimental binding affinity predictions with synthetic data", BioRxiv, 2026. https://doi.org/10.64898/2026.03.02.708607
The success of deep learning binding affinity prediction models depends critically on expanding experimental data with reliable synthetic data. We extend the Structurally Augmented IC50 Repository (SAIR) with approximately 80K absolute free energy perturbation (AFEP) calculations and present two distinct data splits, SAIR-FEP and SAIR-OOD (out-of-distribution), to simulate realistic drug discovery scenarios. By filtering for high-confidence, co-folded complexes, we show that performance improves predictably, whereas training on all complexes blindly does not yield performance gains.
