Boltz-2: Two Months In - Hype vs. Reality From Drug Discovery Perspective
Comparison of a variety of methods on the PL-REX binding-affinity benchmark.If you've been following computational drug discovery over the past two months, you've undoubtedly witnessed the remarkable surge of excitement, debate, and content surrounding Boltz-2. The launch of this groundbreaking AI cofolding model didn't merely create ripples - it unleashed an unprecedented deluge of posts, articles, and benchmarks across the scientific community. LinkedIn feeds became so inundated with expert analyses and preliminary findings that I found myself resorting to the most fitting solution: deploying an AI agent to systematically process the overwhelming discourse about this new tool. It's an appropriately recursive challenge for an era-defining tool.
This extraordinary response was ignited when MIT and Recursion unveiled their open-source model with audacious claims: it represented the first AI system to rival the precision of the industry's gold-standard Free Energy Perturbation (FEP) simulations for predicting protein-ligand binding affinity, while delivering results over 1,000 times faster. This breakthrough promised to eliminate a fundamental bottleneck in pharmaceutical research, potentially transforming how scientists identify and prioritize promising drug candidates.
The adoption was both immediate and comprehensive. According to Recursion's LinkedIn post, Boltz-2 had been downloaded over 170,000 times by more than 41,500 users, while leading platforms, including Rowan Scientific, NVIDIA, Tamarind Bio, and DeepMirror swiftly incorporated it into their systems, effectively democratizing access to this sophisticated technology. Now, two months after its debut, the initial euphoria has encountered rigorous independent evaluations, real-world testing, and measured scientific scrutiny. The moment has arrived for an honest assessment: what insights have emerged regarding Boltz-2's genuine capabilities?
What got everyone excited?
The launch of Boltz-2 by MIT and Recursion was met with considerable excitement, not merely because it was another entry in the field of structural biology AI, but because it promised to tackle a long-standing, critical challenge in drug discovery. It was heralded as the first major open-source tool to holistically unify the prediction of a protein-ligand complex's 3D structure with its binding affinity. And, of course, binding affinity is a crucial metric that directly correlates with a drug's potency and therapeutic effectiveness. The initial buzz was fueled by several key innovations and bold claims:
- A novel affinity prediction module: At its core was a new affinity prediction component. Trained on millions of curated binding measurements from public databases and augmented with data from molecular dynamics (MD) simulations, this module was engineered to predict binding affinity with high accuracy directly from the generated molecular structures. This was a significant leap, as previous models like AlphaFold3 and Boltz-1, while powerful at structure prediction, lacked a reliable, built-in mechanism to quantify this vital functional property.
- Unprecedented speed: Boltz-2 promised to deliver FEP-level accuracy at a fraction of the computational cost. The developers claimed it was over 1,000 times faster than traditional Free Energy Perturbation (FEP) calculations, capable of producing a result in about 20 seconds on a single GPU, compared to the hours or even days required for physics-based simulations. This dramatic acceleration made large-scale virtual screening of millions of compounds computationally feasible for the first time.
- Enhanced controllability and physical realism: The model introduced a suite of features giving scientists more granular control over predictions. This included "method conditioning," which allows users to bias predictions toward conformations typical of specific experimental techniques like X-ray crystallography or NMR. It also supported the use of templates and user-defined geometric constraints to guide the folding process. Furthermore, a feature called "Boltz-Steering" was integrated to apply physics-based potentials during inference, reducing the likelihood of steric clashes and other chemically implausible artifacts in the final structures.
- A commitment to open-source science: In a landscape where leading models were becoming increasingly proprietary, Boltz-2 was released under a permissive MIT license (you can access the code here: jwohlwend/boltz: Official repository for the Boltz biomolecular interaction models). This meant the model, its weights, and the training code were made freely available for both academic and commercial use, fostering rapid, widespread adoption and inviting community-led validation and extension.
The initial results presented by the developers were compelling, showcasing strong performance on the standard FEP+ benchmark and outperforming all other methods in the blind CASP16 affinity prediction challenge, further cementing its status as a potential game-changer.
The reality check: independent evaluations and community feedback
Once Boltz-2 was released into the wild, the scientific community began to put it through its paces, moving beyond the curated benchmarks of the initial publication. As independent academic labs, industry consortia, and individual researchers subjected the model to real-world drug discovery challenges, a more nuanced and granular understanding of its capabilities began to emerge, separating the initial hype from the on-the-ground reality.
Structural and pose prediction: a tale of two target types
A clear consensus has formed that Boltz-2's accuracy in predicting the three-dimensional structure of a protein-ligand complex is highly contingent on the intrinsic properties of the protein target. This has created a distinct performance dichotomy.
- Strengths: In scenarios involving structurally stable, well-characterized proteins with rigid binding pockets, Boltz-2 has demonstrated excellent performance. Independent tests, such as a comprehensive analysis by DeepMirror, found that for targets like KRAS and the SARS-CoV-2 main protease, the model can accurately recapitulate the overall protein fold and correctly place the ligand within the catalytic pocket. In these examples, where high-resolution experimental data is often abundant in public databases, Boltz-2 reliably generates high-quality poses that are useful for structure-based design.
- Weaknesses: However, the model's performance falters significantly when confronted with the complex phenomenon of "induced fit," where a protein must undergo substantial conformational changes to accommodate a ligand. DeepMirror indicates that in evaluations on more flexible or allosteric targets like PI3K-α, cGAS, and WRN helicase, Boltz-2 often failed to predict these crucial structural rearrangements. Instead of molding the protein to the ligand, the model tended to revert to a known unbound conformation, resulting in the ligand being misplaced or forced into chemically implausible poses. Furthermore, a persistent challenge relates to stereochemical fidelity. While an improvement over its predecessors, independent benchmarks reveal that Boltz-2 can still generate structures with incorrect chirality at key stereocenters and non-trivial deviations in bond lengths and angles, necessitating careful manual inspection by experienced chemists.
Binding affinity prediction: good at ranking, but not yet a ruler
Arguably, the most heralded feature of Boltz-2 was its ability to predict binding affinity. Here, the reality has proven to be multifaceted, highlighting its utility for certain tasks while underscoring its limitations for others.
- Strengths: A consistent finding across multiple analyses is Boltz-2's utility in qualitative ranking. It can often effectively discern which analogues in a chemical series are likely to be more or less potent, making it a valuable tool for prioritizing compounds during hit-to-lead campaigns. An evaluation using the large-scale Uni-FEP benchmark from Atombeat demonstrated robust performance across 15 diverse protein families, including historically challenging targets like GPCRs and kinases. Perhaps most interestingly, the model demonstrated an ability to overcome known data labeling errors within the public ChEMBL database, correctly predicting affinities despite being fed noisy data. This may suggest some degree of genuine learning of underlying biophysical principles, rather than pure data memorization.
- Weaknesses: Despite its ranking capabilities, its performance in predicting absolute, quantitative binding affinities has been met with more critical appraisal. An analysis by Semen Yesylevskyy (RECEPTOR.AI), on the PL-REX dataset, designed to test fine-grained ranking, concluded that Boltz-2 represents "only an incremental improvement" over existing methods, not a revolutionary leap. His findings showed it was only 5-7% better than the closest ML competitor. Another general trend observed by Xi Chen and coworkers is Boltz-2's tendency to predict binding affinities within a narrow range, often within 2 kcal/mol. They found that for numerous targets where the experimental binding affinities spanned a much wider range, Boltz-2's predictions would cluster near the mean, a behavior known as "regressing to the center." A significant and well-documented blind spot is its inability to account for systems involving critical "buried" water molecules, which often mediate essential hydrogen-bond networks within a binding site. As the Uni-FEP benchmark analysis noted, this remains the "'final stronghold of FEP'," vividly illustrating a fundamental gap where physics-based methods still hold a decisive advantage.
Generalization or memorization?
Beyond the specific technical merits and limitations, perhaps the most critical and far-reaching discussion within the community centers on a fundamental question of AI: the dichotomy between genuine generalization and mere memorization. The practical utility of any predictive model hinges on its ability to perform accurately on new, unseen data, and it is on this point that Boltz-2 faces its most significant scrutiny.
This concern, articulated by experienced computational chemists like Pat Walters (Relay Therapeutics) and John Taylor (Cancer Research Horizons), revolves around the potential for data leakage, i.e., a subtle but significant issue where cryptic overlaps between the vast public training corpus (drawn from sources like PDBbind and ChEMBL) and the supposedly independent benchmark sets could artificially inflate performance metrics. The fear is that the model may not be learning the fundamental physics of binding, but rather recognizing molecular fragments, protein pockets, or entire complexes it has effectively seen during its training phase. John Parkhill (Terray) echoed this sentiment, noting a specific bug his team had previously fixed in their internal models that appears to persist in Boltz-2:
Recommended by LinkedIn
"because of the vast data imbalance in the public data, inferring across uniform chemical space yields an obviously unrealistic distribution..."
The most compelling evidence lending credence to this concern is the observation that Boltz-2’s performance deteriorates markedly when evaluated on private, internal datasets from pharmaceutical companies compared to its results on public benchmarks. These proprietary datasets represent a truer test of generalization, as they contain novel chemotypes and target variations that are not in the public domain. This performance drop suggests that the model may be functioning more as a sophisticated pattern-recognition engine, adept at interpolating within the familiar chemical and structural space of its training data, rather than truly generalizing the underlying principles of molecular recognition required to extrapolate to genuinely novel targets.
This issue is compounded by a starkly practical limitation revealed in early user reports: a high false-positive rate, estimated to be around 40%. In a drug discovery context, this is a substantial figure. It means that for every ten compounds the model flags as a promising "hit," four are likely to be duds upon experimental testing. This places a heavy burden of validation on discovery teams and tempers the notion of Boltz-2 as a flawless filter.
Ultimately, these factors converge on a single, unequivocal conclusion: the predictions generated by Boltz-2 cannot be taken at face value and necessitate a rigorous, non-negotiable process of experimental validation. The model is not a substitute for the laboratory bench but rather a tool to guide it, underscoring that trust in its output must be earned through empirical evidence, not simply assumed based on benchmark performance.
The verdict after two months: a powerful tool, not a magic bullet
Two months post-launch, the initial fervor surrounding Boltz-2 has subsided, giving way to a more sober and pragmatic assessment from the scientific community. The verdict is now clear: unquestionably, Boltz-2 represents a landmark engineering achievement and a valuable addition to the computational chemist's toolkit. However, the emerging consensus is that it is a powerful instrument within a larger orchestra, not a solo performer capable of rendering the ensemble obsolete. And, of course, it is not the "FEP-killer" that some of the initial, more sensational headlines had proclaimed. The chasm between the initial hype and the current reality can be summarized as follows:
- The hype: The narrative at launch positioned Boltz-2 as a revolutionary, paradigm-shifting instrument that would effectively make gold-standard methods like Free Energy Perturbation obsolete. It was presented as having definitively cracked the binding affinity problem, offering a one-shot solution for accurate prediction at unprecedented speeds.
- The reality: The reality, as borne out by extensive independent testing, is more nuanced yet arguably more practical. Boltz-2 is a powerful, fast, and significant incremental advance. Its primary value lies in its function as a high-throughput screening and prioritization engine. It excels at rapidly sifting through vast virtual libraries to identify promising candidates, but its predictive fidelity is intrinsically linked to the biophysical characteristics of the target and, crucially, its resemblance to the data upon which the model was trained.
Consequently, the scientific community has largely converged on a vision where Boltz-2 complements, rather than substitutes, more computationally intensive, physics-based methods. This has given rise to a highly practical and increasingly popular workflow concept known as "Affinity funneling". In this tiered approach, Boltz-2’s unparalleled speed is leveraged at the top of the funnel to rapidly screen immense virtual libraries, potentially containing millions of compounds. This initial stage filters the vast chemical space down to a more manageable, enriched set of several thousand promising candidates. It is only at this stage, on this curated subset, that the more resource-intensive "gold standard" methods like NES or FEP are deployed to provide a final, rigorous assessment of binding affinity. This synergistic workflow ensures that precious computational resources are allocated with maximum efficiency, combining the breadth and speed of AI with the depth and physical rigor of traditional simulations.
A future trajectory defined by critical evaluation
The open and rigorous scrutiny that Boltz-2 has undergone is not a sign of its failure, but rather a testament to the robust, self-correcting nature of the scientific community. The constructive discourse has illuminated a clear trajectory for future development, with research efforts now coalescing around addressing the model's well-documented limitations. The key frontiers for progress include:
- Improving generalization and escaping the training set. This is arguably the most critical challenge. Overcoming it will demand the curation of more diverse, higher-quality, and multi-modal training data. This includes not only incorporating more examples of highly flexible proteins and allosteric systems but also finding ways to integrate proprietary data from industrial sources, perhaps through novel federated learning or data-sharing consortia (OpenBind? Target2035?), to ensure the model can perform reliably on more exotic chemical matter.
- Enhancing biophysical realism: To unlock utility across a wider spectrum of biological targets, future iterations must move beyond a simplistic protein-ligand view. Explicitly modeling the influence of essential cofactors, metallic ions, and, most importantly, structurally significant water molecules will be indispensable. These elements are often critical determinants of molecular recognition and binding affinity, and their current omission represents a fundamental ceiling on the model's accuracy.
- Implementing hybrid methodologies: A highly promising avenue of research involves the synergistic integration of AI speed with physical rigor. Like in the case of docking (see my previous article), rather than viewing ML and physics-based simulations as competitors, the focus is shifting to hybrid workflows. Early studies are demonstrating that using Boltz-2 to generate initial structural hypotheses, which are then refined and re-scored using MD simulations, can substantially mitigate the model's physical implausibilities and improve predictive accuracy.
Boltz-2 has indeed democratized access to structure-informed drug design and has established a new, formidable benchmark for open-source AI tools. While the initial, unbridled hype has been rightly tempered by a healthy dose of empirical reality, its release has catalyzed a global conversation and tangibly accelerated progress in the field. It stands as a powerful instrument that, when wielded with prudence and an acute awareness of its operational boundaries, will help to streamline the long and arduous path of discovering new medicines.
References
[1] Dozens of LinkedIn posts published in June-August 2025.
[2] DeepMirror. "Evaluating Boltz-2 on Real Drug Targets: Does it work?" DeepMirror Blog, June 17, 2025. https://www.deepmirror.ai/post/boltz-2-real-drug-targets.
[3] S. Passaro, G. Corso, J. Wohlwend, M. Reveiz, S. Thaler, V. R. Somnath, N. Getz, T. Portnoi, J. Roy, H. Stark, D. Kwabi-Addo, D. Beaini, T. Jaakkola, and R. Barzilay, "Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction," bioRxiv, June 18, 2025. DOI: 10.1101/2025.06.14.659707.
[4] R. Zou, L. Wang, X. Wang, Y. Ding, and H. Zheng, "Breaking Barriers in FEP Benchmarking: A Large-Scale Dataset Reflecting Real-World Drug Discovery Challenges," ChemRxiv, July 15, 2025. DOI: 10.26434/chemrxiv-2025-rf8mf.
[5] C. Wagen and A. Wagen, "The Boltz-2 FAQ," Rowan Scientific Blog, June 9, 2025. https://rowansci.com/blog/boltz2-faq.
Timely, Boltz-2 directly in today’s CDD Vault technical release notes: https://www.collaborativedrug.com/cdd-blog/cdd-vault-update-boltz-2-model-for-ai
Nice article. It seems that broadly, experience has followed expectation from the initial Boltz-2 paper (as I noted in my at-the-time blog, https://medium.com/@dapscience/boltz-2-vs-fep-the-wrong-question-heres-why-they-re-stronger-together-6863be348aa2 ). In particular, with respect to the limitations of lock-and-key for induced fit binding and the limitations in accuracy. Also, regarding the comment in your article: "This has given rise to a highly practical and increasingly popular workflow concept known as "Affinity funneling". That term actually originated in my blog :-)
I have a few questions, as I haven't tried Boltz-2 yet: Does it differentiate between enantiomers and diastereomers? Does it consider 'halogen bonds' too?
Great read - again this highlights the potential and continuing place of ground-up physics and simulation models.
Thanks for sharing, Serhii. Great takeaway for real world implementation