Benchmarking acceleration in materials discovery with AI

I’ve been discussing with a lot of people recently about the best approaches to benchmarking “acceleration” in materials discovery with AI.

There are few examples of this done in literature. Baselines are often either easy to establish but not highly relevant, such as a human with random sampling—or more relevant but requires intensive effort, such as comparing a human with design-of-experiments vs. human with AI.

We have some ideas internally at the Acceleration Consortium about how assess the value of tools we build to accelerate time-to-discovery (time to advance state-of-art), such as:

  • Put the effort in to do proper benchmark studies against DOE and other methods
  • Track human time savings
  • Develop community challenge calls
  • Focus on showing maximum value-per-experiment with a given process and extrapolate that to accelerating time-to-discovery
  • Just keep making cool discoveries that advance state-of-art and people won’t care about your benchmarking :slight_smile:
  • Discoveries with highly unusual pathways from ideation to realization (something a human would never really do in an experimental campaign)

Would love to hear what the community thinks about this subject.

1 Like

best approaches to benchmarking “acceleration” in materials discovery with AI

I think it’s worth separating out the notion of discovering high performing materials vs. discovering new mechanisms (somewhat of an exception is non-performance-focused discovery of new crystal structures or molecules which could loosely be categorized as “novelty-focused”). I tend to be application-oriented, and it’s easier for me to talk about discovery from a performance property perspective. Benchmarking “mechanism discovery” (I’m broadly categorizing being more knowledge-driven than application-driven) is perhaps even more ambiguous and difficult to benchmark robustly (the main, somewhat lukewarm examples that come to mind are symbolic regression, trying to rediscover some known equation).

Baselines are often either easy to establish but not highly relevant, such as a human with random sampling—or more relevant but requires intensive effort, such as comparing a human with design-of-experiments vs. human with AI

I really like the GA vs. BO interactive figure from physics x. This is rarer to see, because most people assume and accept (including me) that it’s unlikely for someone to have a GA algorithm that outperforms Bayesian optimization on a black box optimization problem with a low budget (hundreds of experiments) and expensive experiments (minutes, days, weeks). I would also add:

  • People often benchmark using a single performance property, but that’s not realistic to materials discovery
  • When benchmarking, it’s usually a carefully curated search space / dataset, not considering things like failure conditions
  • Sometimes random search is better than human intuition (7.1 MB) (DOI)
  • Track human time savings
  • Just keep making cool discoveries that advance state-of-art and people won’t care about your benchmarking

It struck me that experimental throughput and human time saving aren’t completely inseparable from benchmarking the “smart” part of an SDL — there’s a common tie with equipment uptime and utilization. Equipment may go idle because human intuition/planning takes time vs. an AI algorithm that’s ready and willing to put that equipment to use (usually within minutes for most existing SDL campaigns I think).

I’d also add:

  • People often benchmarking using physics-based simulations OR experimental data, not both (and especially not on-demand)
    The smart part of an SDL also can’t be fully disentangled from (high-throughput) physics-based simulations. Results from the simulations can be used directly in the decision-making algorithm, typically as ML features, or they can be used to help inform human intuition or act as a screening tool for experimentalists.

Just keep making cool discoveries that advance state-of-art and people won’t care about your benchmarking

i.e., allude to superiority implicitly by generating real-world value (high-risk, high-reward) vs. demonstrating superiority explicitly in known examples. I still lean towards the idea that “if you have the resources to benchmark something robustly, it isn’t a real-world problem”


More concretely, I lean towards:

  • Judging optimization performance based on “thresholded multi-objective hypervolume improvement as a function of running cost” (not “best so far” single-objective traces as a function of iterations)
  • Literature and database studies that span decades and try to create some sort of quantified baseline for “rate of scientific discovery” and the quality thereof, with individual labels that describe the hardware and software tools involved relative to the domain expertise of the authors (i.e., are the authors expert practitioners of the tools also experts in DFT, design of experiments, high-throughput experimentation, etc. or are they “black box users” that are are more likely to impede their own progress by attempting to apply it)
  • Learning about this kind of benchmarking from adjacent domains (You may have heard of centaur chess - there’s a nice paragraph related to the debate of human vs. AI vs. human+AI Advanced chess - Wikipedia, I took the liberty to replace “chess” with “science” and “player/operator/etc.” with “scientist” below)

Centaur [science] is sometimes invoked to argue that humans will continue to remain relevant as AI progresses. U.S. Deputy Defense Secretary Robert O. Work invoked in 2016 the concept of “centaur warfighting”, extending the centaur concept beyond the [scientific] world.[19][20] Tyler Cowen and others assessed in 2013 that, due to [science AI] advances, it was getting difficult to see any major advantage to centaurs over computers by themselves in science, and that it seemed unlikely that centaurs would retain a significant advantage for much longer.[21] In contrast, as recently as 2017, Kasparov has stated that, given an appropriate [scientist], he is confident that a centaur team could outperform the top AI, while James Bridle states in 2018 that “an average [scientist] paired with an average AI is capable of beating the most sophisticated AI”.[22][23] A recent study has shown that AI in centaur [science] both substitutes traditional human skills and enables new complementary capabilities, providing suggestive evidence of how AI reshapes competitive dynamics in organizations.[24]


Some other random asides

  • Preference ranking via pairwise comparisons is a good way to assess/encode human intuition (i.e., the “score” becomes implicitly defined as an “intrinsic utility function”, with the idea that you can model the “latent utility function”). Copying the first paragraph of Bayesian optimization with pairwise comparison data | BoTorch below:

In many real-world problems, people are faced with making multi-objective decisions. While it is often hard write down the exact utility function over those objectives, it is much easier for people to make pairwise comparisons. Drawing from utility theory and discrete choice models in economics, one can assume the user makes comparisons based on some intrinsic utility function and model the latent utility function using only the observed attributes and pairwise comparisons. In machine learning terms, we are concerned with object ranking here. This book has some more general discussions on this topic.

I can imagine an “AI for science value demonstration benchmark” that uses preference ranking to capture human intuition in a more quantifiable way (e.g., creating personas for different scientists through large preference ranking datasets). I remember chatting with Abdoulatif Cisse from Andy Cooper group while he presented a poster (at Accelerate '24 I think?) that used preference BO to leverage human expertise within their BO campaign [2308.11787] HypBO: Accelerating Black-Box Scientific Experiments Using Experts' Hypotheses. Preference BO is a rich field outside of AI for materials, but there have been a few examples here or there (Felix had a similar paper recently - [2501.15554] BoTier: Multi-Objective Bayesian Optimization with Tiered Composite Objectives).

There’s also been a recent upsurge in LLM+BO work. At Meta’s Adaptive Experimentation Workshop, I became less skeptical as I saw some compelling examples from top-notch folks.

Someone working with me over the summer is helping me set up an optimization benchmarking competition on Kaggle, which I’m excited about.

This is a topic I enjoy. Thanks for posting! Likewise curious to hear what others think.