question posed by @BRSutherland that I thought warranted surfacing here
Is there some level of dimensionality where BO starts to fail? If data regime stays small and dimensionality continues to increase, at some point there are no optimization algorithms that will work efficiently and you should redesign your campaign.
My response:
As you pointed out, curse of dimensionality is a general problem, not something specific to BO. My perhaps dissatisfying answer is that I recommend choosing a model that tends to not overfit in high dimensions. As a recent example, there is a project optimizing a chemical reaction. There are four ingredients, one processing parameter (time), and one output. I jumped in recently to support, and they’ve collected ~40 datapoints so far. For this specific dataset, if I choose a model that tends not to overfit, the next batch of suggestions does a better job at exploring the boundaries of the space which were largely unexplored in the original 40 datapoints. An alternative would be to simply run quasi-random search to get the same behavior, but the former is likely to be more sample-efficient.
The rule of thumb is up to a couple dozen parameters, but there are cases where for problems with certain behavior (especially when only a subset of parameters truly matter), it’s been shown to do OK with a hundred or hundreds of dimensions. For example https://arxiv.org/pdf/2103.00349.
128 hyperparameters is a lot – way more than our standard GPEI algorithm can handle comfortably
Within aluminum alloys, there are ~23 unique periodic elements represented in common mixtures. Would it make sense to expand the search space to all non-radioactive materials? Probably not. Could a high-dimensional algorithm quickly determine that those other elements aren’t relevant? Maybe. Would that be a moot point because many of those configurations wouldn’t be synthesizable or measurable? (i.e., unable to get to point of running a tensile test). Probably.
However, I would lean away from restricting it to be lower than those 23 periodic elements, and instead focusing on finding suitable lower and upper bounds for them and honing in on things like processing conditions and purity. Likewise, I would spend the extra computation on “fully Bayesian” models and more computationally expensive models like SAASBO (I would rather wait a couple hours for an algorithm to suggest a higher quality experiment prior to running something in the lab – experiments carry a lot of hidden costs compared with running an algorithm).
Great points from sgbraid. I’d add some framing that might help ground the discussion in a machine learnimg perspective.
“Bayesian” here means models that output distributions, not point estimates. This is what enables the acquisition function to balance exploration vs. exploitation. Bayesian models do tend to overfit slightly less than their point-estimate counterparts, but honestly, the difference isn’t the main story.
The deeper issue is the bias-variance tradeoff:
Variance (overfitting): the model is too flexible, fits noise in the training data, and generalizes poorly to new points
Bias (underfitting): the model is too rigid, misses real structure in the data
In high dimensions with limited data, variance dominates — you simply don’t have enough coverage of the space
The dimensionality at which overfitting becomes a problem depends on the problem, the amount of data, how well that data covers the space, and the model choice. There are no absolute rules without making these factors precise.
That said, some general trends hold:
More rigid models reduce variance. Fitting a linear model, adding regularization (L1/L2), or restricting kernel lengthscales are all ways to impose structure and resist overfitting.
The risk is imposing the wrong bias. A line is only a good model if the underlying response is approximately linear.
Physics-informed / domain-informed models are a powerful middle ground — they introduce bias in the form of known mechanistic structure, which is much more likely to be “correct” bias than generic regularization.
Practical example — Arrhenius kinetics:
If your system obeys the Arrhenius relationship, you already know something powerful: reaction rate roughly doubles for every 10°C rise in temperature. It’s established physics, and it’s enormously valuable prior knowledge.
You can exploit this directly:
Encode it in your model — parameterize the GP mean function or kernel around the Arrhenius form rather than leaving it as a black box
Need less data — the functional form is already baked in, so you’re fitting a handful of parameters rather than learning the shape from scratch
Extrapolate more reliably — a black-box model has no reason to respect the physics outside its training range; an Arrhenius-informed model does
A plain GP trying to rediscover this relationship from data alone might need dozens of points just to approximate what one equation captures exactly. That’s the sample efficiency argument in concrete terms.
Arrhenius kinetics is just one example of prior knowledge, and it shows up across many high-value domains:
Battery degradation vs. temperature
Pharmaceutical shelf-life and stability testing
Semiconductor reliability (HTOL testing)
Corrosion and material aging
Enzyme kinetics in biological systems
The takeaway: don’t just reduce dimensionality or reach for a more flexible model — think carefully about what structure you already know, and encode it. That’s where the real leverage is.