Skip to content

Blog

Short notes on forecasting, pricing, and decision-making under uncertainty.

Posts

Simple Models Win Where It Matters: Example SES

AI is advancing at incredible speed. It is tempting to assume that advanced models lead to better outcomes. This assumption often fails in demand forecasting. The best model in operations isn't the smartest one. It's the one that stays competitive while remaining stable, explainable, and governable at portfolio scale. The score difference between simple and complex can be minute. The operational difference is huge.

Let us take the case of univariate models — they lack demand drivers (ML features) like price, location, promotions, weather, marketing pushes, etc. It is easy to assume that such models will not perform well. One would be pleasantly surprised by how good they actually perform without any bells and whistles.

When transitioning from a manual forecasting process to AI-assisted planning, it is useful to leverage such models, as they work on just the past data for each SKU. Thereby, get the system running while one transitions to driver-driven ML models. And in practice, at the start, driver data is rarely ready: unclean, messy, or too unreliable to operationalize.

In demand forecasting, modern portfolios have hundreds or thousands of SKUs, each behaving somewhat differently — which is exactly where univariate models remain competitive.

Benchmark setup

I benchmarked 18 univariate forecasting models on a 297-SKU subset from FreshRetailNet-50K (daily, intermittent demand, perishable context).

Dataset: https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K/tree/main

The benchmark included:

  • Classical methods: SES, Holt, Holt-Winters, Theta, Croston variants, Naïve
  • ML model (univariate formulation): LightGBM
  • Foundation model: Chronos2

Result

In the top plot, one can clearly see one such model — Simple Exponential Smoothing Optimized (SES) with a score of 64.28 — sitting very close to the top-performing model Chronos2 (64.25). Lower score is considered better. SES (Optimized) is effectively tied with the best portfolio model, Chronos2. Here, the score difference is minute, but the operational difference is huge.

Plot

Portfolio score by model — Executive view (SES highlighted)

A model that is slightly worse on paper but stable and governable can outperform a fragile system in real operations. Simple classical models like SES may not beat every other model, but they remain close to the top while being extremely easy to operate.

Takeaway

In this case, SES is not a solution for everything but a high-quality operational baseline. When chosen, it provides portfolio-wide stability, minimal governance overhead, and predictable behavior under noise.

PS: A more technical plot is shown below.

Portfolio score by model — Technical view (SES highlighted)

Why the Most Frequent SKU Winner Can Be the Wrong Portfolio Model

In demand forecasting, choosing portfolio model(s) is confusing, as there are a number of choices available. Each Stock Keeping Unit (SKU) shows a different type of behavior (regime). Also, demand drivers vary considerably across products and time periods.

Models (statistical or machine learning) at the SKU level give insight into which models are doing better more often. This may lead to selecting one model as better than others for global portfolio use. This may mislead global portfolio model selection. A model that wins frequently at the SKU level does not automatically become the best portfolio model.

The choice of one or a few models as global model(s) remains solidly grounded in operational reality. More models lead to more maintenance, more monitoring, more serving complexity, and more governance overhead. This leads to the inevitable question: how do we select such model(s)?

First, one has to understand that, at the portfolio level, frequency of wins does not represent stability. A model can be great across a range of SKUs but may perform very poorly on other SKUs. Then, the choice of model pivots to the one that does not perform too badly on average and still remains competitive.

To validate this, I benchmarked 18 different types of univariate models on a 297-SKU subset of data from the FreshRetailNet-50K dataset (daily, intermittent demand, perishable context):
https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K/tree/main

Model Decision Map Placeholder

The plot demonstrates the following:

  1. WindowAverage model has the highest win share of 13.8%.
  2. WindowAverage model has a mean score of 77.86 (lower is better).
  3. The best-performing model is chronos2 with a score of 64.25.

The most frequent winner is about 21% worse than the portfolio leader. This difference matters significantly to planning leaders, as it may increase working capital pressure and disturb service-level stability.

At the portfolio level, stability with operational feasibility remains the top requirement. This stability is achieved by combining both win share and portfolio mean score.

A bit more technical plot is here:

Model Decision Map Placeholder