Skip to content

Notes

Simple models remain operationally viable — until demand stops being smooth

in Notes, Forecasting, Operations · 6 min read


In previous posts, I showed how simple models like SES Optimized are still performant and may give better results than other univariate models. At portfolio level they can perform well. However, choosing a single model without considering demand structure means not capturing the full benefit of these models.


What forecasting models are for

Forecasting models are not made for getting benchmark scores on a portfolio.

They are used to make better decisions for replenishment, purchasing, inventory positioning, working capital allocation, and service-level stability.

Global use of one model is enticing to the practitioner. It reduces complexity. It simplifies explanation. It lowers governance overhead. It makes deployment and maintenance easier.

However, this decision comes with a cost — non-performance across different types of demand.

A model may look strong at the overall portfolio level and still be a weak default for large parts of it.


Demand is not uniform

Demand patterns are not the same across a portfolio. A portfolio generally contains fundamentally different underlying demand patterns — smooth, erratic, intermittent, and lumpy. Each makes different demands on forecasting models.

A commonly used framework to classify such patterns is the ADI/CV² framework. It checks two things: how frequently demand occurs, and how variable it is when it occurs.

Most portfolios contain all regimes simultaneously. A model that performs well in one regime may not perform well in others. Most forecasting policies do not account for this — they try to minimise the number of models used instead.

SES Optimized is a recency-based model. It performs reasonably well when the underlying signal is stable. In such cases, recent history still carries meaningful information about the near future. The model is simple, efficient, and often surprisingly competitive.

However, if the signal is not inherently stable, recency does not carry as much information about the upcoming future — and performance degrades.


The benchmark

To make this concrete, I benchmarked 18 univariate forecasting models on a 297-SKU subset from FreshRetailNet-50K and grouped SKUs using the ADI/CV² regime lens. Lower score is better.

SES Vs Best Model Per Regime

Above plots the performance of SES Optimized against the best performing model in each regime. The gap is the performance cost of holding SES as the default across that regime.

Demand regime Best model score SES score Gap
Smooth 53.5 54.3 0.8
Erratic 104.1 111.1 7.0
Lumpy 101.1 116.6 15.5

The gap is the performance cost of holding SES as the default across that regime.

In smooth demand, SES Optimized remains very close to the best performing model. As soon as it moves into erratic and lumpy demand, it is no longer the optimal default.

What this shows. Without the regime analysis, one would likely choose SES Optimized as the single operational default — it performs well at portfolio level. The demand-based analysis shows exactly where this model holds and where it fails.

A note on scope. These results are specific to this dataset — a daily, perishable, intermittent-demand context. SES being competitive in smooth demand here does not mean it will be competitive in smooth demand on a different dataset, a different category, or a different operating context. The regime lens is the transferable insight. The specific numbers are not.


What this means in practice

Regime Verdict
Smooth SES is defensible here. Governance simplicity earns the marginal accuracy concession.
Erratic SES is no longer optimal. Stronger alternatives are worth evaluating.
Lumpy SES is not a sound default. The performance cost is too large to absorb.

Make the model policy as simple as possible — but not simpler than the demand structure allows.

That said, this is not always a day-one decision. Regime-aware model selection requires data infrastructure, classification logic, and governance for multiple model types. That cost is real.

For many organisations, the right first move is still a simple model deployed globally — not because it is optimal across all regimes, but because getting it running cleanly, with proper monitoring and bias tracking, is already a meaningful operational capability. It is the foundation that makes the next step possible.

The regime analysis then becomes the natural next question: now that we have a baseline, where is it costing us? That is a better question to ask from a position of operational stability than from the start of a forecasting programme.

Simple models are not the final destination. But for many teams, they are the right place to begin — and understanding exactly where they hold is what makes the transition to regime-aware policy a deliberate step rather than a reactive one.


Model choice should follow demand structure — not hype toward newer models, and not habit toward familiar ones.

That is what separates a model that looks good in a benchmark from a policy that performs well in a real portfolio.


Benchmark: 297 SKUs · FreshRetailNet-50K · 18 univariate models · Daily perishable retail data · ADI/CV² segmentation · Lower score = better forecast accuracy. Regime-level results; individual SKU variance exists within each category.

Your best forecast model might be your biggest operational risk

Forecasts are not made to win benchmarks. They are made for decision-making.

Running a business means dealing with volatility, especially in demand and supply. The better a business can estimate future demand, the better it can plan inventory, negotiate procurement, allocate working capital, and maintain service levels. Better forecasts do not remove uncertainty. They enable better decisions under uncertainty.

Better estimation of demand means that a large chunk of inventory decisions is made far ahead of time. That improves purchasing, product availability, customer satisfaction, and the cost of understocking or overstocking. In that sense, forecasting is not just a modeling problem. It is a business control system.

Due to the inherent volatility of the world, even great forecasts are meaningfully off. Most forecast models are optimized to reduce the gap between forecast and reality without caring about direction. That is what absolute error captures: how far you were from reality, regardless of whether you were too high or too low.

Direction, however, matters over an operational horizon.

That is what bias captures: whether a forecasting system consistently leans in one direction over time. A system that consistently over-forecasts creates one kind of operational damage. A system that consistently under-forecasts creates another. In both cases, the damage compounds quietly.

For any given horizon, two numbers define forecast performance:

  • MAE % — how wrong you are on average, in magnitude
  • Average Bias — how wrong your system has become, directionally

MAE tells you how wrong you are. Bias tells you how wrong your system has become.

Both metrics are important. But they operate on different timescales and carry different risk profiles. MAE reflects day-to-day noise. Bias reflects directional drift — systematic error. A business can often absorb more day-to-day fluctuation. But bias becomes visible only after the damage is done. A model that is marginally worse on MAE but holds lower bias can still be the stronger operational choice.

The right operating lens is two-dimensional — not a single leaderboard number.

To make this concrete, I benchmarked 18 univariate forecasting models on a 297-SKU subset from FreshRetailNet-50K: daily retail data, intermittent demand, perishable operating context.

Executive view

The executive view makes the tradeoff visible immediately: Chronos2 leads on MAE %, while SES Opt holds lower absolute bias.

Executive view: MAE vs Bias

Two models ended up very close:

Model MAE % |Bias| % Composite Score
Chronos2 42.05 22.20 64.25
SES Opt 43.32 20.96 64.28

This was much more nuanced than a simple Foundation Model (Chronos2) vs Classical Model (SES Opt) story.

Chronos2 leads on MAE % by 1.27 points, or roughly 2.9%.
SES Opt leads on absolute bias by 1.24 points, which means about 5.6% less directional drift.
On composite score, they are essentially tied: 64.25 vs 64.28.

That is the real lesson.

If you optimize only for MAE, you would likely pick Chronos2. But that decision also comes with higher directional drift. Over time, that drift can become slow-motion inventory distortion — the kind that often does not show up in a standard forecasting review until it has already become a business problem.

So the better question is not:

Which model has the lowest error?

The better question is:

Which model gives acceptable error with the least directional risk?

That question should change model selection.

Operational excellence is not achieved by minimizing forecast error alone. It is achieved by jointly managing error magnitude and directional drift — and by making both visible in the decision process.

A model with slightly better headline accuracy can still be the weaker operational choice if its bias is allowed to accumulate unchecked.

Technical view

The full benchmark view shows that this is not a cherry-picked comparison. The Chronos2 vs SES Opt tradeoff sits inside a broader portfolio of 18 evaluated models.

Technical view: MAE vs Bias

That is why the best forecast model on paper can still be your biggest operational risk.

Simple Models Win Where It Matters: Example SES

AI is advancing at incredible speed. It is tempting to assume that advanced models lead to better outcomes. This assumption often fails in demand forecasting. The best model in operations isn't the smartest one. It's the one that stays competitive while remaining stable, explainable, and governable at portfolio scale. The score difference between simple and complex can be minute. The operational difference is huge.

Let us take the case of univariate models — they lack demand drivers (ML features) like price, location, promotions, weather, marketing pushes, etc. It is easy to assume that such models will not perform well. One would be pleasantly surprised by how good they actually perform without any bells and whistles.

When transitioning from a manual forecasting process to AI-assisted planning, it is useful to leverage such models, as they work on just the past data for each SKU. Thereby, get the system running while one transitions to driver-driven ML models. And in practice, at the start, driver data is rarely ready: unclean, messy, or too unreliable to operationalize.

In demand forecasting, modern portfolios have hundreds or thousands of SKUs, each behaving somewhat differently — which is exactly where univariate models remain competitive.

Benchmark setup

I benchmarked 18 univariate forecasting models on a 297-SKU subset from FreshRetailNet-50K (daily, intermittent demand, perishable context).

Dataset: https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K/tree/main

The benchmark included:

  • Classical methods: SES, Holt, Holt-Winters, Theta, Croston variants, Naïve
  • ML model (univariate formulation): LightGBM
  • Foundation model: Chronos2

Result

In the top plot, one can clearly see one such model — Simple Exponential Smoothing Optimized (SES) with a score of 64.28 — sitting very close to the top-performing model Chronos2 (64.25). Lower score is considered better. SES (Optimized) is effectively tied with the best portfolio model, Chronos2. Here, the score difference is minute, but the operational difference is huge.

Plot

Portfolio score by model — Executive view (SES highlighted)

A model that is slightly worse on paper but stable and governable can outperform a fragile system in real operations. Simple classical models like SES may not beat every other model, but they remain close to the top while being extremely easy to operate.

Takeaway

In this case, SES is not a solution for everything but a high-quality operational baseline. When chosen, it provides portfolio-wide stability, minimal governance overhead, and predictable behavior under noise.

PS: A more technical plot is shown below.

Portfolio score by model — Technical view (SES highlighted)

Why the Most Frequent SKU Winner Can Be the Wrong Portfolio Model

In demand forecasting, choosing portfolio model(s) is confusing, as there are a number of choices available. Each Stock Keeping Unit (SKU) shows a different type of behavior (regime). Also, demand drivers vary considerably across products and time periods.

Models (statistical or machine learning) at the SKU level give insight into which models are doing better more often. This may lead to selecting one model as better than others for global portfolio use. This may mislead global portfolio model selection. A model that wins frequently at the SKU level does not automatically become the best portfolio model.

The choice of one or a few models as global model(s) remains solidly grounded in operational reality. More models lead to more maintenance, more monitoring, more serving complexity, and more governance overhead. This leads to the inevitable question: how do we select such model(s)?

First, one has to understand that, at the portfolio level, frequency of wins does not represent stability. A model can be great across a range of SKUs but may perform very poorly on other SKUs. Then, the choice of model pivots to the one that does not perform too badly on average and still remains competitive.

To validate this, I benchmarked 18 different types of univariate models on a 297-SKU subset of data from the FreshRetailNet-50K dataset (daily, intermittent demand, perishable context):
https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K/tree/main

Model Decision Map Placeholder

The plot demonstrates the following:

  1. WindowAverage model has the highest win share of 13.8%.
  2. WindowAverage model has a mean score of 77.86 (lower is better).
  3. The best-performing model is chronos2 with a score of 64.25.

The most frequent winner is about 21% worse than the portfolio leader. This difference matters significantly to planning leaders, as it may increase working capital pressure and disturb service-level stability.

At the portfolio level, stability with operational feasibility remains the top requirement. This stability is achieved by combining both win share and portfolio mean score.

A bit more technical plot is here:

Model Decision Map Placeholder