Navigating the Limits of AI in Extreme Weather Forecasting: A Practical Guide

Overview

Extreme weather events—heatwaves, cold snaps, and storms—cause billions in damages annually. Accurate forecasting saves lives and reduces economic losses. For decades, we relied on physics-based numerical weather prediction (NWP) models. Recently, artificial intelligence (AI) models have emerged as faster, sometimes more accurate alternatives. But a 2024 study published in Science Advances reveals a critical blind spot: AI models underperform when predicting record-breaking extremes. This guide walks you through the key differences, the practical steps to evaluate model performance, and common pitfalls to avoid when choosing between traditional and AI-based forecasts for extreme events.

Navigating the Limits of AI in Extreme Weather Forecasting: A Practical Guide — Source: www.carbonbrief.org

Prerequisites

Before diving in, you should be familiar with:

Basic concepts of weather forecasting (e.g., ensemble models, resolution)
Fundamentals of machine learning (training data, overfitting, generalization)
Statistical metrics like root mean square error (RMSE) and bias
Access to a dataset of historical extreme events (e.g., from NOAA or ECMWF archives)

No programming experience is strictly required, but examples use Python-like pseudocode for clarity.

Step-by-Step Guide to Assessing AI vs Traditional Models for Extreme Weather

Step 1: Understand the Data Diet of Each Model

Physics-based models (e.g., ECMWF's IFS) solve equations of fluid dynamics and thermodynamics. They can simulate conditions never seen before because the laws of physics are universal. In contrast, AI models (e.g., Google's GraphCast, Huawei's Pangu) are trained on historical reanalysis data. They learn patterns but are limited by the range of that data. For record-breaking events—values far outside the training set—AI models tend to regress toward the mean.

Action: When evaluating an AI model, always check the distribution of the training dataset. Does it include extremes beyond, say, three standard deviations from the mean? If not, expect underestimation.

Code Example: Checking Training Data Extremes

# Pseudocode
import numpy as np
train_data = load_historical_weather('reanalysis.nc')
extreme_threshold = np.percentile(train_data, 99.9)
print(f'99.9th percentile in training: {extreme_threshold}°C')
# If test event exceeds this, uncertainty grows

Step 2: Define a Robust Evaluation Metric for Extremes

Standard metrics like RMSE or anomaly correlation coefficient (ACC) can mask poor extreme-event performance. Instead, use targeted metrics:

Extreme Probability Score (EPS): How often does the model correctly predict an event above the 99th percentile?
Intensity Bias: Mean error for events in the top 1% of historical records.
Return Period Accuracy: For a 100-year event, how far off is the modeled magnitude?

The study cited in the introduction tested thousands of record-breaking hot, cold, and windy events from 2018 and 2020. They found that AI models underestimated both frequency and intensity.

Action: Create a test set of extreme events that were not in the training data. Compare predictions from a physics-based model and an AI model using these extreme-specific metrics.

Code Example: Extreme Intensity Bias

# Pseudocode
def extreme_bias(observed, predicted, percentile=99):
    mask = observed > np.percentile(observed, percentile)
    bias = np.mean(predicted[mask] - observed[mask])
    return bias

bias_physics = extreme_bias(test_obs, physics_pred)
bias_ai = extreme_bias(test_obs, ai_pred)
print('Physics bias:', bias_physics, 'AI bias:', bias_ai)

Step 3: Analyze Why AI Falls Short

As study author Prof. Sebastian Engelke explains, AI models are 'constrained to the range of the training dataset.' This creates a distributional shift problem: test events are outside the training distribution. Physics models, meanwhile, can simulate new extremes because their equations are not bounded by historical statistics.

Action: Plot the observed vs. predicted values for extreme events in a scatter plot. Color points by whether they exceed the training maximum. You'll likely see that AI predictions for the most extreme points cluster near the training maximum, while physics model predictions spread more realistically.

Visualization Tip

# Pseudocode
import matplotlib.pyplot as plt
plt.scatter(test_obs, ai_pred, label='AI', alpha=0.5)
plt.scatter(test_obs, physics_pred, label='Physics', alpha=0.5)
plt.axhline(y=train_max, color='r', linestyle='--', label='Training max')
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.legend()
plt.show()

Step 4: Interpret the Results in Real-World Context

If your evaluation shows AI models systematically underpredicting extremes, the implications are serious. Early warnings based on AI could miss the severity of a once-in-a-century heatwave, leading to inadequate preparation. For routine forecasts (e.g., next week's temperature within normal range), AI may perform as well or better. The decision to replace physics-based models should consider the cost of missing extremes.

Action: For a given forecast scenario (e.g., a heatwave warning system), weigh the probability of extremes versus the computational savings of AI. The study's authors call this a 'warning shot' against too-rapid replacement of traditional models.

Common Mistakes

Mistake 1: Ignoring the Training Data Distribution

Many assume that because AI models excel on standard benchmarks, they will perform equally on rare events. This is false. Always check if your test extremes fall within the training span.

Mistake 2: Using Only Aggregate Metrics

Relying on RMSE or MAE can hide severe underperformance in the tails. Always include extreme-specific metrics like the ones in Step 2.

Mistake 3: Overlooking Physics-Based Constraints

AI models learn statistical patterns but have no understanding of physical laws. For example, they might predict a temperature that violates energy balance. Always use a physics-based model as a sanity check for AI outputs when extremes are forecast.

Mistake 4: Assuming New AI Models Have Solved This

Even state-of-the-art AI weather models like Pangu and GraphCast show this limitation. The problem is inherent: they are trained on historical data. As of 2025, no AI model can reliably extrapolate to unprecedented extremes without some physics injection.

Summary

AI weather models offer speed and skill for everyday forecasts, but they struggle with record-breaking extremes because their training data limits their range. Traditional physics-based models remain essential for accurate prediction of rare, high-impact events. When evaluating a forecasting system, always test on unseen extremes, use tail-specific metrics, and remember the training data distribution. The prudent path is hybrid: use AI for routine forecasts but fall back on physics-based simulations when the probability of extremes rises.