Navigating the Limits of AI in Extreme Weather Forecasting: A Practical Guide
Overview
Extreme weather events—heatwaves, cold snaps, and storms—cause billions in damages annually. Accurate forecasting saves lives and reduces economic losses. For decades, we relied on physics-based numerical weather prediction (NWP) models. Recently, artificial intelligence (AI) models have emerged as faster, sometimes more accurate alternatives. But a 2024 study published in Science Advances reveals a critical blind spot: AI models underperform when predicting record-breaking extremes. This guide walks you through the key differences, the practical steps to evaluate model performance, and common pitfalls to avoid when choosing between traditional and AI-based forecasts for extreme events.

Prerequisites
Before diving in, you should be familiar with:
- Basic concepts of weather forecasting (e.g., ensemble models, resolution)
- Fundamentals of machine learning (training data, overfitting, generalization)
- Statistical metrics like root mean square error (RMSE) and bias
- Access to a dataset of historical extreme events (e.g., from NOAA or ECMWF archives)
No programming experience is strictly required, but examples use Python-like pseudocode for clarity.
Step-by-Step Guide to Assessing AI vs Traditional Models for Extreme Weather
Step 1: Understand the Data Diet of Each Model
Physics-based models (e.g., ECMWF's IFS) solve equations of fluid dynamics and thermodynamics. They can simulate conditions never seen before because the laws of physics are universal. In contrast, AI models (e.g., Google's GraphCast, Huawei's Pangu) are trained on historical reanalysis data. They learn patterns but are limited by the range of that data. For record-breaking events—values far outside the training set—AI models tend to regress toward the mean.
Action: When evaluating an AI model, always check the distribution of the training dataset. Does it include extremes beyond, say, three standard deviations from the mean? If not, expect underestimation.
Code Example: Checking Training Data Extremes
# Pseudocode
import numpy as np
train_data = load_historical_weather('reanalysis.nc')
extreme_threshold = np.percentile(train_data, 99.9)
print(f'99.9th percentile in training: {extreme_threshold}°C')
# If test event exceeds this, uncertainty grows
Step 2: Define a Robust Evaluation Metric for Extremes
Standard metrics like RMSE or anomaly correlation coefficient (ACC) can mask poor extreme-event performance. Instead, use targeted metrics:
- Extreme Probability Score (EPS): How often does the model correctly predict an event above the 99th percentile?
- Intensity Bias: Mean error for events in the top 1% of historical records.
- Return Period Accuracy: For a 100-year event, how far off is the modeled magnitude?
The study cited in the introduction tested thousands of record-breaking hot, cold, and windy events from 2018 and 2020. They found that AI models underestimated both frequency and intensity.
Action: Create a test set of extreme events that were not in the training data. Compare predictions from a physics-based model and an AI model using these extreme-specific metrics.
Code Example: Extreme Intensity Bias
# Pseudocode
def extreme_bias(observed, predicted, percentile=99):
mask = observed > np.percentile(observed, percentile)
bias = np.mean(predicted[mask] - observed[mask])
return bias
bias_physics = extreme_bias(test_obs, physics_pred)
bias_ai = extreme_bias(test_obs, ai_pred)
print('Physics bias:', bias_physics, 'AI bias:', bias_ai)
Step 3: Analyze Why AI Falls Short
As study author Prof. Sebastian Engelke explains, AI models are 'constrained to the range of the training dataset.' This creates a distributional shift problem: test events are outside the training distribution. Physics models, meanwhile, can simulate new extremes because their equations are not bounded by historical statistics.
Action: Plot the observed vs. predicted values for extreme events in a scatter plot. Color points by whether they exceed the training maximum. You'll likely see that AI predictions for the most extreme points cluster near the training maximum, while physics model predictions spread more realistically.

Visualization Tip
# Pseudocode
import matplotlib.pyplot as plt
plt.scatter(test_obs, ai_pred, label='AI', alpha=0.5)
plt.scatter(test_obs, physics_pred, label='Physics', alpha=0.5)
plt.axhline(y=train_max, color='r', linestyle='--', label='Training max')
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.legend()
plt.show()
Step 4: Interpret the Results in Real-World Context
If your evaluation shows AI models systematically underpredicting extremes, the implications are serious. Early warnings based on AI could miss the severity of a once-in-a-century heatwave, leading to inadequate preparation. For routine forecasts (e.g., next week's temperature within normal range), AI may perform as well or better. The decision to replace physics-based models should consider the cost of missing extremes.
Action: For a given forecast scenario (e.g., a heatwave warning system), weigh the probability of extremes versus the computational savings of AI. The study's authors call this a 'warning shot' against too-rapid replacement of traditional models.
Common Mistakes
Mistake 1: Ignoring the Training Data Distribution
Many assume that because AI models excel on standard benchmarks, they will perform equally on rare events. This is false. Always check if your test extremes fall within the training span.
Mistake 2: Using Only Aggregate Metrics
Relying on RMSE or MAE can hide severe underperformance in the tails. Always include extreme-specific metrics like the ones in Step 2.
Mistake 3: Overlooking Physics-Based Constraints
AI models learn statistical patterns but have no understanding of physical laws. For example, they might predict a temperature that violates energy balance. Always use a physics-based model as a sanity check for AI outputs when extremes are forecast.
Mistake 4: Assuming New AI Models Have Solved This
Even state-of-the-art AI weather models like Pangu and GraphCast show this limitation. The problem is inherent: they are trained on historical data. As of 2025, no AI model can reliably extrapolate to unprecedented extremes without some physics injection.
Summary
AI weather models offer speed and skill for everyday forecasts, but they struggle with record-breaking extremes because their training data limits their range. Traditional physics-based models remain essential for accurate prediction of rare, high-impact events. When evaluating a forecasting system, always test on unseen extremes, use tail-specific metrics, and remember the training data distribution. The prudent path is hybrid: use AI for routine forecasts but fall back on physics-based simulations when the probability of extremes rises.
Related Articles
- Pentagon Taps Seven Tech Giants to Deploy AI for Battlefield Decision-Making
- How an AI Named RAVEN Revolutionized Exoplanet Discovery with TESS Data
- MIT’s Physics-Based Virtual Violin Revolutionizes Instrument Design for Luthiers
- SpaceX Set to Deploy 45 Satellites in Early Morning Launch from California
- 10 Key Insights About Planet Labs' Revolutionary Satellite Subscription Service
- Cloud Patterns Signal the End of Winter in Alaska
- A Step-by-Step Guide to Using the Keto Diet for Mental Health Support
- How to Engage with NASA STEM Activities This Summer: A Step-by-Step Guide