Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5. Temporal Statistics and Derived Variables

Learning objectives

By the end of this notebook you will be able to:

  • Use earthkit-transforms to compute daily and monthly means from sub-daily data

  • Compute a climatological mean and derive anomaly fields

  • Compute rolling statistics for smoothed features

  • Understand why pre-computing derived variables improves training efficiency

  • Write temporal aggregates to Zarr


Why pre-compute temporal features?

Raw ERA5 data is hourly. But most ML models do not learn directly from individual hours — they need features like:

  • Daily means — remove the diurnal cycle

  • Monthly means — seasonal context

  • Anomalies — deviation from climatology, isolating the weather signal from the seasonal cycle

  • Rolling means — smoothed trends, useful for medium-range forecasting inputs

Computing these once, at preprocessing time, and storing them in Zarr means they are available instantly during training — no recomputation every epoch.

Data note for this notebook

This notebook requires a multi-timestep dataset to demonstrate temporal aggregation meaningfully. Options:

  1. CDS request (recommended): fetch several months of daily or hourly ERA5

  2. Provided sample: # DATA: multi-timestep GRIB file needed — see setup cell

The cells below show both paths. The CDS request is commented out by default.

Setup

import earthkit.data as ekd
import earthkit.transforms as ekt
import earthkit.plots as ekp
import xarray as xr
import os

ekd.settings.set({"cache-policy": "user"})
os.makedirs("data", exist_ok=True)

print("earthkit.transforms version:", ekt.__version__)

Load multi-timestep data

Fetch three months of daily 2m temperature over Europe.

# Uncomment to fetch from CDS (requires ~/.cdsapirc)

# ds_cds = ekd.from_source(
#     "cds",
#     "reanalysis-era5-single-levels",
#     request=dict(
#         variable=["2t"],
#         product_type="reanalysis",
#         date=[t.strftime("%Y-%m-%d") for t in time],
#         time="00:00",
#         area=[72, -25, 30, 45],   # Europe: N, W, S, E
#         grid=[1, 1],
#         format="grib",
#     ),
# )
# xr_daily = ds_cds.to_xarray()

print("CDS option commented out — see Option B below.")

Option B — Sample / provided data

# DATA: This notebook requires a multi-timestep ERA5 GRIB file.
#
ds = ekd.from_source("sample", "era5_europe_20200101_20201231_2t_1deg.grib")
xr_daily = ds.to_xarray(add_earthkit_attrs = False)

print("Loaded dataset:")
print(xr_daily)

#
# For demonstration purposes, we can construct a synthetic multi-day xarray
# Dataset that mimics ERA5 structure so all cells below run without error.

# import pandas as pd

# # Synthetic 3-month daily dataset: Jan–Mar 2020, Europe 1° grid
# time = pd.date_range("2020-01-01", "2020-12-31", freq="D")
# lat = np.arange(72, 29, -1, dtype=float)
# lon = np.arange(-25, 46, dtype=float)

# # Temperature: seasonal cycle + random noise
# day_of_year = np.array([t.day_of_year for t in time])
# seasonal = 260 + 20 * np.sin(2 * np.pi * day_of_year / 365 - np.pi / 2)
# t_data = seasonal[:, None, None] + np.random.normal(0, 3, (len(time), len(lat), len(lon)))

# xr_daily = xr.Dataset(
#     {
#         "2t": xr.DataArray(
#             t_data.astype(np.float32),
#             dims=["forecast_reference_time", "latitude", "longitude"],
#             coords={"forecast_reference_time": time, "latitude": lat, "longitude": lon},
#             attrs={"units": "K", "long_name": "2 metre temperature"},
#         )
#     }
# )
# print("Synthetic dataset:")
# print(xr_daily)

Temporal aggregation with earthkit-transforms

earhtkit-transforms provides a clean API for standard temporal reductions. It preserves xarray metadata and coordinate attributes.

# Daily means (useful when input is hourly; here input is already daily)
# With real hourly data:
#   daily_mean = ekt.temporal.daily_mean(xr_hourly)
# The function detects the time coordinate automatically.

monthly_mean = ekt.temporal.monthly_mean(xr_daily)
print("Monthly mean dataset:")
monthly_mean
fig = ekp.quickplot(monthly_mean, figsize=(12, 7))
fig.title("ERA5 {variable_name}: monthly mean's for {forecast_reference_time:%Y}")
fig.subplot_titles("{forecast_reference_time:%Y-%m}");

Climatological mean and anomalies

A climatology is the long-term mean for each calendar day (or month) — the expected value given the time of year. The anomaly is the departure from that expectation:

anomaly(t)=T(t)Tclim(month(t))\text{anomaly}(t) = T(t) - \overline{T}_{\text{clim}}(\text{month}(t))

Anomalies are often more informative than raw fields for ML — the seasonal cycle is predictable and adds noise relative to the weather signal the model should learn.

# Climatological mean: average over the time dimension
# (With a real multi-year dataset this would be averaged over the same
#  calendar month across many years; here we use all available months)

climatology = ekt.climatology.mean(xr_daily, frequency="month")

print("Climatological mean shape:", climatology['2t'].shape)
print(f"Mean temperature: {float(climatology['2t'].mean()):.1f} K")
# Anomalies: subtract the climatological mean from each timestep
anomalies = (xr_daily - climatology)
anomalies['2t'].attrs["long_name"] = "2m temperature anomaly"
anomalies['2t'].attrs["units"] = "K"

print(f"Anomaly range: [{float(anomalies['2t'].min()):.2f}, {float(anomalies['2t'].max()):.2f}] K")
print(f"Anomaly mean (should be ~0): {float(anomalies['2t'].mean()):.4f} K")

Rolling statistics

A rolling (moving window) mean smooths short-term noise, which can be useful as an input feature for extended-range forecasting.

# 1-day rolling mean (centred window), NOTE: As input is already daily, this is just a demonstration of the function.
rolling_1d = ekt.temporal.daily_reduce(xr_daily, how = "mean")

# 7-day rolling mean (centred window)
rolling_7d = ekt.temporal.rolling_reduce(xr_daily, window_length=7, how="mean")

# Monthly rolling mean
rolling_monthly = ekt.temporal.monthly_reduce(xr_daily, how="mean")

print(f"Raw shape:        {xr_daily['2t'].shape}")
print(f"7-day rolling:    {rolling_7d['2t'].shape}  (same shape, NaN at edges)")
fig = ekp.Figure(1,1)
subplot = fig.add_timeseries(0, 0)

lat_coord, lon_coord = 47, 4

subplot.line(xr_daily['2t'].sel(latitude=lat_coord, longitude=lon_coord), label="Daily")
subplot.line(rolling_1d['2t'].sel(latitude=lat_coord, longitude=lon_coord), label="1-day rolling mean")
subplot.line(rolling_7d['2t'].sel(latitude=lat_coord, longitude=lon_coord), label="7-day rolling mean")
subplot.line(rolling_monthly['2t'].sel(latitude=lat_coord, longitude=lon_coord), label="Monthly rolling mean")

subplot.legend()

fig.title("ERA5 daily {variable_name} in {location:%c}, {location:%C}")
fig.xlabel()
fig.ylabel("{variable_name} ({units})");

Compare data volumes

# Memory footprint comparison
def mb(arr):
    return arr.nbytes / 1e6

print(f"Daily data    : {mb(xr_daily['2t'].values):.1f} MB  ({xr_daily.sizes['forecast_reference_time']} timesteps)")
print(f"Monthly mean  : {mb(monthly_mean['2t'].values):.3f} MB  ({monthly_mean.sizes['forecast_reference_time']} timesteps)")
print(f"Climatology   : {mb(climatology['2t'].values):.3f} MB  ({climatology.sizes['month']} timesteps)")
print(f"Anomalies     : {mb(anomalies['2t'].values):.1f} MB  ({anomalies.sizes['forecast_reference_time']} timesteps)")

Write to Zarr

temporal_ds = xr.Dataset(
    {
        "2t":           xr_daily["2t"],
        "2t_anomaly":   anomalies['2t'],
        "2t_rolling7d": rolling_7d['2t'],
    }
)

temporal_ds.chunk('auto').to_zarr("data/era5_temporal.zarr", mode="w")
print("Written to data/era5_temporal.zarr")
print(temporal_ds)

Summary

You have:

  • Used ekt.temporal.monthly_mean() to aggregate daily data

  • Computed a climatological mean and anomaly field

  • Applied a rolling window mean with xarray

  • Written the processed features to data/era5_temporal.zarr


Activity

  1. Compute a seasonal mean for DJF (December–January–February). Hint: select months 12, 1, 2.

  2. Plot the anomaly field for a single day and compare it to the raw temperature field. What patterns are highlighted by the anomaly?

  3. Modify the synthetic data to include a 2-year record and compute a proper climatology by grouping on month.

# Seasonal DJF mean
djf = xr_hourly.sel(time=xr_daily.time.dt.month.isin([12, 1, 2]))
djf_mean = ekt.temporal.mean(djf)