Learning objectives
By the end of this notebook you will be able to:
Explain why Zarr is better than GRIB for ML training
Convert an earthkit GRIB dataset to a Zarr store
Open a Zarr store with xarray and inspect its structure
Verify round-trip fidelity by comparing original and reloaded data
Compare storage sizes between GRIB and Zarr
Why Zarr for ML?¶
GRIB is the standard archival format for weather data. It was designed for transmission and long-term storage — not for the random-access read patterns of training loops.
Zarr is a chunked, compressed, cloud-native array format designed for exactly this:
| Property | GRIB | Zarr |
|---|---|---|
| Layout | Sequential messages | Chunked N-D arrays |
| Random access | Slow (scan whole file) | Fast (read one chunk) |
| Cloud-native | No | Yes (S3, GCS, ADLS) |
| Lazy loading | No | Yes |
| Parallel reads | No | Yes |
| Framework integration | Custom parsers | xarray, PyTorch, TF |
A DataLoader reading one timestep at a time from a time-chunked Zarr store reads exactly one chunk per sample — no wasted I/O.
Setup¶
import earthkit.data as ekd
import xarray as xr
import zarr
import numpy as np
import os
ekd.settings.set({"cache-policy": "user"})
os.makedirs("data", exist_ok=True)
print("zarr version:", zarr.__version__)
print("xarray version:", xr.__version__)Load ERA5 data¶
We load the same sample data as notebook 1 and also save a local copy of the GRIB file for size comparison.
# DATA: era5-2t-msl-1985122512.grib — 2t and msl, single timestep
ds = ekd.from_source("sample", "era5-2t-msl-1985122512.grib").to_fieldlist()
print(f"FieldList: {len(ds)} field(s)")
ds.ls()# Save the GRIB for size comparison
grib_path = "data/era5_sample.grib"
ds.to_target('file', grib_path)
print(f"GRIB saved: {grib_path}")
Convert to Zarr¶
earthkit-data provides a to_target() method that handles the GRIB → xarray → Zarr conversion in one call. Under the hood it:
Decodes GRIB metadata and values
Builds a CF-convention xarray Dataset
Encodes and writes to the Zarr format
zarr_path = "data/era5.zarr"
ds.to_target(
"zarr",
xarray_to_zarr_kwargs={"store": zarr_path, "mode": "w"},
)
print(f"Zarr store written to: {zarr_path}")Alternatively you can go via xarray directly, which gives you full control over encoding:
# Alternative route: earthkit -> xarray -> zarr
# xr_ds = ds.to_xarray()
# xr_ds.to_zarr(zarr_path, mode="w")Inspect the Zarr store¶
A Zarr store is a directory of small binary files — one per chunk, plus metadata. Let’s look inside.
# Open with the zarr library to inspect the raw structure
store = zarr.open(zarr_path, mode="r")
print(store.tree())# Open with xarray — this is how downstream code will use it
zarr_ds = xr.open_dataset(zarr_path, engine="zarr")
print(zarr_ds)# Inspect chunk layout
for var in zarr_ds.data_vars:
encoding = zarr_ds[var].encoding
print(f"{var}: chunks={encoding.get('chunks')}, dtype={zarr_ds[var].dtype}")Round-trip verification¶
Confirm that the values stored in Zarr match the original GRIB data.
original = ds.to_xarray()
for var in original.data_vars:
orig_vals = original[var].values
zarr_vals = zarr_ds[var].values
max_diff = np.abs(orig_vals - zarr_vals).max()
print(f"{var}: max absolute difference = {max_diff:.6f}")Visualise from Zarr¶
Plotting directly from the Zarr store confirms the data is intact.
import earthkit.plots as ekp
# Visualise from the Zarr store (via xarray) — confirms round-trip fidelity
ekp.quickplot(zarr_ds, mode = 'overlay');
File size comparison¶
def directory_size_kb(path):
"""Total size of a directory tree in kilobytes."""
total = 0
for dirpath, _, filenames in os.walk(path):
for f in filenames:
total += os.path.getsize(os.path.join(dirpath, f))
return total / 1024
grib_kb = os.path.getsize(grib_path) / 1024
zarr_kb = directory_size_kb(zarr_path)
print(f"GRIB size : {grib_kb:.1f} KB")
print(f"Zarr size : {zarr_kb:.1f} KB")
print(f"Ratio : {zarr_kb / grib_kb:.2f}x")Note on compression
Zarr uses chunk-level compression (Blosc by default). For small test files the overhead from metadata files can make Zarr appear larger. At real ERA5 scales (gigabytes to terabytes) Zarr is typically more compact than GRIB and dramatically faster for random access.
Summary¶
You have:
Converted a GRIB file to a Zarr store at
data/era5.zarrVerified the round-trip is lossless
Confirmed the Zarr store can be opened with xarray and earthkit.plots
This Zarr store is the input for all subsequent notebooks.
Activity
Re-run the
to_target()call with an explicit chunk size: addearthkit_to_xarray_kwargs={"chunks": {"latitude": 5, "longitude": 5}}and inspect the new chunk layout withstore.tree().What happens to the Zarr directory structure when you change chunk sizes?
How does the file count in the Zarr directory relate to the number of chunks?
# Your code here