Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2. Writing to Zarr

Learning objectives

By the end of this notebook you will be able to:

  • Explain why Zarr is better than GRIB for ML training

  • Convert an earthkit GRIB dataset to a Zarr store

  • Open a Zarr store with xarray and inspect its structure

  • Verify round-trip fidelity by comparing original and reloaded data

  • Compare storage sizes between GRIB and Zarr


Why Zarr for ML?

GRIB is the standard archival format for weather data. It was designed for transmission and long-term storage — not for the random-access read patterns of training loops.

Zarr is a chunked, compressed, cloud-native array format designed for exactly this:

PropertyGRIBZarr
LayoutSequential messagesChunked N-D arrays
Random accessSlow (scan whole file)Fast (read one chunk)
Cloud-nativeNoYes (S3, GCS, ADLS)
Lazy loadingNoYes
Parallel readsNoYes
Framework integrationCustom parsersxarray, PyTorch, TF

A DataLoader reading one timestep at a time from a time-chunked Zarr store reads exactly one chunk per sample — no wasted I/O.

Setup

import earthkit.data as ekd
import xarray as xr
import zarr
import numpy as np
import os

ekd.settings.set({"cache-policy": "user"})
os.makedirs("data", exist_ok=True)

print("zarr version:", zarr.__version__)
print("xarray version:", xr.__version__)

Load ERA5 data

We load the same sample data as notebook 1 and also save a local copy of the GRIB file for size comparison.

# DATA: era5-2t-msl-1985122512.grib — 2t and msl, single timestep

ds = ekd.from_source("sample", "era5-2t-msl-1985122512.grib").to_fieldlist()
print(f"FieldList: {len(ds)} field(s)")
ds.ls()
# Save the GRIB for size comparison
grib_path = "data/era5_sample.grib"
ds.to_target('file', grib_path)
print(f"GRIB saved: {grib_path}")

Convert to Zarr

earthkit-data provides a to_target() method that handles the GRIB → xarray → Zarr conversion in one call. Under the hood it:

  1. Decodes GRIB metadata and values

  2. Builds a CF-convention xarray Dataset

  3. Encodes and writes to the Zarr format

zarr_path = "data/era5.zarr"

ds.to_target(
    "zarr",
    xarray_to_zarr_kwargs={"store": zarr_path, "mode": "w"},
)

print(f"Zarr store written to: {zarr_path}")

Alternatively you can go via xarray directly, which gives you full control over encoding:

# Alternative route: earthkit -> xarray -> zarr
# xr_ds = ds.to_xarray()
# xr_ds.to_zarr(zarr_path, mode="w")

Inspect the Zarr store

A Zarr store is a directory of small binary files — one per chunk, plus metadata. Let’s look inside.

# Open with the zarr library to inspect the raw structure
store = zarr.open(zarr_path, mode="r")
print(store.tree())
# Open with xarray — this is how downstream code will use it
zarr_ds = xr.open_dataset(zarr_path, engine="zarr")
print(zarr_ds)
# Inspect chunk layout
for var in zarr_ds.data_vars:
    encoding = zarr_ds[var].encoding
    print(f"{var}: chunks={encoding.get('chunks')}, dtype={zarr_ds[var].dtype}")

Round-trip verification

Confirm that the values stored in Zarr match the original GRIB data.

original = ds.to_xarray()

for var in original.data_vars:
    orig_vals = original[var].values
    zarr_vals = zarr_ds[var].values
    max_diff = np.abs(orig_vals - zarr_vals).max()
    print(f"{var}: max absolute difference = {max_diff:.6f}")

Visualise from Zarr

Plotting directly from the Zarr store confirms the data is intact.

import earthkit.plots as ekp

# Visualise from the Zarr store (via xarray) — confirms round-trip fidelity
ekp.quickplot(zarr_ds, mode = 'overlay');

File size comparison

def directory_size_kb(path):
    """Total size of a directory tree in kilobytes."""
    total = 0
    for dirpath, _, filenames in os.walk(path):
        for f in filenames:
            total += os.path.getsize(os.path.join(dirpath, f))
    return total / 1024

grib_kb = os.path.getsize(grib_path) / 1024
zarr_kb = directory_size_kb(zarr_path)

print(f"GRIB size : {grib_kb:.1f} KB")
print(f"Zarr size : {zarr_kb:.1f} KB")
print(f"Ratio     : {zarr_kb / grib_kb:.2f}x")

Note on compression

Zarr uses chunk-level compression (Blosc by default). For small test files the overhead from metadata files can make Zarr appear larger. At real ERA5 scales (gigabytes to terabytes) Zarr is typically more compact than GRIB and dramatically faster for random access.


Summary

You have:

  • Converted a GRIB file to a Zarr store at data/era5.zarr

  • Verified the round-trip is lossless

  • Confirmed the Zarr store can be opened with xarray and earthkit.plots

This Zarr store is the input for all subsequent notebooks.


Activity

  1. Re-run the to_target() call with an explicit chunk size: add earthkit_to_xarray_kwargs={"chunks": {"latitude": 5, "longitude": 5}} and inspect the new chunk layout with store.tree().

  2. What happens to the Zarr directory structure when you change chunk sizes?

  3. How does the file count in the Zarr directory relate to the number of chunks?

# Your code here