Downloading PECDv4.2 data from the CDS via cdsapi

Downloading PECDv4.2 data from the CDS via cdsapi#

This notebook provides a practical introduction to retrieving data from the Copernicus Climate Change Service (C3S) through the Climate Data Store (CDS) Application Program Interface (API), a service providing programmatic access to CDS.

The tutorial will demonstrate how to access climate and energy-related variables from the Pan-European Climate Database (PECDv4.2) derived from reanalysis and climate projections. At the following link you can find an overview of the dataset, the technical documentation and the interface to download the data.

For illustration purposes, this notebook uses a short temporal subsample of the PECDv4.2 dataset. Likewise, only a subset of the available CMIP6 climate projection models and emission scenarios is downloaded and analyzed. This approach is intended to demonstrate the workflow and methodology rather than to produce robust, policy-relevant impact assessments. It is important to emphasize that for rigorous climate impact studies, the use of such limited data is not sufficient. Reliable analyses should be based on longer time series (typically at least 30 years) to capture interannual climate variability and trends, as well as on multiple climate models and scenarios to quantify uncertainty and assess model spread.

In this example we will at first download aggregated data in CSV format for one energy variable, Solar Photovoltaic Capacity Factor (or SPV), covering both a historical window (2011-2014) reconstructed using as input ERA5 reanalysis climate data, and a future window (2031-2034) computed using as input 3 different CMIP6 climate projection models for one of the available scenarios, the SSP245.
Afterwards, we will see how to download the same data, but retaining only strictly necessary information (e.g., a limited list of countries). This approach is useful when storage is constrained and you want to discard unneeded data early. Here, we will keep this subselection of countries: Italy, France, and Germany.

Note
ERA5 is the fifth-generation atmospheric reanalysis program developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) in collaboration with the Copernicus Climate Change Service (C3S). It operates on a global scale and has a spatial resolution of \(0.25° \times \ 0.25°\) (latitude and longitude), which corresponds to approximately 31 km; estimates of atmospheric variables are provided hourly throughout a temporal coverage of about eight decades, from 1940 to today.

Note
CMIP6 (Coupled Model Intercomparison Project Phase 6) is an international effort that brings together climate models from research institutions worldwide. Its goal is to standardize and compare climate simulations to better understand past and future climate behavior. The results are widely used in scientific research and reports like those from the IPCC.

Note
SSP245 (or SSP2-4.5) climate scenario is one of the plausible future pathways that combine assumptions about human development (like population growth, energy use, and policy) with projections of greenhouse gas emissions. Climate models use these scenarios to simulate how the Earth’s climate might respond under different conditions. The SSP2-4.5 is a “middle-of-the-road” scenario that assumes a moderate population and economic growth, a slow and uneven progress toward sustainability, and some mitigation of emissions (though not aggressive climate policies). The “4.5” refers to the projected radiative forcing — the extra energy trapped in the Earth system — of 4.5 W/m² by the year 2100.

Learning objectives 🧠#

In this notebook, you will learn how to use the cdsapi to download data from the Climate Data Store (CDS). You will then learn how to send an API request using Python code through cdsapi. You will see how to split a large request, which won’t be accepted on the CDS, into smaller chunks and send them through a for loop, or using parallel calls, in Python. Finally, you will understand how to retain only the needed information from the downloaded CSV files to reduce the amount of stored data.

Target Audience 🎯#

Anyone interested in learning how to efficiently download data from the PECDv4.2 dataset.

Prepare your environment#

Warning
It is possible to retrieve data directly from the PECD CDS download form by ticking the boxes of interest. Once all the required information is selected manually, you can start the download by clicking on “Submit form”. If you would like to try this method, visit the CDS download form. Please ensure you have an ECMWF account and that you have accepted the Terms and Conditions for each dataset you intend to download.

In this tutorial, we will use Python and the CDS API instead. This method can save time by allowing you to easily send multiple API requests simultaneously. The API calls shown here have the same structure as those displayed at the bottom of the aforementioned download form under “Show API request.”

Import libraries#

We begin by importing the required libraries: these include the os module, which provides a way to interact with the operating system and it is used here to create a folder to store the downloaded data; the glob library, which finds all the pathnames matching a specified pattern according to the rules used by the Unix shell; the pandas library, which is one of the most common and easy to use tools for data analysis and manipulation; cdsapi, which provides programmatic access to the Copernicus Climate Data Store (CDS), allowing you to download data; multiprocessing, which enables the use of multiple processors on your machine and is used here to handle parallel API requests.

import os
import glob
import pandas as pd
import cdsapi
from multiprocessing import Pool

Set up the CDS API and your credentials#

As said above, we will need to exploit the cdsapi library in order to be able to download data. To learn how to use the CDS API, see the official guide. If you have already set up your .cdsapirc file locally, you can upload it directly to your home directory.

Alternatively, you can replace None in the following code cell with your API Token as a string (i.e. enclosed in quotes, like "your_api_key"). Your token can be found on the CDS portal at: https://cds.climate.copernicus.eu/profile (you will need to log in to view your credentials). Remember to agree to the Terms and Conditions of every dataset you intend to download.

# If you have already setup your .cdsapirc file you can leave this as None
cdsapi_key = None
cdsapi_url = "https://cds.climate.copernicus.eu/api"

Download PECDv4.2 data#

Submit a CDS download request#

With everything set up, we can submit our first CDS download request. In the following example we will retrieve data from PECD version “PECDv4.2” across both the “Historical” and “Future Projections” temporal streams. The selected variable is “Solar photovoltaic generation capacity factor” and the chosen technology is “60 (SPV industrial rooftop)”, i.e. industrial rooftop installations with Si modules.

The origins of the data include:

“ERA5 reanalysis” (for historical observations);
“CMCC-CM2-SR5”, “EC-Earth3”, and “MPI-ESM1-2-HR” climate models (for future projections).

The emission scenario selected is “SSP2-4.5”; data is retrieved at the country level (“nuts_0” spatial resolution) and for the years 2011, 2012, 2013, 2014, 2031, 2032, 2033, 2034 (find more about NUTS on Eurostat). To see the corresponding API request, select these options in the CDS download form and click “Show API request”.

In general, these kinds of requests can be submitted in a “standard” way, by following ECMWF guidelines. However, if the download request is large, it might exceed the CDS request cost limit, which can lead to a failure of the download process.

For this reason, we here show a simple method to split large requests into smaller ones using Python code, and, after this, how to submit them individually to the CDS.

Create a function to handle the data download#

First, let’s make sure there is a dedicated folder ready to host the downloaded data.

# create folder to store downloaded data
folder = "cds_data/dowload_data_from_cds"
os.system(f"mkdir -p {folder}")

We will now build a function to send a single API request; later we will call it using the multiprocessing library. This allows us to split a very large request into many smaller ones that can be processed in parallel.

def retrieve_cds_data(
    dataset: str,  # the name of the dataset to download from
    pecd_version: str,  # the version of the Pan-European Climate Database (PECD)
    temporal_period: list[str],  # time period of the data ("historical", "future_projections")
    origin: list[str],  # indicates the source of the data (e.g. ERA5)
    variable: list[str],  # the specific climate or energy variable you want to download
    technology: list[str],  # specifies a technology related to the energy variables
    spatial_resolution: list[str],  # the geographical resolution of the data (e.g. "nuts_0")
    year: list[int],  # the years for which you want to retrieve data
    emissions: list[str] = None,  # if applicable, the emissions scenario (e.g. "ssp2_4_5")
):

    # dictionary of the API request
    request = {
        "pecd_version": pecd_version,
        "temporal_period": temporal_period,
        "origin": origin,
        "variable": variable,
        "technology": technology,
        "spatial_resolution": spatial_resolution,
        "year": year,
    }

    # build the file path to the downloaded data
    file_path = (
        f"{folder}/"
        f"{pecd_version}_{temporal_period[0]}_{origin[0]}_{variable[0]}_"
        f"{technology[0]}_{spatial_resolution[0]}_{year[0]}"
    )

    # add emissions field if needed
    if emissions is not None:
        request["emission_scenario"] = emissions
        file_path += f"_{emissions[0]}"

    file_path += ".zip"

    # initialize Client object
    client = cdsapi.Client(cdsapi_url, cdsapi_key)
    # call retrieve method that downloads the data
    client.retrieve(dataset, request, file_path)

Generate a list of API requests#

This section focuses on splitting the above request by creating a list of smaller requests that will be called by our retrieve_cds_data function. Each item in the list will represent a specific data download request.

First of all, we define some variables that will be used to specify the data to be downloaded from the CDS. These variables act as parameters for the API requests that will be made later.

# define our dataset
dataset = "sis-energy-pecd"

# constants
pecd_version = "pecd4_2"
emissions = ["ssp2_4_5"]
technology = ["60"]
spatial_resolution = ["nuts_0"]

# list of variables to download
varnames = ["solar_photovoltaic_generation_capacity_factor"]

# dictionary of origins - projection models
origins = {
    "historical": ["era5_reanalysis"],
    "future_projections": ["cmcc_cm2_sr5", "ec_earth3", "mpi_esm1_2_hr"],
}

We then create a list of years both for historical data and projection data, and divide those lists into 2 years chunks. This is the core of our splitting strategy: by chunking by year, we submit two years at a time instead of four. This lets us cover longer periods safely by iterating over the chunks while keeping each individual request below the limit.

# list of years to download
# define historical and projection year limits to download
hist_start, hist_end = 2011, 2014
proj_start, proj_end = 2031, 2034
# create the lists using a for loop
hist_years = [str(i) for i in range(hist_start, hist_end + 1)]
proj_years = [str(i) for i in range(proj_start, proj_end + 1)]
# print the lists to clearly understand their structure
print(f"hist_years: {hist_years}")
print(f"proj_years: {proj_years}")

# divide our list of years into 2 groups of 2 years each
n = 2
hist_years_list = [hist_years[n * i: n * (i + 1)] for i in range(0, len(hist_years) // n)]
proj_years_list = [proj_years[n * i: n * (i + 1)] for i in range(0, len(proj_years) // n)]
# print the new divided lists
print(f"hist_years_list: {hist_years_list}")
print(f"proj_years_list: {proj_years_list}")

hist_years: ['2011', '2012', '2013', '2014']
proj_years: ['2031', '2032', '2033', '2034']
hist_years_list: [['2011', '2012'], ['2013', '2014']]
proj_years_list: [['2031', '2032'], ['2033', '2034']]

As you can see from the output of the previous cell, we have eventually obtained two lists, hist_years_list and proj_year_list, made by two elements. Each element of these lists is a list itself, containing two elements (e.g. ‘2011’ and ‘2012’).

We are able now to create our list of “small” API requests. To do so, we will create a nested loop structure. The outer loop iterates through each variable defined in the varnames list above. For each variable, the code will generate requests for both historical and future projection data, contained in a tuple object. The inner loop iterates through each group of years in the corresponding years list. This list of tuples are necessary in order to call the starmap method of multiprocessing.

requests = []

# outer loop through variables
for var in varnames:  # one-step loop, since our varnames list contains just one element
    period = "historical"
    # loop through historical years
    for year in hist_years_list:
        request = (
            dataset,
            pecd_version,
            [period],
            origins[period],
            [var],
            technology,
            spatial_resolution,
            year,
        )
        requests.append(request)
    period = "future_projections"

    # loop through projection years
    for year in proj_years_list:
        for origin in origins[period]:
            request = (
                dataset,
                pecd_version,
                [period],
                [origin],
                [var],
                technology,
                spatial_resolution,
                year,
                emissions,
            )
            requests.append(request)

# print requests
print(f"total requests: {len(requests)}")
for request in requests:
    print(request)

total requests: 8
('sis-energy-pecd', 'pecd4_2', ['historical'], ['era5_reanalysis'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2011', '2012'])
('sis-energy-pecd', 'pecd4_2', ['historical'], ['era5_reanalysis'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2013', '2014'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['cmcc_cm2_sr5'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2031', '2032'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['ec_earth3'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2031', '2032'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['mpi_esm1_2_hr'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2031', '2032'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['cmcc_cm2_sr5'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2033', '2034'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['ec_earth3'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2033', '2034'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['mpi_esm1_2_hr'], ['solar_photovoltaic_generation_capacity_factor'], ['60'], ['nuts_0'], ['2033', '2034'], ['ssp2_4_5'])

These requests can be parallelized with multiprocessing (option 1), but you might as well choose to create a simple for loop over the requests list (option 2). Both ways, the result will be the same.

Download the data#

Option 1: in this example we initialize the Pool object with 8 processes and call the starmap method, passing as arguments the function previously defined and the list of tuples created before.

# Option 1
with Pool(8) as p:
    p.starmap(retrieve_cds_data, requests)

Option 2: as mentioned above, the same can be achieved with a for loop.

# Option 2
for request in requests:
    retrieve_cds_data(*request)

Unzip downloaded files#

Now that we have successfully downloaded our data from the CDS, the only remaining step before analysis is to extract the ZIP files. After that, we will be ready to work with the selected PECDv4.2 data.

Note The final CSV files will have a name according to the naming conventions of Pan-European Climate Database. You can find the explanation of the different fields in the production guide of the PECD.

# Unzipping every file in our folder
for fname in os.listdir(folder):
    if fname.endswith(".zip"):
        subfolder = fname.split(".")[0]
        os.system(f"unzip {folder}/{fname} -d {folder}")

Download PECDv4.2 data and retain only a specific subsample of countries#

As mentioned above, in some cases it may be preferable to download the data and immediately keep only the subset you need, discarding information that isn’t required for subsequent analysis. This can significantly reduce disk usage. Below, we repeat the same download procedure as in the first example, but retaining data for only the information pertaining to three countries. The steps are nearly identical, with a few small changes to achieve this new goal.

First of all, we create a new folder to store the data.

# create new folder to store downloaded data
folder = "cds_data/dowload_subsample_data_from_cds"
os.system(f"mkdir -p {folder}")

As in the first download example, the next step is to define a function that sends a single API request. Then we will build a list of feasible requests and execute them in parallel by calling that function across the list.

Create a new function to handle the data download#

The following function retrieve_sel_cds_csv_data works in a slightly different way from retrieve_cds_data: it downloads and unzips the requested files into a temporary folder, then for each CSV file it writes a new CSV that contains only the columns for the selected countries, listed in reg_list. Afterwards, it deletes the original CSVs and the downloaded ZIP archives, keeping only the filtered files that are needed for subsequent analysis.

def retrieve_sel_cds_csv_data(
    dataset: str,  # the name of the dataset to download from
    pecd_version: str,  # the version of the Pan-European Climate Database (PECD)
    temporal_period: list[str],  # time period of the data ("historical", "future_projections")
    origin: list[str],  # indicates the source of the data (e.g. ERA5)
    variable: list[str],  # the specific climate or energy variable you want to download
    spatial_resolution: list[str],  # the geographical resolution of the data (e.g. "nuts_0")
    year: list[int],  # the years for which you want to retrieve data
    reg_list: list[str],  # NEW: list of the selected countries
    technology: list[str] = None,  # specifies a technology related to the energy variables
    emissions: list[str] = None,  # if applicable, the emissions scenario (e.g. "ssp2_4_5")
):
    # dictionary of the api request
    request = {
        "pecd_version": pecd_version,
        "temporal_period": temporal_period,
        "origin": origin,
        "spatial_resolution": spatial_resolution,
        "year": year,
        "variable": variable,
    }

    # build the file path to the downloaded data
    id_string = (
        f"{pecd_version}_{temporal_period[0]}_{origin[0]}_"
        f"{variable[0]}_{spatial_resolution[0]}_{year[0]}"
    )
    folder_i = f"{folder}/{id_string}"
    os.system(f"mkdir -p {folder_i}")
    file_path = f"{folder_i}/{id_string}"

    # add emissions and technology fields if needed
    if emissions is not None:
        request["emission_scenario"] = emissions
        file_path += f"_{emissions[0]}"
    if technology is not None:
        request["technology"] = technology
        file_path += f"_{technology[0]}"
    file_path += ".zip"

    # initialize Client object
    client = cdsapi.Client(cdsapi_url, cdsapi_key)
    # call retrieve method that downloads the data
    client.retrieve(dataset, request, file_path)  # .download()

    # unzipping files to temporary folder
    os.system(f"unzip {file_path} -d {folder_i}/temp")

    # listing all newly downloaded files
    fpaths = sorted(glob.glob(os.path.join(folder_i, "temp", "*")))

    # Selecting regions from list
    for fpath in fpaths:
        df = pd.read_csv(fpath, comment="#", index_col=["Date"], parse_dates=["Date"])
        # checking if regions are present in CSV file
        for reg in reg_list:
            if reg not in df.columns:
                print(f"MIND: Region {reg} not available in dataframe. Skipping this region.")
                reg_list.remove(reg)
        if not reg_list:
            print("None of provided regions were in downloaded file.")
            continue
        # selecting needed regions from CSV file
        df = df[reg_list]
        # saving new CSV file
        df.to_csv(os.path.join(folder, os.path.basename(fpath)))
    # deleting unnecessary original CSVs
    for f in glob.glob(f"{folder_i}/temp/*.csv"):
        os.remove(f)

    # deleting unnecessary .zip files
    for f in glob.glob(f"{folder_i}/*.zip"):
        os.remove(f)

Generate a list of API requests#

Some variables that will be used when calling retrieve_sel_cds_csv_data, needed to specify the data to download, are defined below. Notice that the same variable, emissions, origins and technology of the first example are selected here. Furthermore, this time we have to define the new variable reg_list.

# define our dataset
dataset = "sis-energy-pecd"

# constants
pecd_version = "pecd4_2"
emissions = ["ssp2_4_5"]
spatial_resolution = ["nuts_0"]
reg_list = ["IT", "FR", "DE"]  # list of regions (based on spatial resolution)

# list of variables to download
varnames = ["solar_photovoltaic_generation_capacity_factor"]
technology = ["60"]

# dictionary of origins - projection models
origins = {
    "historical": ["era5_reanalysis"],
    "future_projections": ["cmcc_cm2_sr5", "ec_earth3", "mpi_esm1_2_hr"],
}

After that, we proceed in defining eight two-year requests that the CDS can handle (see the section “Generate a list of API requests” in the first download example).

# list of years to download
hist_start, hist_end = 2011, 2014
proj_start, proj_end = 2031, 2034
hist_years = [str(i) for i in range(hist_start, hist_end + 1)]
proj_years = [str(i) for i in range(proj_start, proj_end + 1)]

# divide our list of years into 2 groups of 2 years each
n = 2
hist_years_list = [hist_years[n * i: n * (i + 1)] for i in range(0, len(hist_years) // n)]
proj_years_list = [proj_years[n * i: n * (i + 1)] for i in range(0, len(proj_years) // n)]

We are now ready to create a list of API requests via the same nested loop structure exploited in the first example.

requests = []
# outer loop through variables
for var in varnames:
    period = "historical"
    # loop through historical years
    for year in hist_years_list:
        request = (
            dataset,
            pecd_version,
            [period],
            origins[period],
            spatial_resolution,
            [var],
            year,
            reg_list,
            technology,
        )
        requests.append(request)
    period = "future_projections"
    # loop through projection years
    for year in proj_years_list:
        for origin in origins[period]:
            request = (
                dataset,
                pecd_version,
                [period],
                [origin],
                spatial_resolution,
                [var],
                year,
                reg_list,
                technology,
                emissions,
            )
            requests.append(request)

# print requests
print(f"total requests: {len(requests)}")
for request in requests:
    print(request)

Download the data#

Finally, we parallelize the requests with multiprocessing. Remember that the same result can be achieved with a simple for loop over the requests (check the first example, not shown here).

We initialize the Pool object with 4 processes and call the starmap method, passing as arguments the function previously defined and the list of tuples created before.

# Running several processes
with Pool(4) as p:
    p.starmap(retrieve_sel_cds_csv_data, requests)

As mentioned above, the function retrieve_sel_cds_csv_data produces only the CSVs containing the selected countries, so the retrieved data is immediately ready for analysis.

Take home messages 📌#

To download the data from the CDS (climate data store), you can build your own API request using the CDS interface.
There is a size limit for each request, so if you need to download several variables/models or years, you need to split the request into smaller requests; you can do this very easily with Python code, and you can also parallelize them.
If you’re concerned about storage and you are interested in keeping only some specific information and dropping unnecessary data, you can build a function that does it for you straight after each request goes through, so that the final files will contain just what you need.

Downloading PECDv4.2 data from the CDS via cdsapi

Contents

Downloading PECDv4.2 data from the CDS via cdsapi#

Learning objectives 🧠#

Target Audience 🎯#

Prepare your environment#

Import libraries#

Set up the CDS API and your credentials#

Download PECDv4.2 data#

Submit a CDS download request#

Create a function to handle the data download#

Generate a list of API requests#

Download the data#

Unzip downloaded files#

Download PECDv4.2 data and retain only a specific subsample of countries#

Create a new function to handle the data download#

Generate a list of API requests#

Download the data#

Take home messages 📌#