RCC - UChicago, 2025¶

Geospatial Python Part 1: Satellite & Climate Raster Analysis¶

Instructors:¶

  • Hamid Dashti (hdashti@uchicago.edu)
  • Parmanand Sinha (pnsinha@uchicago.edu)

Commands¶

  • Shift + Enter: Run and move to the next cell.
  • Ctrl + Enter: Run the cell in place.
  • Alt + Enter: Run and insert a new cell below.
  • To delete a cell: Press esc to enter command mode, then press DD.
  • To insert a new cell below: Press esc then b.
  • To insert a new cell above: Press esc then a.

Topics¶

Part 1:

  • Introduction to common geospatial file formats

  • Introduction to xarray:

    • Data exploration
    • Indexing
  • Visualizing data with xarray:

    • Static plots
    • Interactive plotting
  • Computation with Xarray:

    • Built-in functions
    • Costume functions
  • Exercise

  • Some advanced features

Geospatial Data Formats for Climate and Satellite Data¶

Key Formats¶

Common geospatial file formats frequently used for climate and satellite data:

NetCDF (Network Common Data Form; Our focus today)¶

Purpose:

  • Storing and sharing multidimensional scientific array-based data with comprehensive metadata.

Key Features:

  • Self-describing with rich metadata for efficient algorithm development.
  • Scalable for efficient access to subsets of large datasets, even remotely.
  • Appendable for data addition without structure redefinition.
  • Sharable with support for multi-user access.

Common Applications:

  • Gridded climate data
  • Satellite images
  • Earth system model outputs

CF Conventions (Climate and Forecast):

  • Standardized metadata for self-description and interoperability.
  • Ensures variables have associated descriptions, physical units, and spatiotemporal coordinates.
  • Enables software tools to work effectively with minimal user intervention.

Links:

  • NetCDF Website
  • CF Conventions

HDF5 (Hierarchical Data Format version 5)¶

Purpose:

  • Storing complex heterogeneous datasets.

Key Features:

  • Hierarchical data organization with groups and datasets.
  • Self-describing with metadata within the file.
  • Multiple data type support (integers, floats, strings, complex numbers).
  • Chunking and compression for efficient storage and access.
  • Large file and dataset support (terabytes to petabytes).
  • Parallel processing capabilities.

Common Applications:

  • Many satellite data (e.g. MODIS) is HDF5.
  • Earth system model outputs.

Links:

  • HDF5 Website

Zarr¶

Purpose:

  • Efficient parallel processing and cloud storage of large datasets.

Key Features:

  • Stores data in chunks across multiple files. Makes reading and writing large datasets faster.
  • Optimized for high-performance computing (HPC), high-throughput computing (HTC), and cloud environments.

Common Applications:

  • Large-scale scientific datasets
  • Cloud-based data analysis

Links:

  • Zarr Website

STAC (SpatioTemporal Asset Catalog)¶

Purpose::

  • Simplify search and discovery of geospatial data across different providers and platforms.
  • Enable interoperability between various tools and applications working with geospatial data.
  • Facilitate easier cloud storage and access for large datasets.

Key Features:

  • Uses JSON files to describe assets, providing information like location, time, data type, metadata, and availability.
  • Flexible and extensible, allowing customization for specific data types and needs.

Links:

  • STAC Website

Basics of Xarray¶

Xarray: Handling NetCDF¶

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays.

Most scientific data are multidimensional arrays. We can add labels to this matrices so we can make them easier to work with.

The Pandas is a powerful Python package for this purpose but its limitted to 2-dimensional (e.g. tablular) data.

You can think of xarray as a generalized version of Pandas that can handle n-dimensional data.

With Xarray, we can read, write and process netcdf files.

Dataset Diagram

Some key terminology:

  • DataArray: A labeled multi-dimensional array.
  • Dataset: A collection of DataArray objects.
  • Variable: A NetCDF-like variable consisting of dimensions, data, and attributes which describe a single array.
  • Dimension: The name of the axis of the data. E.g. in math we say 'x' dimension to describe values on x axis, in climate data we can say 'time' dimension to describe that data has a temporal aspect.
  • Coordinates: An array that labels a dimension or set of dimensions of another DataArray. E.g. for 'x' axis in math we have (0,1,2,...), similarly we can have labels for the latitude dimension.

1. Get Data¶

The data is located in /project/rcc/workshops/hamid/spring2025/data. use the command cp and copy them to the data directory

Import libraries:

In [ ]:
# import libraries
import matplotlib.pyplot as plt
import xarray as xr
import numpy as np

2. Open the data and explore¶

In [ ]:
# Open the landsat and air temrperature datasets
landsat = xr.open_dataset("./data/small_landsat.nc")
air_temp = xr.open_dataset("./data/air.mon.mean.nc")
In [ ]:
# Explore landsat dataset
landsat
In [ ]:
# Explore air temperature dataset
air_temp
In [ ]:
# Extract the air temprature variables
ta = air_temp["air"]
ta

If we already know that there is only one variable in the dataset, we can use the shortcut to load the data using xr.open_dataarray().

In [ ]:
ta = xr.open_dataarray("./data/air.mon.mean.nc")
ta

Extract the values of the air temprature DataArray as numpy array

In [ ]:
# ta.values
data = ta.data
print(data)
print(data.shape)

Get dimensions names. Note its only names.

In [ ]:
ta.dims

Get the coordinates. Note these are array of labels (lat/lon values, timestamps).

In [ ]:
ta.coords

you can extract the time, latitude, and longitude coordinates as follows:

In [ ]:
ta.coords["time"]

Get the metadata (attribute):

In [ ]:
ta.attrs
# Similar to coordinates, you can access the attributes of a DataArray using the .attrs[""]:
# print(ta.attrs['units'])

Many functions used in numpy can be used here as well:

In [ ]:
# For example get the shape of the data
ta.shape

Indexing and selecting data and simple plotting¶

Labeling data (as xarray does) make indexing and selecting data very flexible and intuitive. We can index data based on position or labels.

index like python standard method []:

In [ ]:
# For example index data at the first time step and all latitudes and longitudes.
ta[0, :, :]

Save NetCDF files to the disk using to_netcdf()

In [ ]:
# ta.to_netcdf("ta.nc")

Quick Practice: Index Data Inclusive of the Upper Hemisphere for Time=100, plot it and Save to Disk

Hint: Utilize functions like range() or np.arange() for efficient indexing. For plotting extract data using .values and plot it using plt.imshow()

Bonus: If you have other software, such as QGIS, installed on your computer, try opening the exported file to explore its contents.

In [ ]:
# Your codes here

Solution:

In [ ]:
# first find what is the index for latitude from 0 to 90
# print(ta.coords["lat"][0:180])
upper_hemsphere = ta[100, 0:180, :]
plt.imshow(upper_hemsphere.values, cmap="coolwarm")

Index with dimension names using isel() and sel methods. Much more intuitive!

  • isel() is integer-based selection much like iloc() in pandas or python's [].
  • sel() is label-based selection like 'loc() in pandas.
In [ ]:
# you can do same thing along lat and lon dimensions
data_isel = ta.isel(time=range(0, 12))
data_isel
In [ ]:
data_sel = ta.sel(time=["2023-08-01", "2023-09-01"])
data_sel

You can select a range of labels:

In [ ]:
ta.sel(time=slice("1984-01-01", "1984-12-01"))

Nearest neighbor selection. Sometime we don' know the exact index (e.g. latitude), so we use nearest neighbor method.

Example: Here we want find closes air temperature to a specific lat and lon.

But the longitude of this file starts from 0 on its most left to 360 on its most right. Lets make it from -180 to 180.

In [ ]:
# fixing longitude
# We will soon see more on basic calculations like below:
ta.coords["lon"] = (ta.coords["lon"] + 180) % 360 - 180
ta = ta.sortby(ta.lon)

Now find the nearest neighbor

In [ ]:
nearest_neighbor = ta.sel(lat=43, lon=89, method="nearest")
plt.plot(nearest_neighbor.values, "-")
nearest_neighbor

Create Mask with where()¶

The where() can come handy in many many situations.

For example, we can tell where in the world temperature was above/below a certain degree, create masks, find missing data and many other applications.

In [ ]:
da = ta.sel(time="2023-05-01")
da_cold = da.where(da < 273.15)
plt.imshow(da_cold.values, cmap="coolwarm")

We can replace the masked regions and replace it with desired data.

In the code below we say find grids where temp is less than 273.15 K and replace them with temp=500 K (just because we can!), and for other region keep the original value.

In [ ]:
# Replace the below zero values with 100 (just for the sake of example)
da_hot = xr.where(da < 273.15, 500, da)
plt.imshow(da_hot.values, cmap="coolwarm")

We can replace it with other DataArrays...

In [ ]:
da_tmp = xr.where(da < 273.15, ta.sel(time="1985-08-01"), da)

3. Plotting¶

One dimensional plot¶

We can simply use the plot() method to plot the data.

In [ ]:
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot()

xarray under the hood used the matplotlib library. Any argument that goes into matplotlib can be passed to xarray

In [ ]:
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot.line("-*")
In [ ]:
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot(
    aspect=2, size=3
)
plt.title("Temperature at 100N, 150E")
plt.xlabel("Monthly Time")
plt.ylabel("Temperature (K)")
plt.tight_layout()

Two dimensional plots¶

xarray is smart enough that in many cases it can interpret your data and choose an appropriate type.

In [ ]:
ta.isel(time=0).plot(robust=True, cbar_kwargs={"label": "Temperature (°C)"})
plt.title("Temperature in January 1948")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

Faceting¶

Its handy when we want to plot multiple plots along a dimension (e.g. time)

In [ ]:
ta_sub = ta.sel(time=slice("2010-01-01", "2010-06-30"))
ta_sub.plot(col="time", col_wrap=3)

More advanced plots¶

In general the xarray basic plotting is great for data exploration (quick test).

However, it other more plotting packages were integrated with xarray to produce more complicated plots.

For example code below is how to plot it in various projections, adding coastlines etc.

In [ ]:
import cartopy.crs as ccrs
import cartopy.feature as cfeature

p = ta.isel(time=10).plot(
    subplot_kws=dict(projection=ccrs.Orthographic(-80, 35), facecolor="gray"),
    transform=ccrs.PlateCarree(),
    robust=True,
)

p.axes.coastlines()
p.axes.add_feature(cfeature.BORDERS)

Different projections

In [ ]:
p = ta.isel(time=10).plot(
    subplot_kws=dict(projection=ccrs.Robinson(), facecolor="gray"),
    figsize=(10, 5),
    transform=ccrs.PlateCarree(),
    robust=True,
)

p.axes.coastlines()
p.axes.add_feature(cfeature.BORDERS)

Interactive plotting¶

Hover your mouse over the plot to see the temperature at different locations.

In [ ]:
# import hvplot.xarray

# ta.isel(time=10).hvplot(
#     width=800,
#     height=400,
#     cmap="fire",
#     projection=ccrs.Mollweide(),
# )

You may run into issue with the next interactive figure, if you do, you can upgrade the ipywidgets library using the following command:

python -m pip install --upgrade ipywidgets

Make a simple animations

In [ ]:
# # !pip install --upgrade ipywidgets
# ta.isel(time=range(12)).hvplot(
#     width=800,
#     height=400,
#     cmap="fire",
#     # projection=ccrs.Orthographic(-90, 30),
#     coastline=True,
#     groupby="time",
#     widget_type="scrubber",
#     widget_location="bottom",
# )

4. Computations¶

Xarray integrates seamlessly with NumPy, allowing you to apply many familiar NumPy functions directly to xarray objects (DataArrays and Datasets) while preserving their labeled structure and coordinates.

In [ ]:
# For example let calculate log and sin of the air temprature, whatever they mean!
ta_sel = ta.sel(time="2023-08-01")

log_ta = np.log(ta_sel)
sin_ta = np.sin(ta_sel)

fig, axes = plt.subplots(2, 1, figsize=(8, 6))
log_ta.plot(ax=axes[0])
sin_ta.plot(ax=axes[1])

axes[0].set_title("Log of Air Temperature!")
axes[1].set_title("Sine of Air Temperature!")
plt.tight_layout()

you can also do operations on the multiple variables

In [ ]:
### Let calcualte Normalized Difference Vegetation Index (NDVI) from landsat data
# NDVI = (NIR - RED) / (NIR + RED)
NIR = landsat["SR_B5"]
RED = landsat["SR_B4"]
NDVI = (NIR - RED) / (NIR + RED)
NDVI = NDVI.squeeze()

NIR.sel(time="2023-05-04").plot(
    cmap="viridis",
    cbar_kwargs={"label": "NIR"},
    figsize=(8, 4),
)

we can use isnull(), notnull(), fillna(), dropna() and a few more to deal with missing data

Plot grids where it is NaN and its not NaN

In [ ]:
plt.subplot(1, 2, 1)
ta_sel.isnull().plot()
plt.subplot(1, 2, 2)
ta_sel.notnull().plot()

Fill the missing values with 0

In [ ]:
ta_sel.fillna(0).plot(robust=True)

Taking the mean/max/min etc.

In [ ]:
ta_sel_mean = ta_sel.mean()
ta_sel_mean
# Try to sum, std, min, max, median, quantile, etc.

Including labels (dimensions) in our calculations.

For example we can take the mean over lat and lon dimensions to get the global mean of temperature.

In [ ]:
ta_mean = ta.mean(dim=["lat", "lon"])
ta_mean.sel(time=slice("2020-01-01", "2022-12-01")).plot()

Practice: plot the map of global temp mean from the year 1990 to 2010 and plot the results¶

In [ ]:
# Your codes here

Solution

In [ ]:
ta.sel(time=slice("1990-01-01", "2010-12-01")).mean(dim="time").plot()

Binning and grouping data using groupby and resample.

In [ ]:
ta_group = ta.groupby("time.year").mean()
ta_resample = ta.resample(time="Y").mean()
# Notice the dimension of the DataArray is now year instead of time

Note:

Grouping:

  • Uses the groupby function to group the data by the "year" component of the "time" coordinate. Each group represents a year's worth of data.
  • The mean() function is then applied within each group, calculating the average value for each year.
  • Flexibility: This method offers more flexibility if you need to perform different operations on each year's data (e.g., standard deviation, percentiles). You can iterate over the groups or apply further calculations within the groupby object.

Resampling:

  • Uses the resample function to directly change the sampling frequency of the data to "Y" (yearly). This essentially averages all values within each year automatically.
  • For simple calculations like computing averages, this approach can be more efficient because it avoids explicit grouping and iteration.
  • Applying further operations on individual years requires additional steps.

Calculate moving average: (Drop year 2025 from plot because we only have a few months of data)

In [ ]:
annual_ta = ta.groupby("time.year").mean()  # Group the data by year
window_size = 5  # Define the window size for the moving average

# # Calculate the moving average
moving_avg = annual_ta.rolling(year=window_size, center=True).mean()
In [ ]:
# plot the results of moving average

# your codes here
In [ ]:
# Solution
fig, ax = plt.subplots(figsize=(10, 5))
annual_ta.mean(dim=["lat", "lon"]).plot()
moving_avg.mean(dim=["lat", "lon"]).plot(label="5-Year Moving Average")

What if we want to resample the data to a different time frequency?

In [ ]:
ta_resample = ta.resample(time="5Y").mean(dim="time")
ta_resample

Apply a costume function to the data along dimension(s)¶

Xarray's apply_ufunction() is great for applying your costume functions along any dimension of your data.

Lets calculate the trend in annual temperature:

In [ ]:
# define a function to compute a linear trend of a time series (we use numpy.polyfit())
def linear_trend(y):
    if np.isnan(y).any():
        return np.nan
    x = np.arange(len(y))
    pf = np.polyfit(x, y, 1)
    # need to return an xr.DataArray for groupby
    return pf[0]

Lets do some basic usage

In [ ]:
# usage of the function
x = np.arange(30)
y = 2 * x + 3 + np.random.randn(30) * 5
plt.plot(x, y, "o")
trend = linear_trend(y)
plt.plot(x, trend * x + 3, "-r")
plt.title("Linear Trend")

Apply the function to each pixel of the air temprature data

In [ ]:
# Calculate the annual mean
ta_annual = ta.resample(time="Y").mean()
# Calculate the linear trend
trend_result = xr.apply_ufunc(
    linear_trend,  # The function to apply (linear_trend in this case)
    ta_annual,  # The input data (ta_annual in this case)
    input_core_dims=[
        ["time"]
    ],  # Specifies the core dimensions of the input data (in this case, "time" is the core dimension)
    vectorize=True,  # Vectorize the function (apply element-wise operations)
)

Plot the trend resutls

In [ ]:
trend_result.plot(robust=True)

Problem 2: Calculating Annual Summer Temperature Anomaly¶

Objective:¶

To calculate the annual summer temperature anomaly for the entire time period and to visualize the anomaly trend and anomaly map for the year 2023.

Hint:¶

A simple index for anomaly is to subtract the long term mean from data.

Procedure:¶

  1. Summer Month Mask:

    • Create a mask for summer months (June, July, August) using ta['time.month'] and &.
  2. Apply Mask:

    • Apply the mask to select only summer months using ta.where(mask). Note: Use the argument drop=True in where() to exclude months other than summer.
  3. Calculate Climatological Mean:

    • Compute the climatological mean of summer temperatures (i.e., annual mean). Use groupby()
  4. Calculate Anomalies:

    • Determine anomalies by subtracting the climatological mean from the summer temperature data.
  5. Visualization:

    • Plot the anomaly for the year 2023 and the global mean (mean["lat","lon"]) for the entire time period. Utilize matplotlib.subplot() to display them side-by-side.

Note:¶

Keep an eye out for any unusual patterns or trends in the anomalies.

Solution:

A. Calculate the summer anomaly

In [ ]:
# Calculate the mask for summer months (June, July, August)
mask = (ta["time.month"] >= 6) & (ta["time.month"] <= 8)

# Apply the mask to select only summer months data
ta_summer = ta.where(mask, drop=True)

# Calculate the annual mean of summer temperatures
ta_summer_annual = ta_summer.groupby("time.year").mean()

# Calculate the climatological mean of summer temperatures
clima_mean = ta_summer_annual.mean(dim="year")

# Calculate the summer temperature anomalies by subtracting the climatological mean
summer_anomalies = ta_summer_annual - clima_mean

B. Plot the anomaly trend and anomaly map for the year 2023

In [ ]:
fig, axs = plt.subplots(ncols=2, figsize=(12, 4))

# Plot the summer anomalies for the year 2023
summer_anomalies.sel(year=2023).plot(robust=True, cmap="coolwarm", ax=axs[0])

# Plot the global mean of summer temperature anomalies
summer_anomalies.mean(["lat", "lon"]).plot(ax=axs[1])

# Add an arrow pointing to the anomaly values in the year 2021
axs[1].annotate(
    "Approaching values projected for 2100 :(",
    xy=(2023, summer_anomalies.sel(year=2023).mean(["lat", "lon"])),
    xytext=(1960, summer_anomalies.sel(year=2023).mean(["lat", "lon"]) - 0.5),
    arrowprops=dict(facecolor="black", arrowstyle="->"),
)

plt.tight_layout()

More on 1.5 C and its importance in IPCC report: chapter 1.

P.S. To be more accurate we need to do a bit more:

  • We need to take the area of cells into account when calculating the global mean.
  • We need to deal with outliers.
  • This dataset is one of the many, other datasets may show different results.
  • And anomalies are not distributed uniformly, for example arctic is warming twice as fast the rest of the world, as we have seen it in the last trend analyses.

Calculate Landsat Enhanced Vegetation Index (EVI)¶

EVI = G * ((NIR - R) / (NIR + C1 * R – C2 * B + L)) In Landsat 8-9, EVI = 2.5 * ((Band 5 – Band 4) / (Band 5 + 6 * Band 4 – 7.5 * Band 2 + 1)).

Plot NDVI and EVI side by side for comparison

In [53]:
# Your codes here

Large Data and Speed: The Next Challenge¶

We've covered the basics of climate data processing in Python. Now, let's tackle the real challenges:

  1. Out-of-memory data: What happens when datasets are too large to fit in memory?
  2. Speeding up computations: How can we analyze data faster, especially with large datasets?

In the next session, we'll introduce Dask, a powerful tool for handling these issues.