Commands¶
Shift
+Enter
: Run and move to the next cell.Ctrl
+Enter
: Run the cell in place.Alt
+Enter
: Run and insert a new cell below.- To delete a cell: Press
esc
to enter command mode, then pressDD
. - To insert a new cell below: Press
esc
thenb
. - To insert a new cell above: Press
esc
thena
.
Topics¶
Part 1:
Introduction to common geospatial file formats
Introduction to xarray:
- Data exploration
- Indexing
Visualizing data with xarray:
- Static plots
- Interactive plotting
Computation with Xarray:
- Built-in functions
- Costume functions
Exercise
Some advanced features
Geospatial Data Formats for Climate and Satellite Data¶
Key Formats¶
Common geospatial file formats frequently used for climate and satellite data:
NetCDF (Network Common Data Form; Our focus today)¶
Purpose:
- Storing and sharing multidimensional scientific array-based data with comprehensive metadata.
Key Features:
- Self-describing with rich metadata for efficient algorithm development.
- Scalable for efficient access to subsets of large datasets, even remotely.
- Appendable for data addition without structure redefinition.
- Sharable with support for multi-user access.
Common Applications:
- Gridded climate data
- Satellite images
- Earth system model outputs
CF Conventions (Climate and Forecast):
- Standardized metadata for self-description and interoperability.
- Ensures variables have associated descriptions, physical units, and spatiotemporal coordinates.
- Enables software tools to work effectively with minimal user intervention.
Links:
HDF5 (Hierarchical Data Format version 5)¶
Purpose:
- Storing complex heterogeneous datasets.
Key Features:
- Hierarchical data organization with groups and datasets.
- Self-describing with metadata within the file.
- Multiple data type support (integers, floats, strings, complex numbers).
- Chunking and compression for efficient storage and access.
- Large file and dataset support (terabytes to petabytes).
- Parallel processing capabilities.
Common Applications:
- Many satellite data (e.g. MODIS) is HDF5.
- Earth system model outputs.
Links:
Zarr¶
Purpose:
- Efficient parallel processing and cloud storage of large datasets.
Key Features:
- Stores data in chunks across multiple files. Makes reading and writing large datasets faster.
- Optimized for high-performance computing (HPC), high-throughput computing (HTC), and cloud environments.
Common Applications:
- Large-scale scientific datasets
- Cloud-based data analysis
Links:
STAC (SpatioTemporal Asset Catalog)¶
Purpose::
- Simplify search and discovery of geospatial data across different providers and platforms.
- Enable interoperability between various tools and applications working with geospatial data.
- Facilitate easier cloud storage and access for large datasets.
Key Features:
- Uses JSON files to describe assets, providing information like location, time, data type, metadata, and availability.
- Flexible and extensible, allowing customization for specific data types and needs.
Links:
Basics of Xarray¶
Xarray: Handling NetCDF¶
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays.
Most scientific data are multidimensional arrays. We can add labels to this matrices so we can make them easier to work with.
The Pandas is a powerful Python package for this purpose but its limitted to 2-dimensional (e.g. tablular) data.
You can think of xarray as a generalized version of Pandas that can handle n-dimensional data.
With Xarray, we can read, write and process netcdf files.
Some key terminology:
- DataArray: A labeled multi-dimensional array.
- Dataset: A collection of DataArray objects.
- Variable: A NetCDF-like variable consisting of dimensions, data, and attributes which describe a single array.
- Dimension: The name of the axis of the data. E.g. in math we say 'x' dimension to describe values on x axis, in climate data we can say 'time' dimension to describe that data has a temporal aspect.
- Coordinates: An array that labels a dimension or set of dimensions of another DataArray. E.g. for 'x' axis in math we have (0,1,2,...), similarly we can have labels for the latitude dimension.
1. Get Data¶
The data is located in /project/rcc/workshops/hamid/spring2025/data
. use the command cp and copy them to the data directory
Import libraries:
# import libraries
import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
2. Open the data and explore¶
# Open the landsat and air temrperature datasets
landsat = xr.open_dataset("./data/small_landsat.nc")
air_temp = xr.open_dataset("./data/air.mon.mean.nc")
# Explore landsat dataset
landsat
# Explore air temperature dataset
air_temp
# Extract the air temprature variables
ta = air_temp["air"]
ta
If we already know that there is only one variable in the dataset, we can use the shortcut to load the data using xr.open_dataarray()
.
ta = xr.open_dataarray("./data/air.mon.mean.nc")
ta
Extract the values of the air temprature DataArray as numpy array
# ta.values
data = ta.data
print(data)
print(data.shape)
Get dimensions names. Note its only names.
ta.dims
Get the coordinates. Note these are array of labels (lat/lon values, timestamps).
ta.coords
you can extract the time, latitude, and longitude coordinates as follows:
ta.coords["time"]
Get the metadata (attribute):
ta.attrs
# Similar to coordinates, you can access the attributes of a DataArray using the .attrs[""]:
# print(ta.attrs['units'])
Many functions used in numpy can be used here as well:
# For example get the shape of the data
ta.shape
Indexing and selecting data and simple plotting¶
Labeling data (as xarray does) make indexing and selecting data very flexible and intuitive. We can index data based on position or labels.
index like python standard method []
:
# For example index data at the first time step and all latitudes and longitudes.
ta[0, :, :]
Save NetCDF files to the disk using to_netcdf()
# ta.to_netcdf("ta.nc")
Quick Practice: Index Data Inclusive of the Upper Hemisphere for Time=100, plot it and Save to Disk
Hint: Utilize functions like range()
or np.arange()
for efficient indexing. For plotting extract data using .values
and plot it using plt.imshow()
Bonus: If you have other software, such as QGIS, installed on your computer, try opening the exported file to explore its contents.
# Your codes here
Solution:
# first find what is the index for latitude from 0 to 90
# print(ta.coords["lat"][0:180])
upper_hemsphere = ta[100, 0:180, :]
plt.imshow(upper_hemsphere.values, cmap="coolwarm")
Index with dimension names using isel()
and sel
methods. Much more intuitive!
isel()
is integer-based selection much likeiloc()
inpandas
or python's[]
.sel()
is label-based selection like'loc()
in pandas.
# you can do same thing along lat and lon dimensions
data_isel = ta.isel(time=range(0, 12))
data_isel
data_sel = ta.sel(time=["2023-08-01", "2023-09-01"])
data_sel
You can select a range of labels:
ta.sel(time=slice("1984-01-01", "1984-12-01"))
Nearest neighbor selection. Sometime we don' know the exact index (e.g. latitude), so we use nearest neighbor method.
Example: Here we want find closes air temperature to a specific lat and lon.
But the longitude of this file starts from 0 on its most left to 360 on its most right. Lets make it from -180 to 180.
# fixing longitude
# We will soon see more on basic calculations like below:
ta.coords["lon"] = (ta.coords["lon"] + 180) % 360 - 180
ta = ta.sortby(ta.lon)
Now find the nearest neighbor
nearest_neighbor = ta.sel(lat=43, lon=89, method="nearest")
plt.plot(nearest_neighbor.values, "-")
nearest_neighbor
Create Mask with where()
¶
The where()
can come handy in many many situations.
For example, we can tell where in the world temperature was above/below a certain degree, create masks, find missing data and many other applications.
da = ta.sel(time="2023-05-01")
da_cold = da.where(da < 273.15)
plt.imshow(da_cold.values, cmap="coolwarm")
We can replace the masked regions and replace it with desired data.
In the code below we say find grids where temp is less than 273.15 K and replace them with temp=500 K (just because we can!), and for other region keep the original value.
# Replace the below zero values with 100 (just for the sake of example)
da_hot = xr.where(da < 273.15, 500, da)
plt.imshow(da_hot.values, cmap="coolwarm")
We can replace it with other DataArrays...
da_tmp = xr.where(da < 273.15, ta.sel(time="1985-08-01"), da)
3. Plotting¶
One dimensional plot¶
We can simply use the plot()
method to plot the data.
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot()
xarray under the hood used the matplotlib
library. Any argument that goes into matplotlib can be passed to xarray
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot.line("-*")
ta.sel(time=slice("2010-01-01", "2020-12-31")).isel(lat=100, lon=150).plot(
aspect=2, size=3
)
plt.title("Temperature at 100N, 150E")
plt.xlabel("Monthly Time")
plt.ylabel("Temperature (K)")
plt.tight_layout()
Two dimensional plots¶
xarray is smart enough that in many cases it can interpret your data and choose an appropriate type.
ta.isel(time=0).plot(robust=True, cbar_kwargs={"label": "Temperature (°C)"})
plt.title("Temperature in January 1948")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
Faceting¶
Its handy when we want to plot multiple plots along a dimension (e.g. time)
ta_sub = ta.sel(time=slice("2010-01-01", "2010-06-30"))
ta_sub.plot(col="time", col_wrap=3)
More advanced plots¶
In general the xarray basic plotting is great for data exploration (quick test).
However, it other more plotting packages were integrated with xarray to produce more complicated plots.
For example code below is how to plot it in various projections, adding coastlines etc.
import cartopy.crs as ccrs
import cartopy.feature as cfeature
p = ta.isel(time=10).plot(
subplot_kws=dict(projection=ccrs.Orthographic(-80, 35), facecolor="gray"),
transform=ccrs.PlateCarree(),
robust=True,
)
p.axes.coastlines()
p.axes.add_feature(cfeature.BORDERS)
Different projections
p = ta.isel(time=10).plot(
subplot_kws=dict(projection=ccrs.Robinson(), facecolor="gray"),
figsize=(10, 5),
transform=ccrs.PlateCarree(),
robust=True,
)
p.axes.coastlines()
p.axes.add_feature(cfeature.BORDERS)
Interactive plotting¶
Hover your mouse over the plot to see the temperature at different locations.
# import hvplot.xarray
# ta.isel(time=10).hvplot(
# width=800,
# height=400,
# cmap="fire",
# projection=ccrs.Mollweide(),
# )
You may run into issue with the next interactive figure, if you do, you can upgrade the ipywidgets library using the following command:
python -m pip install --upgrade ipywidgets
Make a simple animations
# # !pip install --upgrade ipywidgets
# ta.isel(time=range(12)).hvplot(
# width=800,
# height=400,
# cmap="fire",
# # projection=ccrs.Orthographic(-90, 30),
# coastline=True,
# groupby="time",
# widget_type="scrubber",
# widget_location="bottom",
# )
4. Computations¶
Xarray integrates seamlessly with NumPy, allowing you to apply many familiar NumPy functions directly to xarray objects (DataArrays and Datasets) while preserving their labeled structure and coordinates.
# For example let calculate log and sin of the air temprature, whatever they mean!
ta_sel = ta.sel(time="2023-08-01")
log_ta = np.log(ta_sel)
sin_ta = np.sin(ta_sel)
fig, axes = plt.subplots(2, 1, figsize=(8, 6))
log_ta.plot(ax=axes[0])
sin_ta.plot(ax=axes[1])
axes[0].set_title("Log of Air Temperature!")
axes[1].set_title("Sine of Air Temperature!")
plt.tight_layout()
you can also do operations on the multiple variables
### Let calcualte Normalized Difference Vegetation Index (NDVI) from landsat data
# NDVI = (NIR - RED) / (NIR + RED)
NIR = landsat["SR_B5"]
RED = landsat["SR_B4"]
NDVI = (NIR - RED) / (NIR + RED)
NDVI = NDVI.squeeze()
NIR.sel(time="2023-05-04").plot(
cmap="viridis",
cbar_kwargs={"label": "NIR"},
figsize=(8, 4),
)
we can use isnull()
, notnull()
, fillna()
, dropna()
and a few more to deal with missing data
Plot grids where it is NaN
and its not NaN
plt.subplot(1, 2, 1)
ta_sel.isnull().plot()
plt.subplot(1, 2, 2)
ta_sel.notnull().plot()
Fill the missing values with 0
ta_sel.fillna(0).plot(robust=True)
Taking the mean/max/min etc.
ta_sel_mean = ta_sel.mean()
ta_sel_mean
# Try to sum, std, min, max, median, quantile, etc.
Including labels (dimensions) in our calculations.
For example we can take the mean over lat and lon dimensions to get the global mean of temperature.
ta_mean = ta.mean(dim=["lat", "lon"])
ta_mean.sel(time=slice("2020-01-01", "2022-12-01")).plot()
Practice: plot the map of global temp mean from the year 1990 to 2010 and plot the results¶
# Your codes here
Solution
ta.sel(time=slice("1990-01-01", "2010-12-01")).mean(dim="time").plot()
Binning and grouping data using groupby
and resample
.
ta_group = ta.groupby("time.year").mean()
ta_resample = ta.resample(time="Y").mean()
# Notice the dimension of the DataArray is now year instead of time
Note:
Grouping:
- Uses the groupby function to group the data by the "year" component of the "time" coordinate. Each group represents a year's worth of data.
- The
mean()
function is then applied within each group, calculating the average value for each year. - Flexibility: This method offers more flexibility if you need to perform different operations on each year's data (e.g., standard deviation, percentiles). You can iterate over the groups or apply further calculations within the groupby object.
Resampling:
- Uses the resample function to directly change the sampling frequency of the data to "Y" (yearly). This essentially averages all values within each year automatically.
- For simple calculations like computing averages, this approach can be more efficient because it avoids explicit grouping and iteration.
- Applying further operations on individual years requires additional steps.
Calculate moving average: (Drop year 2025 from plot because we only have a few months of data)
annual_ta = ta.groupby("time.year").mean() # Group the data by year
window_size = 5 # Define the window size for the moving average
# # Calculate the moving average
moving_avg = annual_ta.rolling(year=window_size, center=True).mean()
# plot the results of moving average
# your codes here
# Solution
fig, ax = plt.subplots(figsize=(10, 5))
annual_ta.mean(dim=["lat", "lon"]).plot()
moving_avg.mean(dim=["lat", "lon"]).plot(label="5-Year Moving Average")
What if we want to resample the data to a different time frequency?
ta_resample = ta.resample(time="5Y").mean(dim="time")
ta_resample
Apply a costume function to the data along dimension(s)¶
Xarray's apply_ufunction()
is great for applying your costume functions along any dimension of your data.
Lets calculate the trend in annual temperature:
# define a function to compute a linear trend of a time series (we use numpy.polyfit())
def linear_trend(y):
if np.isnan(y).any():
return np.nan
x = np.arange(len(y))
pf = np.polyfit(x, y, 1)
# need to return an xr.DataArray for groupby
return pf[0]
Lets do some basic usage
# usage of the function
x = np.arange(30)
y = 2 * x + 3 + np.random.randn(30) * 5
plt.plot(x, y, "o")
trend = linear_trend(y)
plt.plot(x, trend * x + 3, "-r")
plt.title("Linear Trend")
Apply the function to each pixel of the air temprature data
# Calculate the annual mean
ta_annual = ta.resample(time="Y").mean()
# Calculate the linear trend
trend_result = xr.apply_ufunc(
linear_trend, # The function to apply (linear_trend in this case)
ta_annual, # The input data (ta_annual in this case)
input_core_dims=[
["time"]
], # Specifies the core dimensions of the input data (in this case, "time" is the core dimension)
vectorize=True, # Vectorize the function (apply element-wise operations)
)
Plot the trend resutls
trend_result.plot(robust=True)
Problem 2: Calculating Annual Summer Temperature Anomaly¶
Objective:¶
To calculate the annual summer temperature anomaly for the entire time period and to visualize the anomaly trend and anomaly map for the year 2023.
Hint:¶
A simple index for anomaly is to subtract the long term mean from data.
Procedure:¶
Summer Month Mask:
- Create a mask for summer months (June, July, August) using
ta['time.month']
and&
.
- Create a mask for summer months (June, July, August) using
Apply Mask:
- Apply the mask to select only summer months using
ta.where(mask)
. Note: Use the argumentdrop=True
inwhere()
to exclude months other than summer.
- Apply the mask to select only summer months using
Calculate Climatological Mean:
- Compute the climatological mean of summer temperatures (i.e., annual mean). Use
groupby()
- Compute the climatological mean of summer temperatures (i.e., annual mean). Use
Calculate Anomalies:
- Determine anomalies by subtracting the climatological mean from the summer temperature data.
Visualization:
- Plot the anomaly for the year 2023 and the global mean (
mean["lat","lon"]
) for the entire time period. Utilizematplotlib.subplot()
to display them side-by-side.
- Plot the anomaly for the year 2023 and the global mean (
Note:¶
Keep an eye out for any unusual patterns or trends in the anomalies.
Solution:
A. Calculate the summer anomaly
# Calculate the mask for summer months (June, July, August)
mask = (ta["time.month"] >= 6) & (ta["time.month"] <= 8)
# Apply the mask to select only summer months data
ta_summer = ta.where(mask, drop=True)
# Calculate the annual mean of summer temperatures
ta_summer_annual = ta_summer.groupby("time.year").mean()
# Calculate the climatological mean of summer temperatures
clima_mean = ta_summer_annual.mean(dim="year")
# Calculate the summer temperature anomalies by subtracting the climatological mean
summer_anomalies = ta_summer_annual - clima_mean
B. Plot the anomaly trend and anomaly map for the year 2023
fig, axs = plt.subplots(ncols=2, figsize=(12, 4))
# Plot the summer anomalies for the year 2023
summer_anomalies.sel(year=2023).plot(robust=True, cmap="coolwarm", ax=axs[0])
# Plot the global mean of summer temperature anomalies
summer_anomalies.mean(["lat", "lon"]).plot(ax=axs[1])
# Add an arrow pointing to the anomaly values in the year 2021
axs[1].annotate(
"Approaching values projected for 2100 :(",
xy=(2023, summer_anomalies.sel(year=2023).mean(["lat", "lon"])),
xytext=(1960, summer_anomalies.sel(year=2023).mean(["lat", "lon"]) - 0.5),
arrowprops=dict(facecolor="black", arrowstyle="->"),
)
plt.tight_layout()
More on 1.5 C and its importance in IPCC report: chapter 1.
P.S. To be more accurate we need to do a bit more:
- We need to take the area of cells into account when calculating the global mean.
- We need to deal with outliers.
- This dataset is one of the many, other datasets may show different results.
- And anomalies are not distributed uniformly, for example arctic is warming twice as fast the rest of the world, as we have seen it in the last trend analyses.
Calculate Landsat Enhanced Vegetation Index (EVI)¶
EVI = G * ((NIR - R) / (NIR + C1 * R – C2 * B + L)) In Landsat 8-9, EVI = 2.5 * ((Band 5 – Band 4) / (Band 5 + 6 * Band 4 – 7.5 * Band 2 + 1)).
Plot NDVI and EVI side by side for comparison
# Your codes here
Large Data and Speed: The Next Challenge¶
We've covered the basics of climate data processing in Python. Now, let's tackle the real challenges:
- Out-of-memory data: What happens when datasets are too large to fit in memory?
- Speeding up computations: How can we analyze data faster, especially with large datasets?
In the next session, we'll introduce Dask
, a powerful tool for handling these issues.