Sometimes I write here about things I find interesting
Downloading cloud-free satellite image time series using Python (the efficient way)
For a recent project, I wanted to analyse satellite images from the European Space Agency’s Sentinel-2 mission over specific areas and over time. The Sentinel-2 mission is pretty great: it has 10-meter resolution optical imagery, takes a picture of almost all points on the continental earth every five days, and is completely free to access. My areas of interest were quite small (low single-digit square kilometres), whereas Sentinel-2 satellite images are usually published in huge 100x100km2 chunks. This post describes how I went about this in what seems to me a fairly efficient way. All data and software used here is open access/open source.
We take a two-step process to process as little unnecessary data as possible and keep our analysis relatively fast. First, we identify those images that might potentially be relevant for us by looking at image metadata only (see section 2.). Second, we look at the actual image data in our area of interest, check for clouds, and download only if it is actually useful for us (see section 3.). The analysis uses Sentinel-2 images in COG format hosted on AWS S3.
1. Define Areas of Interest
First, we need to define the areas that we are interested in. Here, I picked a number of parks in Berlin. For each park we define a name and a polygon describing the location and shape of the park in Well-Known-Text (WKT) format. I used this web-based tool to draw polygons on a map around my areas of interest and export them as WKT.
We also import a bunch of packages here - the important ones are described in the following sections.
2. Identifying Suitable Satellite Images Based on STAC Metadata
SENTINEL-2 data can be accessed in a number of ways. Here, I use the catalogue hosted on AWS S3 here in the wonderful COG format (see next section). This data is searchable through a STAC API. The STAC or SpatioTemporal Asset Catalog specification provides a unified way to query our data and allows us to filter the many available images with spatial, temporal, or other metadata filters. In Python, we can use the pystac-client library to identify images that seem relevant to us. We just point it to the catalogue on S3 and we are ready to query.
An image that is useful to our analysis should fulfil three properties. First, it should overlap one of the areas of interest we defined in the previous step. Second, it should be taken in the timeframe that we are interested in. Here, we are searching for images taken in 2022. Finally, it should not contain clouds over the areas of interest.
Whereas filtering by area and time is straightforward, the question of cloud cover is more difficult. In the metadata we can use to filter the Sentinel-2 STAC, we have an attribute eo:cloud_cover that gives us the percentage of clouds in the whole image. However, remember that the Sentinel-2 images are 100km x 100km in size. This limits the usefulness of the average cloud cover value: an average cloud-cover of 70% could still leave our area of interest cloud-free, or conversely the clouds covering 4% of a different image could all be concentrated over our area of interest. An example of such a scenario is given in the image below. Based on the cloud-cover metadata alone, we cannot know with certainty whether an image is useful to us or not.
Sentinel-2 image from 2022-11-06 showing clouds in the west, but not over Berlin’s Tiergarten park.
To illustrate this further, I analysed all 272 images Sentinel-2 took in 2022 of Berlin’s Tiergarten park and plotted the fraction of clouds in the whole 100x100km image versus the park area only (following the methodology described in the next section). As the park is quite small, we can see that in most cases it is either fully covered in clouds or fully free. We also see that there were days where the cloud cover in the whole image is as high as 80%, but our area of interest was completely uncovered. You can find the code used to generate this plot on my GitHub at the end of the notebook here.
Cloud cover in the whole image versus our area of interest for 272 Sentinel-2 images containing Berlin’s Tiergarten park.
We consequently address the cloud problem in two steps. In this first querying step we filter by a leniently high cloud cover percentage to discard the images that are unlikely to contain useful data to us (e.g. 80%). In the next step (section 3.) we then check precisely whether there are clouds over our area of interest. Here, we apply a much stricter filter (e.g. 2%).
At the end of this step, we have identified 1034 images in the year 2022 with less than 50% cloud cover that we want to examine in greater detail in the next step. The median park in our collection has 109 images. Note that so far, we did not download any imagery, but just queried metadata.
3. Checking if Clouds Cover Our Area of Interest And Downloading the Relevant Color Image Sections
Now that we have the list of images that fulfil our query criteria, for each image we need to check whether there are clouds over our areas of interest to see whether it is relevant for us.
As we are only interested in a small area within a huge image, it would be unfortunate if we had to download the whole image everytime to then only use a few hundred pixels within it. Luckily, we can take a more efficient route as the data is available on AWS in Cloud-Optimized Geotiff or COG format. Every COG file is a GeoTIFF file, essentially meaning that pixel values map to points on earth. The useful part about COGs is that a file organizes its pixel values in a spatially regular way and contains information about this regular layout in its header. Provided with our area of interest, software designed for handling COGs (such as GDAL which we use through Rasterio) can thus read the header, use the information to compute where within the huge file the pixels that interest us are located, and use a HTTP range request to download only those pixels. This drastically reduces the amount of data we need to download.
To check for clouds, we use the Scene Classification Layer (SCL) produced by ESA as part of the Level-2A processing algorithm for each image. This is published as a COG file, where each pixel indicates which of eleven classes it likely contains, including water, vegetation, and different types of cloud. You can read more about this here.
The main advantage of this approach is its simplicity, but anecdotally the performance of the SCL cloud detection algorithm seems to vary quite a bit. I will leave a proper exploration of this for a future post. For our current purposes, however, we continue with this simple approach.
Using Rasterio, we download the part of the scene classification COG over our area of interest and determine the fraction of pixels that contain clouds or cloud shadows. If this is over our threshold (e.g. 2%), we discard the image. If the image is sufficiently free of clouds, we download the true colour image that overlaps our area of interest and save it locally.
And we are done!
You can find a notebook with all the code from this post in my Github here.
If someone reading has some experience with better ways of approaching any of these steps (which I am sure exist!), I would love to hear about it.