Ingesting a Sentinel Image

Sentinel-2 is an observation mission developed by the European Space Agency to monitor the surface of the Earth (official website). Sets of images are taken of the surface where each image corresponds to a specific wavelength. These images can provide useful data for a wide variety of industries, however, the format they are stored can prove difficult to work with. This being, JPEG 2000 (file extension .jp2), an image compression format for JPEGs that allow for improved quality and compression ratio.

There are few programs that can work with jp2, which can make processing large amounts of them difficult. Because of GeoPySpark, though, we can leverage the tools available to us in Python that can work with jp2 and use them to format the sentinel data so that it can be ingested.

Note: This guide goes over how to use jp2 files with GeoPySpark, the actual ingest process itself is discussed in more detail in Greyscale Ingest Code Breakdown.

Getting the Data

Before we can start this tutorial, we will need to get the sentinel images. All sentinel data can be found on Amazon’s S3, and we will be downloading it straight from there.

The way the data is stored on S3 will not be discussed here, instead, a general overview of the data will be given. We will downloading three different jp2 that represent the same area and time in different wavelength. These being: Aerosol detection (443 nm), Water vapor (945 nm), and Cirrus (1375 nm). Why these three bands? It’s because of the resolution of the image, which determines what bands are represented best. For this example, we will be working at a 60 m resolution; which provides the best representation of the mentioned bands. As for what is in the photos, it is the eastern coast of Corsica taken on January 4th, 2017.

All of the above helps create the following base url for this image set, which we will assign to the baseurl variable in the terminal:

baseurl="http://sentinel-s2-l1c.s3.amazonaws.com/tiles/32/T/NM/2017/1/4/0/"

To download the bands, we just have to wget each one, and then move the resulting jp2 to /tmp.

wget ${baseurl}B01.jp2 ${baseurl}B09.jp2 ${baseurl}B10.jp2
mv B01.jp2 B09.jp2 B10.jp2 /tmp

Now that we have our data, we can now begin the ingest process.

The Code

import numpy as np
import rasterio

from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis.constants import SPATIAL, ZOOM
from geopyspark.geotrellis.catalog import write
from geopyspark.geotrellis.rdd import RasterRDD


geopysc = GeoPyContext(appName="sentinel-ingest", master="local[*]")

jp2s = ["/tmp/B01.jp2", "/tmp/B09.jp2", "/tmp/B10.jp2"]
arrs = []

# Reading the jp2s with rasterio
for jp2 in jp2s:
    with rasterio.open(jp2) as f:
        arrs.append(f.read(1))

data = np.array(arrs, dtype=arrs[0].dtype)

# saving the max and min values of the tile with
open('/tmp/sentinel_stats.txt', 'w') as f:
    f.writelines([str(data.max()) + "\n", str(data.min())])

if f.nodata:
    no_data = f.nodata
else:
    no_data = 0

bounds = f.bounds
epsg_code = int(f.crs.to_dict()['init'][5:])

# Creating the RasterRDD
tile = {'data': data, 'no_data_value': no_data}

extent = {'xmin': bounds.left, 'ymin': bounds.bottom, 'xmax': bounds.right, 'ymax': bounds.top}
projected_extent = {'extent': extent, 'epsg': epsg_code}

rdd = geopysc.pysc.parallelize([(projected_extent, tile)])
raster_rdd = RasterRDD.from_numpy_rdd(geopysc, SPATIAL, rdd)

metadata = raster_rdd.collect_metadata()
laid_out = raster_rdd.tile_to_layout(metadata)
reprojected = laid_out.reproject("EPSG:3857", scheme=ZOOM)

pyramided = reprojected.pyramid(start_zoom=12, end_zoom=1)

for tiled in pyramided:
    write("file:///tmp/sentinel-catalog", "sentinel-benchmark", tiled)

Running the Code

Running the code is simple, and you have two different ways of doing it.

The first is to copy and paste the code into a console like, iPython, and then running it.

The second is to place this code in a python file and then saving it. To run it from the file, go to the directory the file is in and run this command:

python3 file.py

Just replace file.py with whatever name you decided to call the file.

Breaking Down the Code

Let’s now see what’s going on through the code by going through each step of the process. Note: As mentioned in the opening, this section will only cover the reading in and formatting the data steps. For a guide through each ingest step, please see Greyscale Ingest Code Breakdown.

The Imports

The one note to make here is:

import rasterio
import numpy as np

We will need rasterio to read in the jp2` and numpy to format the data so that it can be used with GeoPySpark.

Reading in the JPEG 2000s

jp2s = ["/tmp/B01.jp2", "/tmp/B09.jp2", "/tmp/B10.jp2"]
arrs = []

# Reading the jp2s with rasterio
for jp2 in jp2s:
    with rasterio.open(jp2) as f:
        arrs.append(f.read(1))

data = np.array(arrs, dtype=arrs[0].dtype)

rasterio being backed by GDAL allows us to read in the jp2. Because each image represents a wavelength, there is a order in which they need to be in when they’re merged to together into a multiband raster which is represented by jp2s. After the reading process, the list of numpy arrays will be turned into one array. This represents our mulitband raster.

Saving the Whole Image Stats

# saving the max and min values of the tile with
open('/tmp/sentinel_stats.txt', 'w') as f:
    f.writelines([str(data.max()) + "\n", str(data.min())])

When we create the tile server for our sentinel images, the data of the numpy arrays will need to be converted to the uint8 data type in order to be represented as a RGB image. In order to do that, though, we will need to normalize each array so that all of the points fall between 0 and 255. This posss a problem, since only a section of the original image is read in and rendered at a time, there is no way of normalizing correctly; as we do not know the entire range of values from the original image. This is why we must save the max and min values of the whole image in a seperate file to read in later.

Formatting the Data

if f.nodata:
    no_data = f.nodata
else:
    no_data = 0

bounds = f.bounds
epsg_code = int(f.crs.to_dict()['init'][5:])

extent = {'xmin': bounds.left, 'ymin': bounds.bottom, 'xmax': bounds.right, 'ymax': bounds.top}
projected_extent = {'extent': extent, 'epsg': epsg_code}

rdd = geopysc.pysc.parallelize([(projected_extent, tile)])
raster_rdd = RasterRDD.from_numpy_rdd(geopysc, SPATIAL, rdd)

GeoPySpark is a Python binding of GeoTrellis, and because of that, requires the data to be in a certain format. Please see Core Concepts to learn what each of these variables represent.

The main take-away from this section of code: if you wish to produce either a RasterRDD or TiledRasterRDD in Python, then the data must be in the correct format.

Ingesting the Data

All that remains now is to ingest the data. These steps can be followed at Greyscale Ingest Code Breakdown.