Working With Layers

Before begining, all examples in this guide need the following boilerplate code:

curl -o /tmp/cropped.tif https://s3.amazonaws.com/geopyspark-test/example-files/cropped.tif
import datetime
import numpy as np
import pyproj
import geopyspark as gps

from pyspark import SparkContext
from shapely.geometry import box, Point

conf = gps.geopyspark_conf(master="local[*]", appName="layers")
pysc = SparkContext(conf=conf)

How is Data Stored and Represented in GeoPySpark?

All data that is worked with in GeoPySpark is at some point stored within an RDD. Therefore, it is important to understand how GeoPySpark stores, represents, and uses these RDDs throughout the library.

GeoPySpark does not work with PySpark RDDs, but rather, uses Python classes that are wrappers for Scala classes that contain and work with a Scala RDD. Specifically, these wrapper classes are RasterLayer and TiledRasterLayer, which will be discussed in more detail later.

Layers Are More Than RDDs

We refer to the Python wrapper classes as layers and not RDDs for two reasons: first, neither RasterLayer or TiledRasterLayer actually extends PySpark’s RDD class; but more importantly, these classes contain more information than just the RDD. When we refer to a “layer”, we mean both the RDD and its attributes.

The RDDs contained by GeoPySpark layers contain tuples which have type (K, V), where K represents the key, and V represents the value. V will always be a Tile, but K differs depending on both the wrapper class and the nature of the data itself. More on this below.

RasterLayer

The RasterLayer class deals with untiled data—that is, the elements of the layer have not been normalized into a single unified layout. Each raster element may have distinct resolutions or sizes; the extents of the constituent rasters need not follow any orderly pattern. Essentially, a RasterLayer stores “raw” data, and its main purpose is to act as a way station on the path to acquiring tiled data that adheres to a specified layout.

The RDDs contained by RasterLayer objects have key type, K, of either ProjectedExtent or TemporalProjectedExtent, when the layer type is SPATIAL or SPACETIME, respectively.

TiledRasterLayer

TiledRasterLayer is the complement to RasterLayer and is meant to store tiled data. Tiled data has been fitted to a certain layout, meaning that it has been regularly sampled, and it has been cut up into uniformly-sized, non-overlapping pieces that can be indexed sensibly. The benefit of having data in this state is that now it will be easy to work with. It is with this class that the user will be able to, for example, perform map algebra, create pyramids, and save the layer. See below for the definitions and specific examples of these operations.

In the case of TiledRasterLayer, K is either SpatialKey or SpaceTimeKey.

RasterLayer

Creating RasterLayers

There are just two ways to create a RasterLayer: (1) through reading GeoTiffs from the local file system, S3, or HDFS; or (2) from an existing PySpark RDD.

From PySpark RDDs

The first option is to create a RasterLayer from a PySpark RDD via the from_numpy_rdd() class method. This step can be a bit more involved, as it requires the data within the PySpark RDD to be formatted in a specific way (see How is Data Stored and Represented in GeoPySpark for more information).

The following example constructs an RDD from a tuple. The first element is a ProjectedExtent because we have decided to make the data spatial. If we were dealing with spatial-temproal data, then TemporalProjectedExtent would be the first element. A Tile will always be the second element of the tuple.

arr = np.ones((1, 16, 16), dtype='int')
tile = gps.Tile.from_numpy_array(numpy_array=np.array(arr), no_data_value=-500)

extent = gps.Extent(0.0, 1.0, 2.0, 3.0)
projected_extent = gps.ProjectedExtent(extent=extent, epsg=3857)

rdd = pysc.parallelize([(projected_extent, tile), (projected_extent, tile)])
multiband_raster_layer = gps.RasterLayer.from_numpy_rdd(layer_type=gps.LayerType.SPATIAL, numpy_rdd=rdd)
multiband_raster_layer

From GeoTiffs

The get() function in the geopyspark.geotrellis.geotiff module creates an instance of RasterLayer from GeoTiffs. These files can be located on either your local file system, HDFS, or S3. In this example, a GeoTiff with spatial data is read locally.

raster_layer = gps.geotiff.get(layer_type=gps.LayerType.SPATIAL, uri="file:///tmp/cropped.tif")
raster_layer

Using RasterLayer

This next section goes over the methods of RasterLayer. It should be noted that not all methods contained within this class will be covered. More information on the methods that deal with the visualization of the contents of the layer can be found in the Visualizing Data in GeoPySpark.

Converting to a Python RDD

By using to_numpy_rdd(), the base RasterLayer will be serialized into a Python RDD. This will convert all of the first values within each tuple to either ProjectedExtent or TemporalProjectedExtent, and the second value to Tile.

python_rdd = raster_layer.to_numpy_rdd()
python_rdd
python_rdd.first()

SpaceTime Layer to Spatial Layer

If you’re working with a spatial-temporal layer and would like to convert it to a spatial layer, then you can use the to_spatial_layer`() method. This changes the keys of the RDD within the layer by converting TemporalProjectedExtent to ProjectedExtent.

# Creating the space time layer

instant = datetime.datetime.now()
temporal_projected_extent = gps.TemporalProjectedExtent(extent=projected_extent.extent,
                                                        epsg=projected_extent.epsg,
                                                        instant=instant)

space_time_rdd = pysc.parallelize([temporal_projected_extent, tile])
space_time_layer = gps.RasterLayer.from_numpy_rdd(layer_type=gps.LayerType.SPACETIME, numpy_rdd=space_time_rdd)
space_time_layer
# Converting the SpaceTime layer to a Spatial layer

space_time_layer.to_spatial_layer()

Collecting Metadata

The Metadata of a layer contains information of the values within it. This data pertains to the layout, projection, and extent of the data found within the layer.

collect_metadata() will return the Metadata of the layer that fits the layout given.

# Collecting Metadata with the default LocalLayout()
metadata = raster_layer.collect_metadata()
metadata
# Collecting Metadata with the default GlobalLayout()
raster_layer.collect_metadata(layout=gps.GlobalLayout())
# Collecting Metadata with a LayoutDefinition
extent = gps.Extent(0.0, 0.0, 33.0, 33.0)
tile_layout = gps.TileLayout(2, 2, 256, 256)
layout_definition = gps.LayoutDefinition(extent, tile_layout)

raster_layer.collect_metadata(layout=layout_definition)

Reproject

reproject() will change the projection of the rasters within the layer to the given target_crs. This method does not sample past the tiles’ boundaries.

# The CRS of the layer before reprojecting
metadata.crs
# The CRS of the layer after reprojecting
raster_layer.reproject(target_crs=3857).collect_metadata().crs

Tiling Data to a Layout

tile_to_layout() will tile and format the rasters within a RasterLayer to a given layout. The result of this tiling is a new instance of TiledRasterLayer. This output contains the same data as its source RasterLayer, however, the information contained within it will now be orginized according to the given layout.

During this step it is also possible to reproject the RasterLayer. This can be done by specifying the target_crs to reproject to. Reprojecting using this method produces a different result than what is returned by the reproject method. Whereas the latter does not sample past the boundaries of rasters within the layer, the former does. This is important as anything with a GlobalLayout needs to sample past the boundaries of the rasters.

From Metadata

Create a TiledRasterLayer that contains the layout from the given Metadata.

Note: If the specified target_crs is different from what’s in the metadata, then an error will be thrown.

raster_layer.tile_to_layout(layout=metadata)
From LayoutDefinition
raster_layer.tile_to_layout(layout=layout_definition)
From LocalLayout
raster_layer.tile_to_layout(gps.LocalLayout())
From GlobalLayout
tiled_raster_layer = raster_layer.tile_to_layout(gps.GlobalLayout())
tiled_raster_layer
From A TiledRasterLayer

One can tile a RasterLayer to the same layout as a TiledRasterLayout.

Note: If the specifying target_crs is different from the other layer’s, then an error will be thrown.

raster_layer.tile_to_layout(layout=tiled_raster_layer)

TiledRasterLayer

Creating TiledRasterLayers

For this guide, we will just go over one initialization method for TiledRasterLayer, from_numpy_rdd. However, there are other ways to create this class. These additional creation strategies can be found in the [map algebra guide].

From PySpark RDD

Like RasterLayers, TiledRasterLayers can be created from RDDs using from_numpy_rdd(). What is different, however, is that Metadata must also be passed in during initialization. This makes creating TiledRasterLayers this way a little bit more arduous.

The following example constructs an RDD from a tuple. The first element is a SpatialKey because we have decided to make the data spatial. See How is Data Stored and Represented in GeoPySpark for more information.

data = np.zeros((1, 512, 512), dtype='float32')
tile = gps.Tile.from_numpy_array(numpy_array=data, no_data_value=-1.0)
instant = datetime.datetime.now()

layer = [(gps.SpaceTimeKey(row=0, col=0, instant=instant), tile),
         (gps.SpaceTimeKey(row=1, col=0, instant=instant), tile),
         (gps.SpaceTimeKey(row=0, col=1, instant=instant), tile),
         (gps.SpaceTimeKey(row=1, col=1, instant=instant), tile)]

rdd = pysc.parallelize(layer)

extent = gps.Extent(0.0, 0.0, 33.0, 33.0)
layout = gps.TileLayout(2, 2, 512, 512)
bounds = gps.Bounds(gps.SpaceTimeKey(col=0, row=0, instant=instant), gps.SpaceTimeKey(col=1, row=1, instant=instant))
layout_definition = gps.LayoutDefinition(extent, layout)

metadata = gps.Metadata(
    bounds=bounds,
    crs='+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +a=6378137 +b=6378137 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ',
    cell_type='float32ud-1.0',
    extent=extent,
    layout_definition=layout_definition)

space_time_tiled_layer = gps.TiledRasterLayer.from_numpy_rdd(layer_type=gps.LayerType.SPACETIME,
                                                             numpy_rdd=rdd, metadata=metadata)
space_time_tiled_layer

Using TiledRasterLayers

This section will go over the methods found within TiledRasterLayer. Like with RasterLayer, not all methods within this class will be covered in this guide. More information on the methods that deal with the visualization of the contents of the layer can be found in Visualizing Data in GeoPySpark; and those that deal with map algebra can be found in the [map algebra guide].

Converting to a Python RDD

By using to_numpy_rdd(), the base TiledRasterLayer will be serialized into a Python RDD. This will convert all of the first values within each tuple to either SpatialKey or SpaceTimeKey, and the second value to Tile.

python_rdd = tiled_raster_layer.to_numpy_rdd()
python_rdd.first()

SpaceTime Layer to Spatial Layer

If you’re working with a spatiotemporal layer and would like to convert it to a spatial layer, then you can use the to_spatial_layer() method. This changes the keys of the RDD within the layer by converting SpaceTimeKey to SpatialKey.

# Converting the SpaceTime layer to a Spatial layer

space_time_tiled_layer.to_spatial_layer()

Repartitioning

While not an RDD, TiledRasterLayer does contain an underlying RDD, and thus, it can be repartitioned using the repartition() method.

# Repartition the internal RDD to have 120 partitions
tiled_raster_layer.repartition(num_partitions=120)

Lookup

If there is a particular tile within the layer that is of interest, it is possible to retrieve it as a Tile using the lookup() method.

min_key = tiled_raster_layer.layer_metadata.bounds.minKey

# Retrieve the Tile that is located at the smallest column and row of the layer
tiled_raster_layer.lookup(col=min_key.col, row=min_key.row)

Masking

By using mask() method, the TiledRasterRDD can be masekd using one or more Shapely geometries.

layer_extent = tiled_raster_layer.layer_metadata.extent

# Polygon to mask a region of the layer
mask = box(layer_extent.xmin,
           layer_extent.ymin,
           layer_extent.xmin + 20,
           layer_extent.ymin + 20)

tiled_raster_layer.mask(geometries=mask)
mask_2 = box(layer_extent.xmin + 50,
             layer_extent.ymin + 50,
             layer_extent.xmax - 20,
             layer_extent.ymax - 20)

# Multiple Polygons can be given to mask the layer
tiled_raster_layer.mask(geometries=[mask, mask_2])

Normalize

normalize() will linearly transform the data within the layer such that all values fall within a given range.

# Normalizes the layer so that the new min value is 0 and the new max value is 60000
tiled_raster_layer.normalize(new_min=0, new_max=60000)

Pyramiding

When using a layer for a TMS server, it is important that the layer is pyramided. That is, we create a level-of-detail hierarchy that covers the same geographical extent, while each level of the pyramid uses one quarter as many pixels as the next level. This allows us to zoom in and out when the layer is being displayed without using extraneous detail. The pyramid() method will produce an instance of Pyramid that will contain within it multiple TiledRasterLayers. Each layer corresponds to a zoom level, and the number of levels depends on the zoom_level of the source layer. With the max zoom of the Pyramid being the source layer’s zoom_level, and the lowest zoom being 0.

For more information on the Pyramid class, see the Pyramid section of the visualization guide.

# This creates a Pyramid with zoom levels that go from 0 to 11 for a total of 12.
tiled_raster_layer.pyramid()

Reproject

This is similar to the reproject method for RasterLayer where the reprojection will not sample past the tiles’ boundaries. This means the layout of the tiles will be changed so that they will take on a LocalLayout rather than a GlobalLayout (read more about these layouts here). Because of this, whatever zoom_level the TiledRasterLayer has will be changed to 0 since the area being represented changes to just the tiles.

# The zoom_level and crs of the TiledRasterLayer before reprojecting
tiled_raster_layer.zoom_level, tiled_raster_layer.layer_metadata.crs
reprojected_tiled_raster_layer = tiled_raster_layer.reproject(target_crs=3857)

# The zoom_level and crs of the TiledRasterLayer after reprojecting
reprojected_tiled_raster_layer.zoom_level, reprojected_tiled_raster_layer.layer_metadata.crs

Stitching

Using stitch() will produce a single Tile by stitching together all of the tiles within the TiledRasterLayer. This can only be done with spatial layers, and is not recommended if the data contained within the layer is large, as it can cause a crash due to the size of the resulting Tile.

# Creates a Tile with an underlying numpy array with a size of (1, 6144, 1536).
tiled_raster_layer.stitch().cells.shape

Saving a Stitched Layer

The save_stitched() method both stitches and saves a layer as a GeoTiff.

# Saves the stitched layer to /tmp/stitched.tif
tiled_raster_layer.save_stitched(path='/tmp/stitched.tif')

It is also possible to specify the regions of layer to be saved when it is stitched.

layer_extent = tiled_raster_layer.layer_metadata.layout_definition.extent

# Only a portion of the stitched layer needs to be saved, so we will create a sub Extent to crop to.
sub_exent = gps.Extent(xmin=layer_extent.xmin + 10,
                       ymin=layer_extent.ymin + 10,
                       xmax=layer_extent.xmax - 10,
                       ymax=layer_extent.ymax - 10)

tiled_raster_layer.save_stitched(path='/tmp/cropped-stitched.tif', crop_bounds=sub_exent)
# In addition to the sub Extent, one can also choose how many cols and rows will be in the saved in the GeoTiff.
tiled_raster_layer.save_stitched(path='/tmp/cropped-stitched-2.tif',
                                 crop_bounds=sub_exent,
                                 crop_dimensions=(1000, 1000))

Tiling Data to a Layout

This is similar to RasterLayer’s tile_to_layout method, except for one important detail. If performing a tile_to_layout() on a TiledRasterLayer that contains a zoom_level, that zoom_level could be lost or changed depending on the layout and/or target_crs chosen. Thus, it is important to keep that in mind in retiling a TiledRasterLayer.

# Original zoom_level of the source TiledRasterLayer
tiled_raster_layer.zoom_level
# zoom_level will be lost in the resulting TiledRasterlayer
tiled_raster_layer.tile_to_layout(layout=gps.LocalLayout())
# zoom_level will be changed in the resulting TiledRasterLayer
tiled_raster_layer.tile_to_layout(layout=gps.GlobalLayout(), target_crs=3857)
# zoom_level will reamin the same in the resulting TiledRasterLayer
tiled_raster_layer.tile_to_layout(layout=gps.GlobalLayout(zoom=11))

Getting Point Values

get_point_values() takes a collection of shapely.geometry.Points and returns the value(s) that are at the given point in the layer. The number of values returned depends on the number of bands the values have, as there will be one value per band.

It is also possible to pass in a ResampleMethod to this method, but not all are supported. The following are all of the ResampleMethods that can be used to calculate point values:

  • ResampleMethod.NEAREST_NEIGHBOR
  • ResampleMethod.BILINEAR
  • ResampleMethod.CUBIC_CONVOLUTION
  • ResampleMethod.CUBIC_SPLINE
Getting the Point Values From a SPATIAL Layer

When using get_point_values on a layer with a LayerType of SPATIAL, the results will be paired as (shapely.geometry.Point, [float]). Where each given Point will be paired with the values it intersects.

# Creating the points
extent = tiled_raster_layer.layer_metadata.extent

p1 = Point(extent.xmin, extent.ymin + 0.5)
p2 = Point(extent.xmax , extent.ymax - 1.0)
Giving a [shapely.geometry.Point] to get_point_values

When points is given as a [shapely.geometry.Point], then the ouput will be a [(shapely.geometry.Point, [float])].

tiled_raster_layer.get_point_values(points=[p1, p2])
Giving a {k: shapely.geometry.Point} to get_point_values

When points is given as a {k: shapely.geometry.Point}, then the ouput will be a {k: (shapely.geometry.Point, [float])}.

tiled_raster_layer.get_point_values(points={'point 1': p1, 'point 2': p2})
Getting the Point Values From a SPACETIME Layer

When using get_point_values on a layer with a LayerType of SPACETIME, the results will be paired as (shapely.geometry.Point, [(datetime.datetime, [float])]). Where each given Point will be paired with a list of tuples that contain the values it intersects and those values’ corresponding timestamps.

st_extent = space_time_tiled_layer.layer_metadata.extent

p1 = Point(st_extent.xmin, st_extent.ymin + 0.5)
p2 = Point(st_extent.xmax , st_extent.ymax - 1.0)
Giving a [shapely.geometry.Point] to get_point_values

When points is given as a [shapely.geometry.Point], then the ouput will be a [(shapely.geometry.Point, [(datetime.datetime, [float])])].

space_time_tiled_layer.get_point_values(points=[p1, p2])
Giving a {k: shapely.geometry.Point} to get_point_values

When points is given as a {k: shapely.geometry.Point}, then the ouput will be a {k: (shapely.geometry.Point, [(datetime.datetime, [float])])}.

space_time_tiled_layer.get_point_values(points={'point 1': p1, 'point 2': p2})

Aggregating the Values of Each Cell

aggregate_by_cell() will compute an aggregate summary for each cell of all values for each key. Thus, if there are multiple copies of the same key in the layer, then the resulting layer will contain just a single instance of that key with its corresponding value being the aggregate summary of all the values that share that key.

Not all Operations are supported. The following ones can be used in aggregate_by_cell:

  • Operation.SUM
  • Operation.MIN
  • Operation.MAX
  • Operation.MEAN
  • Operation.VARIANCE
  • Operation.STANDARD_DEVIATION
unioned_layer = gps.union(layers=[tiled_raster_layer, tiled_raster_layer + 1])

# Sum the values of the unioned_layer
unioned_layer.aggregate_by_cell(operation=gps.Operation.SUM)

# Get the max value for each cell
unioned_layer.aggregate_by_cell(operation=gps.Operation.MAX)

General Methods

There exist methods that are found in both RasterLayer and TiledRasterLayer. These methods tend to perform more general analysis/tasks, thus making them suitable for both classes. This next section will go over these methods.

Note: In the following examples, both RasterLayers and TiledRasterLayers will be used. However, they can easily be subsituted with the other class.

Unioning Layers Togther

To combine the contents of multiple layers together, one can use the union() method. This will produce either a new RasterLayer or TiledRasterLayer that contains all of the elements from the given layers.

Note: The resulting layer can contain duplicate keys.

gps.union(layers=[tiled_raster_layer, tiled_raster_layer])

Selecting a SubSection of Bands

To select certain bands to work with, the bands method will take either a single or collection of band indices and will return the subset as a new RasterLayer or TiledRasterLayer.

Note: There could high performance costs if operations are performed between two sub-bands of a large dataset. Thus, if you’re working with a large amount of data, then it is recommended to do band selection before reading them in.

# Selecting the second band from the layer
multiband_raster_layer.bands(1)
# Selecting the first and second bands from the layer
multiband_raster_layer.bands([0, 1])

Combining Bands of Two Or More Layers

The combine_bands() method will concatenate the bands of values that share a key between two or more layers. Thus, the resulting layer will contain a new Tile for each shared key where the Tile will contain all of the bands from the given layers.

The order in which the layers are passed into combine_bands matters. Where the resulting values’ bands will be ordered based on their position of their respective layer.

# Setting up example RDD
twos = np.ones((1, 16, 16), dtype='int') + 1
twos_tile = gps.Tile.from_numpy_array(numpy_array=np.array(twos), no_data_value=-500)

twos_rdd = pysc.parallelize([(projected_extent, twos_tile)])
twos_raster_layer = gps.RasterLayer.from_numpy_rdd(layer_type=gps.LayerType.SPATIAL, numpy_rdd=twos_rdd)
# The resulting values of the layer will have 2 bands: the first will be all ones,
# and the last band will be all twos
gps.combine_bands(layers=[multiband_raster_layer, twos_raster_layer])
# The resulting values of the layer will have 2 bands: the first will be all twos and the
# other band will be all ones
gps.combine_bands(layers=[twos_raster_layer, multiband_raster_layer])

Collecting the Keys of a Layer

To collect all of the keys of a layer, use the collect_keys method.

# Returns a list of ProjectedExtents
multiband_raster_layer.collect_keys()

# Returns a list of a SpatialKeys
tiled_raster_layer.collect_keys()

# Returns a list of SpaceTimeKeys
space_time_tiled_layer.collect_keys()

Filtering a Layer By Times

Using the filter_by_times method will produce a layer whose values fall within the given time interval(s).

Filtering By a Single Instant

A single datetime.datetime instance can be used to filter the layer. If that is the case then only exact matches with the given time will be kept.

space_time_layer.filter_by_times(time_intervals=[instant])

Filtering By Intervals

Various time intervals can also be given as well, and any keys whose instant falls within the time spans will be kept in the layer.

end_date_1 = instant + datetime.timedelta(days=3)
end_date_2 = instant + datetime.timedelta(days=5)

# Will filter out any value whose key does not fall in the range of
# instant and end_date_1
space_time_layer.filter_by_times(time_intervals=[instant, end_date_1])

# Will filter out any value whose key does not fall in the range of
# instant and end_date_1 OR whose key does not match end_date_2
space_time_layer.filter_by_times(time_intervals=[instant, end_date_1, end_date_2])

Converting the Data Type of the Rasters’ Cells

The convert_data_type method will convert the types of the cells within the rasters of the layer to a new data type. The noData value can also be set during this conversion, and if it’s not set, then there will be no noData value for the resulting rasters.

# The data type of the cells before converting
metadata.cell_type
# Changing the cell type to int8 with a noData value of -100.
raster_layer.convert_data_type(new_type=gps.CellType.INT8, no_data_value=-100).collect_metadata().cell_type
# Changing the cell type to int32 with no noData value.
raster_layer.convert_data_type(new_type=gps.CellType.INT32).collect_metadata().cell_type

Reclassify Cell Values

reclassify changes the cell values based on the value_map and classification_strategy given. In addition to these two parameters, the data_type of the cells also needs to be given. This is either int or float.

# Values of the first tile before being reclassified
multiband_raster_layer.to_numpy_rdd().first()[1]
# Change all values greater than or equal to 1 to 10
reclassified = multiband_raster_layer.reclassify(value_map={1: 10},
                                                 data_type=int,
                                                 classification_strategy=gps.ClassificationStrategy.GREATER_THAN_OR_EQUAL_TO)
reclassified.to_numpy_rdd().first()[1]

Merging the Values of a Layer Together

By using the merge method, all values that share a key within the layer will be merged together to form a new, single value. This is accomplished by replacing the cells of one value with another’s. However, not all cells, if any, may be replaced. When merging the cell of values, the following steps are taken to determine if a cell’s value should be changed:

  1. If the cell contains a NoData value, then it will be replaced.
  2. If no NoData value is set, then a cell with a vlue of 0 will be replaced.
  3. if neither of the above are true, then the cell retains its value.
# Creating the layers
no_data = np.full((1, 4, 4), -1)
zeros = np.zeros((1, 4, 4))

def create_layer(no_data_value=None):
    data_tile = gps.Tile.from_numpy_array(numpy_array=no_data, no_data_value=no_data_value)
    zeros_tile = gps.Tile.from_numpy_array(numpy_array=zeros, no_data_value=no_data_value)

    layer_rdd = pysc.parallelize([(projected_extent, data_tile), (projected_extent, zeros_tile)])
    return gps.RasterLayer.from_numpy_rdd(layer_type=gps.LayerType.SPATIAL, numpy_rdd=layer_rdd)

# Resulting layer has a no_data_value of -1
no_data_layer = create_layer(-1)

# Resutling layer has no no_data_value
no_no_data_layer = create_layer()
# The resulting merged value will be all zeros since -1 is the noData value
no_data_layer.merge()

# The resulting merged value will be all -1's as ``no_data_value`` was set.
no_no_data_layer.merge()

Mapping Over the Cells

It is possible to work with the cells within a layer directly via the map_cells method. This method takes a function that expects a numpy array and a noData value as parameters, and returns a new numpy array. Thus, the function given would have the following type signature:

def input_function(numpy_array: np.ndarray, no_data_value=None) -> np.ndarray

The given function is then applied to each Tile in the layer.

Note: In order for this method to operate, the internal RDD first needs to be deserialized from Scala to Python and then serialized from Python back to Scala. Because of this, it is recommended to chain together all functions to avoid unnecessary serialization overhead.

def add_one(cells, _):
    return cells + 1

# Mapping with a single funciton
raster_layer.map_cells(add_one)
def divide_two(cells, _):
    return (add_one(cells) / 2)

# Chaning together two functions to be mapped
raster_layer.map_cells(divide_two)

Mapping Over Tiles

Like map_cells, map_tiles maps a given function over all of the Tiles within the layer. It takes a function that expects a Tile and returns a Tile. Therefore, the input function’s type signature would be this:

def input_function(tile: Tile) -> Tile

Note: In order for this method to operate, the internal RDD first needs to be deserialized from Scala to Python and then serialized from Python back to Scala. Because of this, it is recommended to chain together all functions to avoid unnecessary serialization overhead.

def minus_two(tile):
    return gps.Tile.from_numpy_array(tile.cells - 2, no_data_value=tile.no_data_value)

raster_layer.map_tiles(minus_two)

Calculating the Histogram for the Layer

It is possible to calculate the histogram of a layer either by using the get_histogram or the get_class_histogram method. Both of these methods produce a Histogram, however, the way the data is represented within the resulting histogram differs depending on the method used. get_histogram will produce a histogram whose values are floats. Whereas get_class_histogram returns a histogram whose values are ints.

For more informaiton on the Histogram class, please see the Histogram [guide].

# Returns a Histogram whose underlying values are floats
tiled_raster_layer.get_histogram()
# Returns a Histogram whose underlying values are ints
tiled_raster_layer.get_class_histogram()

Finding the Quantile Breaks for the Layer

If you wish to find the quantile breaks for a layer without a Histogram, then you can use the get_quantile_breaks method.

tiled_raster_layer.get_quantile_breaks(num_breaks=3)

Quantile Breaks for Exact Ints

There is another version of get_quantile_breaks called get_quantile_breaks_exact_int that will count exact integer values. However, if there are too many values within the layer, then memory errors could occur.

tiled_raster_layer.get_quantile_breaks_exact_int(num_breaks=3)

Finding the Min and Max Values of a Layer

The get_min_max method will find the min and max value for the layer. The result will always be (float, float) regardless of the data type of the cells.

tiled_raster_layer.get_min_max()

Converting the Values of a Layer to PNGs

Via the to_png_rdd method, one can convert each value within a layer to a PNG in the form of bytes. In order to convert each value to a PNG, one needs to supply a ColorMap. For more information on the ColorMap class, please see the ColorMap section of the docs.

In addition to converting each value to a PNG, the resulting collection of (K, V)s will be held in a Python RDD.

hist = tiled_raster_layer.get_histogram()
cmap = gps.ColorMap.build(hist, 'viridis')

tiled_raster_layer.to_png_rdd(color_map=cmap)

Converting the Values of a Layer to GeoTiffs

Similar to to_png_rdd, only to_geotiff_rdd will return a Python RDD[(K, bytes)] where the bytes represent a GeoTiff.

Selecting a StorageMethod

There are two different ways the segments of a GeoTiff can be formatted: StorageMethod.STRIPED or StorageMethod.TILED. This is represented by the storage_method parameter. By default, StorageMethod.STRIPED is used.

Selecting the Size of the Segments

There are two different parameters that control the size of each segment: rows_per_strip and tile_dimensions. Only one of these values needs to be set, and that is determined by what the storage_method is.

If the storage_method is StorageMethod.STRIPED, then rows_per_strip will be the parameter to change. By default, the rows_per_strip will be calculated so that each strip is 8K or less.

If the storage_method is StorageMethod.TILED, then tile_dimensions can be set. This is given as a (int, int) where the first value is the number of cols and the second is the number of rows`. By default, the tile_dimensions is (256, 256).

Selecting a CompressionMethod

The two types of compressions that can be chosen are: Compression.NO_COMPRESSION or Compression.DEFLATE_COMPRESSION. By default, the compression parameter is set to Compression.NO_COMPRESSION.

Selecting a ColorSpace

The color_space parameter determines how the colors should be organized in each GeoTiff. By default, it’s ColorSpace.BLACK_IS_ZERO.

Passing in a ColorMap

A ColorMap instance can be passed in so that the resulting GeoTiffs are in a different gradiant. By default, color_map is None. To learn more about ColorMap, see the ColorMap section of the docs.

# Creates an RDD[(K, bytes)] with the default parameters
tiled_raster_layer.to_geotiff_rdd()

# Creates an RDD whose GeoTiffs are tiled with a size of (128, 128)
tiled_raster_layer.to_geotiff_rdd(storage_method=gps.StorageMethod.TILED, tile_dimensions=(128, 128))

RDD Methods

As mentioned in the section on TiledRasterLayer’s repartition method, TiledRasterLayer has methods to work with its internal RDD. This holds true for RasterLayer as well.

The following is a list of RDD with examples that are supported by both classes.

Cache

raster_layer.cache()

Persist

# If no level is given, then MEMORY_ONLY will be used
tiled_raster_layer.persist()

Unpersist

tiled_raster_layer.unpersist()

getNumberOfPartitions

raster_layer.getNumPartitions()

Count

raster_layer.count()

isEmpty

raster_layer.isEmpty()