RasterRDD and TiledRasterRDD¶
This section seeks to explain how to create and use RasterRDD and
TiledRasterRDD. Before continuing this example, it is suggested that
you read How Data is Stored in RDDs.
Creating RasterRDD and TiledRasterRDD¶
RasterRDD¶
Of the two different RDD classes, RasterRDD has the least number of ways
to be initialized. There are just two: through reading GeoTiffs from the local
file system, S3, or HDFS; or from an existing PySpark RDD.
From GeoTiffs¶
The get() method in
geopyspark.geotrellis.geotiff_rdd creates an instance of RasterRDD from
GeoTiffs.
from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis.constants import SPATIAL
from geopyspark.geotrellis.geotiff_rdd import get
geopysc = GeoPyContext(appName="rasterrdd-example", master="local")
raster_rdd = get(geopysc=geopysc, rdd_type=SPATIAL, "path/to/your/geotiff.tif")
Note: If you have multiple GeoTiffs, you can just specify the directory where
they’re all stored. Or if the GeoTiffs are spread out in multiple locations,
you can give get a list of the places to read in the GeoTiffs.
From PySpark RDDs¶
The second option is to create a new RasterRDD from a PySpark RDD via the
from_numpy_rdd() class method.
This step is a bit more involved than the last, as it requires the data within
the PySpark RDD to be formatted in a specific way.
from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis import Extent
from geopyspark.geotrellis.constants import SPATIAL
from geopyspark.geotrellis.rdd import RasterRDD
import numpy as np
geopysc = GeoPyContext(appName="rasterrdd-example", master="local")
arr = np.ones((1, 16, 16), dtype=int)
# The raster data that will be contained in this RasterRDD will be 16x16,
# and will have a noData value of -500.
tile = {'no_data_value': -500, 'data': arr}
extent = Extent(0.0, 1.0, 2.0, 3.0)
# Since the RasterRDD will be SPATIAL, a ProjectedExtent is constructed.
projected_extent = {'extent': extent, 'epsg': 3857}
# Create a PySpark RDD that contains a single tuple, (projected_extent, tile)
# Note: The order of the values in the tuple is important. ProjectedExtent
# or TemporalProjectedExtent MUST Be the first element.
rdd = geopysc.pysc.parallelize([(projected_extent, tile)])
raster_rdd = RasterRDD.from_numpy_rdd(geopysc=geopysc, rdd_type=SPATIAL, numpy_rdd=rdd)
TiledRasterRDD¶
Unlike RasterRDD, TiledRasterRDD has multiple ways of being created.
From PySpark RDDs¶
TiledRasterRDD also has the class method,
from_numpy_rdd().
from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis import Extent, TileLayout, Bounds, LayoutDefinition
from geopyspark.geotrellis.constants import SPATIAL
from geopyspark.geotrellis.rdd import TiledRasterRDD
import numpy as np
geopysc = GeoPyContext(appName="tiledrasterrdd-example", master="local")
data = np.array([[
[1.0, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 0.0]]])
# Data to be placed within the TiledRasterRDD.
# Each value is a tuple where the first value is either a SpatialKey or a
# SpaceTime. With the second being the tile.
layer = [({'row': 0, 'col': 0}, {'no_data_value': -1.0, 'data': data}),
({'row': 1, 'col': 0}, {'no_data_value': -1.0, 'data': data}),
({'row': 0, 'col': 1}, {'no_data_value': -1.0, 'data': data}),
({'row': 1, 'col': 1}, {'no_data_value': -1.0, 'data': data})]
# Creating the PySpark RDD.
rdd = BaseTestClass.geopysc.pysc.parallelize(layer)
# All TiledRasterRDDs have metadata that describes the layout of data within
# it. Therefore, in order to create it from a PySpark RDD, the metadata must
# be either created, or taken from elsewhere.
extent = Extent(0.0, 0.0, 33.0, 33.0)
layout = TileLayout(2, 2, 5, 5)
bounds = Bounds({'col': 0, 'row': 0}, {'col': 1, 'row': 1})
layout_definition = LayoutDefinition(extent, layout)
metadata = Metadata(
bounds=bounds,
crs='+proj=longlat +datum=WGS84 +no_defs ',
cell_type='float32ud-1.0',
extent=extent,
layout_definition=layout_definition)
tiled_rdd = TiledRasterRDD.from_numpy_rdd(geopysc=geopysc, rdd_type=SPATIAL,
numpy_rdd=rdd, metadata=metadata)
Through Rasterization¶
Another means of producing TiledRasterRDD is through rasterizing a Shapely
geometry via the rasterize()
method.
from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis import Extent
from geopyspark.geotrellis.constants import SPATIAL
from geopyspark.geotrellis.rdd import TiledRasterRDD
from shapely.geometry import Polygon
geopysc = GeoPyContext(appName="tiledrasterrdd-example", master="local")
extent = Extent(0.0, 0.0, 11.0, 11.0)
polygon = Polygon([(0, 11), (11, 11), (11, 0), (0, 0)])
# Creates a TiledRasterRDD from a Shapely Polygon. The resulting raster will
# be 256x256 and all values within it are 1.
tiled_rdd = TiledRasterRDD.rasterize(geopysc=geopysc, rdd_type=SPATIAL,
geometry=polygon, extent=extent,
cols=256, rows=256, fill_value=1)
Through Euclidean Distance¶
The final way to create TiledRasterRDD is by calculating the Euclidean of
a Shapely geometry. euclidean_distance()
is the class method which does this. While you can use any geometry to perform
Euclidean distance, it is recommended not to use Polygons if they cover
many cells of the resulting raster. As this can impact performance in a
negative way.
from geopyspark.geopycontext import GeoPyContext
from geopyspark.geotrellis import Extent
from geopyspark.geotrellis.constants import SPATIAL
from geopyspark.geotrellis.rdd import TiledRasterRDD
from shapely.geometry import MultiPoint
import pyproj
geopysc = GeoPyContext(appName="tiledrasterrdd-example", master="local")
# Shapely produces points in LatLng by default. However, GeoPySpark tends to
# work with values in WebMercator, so we must reproject the geometries.
latlong = pyproj.Proj(init='epsg:4326')
webmerc = pyproj.Proj(init='epsg:3857')
points = MultiPoint([pyproj.transform(latlong, webmerc, 1, 1),
pyproj.transform(latlong, webmerc, 2, 2)])
# Makes a TiledRasterRDD from the Euclidean distance calculation.
# The resulting TiledRasterRDD will have a zoom level of 7.
tiled_rdd = TiledRasterRDD.euclidean_distance(geopysc=geopysc,
geometry=points,
source_crs=3857,
zoom=7)
Using RasterRDD and TiledRasterRDD¶
After initializing RasterRDD and/or TiledRasterRDD, it is now time to
use them.
Common Methods¶
While different in terms of functionality, RasterRDD and TiledRasterRDD
both share some methods.
Converting to a PySpark RDD¶
If you wish to you convert to a PySpark RDD, it can be done via the
to_numpy_rdd method.
# RasterRDD
raster_rdd.to_numpy_rdd()
# TiledRasterRDD
tiled_rdd.to_numpy_rdd()
Reclassifying Values¶
reclassify can reclassify values in either RasterRDD or
TiledRasterRDD. This is done by binning each value in the RDD.
The boundary_startegy will determine how each value will be binned. These
are the strategies to choose from: GREATERTHAN, GREATERTHANOREQUALTO,
LESSTHAN, LESSTHANOREQUALTO, and EXACT.
If a value does not fall within the boundary, then it’s given the
no_data_value. A different replacement can be used instead
with replace_nodata_with.
from geopyspark.geotrellis.constants import EXACT, LESSTHAN
value_map = {1: 0}
# All values less than or equal to 1 will now become zero.
# Any other number is now whatever the no_data_value is for this
# TiledRasterRDD
tiled_rdd.reclassify(value_map=value_map, data_type=int)
value_map = {5.0: 10.0, 15.0: 20.0}
# Only 5.0 and 15.0 will be reclassified. Everything else will become -1000.0
tiled_rdd.relcassify(value_map=value_map, data_type=float, boundary_strategy=EXACT,
replace_no_data_with=-1000.0)
Min and Max¶
get_min_max will produce the min and max values of the RDD. They always be
returned as floats. Regardless of the type of the values.
tiled_rdd.get_min_max()
RasterRDD¶
The purpose of RasterRDD is store and format data to produce a
TiledRasterRDD. Thus, this class lacks the methods needed to perform any
kind of spatial analysis. It can be thought of as something of an “organizer”.
Which sorts and lays out the data so that TiledRasterRDD can perform
operations on the data.
Collecting Metadata¶
In order to convert a RasterRDD to a TiledRasterRDD the
Metadata must first be collected; as it
contains the information on how the data should be formatted and laid out in
the TiledRasterRDD. collect_metadata()
is used to obtain the metadata, and it can accept to different types of inputs
depending on how one wishes to layout the data.
The first option is to specify an Extent and a
TileLayout for the Metadata. Where the
Extent is the area that will be covered by the Tiles and the
TileLayout describes the Tiles and the grid they’re arranged on.
from geopyspark.geotrellis import Extent, TileLayout
extent = Extent(0.0, 0.0, 33.0, 33.0)
tile_layout = TileLayout(2, 2, 256, 256)
# The Metadata that will be returned will conform to the extent and tile
# layout that was given. In this case, the rasters will be tiled into a 2x2
# grid with each Tile having 256 cols and rows. This grid will cover the
# area within the extent.
md = raster_rdd.collect_metadata(extent=extent, layout=tile_layout)
The other option is to simply give collect_metadata the tile_size
that each Tile should be in the resulting grid. Extent and
TileLayout will be calculated from this size. Using this method will ensure
that the native resolutions of the rasters are kept.
# tile_size has a default value of 256. If this works for your case, then
# you can just do this
md = raster_rdd.collect_metadata()
# Otherwise, you can specify your own tile_size.
md = raster_rdd.collect_metadata(tile_size=512)
Formatting the Data to a Layout¶
Once Metadata has been obtained, RasterRDD will be able to format the
data, which will result in a new TiledRasterRDD instance. There are two
methods to do this: cut_tiles() and
tile_to_layout().
Both of these methods have the same inputs and similar outputs, however, there is one key
difference between the two. cut_tiles will cut the rasters to the given
layout, but will not fix any overlap that may occur. Whereas tile_to_layout
will cut and then merge together areas that are overlapped. This matters as
each Tile is referenced by a key, and if there’s overlap than there could
be duplicate keys.
Therefore, it is recommended to use tile_to_layout to ensure there is no
duplication.
md = raster_rdd.collect_metadata()
tiled_rdd = raster_rdd.tile_to_layout(layer_metadata=md)
# resample_method can be set when doing the formatting. For this example,
# BILINEAR will be used. The defatul method is NEARESTNEIGHBOR.
from geopyspark.geotrellis.constants import BILINEAR
tiled_rdd = raster_rdd.tile_to_layout(layer_metadata=md, resample_method=BILINEAR)
A Quicker Way to TiledRasterRDD¶
to_tiled_layer() allows the user to
layout their data and produce a TiledRasterRDD in just one step. This
method is collect_metadata and tile_to_layout combined, and is used to
save a little time when writing.
# Using Extent and TileLayout
from geopyspark.geotrellis import Extent, TileLayout
extent = Extent(0.0, 0.0, 33.0, 33.0)
tile_layout = TileLayout(2, 2, 256, 256)
tiled_rdd = raster_rdd.to_tiled_layer(extent=extent, layout=tile_layout)
# Or using tile_size instead
tiled_rdd = raster_rdd.to_tiled_layer()
TiledRasterRDD¶
TiledRasterRDD will be the class that will see the most use. It provides
all the methods needed to perform a computations and analysis on the data. When
reading and saving layers, this class will be used.
A Note on Using Geometries¶
Before doing operations that involve geometries, it is important to check to
make sure that the geometry is in the correct projection. Geometries created
through Shapely are in LatLong. Unless the data in TiledRasterRDD is also
in this projection, the geometry being used will need to be reprojected.
from functools import partial
from shapely.geometry import Polygon
from shapely.ops import transform
import pyproj
polygon = Polygon([(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)])
# Reporjects the geometry to WebMercator so that it will intersect with
# the TiledRasterRDD.
project = partial(
pyproj.transform,
pyproj.Proj(init='epsg:4326'),
pyproj.Proj(init='epsg:3857'))
reprojected_polygon = transform(project, geom)
Reprojecting¶
Often the tiles within TiledRasterRDD will have to be reprojected. There is
a method to do this aptly named, reproject().
If you wish to create a TMS server from this data, then this method should be
used to ensure that the layout will work when pyramiding (more on that in a
bit).
If you do not wish to create a TMS server, and just want to reproject the data, then there are two different ways to different ways to do so.
# Using Extant and TileLayout
from geopyspark.geotrellis import Extent, TileLayout
extent = Extent(0.0, 0.0, 33.0, 33.0)
tile_layout = TileLayout(2, 2, 256, 256)
reprojected_rdd = tiled_rdd.reproject(target_crs=3857, extent=extent,
layout=tile_layout)
# Using tile_size
reprojected_rdd = tiled_rdd.reproject(target_crs=3857)
If you want to make a TMS server, then there is only one option available for reprojecting.
from geopyspark.geotrellis.constants import ZOOM
reprojected_rdd = tiled_rdd.reproject(target_crs=3857, scheme=ZOOM)
# Reprojecting with different tile_size
reprojected_rdd = tiled_rdd.reproject(target_crs=3857, scheme=ZOOM, tile_size=512)
What is the difference between using and not using ZOOM? It has to do with
how GeoTrellis represents the layout of the data in the RDD. There are three
different classes GeoTrellis uses: LayoutDefinition,
FloatingLayoutScheme and ZoomedLayoutScheme. The exact nature and
differences between these classes will not be discussed here, rather, a brief
explanation will be given.
Because the resolution of images changes as one zooms in and out when using
a TMS server, the layout of the tiles changes. Neither LayoutDefinition or
FloatingLayoutScheme have the ability to adjust the layout from a zoom.
Only ZoomedLayoutScheme can do this, which is why it must be set when
reprojecting.
Retiling¶
It is possible to change the layout of the tiles within TiledRasterRDD
via tile_to_layout().
from geopyspark.geotrellis import Extent, TileLayout, LayoutDefinition
extent = Extent(100.0, 100.0, 250.0, 250.0)
tile_layout = TileLayout(5, 5, 256, 256)
layout_definition = TileDefinition(extent, tile_layout)
retiled_rdd = tiled_rdd.tile_to_layout(layout=layout_definition)
Masking¶
By using mask(), the
TiledRasterRDD can be masekd using one or more Shapely geometries.
from shapely.geometry import Polygon
polygon = Polygon([(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)])
# The resulting TiledRasterRDD will only contain values that were interested
# by this Polygon
masked_rdd = tiled_rdd.mask(geometries=polygon)
Stitching¶
Using stitch() will produce
a single raster by stitching together all of the tiles within the
TiledRasterRDD. This can only be done with SPATIAL RDDs, and is not
recommended if the data contained within is large. As it can cause crashes due
to its size.
raster = tiled_rdd.stitch()
Pyramiding¶
Before creating a TMS server, a TiledRasterRDD needs to be pyramided first.
pyramid() will create a new
TiledRasterRDD for each zoom level, and the resulting list can then be
either be accessed to fetch specific tiles or can be saved for later use.
# Creates 12 new TiledRasterRDDs where each one has a different layout
# depending on its zoom level.
pyramided_rdds = tiled_rdd.pyramid(start_zoom=12, end_zoom=1)
Why is start_zoom greater than end_zoom? This is because start_zoom
represents the lowest or most zoomed level of the pyramid. And the pyramiding
process starts with the greatest zoom and works its way up to the most zoomed
out.
Operations¶
TiledRasterRDDs can perform both local and focal operations.
Local¶
Performing local operations with TiledRasterRDDs can be performed with
ints, floats, or other TiledRasterRDDs.
# All values will have one added to them
tiled_rdd + 1
# Find the average of two TiledRasterRDDs
(tiled_rdd_1 + tiled_rdd_2) / 2
# The position of TiledRasterRDD in the operation doesn't matter, so it can
# be used on either side of of the operation.
1 / (5 - tiled_rdd)
Focal¶
Focal operations are done by selecting both a neighborhood and a
operation. Because the inputs must be sent over to Scala, the operation
must be entered in the form of a constant.
The values used to represent operation are: SUM, MIN, MAX,
MEAN, MEDIAN, MODE, STANDARDDEVIATION, ASPECT, and
SLOPE. These are all of the current available focal operations that can be
done in GeoPySpark.
neighborhood can be specified with either a
Neighboorhod sub-class, or a
constant.
from geopyspark.geotrellis.neighborhoods import Square
from geopyspakr.geotrellis.constants import SLOPE
# Creates a Square neighborhood. Setting extent to 1 will mean that only one
# cell past the focus of the bounding box will be included in the
# neighborhood. Thus it creates a neighborhood that is 3x3 cells in size.
square_neighborhood = Square(extent=1)
# Calculate the slope for each neighborhood in the TiledRasterRDD
slope_rdd = tiled_rdd.focal(operation=SLOPE, neighborhood=square_neighborhood)
# To perform a focal operation with creating a Neighborhood class.
from geopyspark.geotrellis.constants import SQUARE
# Since a class wasn't initialized, the parameters to make the neighborhood
# must be passed in to the method. Square only requires one parameter, so
# only param_1 needs to be set.
slope_rdd = tiled_rdd.focal(operation=SLOPE, neighborhood=SQUARE, param_1=1)
Polygonal Summary Methods¶
In addition to local and focal methods, TiledRasterRDD can also perform
polygonal summary methods. Using Shapely geometries, one can find the min, max,
sum, and mean of all of the values intersected by the geometry.
from shapely.geometry import Polygon
polygon = Polygon([(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)])
# Finds the min value that falls inside the Polygon. The data type of the
# values within the Tiles must be stated. For this example, they are ints.
tiled_rdd.polygonal_min(geometry=polygon, data_type=int)
# Finds the max value that falls inside the Polygon.
tiled_rdd.polygonal_max(geometry=polygon, data_type=float)
# Finds the sum of the values that fall inside the Polygon.
tiled_rdd.polygonal_sum(geometry=polygon, data_type=int)
# polygonal_mean will always return a float, so there's no need to set
# data_type.
tiled_rdd.polygonal_mean(geometry=polygon)
Cost Distance¶
It’s possible to calculate the cost distance of a TiledRasterRDD via
cost_distance().
from shapely.geometry import Point
points = [Point(0, 0), Point(1, 2)]
tiled_rdd.cost_distance(geometries=points, max_distance=144000)