Reading and Rasterizing Open Street Map Data

This tutorial shows how to read in Open Street Map (OSM) data, and then rasterize it using GeoPySpark.

Note: This guide is aimed at users who are already familiar with GeoPySpark.

Getting the Data

To start, let’s first grab an orc file, which a special file type that is optimized for Hadoop operations. The following command will use curl to download the file from S3 and move it to the /tmp directory.

curl -o /tmp/boyertown.orc https://s3.amazonaws.com/geopyspark-test/example-files/boyertown.orc

A side note: Files can be retrieved directly from S3. However, this could not be done in this instance due to permission requirements needed to access the file.

Reading in the Data

Now that we have our data, we can now read it in and begin to work with.

import geopyspark as gps
from pyspark import SparkContext

conf = gps.geopyspark_conf(appName="osm-rasterize-example", master="local[*]")
pysc = SparkContext(conf=conf)

features = gps.osm_reader.from_orc("/tmp/boyertown.orc")

The above code sets up a SparkContext and then reads in the boyertown.orc file as features, which is an instance of FeaturesCollection.

When OSM data is read into GeoPySpark, each OSM Element is turned into single or multiple different geometries. With each of these geometries retaining the metadata from the derived OSM Element. These geometry metadata pairs are referred to as a Feature. These features are grouped together by the type of geometry they contain. When accessing features from a FeaturesCollection, it is done by geometry.

There are four different types of geometries in the FeaturesCollection:

  • Point
  • Line
  • Polygon
  • MultiPolygon

Selecting the Features We Want

For this example, we’re interested in rasterizing the Lines and Polygons from the OSM data, so we will select those Features from the FeaturesCollection that contain them. The following code will create a Python RDD of Features that contains all Line geometries (lines), and a Python RDD that contains all Polygon geometries (polygons).

lines = features.get_line_features_rdd()
polygons = features.get_polygon_features_rdd()

Looking at the Tags of the Features

When we rasterize the Polygon Features, we’d like for schools to have a different value than all of the other Polygons. However, we are unsure if any schools were contained within the original data, and we’d like to see if any are. One method we could use to determine if there are schools is to look at the tags of the Polygon Features. The following code will show all of the unique tags for all of the Polygons in the collection.

features.get_polygon_tags()

Which has the following output:

{'NHD:ComID': '25964412',
 'NHD:Elevation': '0.00000000000',
 'NHD:FCode': '39004',
 'NHD:FDate': '2001/08/16',
 'NHD:FTYPE': 'LakePond',
 'NHD:Permanent_': '25964412',
 'NHD:ReachCode': '02040203004486',
 'NHD:Resolution': 'High',
 'addr:city': 'Gilbertsville',
 'addr:housenumber': '1100',
 'addr:postcode': '19525',
 'addr:state': 'PA',
 'addr:street': 'E Philadelphia Avenue',
 'amenity': 'school',
 'area': 'yes',
 'building': 'yes',
 'leisure': 'pitch',
 'name': 'Boyertown Area Junior High School-West Center',
 'natural': 'water',
 'railway': 'platform',
 'smoking': 'outside',
 'source': 'Yahoo',
 'sport': 'baseball',
 'tourism': 'museum',
 'wikidata': 'Q8069423',
 'wikipedia': "en:Zern's Farmer's Market"}

So it appears that there are schools in this dataset, and that we can continue on.

Assigning Values to Geometries

Now that we have our Features, it’s time to assign them values. The reason we need to do so is because when a vector becomes a raster, its cells need to have some kind of value. When rasterizing Features, each geometry contained within it will be given a single value, and all cells that intersect that shape will have that value. In addition to value of the actual cells, there’s another property that we will want to set for each Feature, Z-Index.

The Z-Index of a Feature determines what value a cell will be if more than one geometry intersects it. With a higher Z-Index taking priority over a lower one. This is important as there may be cases where multiple geometries are present at a single cell, but that cell can only contain one value.

For this example, we are going to want all Polygons to have a higher Z-Index than the Lines. In addition, since we’re interested in schools, Polygons that are labeled as schools will have a greater Z-Index than other Polygons.

mapped_lines = lines.map(lambda feature: gps.Feature(feautre.geometry, gps.CellValue(value=1, zindex=1)))

def assign_polygon_feature(feature):
    tags = feature.properties.tags.values()

    if 'school' in tags.values():
        return gps.Feature(feature.geometry, gps.CellValue(value=3, zindex=3))
    else:
        return gps.Feature(feature.geometry, gps.CellValue(value=2, zindex=2))

mapped_polygons = polygons.map(assign_polygon_feature)

We create the mapped_lines variable that contains an RDD of Features, where each Feature has a CellValue with a value and zindex of 1. The assign_polygon_feature function is then created which will test to see if a Polygon is a school or not. If it is, then the resulting Feature will have a CellValue with a value and zindex of 3. Otherwise, those two values will be 2.

Rasterizing the Features

Now that the Features have been given CellValues, it is now time to rasterize them.

unioned_features = pysc.union((mapped_lines, mapped_polygons))

rasterized_layer = gps.rasterize_features(features=unioned_features, crs=4326, zoom=12)

The rasterize_features function requires a single RDD of Features. Therefore, we union together mapped_lines and mapped_polygons which gives us unioned_features. Along with passing in our RDD, we must also set the crs and zoom of the resulting Layer. In this case, the crs is in LatLng, so we set it to be 4326. zoom varies between use cases, so it was just chosen arbitrarily for this example. The resulting rasterized_layer is a TiledRasterLayer that we can now analyze and/or ingest.