Imputer¶
In this notebook, we’re tackling the Imputer module—your best take for sorting out missing geospatial data. Let’s see it in action with some sample data!
Data source used:
- PLUTO data from NYC Open Data. https://www.nyc.gov/content/planning/pages/resources/datasets/mappluto-pluto-change
import urban_mapper as um
# Fire up UrbanMapper
mapper = um.UrbanMapper()
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 1 ----> 1 import urban_mapper as um 3 # Fire up UrbanMapper 4 mapper = um.UrbanMapper() File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/__init__.py:3 1 from loguru import logger ----> 3 from .mixins import ( 4 LoaderMixin, 5 EnricherMixin, 6 VisualMixin, 7 TableVisMixin, 8 AuctusSearchMixin, 9 PipelineGeneratorMixin, 10 UrbanPipelineMixin, 11 ) 12 from .modules import ( 13 LoaderBase, 14 CSVLoader, (...) 30 PipelineGeneratorFactory, 31 ) 33 from .urban_mapper import UrbanMapper File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/mixins/__init__.py:1 ----> 1 from .loader import LoaderMixin 2 from .enricher import EnricherMixin 3 from .visual import VisualMixin File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/mixins/loader.py:1 ----> 1 from urban_mapper.modules.loader.loader_factory import LoaderFactory 4 class LoaderMixin(LoaderFactory): 5 def __init__(self): File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/modules/__init__.py:1 ----> 1 from .loader import LoaderBase, CSVLoader, ShapefileLoader, ParquetLoader 2 from .imputer import ( 3 GeoImputerBase, 4 SimpleGeoImputer, 5 AddressGeoImputer, 6 ) 7 from .filter import ( 8 GeoFilterBase, 9 BoundingBoxFilter, 10 ) File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/modules/loader/__init__.py:3 1 from .abc_loader import LoaderBase 2 from .loaders import CSVLoader, ShapefileLoader, ParquetLoader ----> 3 from .loader_factory import LoaderFactory 5 __all__ = [ 6 "LoaderBase", 7 "CSVLoader", (...) 10 "LoaderFactory", 11 ] File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/modules/loader/loader_factory.py:19 17 from urban_mapper.modules.loader.loaders.csv_loader import CSVLoader 18 from urban_mapper.modules.loader.loaders.parquet_loader import ParquetLoader ---> 19 from urban_mapper.modules.loader.loaders.raster_loader import RasterLoader # Importing RasterLoader of the new raster loader module 20 from urban_mapper.modules.loader.loaders.shapefile_loader import ShapefileLoader 21 from urban_mapper.utils import require_attributes File ~/checkouts/readthedocs.org/user_builds/urbanmapper/checkouts/70/src/urban_mapper/modules/loader/loaders/raster_loader.py:2 1 from ..abc_loader import LoaderBase ----> 2 import rasterio 3 from typing import Any 4 import numpy as np ModuleNotFoundError: No module named 'rasterio'
Loading Sample Data¶
First, let’s grab some sample CSV data. It might have a few gaps in the coordinates, but we’ll sort that out in a jiffy!
Note that:
- Loader example can be seen in
examples/Basics/loader.ipynb
especially to load your data.
# Load data
# Note: For the documentation interactive mode, we only query 20000 records from the dataset. Feel free to remove for a more realistic analysis.
data = (
mapper
.loader
.from_huggingface("oscur/pluto", number_of_rows=20000, streaming=True).with_columns("longitude", "latitude").load()
# From the loader module, from the following file within the OSCUR HuggingFace datasets hub and with the `longitude` and `latitude`
)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 4 1 # Load data 2 # Note: For the documentation interactive mode, we only query 20000 records from the dataset. Feel free to remove for a more realistic analysis. 3 data = ( ----> 4 mapper 5 .loader 6 .from_huggingface("oscur/pluto", number_of_rows=20000, streaming=True).with_columns("longitude", "latitude").load() 7 # From the loader module, from the following file within the OSCUR HuggingFace datasets hub and with the `longitude` and `latitude` 8 ) NameError: name 'mapper' is not defined
Applying the Imputer¶
Now, let’s bring in the SimpleGeoImputer
to patch up any missing longitude or latitude values. We’ll tell it which columns to focus on.
SimpleGeoImputer
naively imputes missing values if either the longitude or latitude is missing.
However, more are available. See further in the documentation.
# Create an urban layer (needed for the imputer)
# See further in the urban_layer example at examples/Basics/urban_layer.ipynb
layer = (
mapper.urban_layer.with_type("streets_intersections") # From the urban layer module and with the type streets_intersections
.from_place("Downtown Brooklyn, New York City, USA") # From place
.build()
)
print(f"[Before Impute] Number of missing values in the longitude column: {data['longitude'].isnull().sum()}")
print(f"[Before Impute] Number of missing values in the latitude column: {data['latitude'].isnull().sum()}")
# Apply the imputer
imputed_data = (
mapper
.imputer # From the imputer module
.with_type("SimpleGeoImputer") # With the type SimpleGeoImputer
.on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
.transform(data, layer) # All imputers require access to the urban layer in case they need to extract information from it.
)
print(f"[After Impute] Number of missing values in the longitude column: {imputed_data['longitude'].isnull().sum()}")
print(f"[After Impute] Number of missing values in the latitude column: {imputed_data['latitude'].isnull().sum()}")
imputed_data
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[3], line 4 1 # Create an urban layer (needed for the imputer) 2 # See further in the urban_layer example at examples/Basics/urban_layer.ipynb 3 layer = ( ----> 4 mapper.urban_layer.with_type("streets_intersections") # From the urban layer module and with the type streets_intersections 5 .from_place("Downtown Brooklyn, New York City, USA") # From place 6 .build() 7 ) 9 print(f"[Before Impute] Number of missing values in the longitude column: {data['longitude'].isnull().sum()}") 10 print(f"[Before Impute] Number of missing values in the latitude column: {data['latitude'].isnull().sum()}") NameError: name 'mapper' is not defined
Be Able To Preview Your Imputer's instance¶
Additionally, you can preview your imputer's instance to see what columns you've specified and the imputer type you've used. Pretty useful when you load a urban analysis shared by someone else.
print(mapper.imputer.preview())
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[4], line 1 ----> 1 print(mapper.imputer.preview()) NameError: name 'mapper' is not defined
Provide many different datasets to the same imputer¶
You can load many datasets and feed the imputer with a dictionary. In that case, the output will also be a dictonary. See the next simple example.
If you want to apply the imputer to a specific dataset of the dictionary, provide .with_data(data_id=...)
to the imputer.
# Load CSV data
data1 = (
mapper
.loader
.from_huggingface("oscur/pluto", number_of_rows=1000, streaming=True).with_columns("longitude", "latitude").load()
# From the loader module, from the following file and with the `longitude` and `latitude`
)
# Load Parquet data
data2 = (
mapper
.loader
.from_huggingface("oscur/taxisvis1M", number_of_rows=1000, streaming=True) # To update with your own path
.with_columns("pickup_longitude", "pickup_latitude").load() # Inform your long and lat columns
)
data = {
"pluto_data": data1,
"taxi_data": data2,
}
# Apply the imputer.
# If the same imputer is applied to all datasets, and longitude_column/latitude_column have different names in each dataset, you can use loader.with_map
# to map columns, standardizing the column names
imputed_data = (
mapper
.imputer # From the imputer module
.with_type("SimpleGeoImputer") # With the type SimpleGeoImputer
.on_columns(longitude_column="longitude", latitude_column="latitude") # On the columns longitude and latitude
.with_data(data_id="pluto_data") # On a specific data from the dictionary
.transform(data, layer) # All imputers require access to the urban layer in case they need to extract information from it.
)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[5], line 3 1 # Load CSV data 2 data1 = ( ----> 3 mapper 4 .loader 5 .from_huggingface("oscur/pluto", number_of_rows=1000, streaming=True).with_columns("longitude", "latitude").load() 6 # From the loader module, from the following file and with the `longitude` and `latitude` 7 ) 9 # Load Parquet data 10 data2 = ( 11 mapper 12 .loader 13 .from_huggingface("oscur/taxisvis1M", number_of_rows=1000, streaming=True) # To update with your own path 14 .with_columns("pickup_longitude", "pickup_latitude").load() # Inform your long and lat columns 15 ) NameError: name 'mapper' is not defined
More Geo Imputers primitives ?¶
Yes ! We deliver AddressGeoImputer
which simply geocode based on a given address
attribute in your dataset, the missing coordinates.
Wants more? Come shout that out on https://github.com/VIDA-NYU/UrbanMapper/issues/4
Wrapping Up¶
Brilliant! 🎉 You’ve patched up those missing coordinates like a champ. Your data’s looking spick and span!