Loader¶
This is where your urban data journey begins. Whether you’ve got CSV
, Parquet
, Shapefiles
, or want to use HuggingFace
datasets we’ll get them loaded up and ready to explore. UrbanMapper provides two main ways to load data:
- Manual Loading of Local Datasets: You can load datasets available locally in various formats like
CSV
,Parquet
, andShapefiles
. This is the default approach for working with your own data. - Integration with Hugging Face Dataset Library: UrbanMapper also supports loading datasets from the Hugging Face library via the
from_dataframe()
method. This broadens the possibilities for integrating external data sources seamlessly.
Data source used:
PLUTO data from NYC Open Data. https://www.nyc.gov/content/planning/pages/resources/datasets/mappluto-pluto-change
Taxi data from NYC Open Data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
The OSCUR Hugging Face Dataset Source: The OSCUR Hugging Face organization hosts all datasets associated with OSCUR: Open-Source Cyberinfrastructure for Urban Computing, a research initiative focused on enabling reproducible, scalable, and accessible data-driven analysis for urban environments. By using the OSCUR datasets, you can skip downloading datasets from Google Drive or official links locally. These datasets are ready to use in all subsequent notebook examples without issue, making your workflow more efficient and seamless.
Ready? Let’s dive in! 🚀
import urban_mapper as um
# Start up UrbanMapper
mapper = um.UrbanMapper()
Loading CSV Data¶
First up, let’s load a CSV file with PLUTO data. We’ll tell UrbanMapper where to find the longitude and latitude columns so it knows what’s what and can make sure those colums are well formatted prior any analysis.
Note that below we employ a given csv, but you can put your own path, try it out!
csv_loader = (
mapper
.loader # From the loader module
.from_file("<path_to>/pluto.csv") # To update with your own path
.with_columns(longitude_column="longitude", latitude_column="latitude") # Inform your long and lat columns
)
gdf = csv_loader.load() # Load the data and create a geodataframe's instance
# gdf stands for GeoDataFrame, like df in pandas for dataframes.
gdf
Loading Parquet Data¶
Next, let's grab a parquet
based dataset for the example. Same workflow as for the csv.
parquet_loader = (
mapper.
loader. # From the loader module
from_file("<path_to>/taxisvis5M.parquet") # To update with your own path
.with_columns("pickup_longitude", "pickup_latitude") # Inform your long and lat columns
)
gdf = parquet_loader.load() # Load the data and create a geodataframe's instance
gdf
Loading Shapefile Data¶
Finally, let’s load a Shapefile-based dataset. Shapefiles have geometry built in, so no need to specify columns — UrbanMapper sorts it out for us!
shp_loader = (
mapper
.loader # From the loader module
.from_file("<path_to>/MapPLUTO.shp") # To update with your own path
)
gdf = shp_loader.load() # Load the data and create a geodataframe's instance
gdf
Loading Data from Hugging Face¶
UrbanMapper provides two ways to load datasets from Hugging Face:
- Using
from_dataframe()
: This method allows you to load a dataset into a pandas DataFrame first, giving you flexibility to preprocess or explore the data before loading it into UrbanMapper. - Using
from_huggingface()
: This method directly loads the dataset into UrbanMapper, skipping the intermediate DataFrame step for simplicity.
Method 1: Using from_dataframe()
¶
This code loads the "oscur/pluto" dataset from Hugging Face, selects the training split, and converts the first 1,000 rows into a pandas DataFrame for efficient analysis and exploration. The resulting DataFrame can then be loaded into UrbanMapper using from_dataframe()
.
from datasets import load_dataset, Dataset
import pandas as pd
# Retrieve the dataset from Hugging Face
dataset = load_dataset("oscur/pluto")
# Select the training split
train_ds = dataset["train"]
# Convert the first 1000 rows to a DataFrame
df = pd.DataFrame(train_ds[:1000])
# Load the dataset using UrbanMapper
df_loader = (
mapper
.loader # From the loader module
.from_dataframe(df) # To update with your dataframe
.with_columns(longitude_column="longitude", latitude_column="latitude") # Inform your long and lat columns
)
gdf = df_loader.load() # Load the data and create a geodataframe's instance
# gdf stands for GeoDataFrame, like df in pandas for dataframes.
gdf
Method 2: Using from_huggingface()
¶
This method directly loads the "oscur/pluto" dataset into UrbanMapper, skipping the intermediate DataFrame step. It's a simpler and faster way to load datasets hosted on Hugging Face.
# Load a full dataset directly from Hugging Face
loader = mapper.loader.from_huggingface("oscur/pluto", number_of_rows=100).with_columns(longitude_column="longitude", latitude_column="latitude")
gdf = loader.load()
gdf # Next steps: analyze or visualize the data
Be Able To Preview Your Loader's instance¶
Additionally, you can preview your loader's instance to see what columns you've specified and the file path you've loaded from. Pretty useful when you load a urban analysis shared by someone else and might want to check what columns are being used for the analysis.
print(gdf.preview())
Loading many datasets to feed and end-to-end UrbanMapper process (step-by-step or pipeline)¶
# Load datasets directly from Hugging Face
pluto_data = mapper.loader.from_huggingface("oscur/pluto", number_of_rows=100).with_columns(longitude_column="longitude", latitude_column="latitude").load()
taxi_data = (
mapper
.loader
.from_huggingface("oscur/taxisvis1M", number_of_rows=100)
.with_columns(longitude_column="pickup_longitude", latitude_column="pickup_latitude")
.with_map({"pickup_longitude": "longitude", "pickup_latitude": "latitude"}) ## Routines like layer.map_nearest_layer needs datasets with the same longitude_column and latitude_column
.load()
)
## ... load any other dataset
data = {
"pluto_data": pluto_data,
"taxi_data": taxi_data,
## ... add any other dataset
}
## Invoke any other UrbanMapper module passing data as parameter
Wrapping Up¶
And that’s that! 🎈 You’ve loaded data from four different formats like a pro: CSV
, Parquet
, Shapefile
, and datasets from Hugging Face. Now you’re all set to play with modules like urban_layer
or imputer
.