Philly Stat 360 · Residential Vacancy Risk Model

Finding the vacant homes the city hasn't found yet.

A machine learning model that scores all 520,000 Philadelphia parcels for vacancy risk — surfacing properties likely to be vacant that don't appear in current city records.

0.940
AUC Area under the ROC curve — how well the model separates vacant from occupied parcels
84.0%
Sensitivity Share of actually-vacant parcels the model correctly flags
89.8%
Specificity Share of occupied parcels the model correctly leaves unflagged
~436k
Parcels scored Every residential parcel in Philadelphia receives a calibrated probability score

Philadelphia has a vacancy problem. The records don't show all of it.

Vacant properties are one of the most visible signs of disinvestment in a neighborhood. They attract illegal dumping, reduce property values for surrounding owners, create fire hazards, and signal to residents that a block is being left behind.

The city's official vacancy count, the Vacant Property Indicator, is compiled from Licenses and Inspections records, and it has a known gap. A building can sit empty for years before an inspector flags it or a neighbor files a complaint. The data reflects enforcement history, not ground truth.

That gap matters, because L&I can't inspect what it doesn't know about. Community development organizations, housing courts, and city planners deciding where to direct resources end up working from an incomplete picture.

This model was built to close part of that gap. It combines dozens of signals from public administrative data: code violation history, clean and seal actions, unsafe and imminently dangerous orders, business license records, building permits, parcel characteristics from OPA, and deed transfer history. The result is a probability score for every residential parcel in the city, and higher scores mean a property looks more like other properties that turned out to be vacant.

The goal is not to have a final determination of vacancy, not a lien or seizure trigger, and not a substitute for field judgement. It's to give the people doing the work a calibrated starting point, a prioritized list of addresses worth a second look based on data rather than chance or proximity to the last complaint.

Why records undercount?

From raw administrative data to a ranked list of addresses.

1
Data assembly
Six city datasets (Violations, Real Estate Transfer, OPA Properties, Spatial Lag, Clean & Seal, and Business Licenses) are joined at the parcel level.
2
Feature engineering
Raw fields are transformed into 34 predictive signals.
3
Model training
The pipeline trains four base learners (logistic regression, random forest, XGBoost, and LightGBM) on 34 features across 352K residential parcels, then blends the calibrated logistic regression and random forest into a 50/50 ensemble validated by ZIP- and tract-grouped spatial cross-validation.
4
Probability scoring
The ensemble outputs a calibrated probability of vacancy for every residential parcel, expressed as a 0 to 100 risk score, a top-one-percent flag, and a five-tier rank bucket for dashboard display.

Start with the data. Go deep with the methodology.

The model produces two finished artifacts. Use the interactive dashboard to explore parcel-level scores across the city. Read the methodology report to understand how the model was built, validated, and what its limitations are.

About Philly Stat 360

Philly Stat 360 is the City of Philadelphia's performance management initiative. We track how city government is doing — across every department, in plain language — and publish the results for every resident to see.

This vacancy risk model is part of a broader effort to use data to make city services more proactive — finding problems before they become crises, and doing it fairly.

Get in touch