Real-world analytics challenges I've solved. All details are NDA-safe and anonymized to protect client confidentiality.
End-to-end ML solution combining property and neighborhood risk data
A regional real estate investment firm needed to systematically evaluate residential property values across the Chicago metropolitan area. Their existing approach relied on manual comparable analysis, which couldn't scale to their deal flow, and failed to account for neighborhood-level risk factors that significantly influence property values. The business question was clear: how can we predict property value with enough precision to make faster, more confident acquisition decisions?
The project required integrating multiple large-scale datasets that didn't naturally align:
~50,000 residential sale records with property attributes (beds, baths, square footage, property type, year built) and sale prices across multiple years.
900,000+ crime incident logs spanning four years with latitude/longitude coordinates, incident type, and timestamp—requiring geographic aggregation by neighborhood.
The key engineering challenge was linking crime data (reported at latitude/longitude) to property records (reported by address and neighborhood name). This required building a neighborhood-to-community-area mapping layer and aggregating crime statistics at that level.
Using Python (pandas, NumPy) and SQL for transformation, I cleaned and standardized property records, handled missing values through MICE imputation, and created derived features like price-per-square-foot and property age. For crime data, I aggregated incidents by community area and crime type to create neighborhood risk indices.
Built a mapping dictionary to translate 77 Chicago community area codes to neighborhood names, enabling the join between crime statistics and property records. This step alone unlocked the ability to incorporate external risk factors into the valuation model.
Tested multiple regression approaches including Ridge Regression, Random Forest, and Gradient Boosting. Used GridSearchCV for hyperparameter tuning and cross-validation to prevent overfitting. The final ensemble model combined property-level features with neighborhood crime density scores.
Top predictive features from Random Forest model (importance scores normalized)
85%+
Variance explained by final model
Top 5
Features drove 60% of predictive power
+12%
Accuracy lift from crime features
Crime density at the neighborhood level proved to be a significant predictor—properties in higher-crime areas showed predictable valuation discounts even after controlling for property characteristics. This validated the business intuition and provided a quantifiable risk adjustment factor.
The model enabled the firm to screen acquisition targets at scale, prioritize site visits based on predicted value-to-price ratios, and quantify neighborhood risk in investment memos. The approach reduced due diligence time per property and provided a defensible, data-driven basis for pricing negotiations.
Languages: Python (pandas, NumPy, scikit-learn), SQL | Methods: Random Forest, Ridge Regression, GridSearchCV, MICE Imputation, TF-IDF | Tools: AWS, Jupyter, Folium (geospatial visualization)
Selected projects demonstrating range across analytics disciplines.
Leadership lacked real-time visibility into revenue cycle performance across service lines.
Unified KPI framework with automated Power BI dashboards and stakeholder alignment on definitions.
Reporting reduced from 2 weeks to real-time. Identified $1.2M recoverable revenue in 90 days.
High claim denial rates eroding margins with no systematic root cause analysis.
Built denial classification model (Python/SQL) with Pareto-style prioritization by payer and denial code.
Top 5 drivers identified (60% of lost revenue). Process changes reduced denials 18%.
Marketing couldn't determine which ad content features drove email engagement across 1,400+ unique campaigns.
Engineered CTR as target variable; built regression models with NLP-derived content features from email metadata.
Identified that amenity offers increased CTR while unbranded content hurt performance. 50%+ variance explained.
Reactive staffing led to overtime costs and service delays during peak periods.
Time-series forecasting model (Python) projecting demand 4-6 weeks ahead, integrated with ops dashboards.
Staffing accuracy improved 25%, reduced overtime, eliminated recurring bottlenecks.
Ad hoc data requests overwhelmed the analytics team, creating backlogs on both sides.
Self-service semantic layer with governed models, user training, and request triage protocols.
Reduced ad hoc requests by 40%. Freed analyst time for strategic work.
Conflicting metric definitions eroded trust in data across departments.
Cross-functional workshops, canonical definitions, governed metric catalog integrated with BI tools.
Single source of truth across 5 departments. Reduced disputes, faster decisions.
I partner with organizations on analytics leadership, consulting, and advisory engagements — from full-time roles to project-based work.
Start a Conversation