🗽 Decoding NYC: A Machine Learning Approach to Public Sector Salary Prediction

🎥 Presentation Video

📌 Project Overview

This repository hosts a comprehensive data science project aimed at decoding the complex structure of the New York City public sector payroll. By moving from raw administrative records to advanced predictive modeling, we developed a high-accuracy system that bridges the gap between institutional data and fiscal forecasting.

The Result: A transition from a weak 0.036 R2 baseline to a high-performing 0.623 R2 Random Forest model—a 1,600% relative improvement in predictive accuracy.

🔍 Research Question

To what extent can we predict the Base Salary of a NYC employee based on their organizational affiliation, geographic location, and professional seniority, and which of these factors has the most significant impact on public sector wage determination?

🛠️ Phase 1: Dataset & Strategic Cleaning

📦 Dataset Description

Source: NYC Citywide Payroll Data (Fiscal Year 2014-2023).
Size: The original dataset contains ~6.77 million rows.
Sampling: For this assignment, we utilized a representative sample of 15,000 rows to ensure high performance while comfortably exceeding the academic requirements.
Target Variable: Base Salary (Continuous numeric value representing the employee's contractual annual or hourly rate).

🧹 Data Cleaning Pipeline

To protect privacy and ensure quality, we implemented the following:

Privacy Protection: Removed personal identifiers (First Name, Last Name, Mid Init) and system IDs (Payroll Number).
Date Parsing: Converted Agency Start Date to datetime objects to calculate precise Seniority.
Standardization: Normalized categorical columns (Boroughs, Agencies) to uppercase and ensured unique records.
Salary Normalization: Converted hourly rates into estimated annual wages (Hourly Rate x 2,080 standard hours) to create a unified economic scale across all employment types.

⚖️ Strategic Outlier Handling: The Business-Logic Filter

Instead of a purely statistical approach, we applied a filter based on the professional reality of NYC:

The Floor ($30,000): We filtered out records with an annual salary below $30,000. These typically represent seasonal staff or data errors that don't reflect career-level employment.
The Ceiling ($400,000): We intentionally preserved high-earning salaries. While statistically flagged as outliers, they represent the city's executive hierarchy. Deleting them would prevent the model from understanding the full organizational ladder.
Integrity Check: Filtered out records with 0 regular hours to eliminate administrative artifacts.

📊 Phase 2: Exploratory Data Analysis & Research Insights

💡 Key Findings

The Seniority Paradox: We found a correlation of only 0.01 between Seniority and Base Salary. This proves that in the NYC public sector, longevity is not a primary driver of pay. Salary is strictly tied to Job Title and Agency entry points.
Geographic Wage Gap: A clear Metropolitan Premium exists. Manhattan and Queens lead the city with the highest average salaries (~$71,000), confirming location as a high-impact factor.
The Multimodal Structure: The salary distribution reveals three distinct peaks (around $45K, $60K, and $78K), suggesting rigid, contract-based pay grades.

🧠 Phase 3: Unsupervised Learning & Feature Engineering

To overcome the limitations of linear modeling, we implemented Contextual Intelligence:

Unsupervised Clustering (K-Means): We segmented the workforce into 4 distinct archetypes (Entry-Level, High-Value Veterans, Overtime Specialists, and Standard Admin).
PCA Validation: Principal Component Analysis confirmed that our clusters represent distinct pay structures rather than random noise.
Advanced Features:
- Polynomial Seniority: Added Seniority_Squared to capture the curved nature of career growth.
- Distance to Centroid: A continuous metric measuring how typical or unique an employee is within their organizational group.

🚀 Phase 4: Regression Modeling & Performance Leap

We moved from a linear performance floor to sophisticated ensemble algorithms.

📈 Performance Comparison

Model	R2 Score	MAE (USD)	Relative Improvement
Baseline (Linear)	0.036	$18,462	-
Engineered Linear	0.154	$14,156	+327%
Decision Tree	0.620	$9,852	+1,600%
Random Forest (Winner)	0.623	$9,824	Top Performer

🏆 Feature Importance

The Random Forest successfully used branching logic to isolate specific pay-rule boundaries.

Top Predictor: Salary_Cluster (Validating the unsupervised learning phase).
Secondary Predictor: Institutional drivers like Agency name and Contract type.

⚖️ Phase 5: Salary Tier Classification

We reframed the task as a Classification Problem to assist in organizational grading using 3 Classes: Low, Mid, and High Salary Tiers.

🎯 Metric Strategy: Prioritizing Recall

In a public sector context, the cost of error is asymmetrical:

The Critical Error (False Negative): Predicting a low tier for a high-tier employee. This leads to severe under-budgeting and financial instability.
The Safety Margin (False Positive): Predicting a higher tier results in a conservative, padded budget, which is a manageable scenario for government agencies.
Result: Our Random Forest Classifier achieved a 0.69 Recall for the High-Salary class, ensuring fiscal responsibility.

🚀 Extra Work & Deep Insights

Beyond basic prediction, we uncovered structural anomalies that define NYC public compensation:

The OT Survival Mechanism: Regression analysis revealed a high density of overtime among lower earners. OT acts as a survival mechanism to supplement income, a need that vanishes as base salary increases toward the $100K threshold.
The Hourly Consultant Anomaly: Normalized analysis revealed that Per Hour roles often represent highly specialized consultants earning premium annualized rates (~$135K), debunking the myth that hourly roles are entry-level.
Fiscal Safety Net: By choosing a 3-class quantile split instead of a binary one, we provided a more nuanced tool for detecting Mid-Tier salary shifts.

📝 Final Reflections

This project demonstrates that a NYC employee's wage is primarily a function of where they work (Borough) and how they are contracted (Pay Basis), rather than how long they have been in the system (Seniority). By leveraging Unsupervised Learning and Ensemble methods, we transformed a dataset characterized by high variance and non-linearity into a robust predictive tool.

Developed by: תומר דריאל
Academic Institution: Reichman University
Date: May 2026

Downloads last month: -