Instructions to use Tomerd88/nyc-salary-predictor-tomer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use Tomerd88/nyc-salary-predictor-tomer with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("Tomerd88/nyc-salary-predictor-tomer", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- π½ Decoding NYC: A Machine Learning Approach to Public Sector Salary Prediction
- π₯ Presentation Video
- π Project Overview
- π οΈ Phase 1: Dataset & Strategic Cleaning
- π Phase 2: Exploratory Data Analysis & Research Insights
- π§ Phase 3: Unsupervised Learning & Feature Engineering
- π Phase 4: Regression Modeling & Performance Leap
- βοΈ Phase 5: Salary Tier Classification
- π Extra Work & Deep Insights
- π Final Reflections
- π₯ Presentation Video
π½ Decoding NYC: A Machine Learning Approach to Public Sector Salary Prediction
π₯ Presentation Video
π Project Overview
This repository hosts a comprehensive data science project aimed at decoding the complex structure of the New York City public sector payroll. By moving from raw administrative records to advanced predictive modeling, we developed a high-accuracy system that bridges the gap between institutional data and fiscal forecasting.
The Result: A transition from a weak 0.036 R2 baseline to a high-performing 0.623 R2 Random Forest modelβa 1,600% relative improvement in predictive accuracy.
π Research Question
To what extent can we predict the Base Salary of a NYC employee based on their organizational affiliation, geographic location, and professional seniority, and which of these factors has the most significant impact on public sector wage determination?
π οΈ Phase 1: Dataset & Strategic Cleaning
π¦ Dataset Description
- Source: NYC Citywide Payroll Data (Fiscal Year 2014-2023).
- Size: The original dataset contains ~6.77 million rows.
- Sampling: For this assignment, we utilized a representative sample of 15,000 rows to ensure high performance while comfortably exceeding the academic requirements.
- Target Variable: Base Salary (Continuous numeric value representing the employee's contractual annual or hourly rate).
π§Ή Data Cleaning Pipeline
To protect privacy and ensure quality, we implemented the following:
- Privacy Protection: Removed personal identifiers (First Name, Last Name, Mid Init) and system IDs (Payroll Number).
- Date Parsing: Converted Agency Start Date to datetime objects to calculate precise Seniority.
- Standardization: Normalized categorical columns (Boroughs, Agencies) to uppercase and ensured unique records.
- Salary Normalization: Converted hourly rates into estimated annual wages (Hourly Rate x 2,080 standard hours) to create a unified economic scale across all employment types.
βοΈ Strategic Outlier Handling: The Business-Logic Filter
Instead of a purely statistical approach, we applied a filter based on the professional reality of NYC:
- The Floor ($30,000): We filtered out records with an annual salary below $30,000. These typically represent seasonal staff or data errors that don't reflect career-level employment.
- The Ceiling ($400,000): We intentionally preserved high-earning salaries. While statistically flagged as outliers, they represent the city's executive hierarchy. Deleting them would prevent the model from understanding the full organizational ladder.
- Integrity Check: Filtered out records with 0 regular hours to eliminate administrative artifacts.
π Phase 2: Exploratory Data Analysis & Research Insights
π‘ Key Findings
- The Seniority Paradox: We found a correlation of only 0.01 between Seniority and Base Salary. This proves that in the NYC public sector, longevity is not a primary driver of pay. Salary is strictly tied to Job Title and Agency entry points.
- Geographic Wage Gap: A clear Metropolitan Premium exists. Manhattan and Queens lead the city with the highest average salaries (~$71,000), confirming location as a high-impact factor.
- The Multimodal Structure: The salary distribution reveals three distinct peaks (around $45K, $60K, and $78K), suggesting rigid, contract-based pay grades.
π§ Phase 3: Unsupervised Learning & Feature Engineering
To overcome the limitations of linear modeling, we implemented Contextual Intelligence:
- Unsupervised Clustering (K-Means): We segmented the workforce into 4 distinct archetypes (Entry-Level, High-Value Veterans, Overtime Specialists, and Standard Admin).
- PCA Validation: Principal Component Analysis confirmed that our clusters represent distinct pay structures rather than random noise.
- Advanced Features:
- Polynomial Seniority: Added Seniority_Squared to capture the curved nature of career growth.
- Distance to Centroid: A continuous metric measuring how typical or unique an employee is within their organizational group.
π Phase 4: Regression Modeling & Performance Leap
We moved from a linear performance floor to sophisticated ensemble algorithms.
π Performance Comparison
| Model | R2 Score | MAE (USD) | Relative Improvement |
|---|---|---|---|
| Baseline (Linear) | 0.036 | $18,462 | - |
| Engineered Linear | 0.154 | $14,156 | +327% |
| Decision Tree | 0.620 | $9,852 | +1,600% |
| Random Forest (Winner) | 0.623 | $9,824 | Top Performer |
π Feature Importance
The Random Forest successfully used branching logic to isolate specific pay-rule boundaries.
- Top Predictor: Salary_Cluster (Validating the unsupervised learning phase).
- Secondary Predictor: Institutional drivers like Agency name and Contract type.
βοΈ Phase 5: Salary Tier Classification
We reframed the task as a Classification Problem to assist in organizational grading using 3 Classes: Low, Mid, and High Salary Tiers.
π― Metric Strategy: Prioritizing Recall
In a public sector context, the cost of error is asymmetrical:
- The Critical Error (False Negative): Predicting a low tier for a high-tier employee. This leads to severe under-budgeting and financial instability.
- The Safety Margin (False Positive): Predicting a higher tier results in a conservative, padded budget, which is a manageable scenario for government agencies.
- Result: Our Random Forest Classifier achieved a 0.69 Recall for the High-Salary class, ensuring fiscal responsibility.
π Extra Work & Deep Insights
Beyond basic prediction, we uncovered structural anomalies that define NYC public compensation:
- The OT Survival Mechanism: Regression analysis revealed a high density of overtime among lower earners. OT acts as a survival mechanism to supplement income, a need that vanishes as base salary increases toward the $100K threshold.
- The Hourly Consultant Anomaly: Normalized analysis revealed that Per Hour roles often represent highly specialized consultants earning premium annualized rates (~$135K), debunking the myth that hourly roles are entry-level.
- Fiscal Safety Net: By choosing a 3-class quantile split instead of a binary one, we provided a more nuanced tool for detecting Mid-Tier salary shifts.
π Final Reflections
This project demonstrates that a NYC employee's wage is primarily a function of where they work (Borough) and how they are contracted (Pay Basis), rather than how long they have been in the system (Seniority). By leveraging Unsupervised Learning and Ensemble methods, we transformed a dataset characterized by high variance and non-linearity into a robust predictive tool.
Developed by: ΧͺΧΧΧ¨ ΧΧ¨ΧΧΧ
Academic Institution: Reichman University
Date: May 2026
- Downloads last month
- -








