email-spam-classifier

Binary email spam classifier combining frozen DistilBERT CLS embeddings (768d) with 39 structured email header/body features fed into a LightGBM binary classifier.

This model is used as the spam pre-filter in mailsieve.

Architecture

Text encoder: distilbert-base-uncased (frozen weights, downloaded separately from HF)
Structured features: 39 (sender, auth headers, HTML structure, temporal)
Classifier: LightGBM binary (gbm_model.joblib)
Feature normalizer: StandardScaler (scaler.joblib)
Total input dimensionality: 807 (768 CLS + 39 structured)
Output: binary label (0 = ham, 1 = spam) + confidence score

Structured Features (39)

Feature
`sender_domain_depth`
`has_display_name`
`display_name_has_digits`
`name_domain_mismatch`
`local_part_is_noreply`
`local_part_has_digits`
`local_part_length`
`auth_score`
`is_html_only`
`html_to_text_ratio`
`link_count`
`unique_link_domain_count`
`has_tracking_pixels`
`has_attachments`
`has_executable_attachment`
`body_word_count`
`subject_word_count`
`subject_is_all_caps`
`subject_exclamation_count`
`hour_of_day`
`day_of_week`
`is_weekend`
`is_outside_business_hours`
`local_part_entropy`
`local_part_consonant_run_ratio`
`local_part_vowel_ratio`
`local_part_digit_ratio`
`local_part_is_random`
`display_name_domain_token_overlap`
`display_name_looks_like_brand`
`name_local_part_char_overlap`
`display_name_has_unicode`
`display_name_special_char_ratio`
`domain_name_length_norm`
`domain_has_digits`
`domain_tld_is_risky`
`domain_is_free_provider`
`domain_is_private_relay`
`sender_contains_recipient_fragment`

Files

File	Description
`gbm_model.joblib`	LightGBM binary classifier
`scaler.joblib`	StandardScaler for structured features
`metadata.json`	`bert_model_id`, `feature_columns`, `version`

Versioning

Git tags follow vMAJOR[.MINOR[.PATCH]][-PRE]:

MAJOR: feature columns change, different BERT model, or input dimensionality change
MINOR: same architecture, retrained on new/expanded dataset
PATCH: artifact hotfix (e.g. wrong scaler uploaded)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for andrew-fink/email-spam-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(11659)

this model