email-spam-classifier
Binary email spam classifier combining frozen DistilBERT CLS embeddings (768d) with 39 structured email header/body features fed into a LightGBM binary classifier.
This model is used as the spam pre-filter in mailsieve.
Architecture
- Text encoder:
distilbert-base-uncased(frozen weights, downloaded separately from HF) - Structured features: 39 (sender, auth headers, HTML structure, temporal)
- Classifier: LightGBM binary (
gbm_model.joblib) - Feature normalizer: StandardScaler (
scaler.joblib) - Total input dimensionality: 807 (768 CLS + 39 structured)
- Output: binary label (0 = ham, 1 = spam) + confidence score
Structured Features (39)
| Feature |
|---|
sender_domain_depth |
has_display_name |
display_name_has_digits |
name_domain_mismatch |
local_part_is_noreply |
local_part_has_digits |
local_part_length |
auth_score |
is_html_only |
html_to_text_ratio |
link_count |
unique_link_domain_count |
has_tracking_pixels |
has_attachments |
has_executable_attachment |
body_word_count |
subject_word_count |
subject_is_all_caps |
subject_exclamation_count |
hour_of_day |
day_of_week |
is_weekend |
is_outside_business_hours |
local_part_entropy |
local_part_consonant_run_ratio |
local_part_vowel_ratio |
local_part_digit_ratio |
local_part_is_random |
display_name_domain_token_overlap |
display_name_looks_like_brand |
name_local_part_char_overlap |
display_name_has_unicode |
display_name_special_char_ratio |
domain_name_length_norm |
domain_has_digits |
domain_tld_is_risky |
domain_is_free_provider |
domain_is_private_relay |
sender_contains_recipient_fragment |
Files
| File | Description |
|---|---|
gbm_model.joblib |
LightGBM binary classifier |
scaler.joblib |
StandardScaler for structured features |
metadata.json |
bert_model_id, feature_columns, version |
Versioning
Git tags follow vMAJOR[.MINOR[.PATCH]][-PRE]:
- MAJOR: feature columns change, different BERT model, or input dimensionality change
- MINOR: same architecture, retrained on new/expanded dataset
- PATCH: artifact hotfix (e.g. wrong scaler uploaded)
Model tree for andrew-fink/email-spam-classifier
Base model
distilbert/distilbert-base-uncased