email-spam-classifier

Binary email spam classifier combining frozen DistilBERT CLS embeddings (768d) with 39 structured email header/body features fed into a LightGBM binary classifier.

This model is used as the spam pre-filter in mailsieve.

Architecture

  • Text encoder: distilbert-base-uncased (frozen weights, downloaded separately from HF)
  • Structured features: 39 (sender, auth headers, HTML structure, temporal)
  • Classifier: LightGBM binary (gbm_model.joblib)
  • Feature normalizer: StandardScaler (scaler.joblib)
  • Total input dimensionality: 807 (768 CLS + 39 structured)
  • Output: binary label (0 = ham, 1 = spam) + confidence score

Structured Features (39)

Feature
sender_domain_depth
has_display_name
display_name_has_digits
name_domain_mismatch
local_part_is_noreply
local_part_has_digits
local_part_length
auth_score
is_html_only
html_to_text_ratio
link_count
unique_link_domain_count
has_tracking_pixels
has_attachments
has_executable_attachment
body_word_count
subject_word_count
subject_is_all_caps
subject_exclamation_count
hour_of_day
day_of_week
is_weekend
is_outside_business_hours
local_part_entropy
local_part_consonant_run_ratio
local_part_vowel_ratio
local_part_digit_ratio
local_part_is_random
display_name_domain_token_overlap
display_name_looks_like_brand
name_local_part_char_overlap
display_name_has_unicode
display_name_special_char_ratio
domain_name_length_norm
domain_has_digits
domain_tld_is_risky
domain_is_free_provider
domain_is_private_relay
sender_contains_recipient_fragment

Files

File Description
gbm_model.joblib LightGBM binary classifier
scaler.joblib StandardScaler for structured features
metadata.json bert_model_id, feature_columns, version

Versioning

Git tags follow vMAJOR[.MINOR[.PATCH]][-PRE]:

  • MAJOR: feature columns change, different BERT model, or input dimensionality change
  • MINOR: same architecture, retrained on new/expanded dataset
  • PATCH: artifact hotfix (e.g. wrong scaler uploaded)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrew-fink/email-spam-classifier

Finetuned
(11659)
this model