OpenMed PII SuperClinical Small — ONNX
ONNX export of OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1 — format-conversion derivative under Apache 2.0; no weights or labels modified. See the upstream model card for training data, intended use, and limitations.
Why an ONNX export?
The upstream model ships PyTorch weights. ONNX enables:
- Inference outside the Python/PyTorch stack (Rust, Go, C++, .NET, mobile, browser via
onnxruntime-web). - Embedding in self-contained binaries without a
transformers/torchruntime dependency. - A smaller deployment footprint via FP16 (270 MB vs 566 MB FP32) with no observed loss in PII span extraction on the verification battery.
Model details
- Architecture:
DebertaV2ForTokenClassification(DeBERTa-v3-small backbone, ~44M parameters) - Task: Token classification for PII / PHI detection in clinical text
- Format: ONNX (FP32 reference + FP16 production variant)
- Max sequence length: 512
- Tokenizer: SentencePiece (DeBERTa-v3 unigram, vocab 128k); fast-tokenizer
tokenizer.jsonincluded - Number of labels: 106 (BIO scheme — 54 entity types: 51 with both
B-andI-, 3 withB-only, plusO)
Labels
54 entity types, BIO-tagged:
| Category | Entity types |
|---|---|
| Identity | first_name, last_name, user_name, gender, age, race_ethnicity, religious_belief, sexuality, political_view |
| Contact | email, phone_number, fax_number, street_address, city, county, state, country, postcode, coordinate |
| Government / financial IDs | ssn, tax_id, account_number, bank_routing_number, swift_bic, credit_debit_card, cvv, pin, customer_id, unique_id |
| Healthcare-specific | medical_record_number, health_plan_beneficiary_number, blood_type |
| Credentials / device | password, api_key, http_cookie, mac_address, ipv4, ipv6, url, device_identifier, biometric_identifier |
| Documents / vehicles | certificate_license_number, license_plate, vehicle_identifier, employee_id |
| Demographics | occupation, education_level, employment_status, language, company_name |
| Temporal | date, date_of_birth, date_time, time |
O is the outside-any-entity label. See config.json for the canonical 106-entry id2label map.
Usage
Python (optimum + transformers)
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForTokenClassification
REPO = "sidupadhyay/OpenMed-PII-SuperClinical-Small-44M-v1-ONNX"
tokenizer = AutoTokenizer.from_pretrained(REPO)
# FP16 (production, ~270 MB):
model = ORTModelForTokenClassification.from_pretrained(REPO, file_name="model_fp16.onnx")
# Or FP32 (accuracy reference, ~566 MB):
# model = ORTModelForTokenClassification.from_pretrained(REPO, file_name="model.onnx")
inputs = tokenizer(
"The patient's date of birth is March 15, 1982 and her SSN is 123-45-6789.",
return_tensors="pt",
)
outputs = model(**inputs)
predicted_label_ids = outputs.logits.argmax(-1)[0].tolist()
predicted_labels = [model.config.id2label[i] for i in predicted_label_ids]
Direct onnxruntime + tokenizers (no transformers / torch)
import onnxruntime as ort
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model_fp16.onnx", providers=["CPUExecutionProvider"])
enc = tok.encode("Patient John Doe's SSN is 123-45-6789.")
input_ids = [enc.ids]
attention_mask = [[1] * len(enc.ids)]
logits, = sess.run(["logits"], {"input_ids": input_ids, "attention_mask": attention_mask})
The fast tokenizer ships with character offsets (offsets) — use them with the per-token argmax to extract PII spans for redaction or annotation.
Original model
- Model card: OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1
- Paper: arXiv:2508.01630
A per-record training-data inventory has not been published; consult the paper and contact the upstream authors for training-data due diligence.
Conversion
- Source:
OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1(Apache 2.0) - Date: 2026-04-27
- Tooling:
optimum[onnxruntime]==2.1.0,transformers==4.57.6,onnx==1.21.0,onnxruntime==1.25.0,torch==2.11.0,onnxconverter-common==1.16.0. Python 3.11.
FP32 (model.onnx):
from optimum.onnxruntime import ORTModelForTokenClassification
m = ORTModelForTokenClassification.from_pretrained(SRC, export=True)
m.save_pretrained(OUT)
FP16 (model_fp16.onnx):
import onnx
from onnxconverter_common import float16
g = onnx.load("model.onnx")
g_fp16 = float16.convert_float_to_float16(g, keep_io_types=True)
onnx.save(g_fp16, "model_fp16.onnx")
keep_io_types=True preserves int64 input_ids / attention_mask inputs and the FP32 logits output, so tokenizer-side and downstream-consumer code paths are unchanged. No op_block_list entries were required.
A small post-processing pass aligns Cast node to attributes with the converter's updated value_info dtypes — onnxconverter_common updates downstream tensor types but doesn't always rewrite the to attribute on existing Cast nodes. 40 nodes (including casts inside an If subgraph used by DeBERTa's relative-position bucket logic) needed patching to match their declared output dtype, otherwise onnxruntime rejects the graph at session creation. The fix is a graph walk over model.graph.node (and recursively over If / Loop subgraphs in attribute.g) that, for each Cast node, sets the to attribute to the dtype declared on the node's output value_info. This is a known sharp edge with onnxconverter_common on transformer graphs containing explicit Cast nodes.
The exact conversion + verification script ships in this repo as verify.py.
License: Apache 2.0, inherited from the upstream model. Format conversion is permitted; no weights are altered.
Verification
verify.py reproduces the conversion-verification battery: 20 synthetic PII-containing sentences (no real personal data) compared against the upstream PyTorch model.
| Metric | FP32 threshold | FP32 result | FP16 threshold | FP16 result |
|---|---|---|---|---|
| Max absolute logit drift (per token) | < 1e-4 | 2.10e-05 | informational | 5.86e-03 |
| Per-token argmax disagreement | 0% | 0 / 299 tokens | < 0.5% | 0 / 299 tokens |
| BIO span agreement (per sample) | 100% | 20 / 20 samples | 100% | 20 / 20 samples |
The FP32 export is lossless within fp32 numerical noise. The FP16 variant matches FP32 argmax decisions on every token in the battery, so all extracted PII spans are identical; logit drift is the expected ~1e-3 magnitude for half-precision and does not cross any decision boundary on this battery.
python verify.py # default: any model.onnx / model_fp16.onnx siblings
python verify.py --model model.onnx --model model_fp16.onnx # explicit
INT8 quantization (not shipped)
Dynamic INT8 quantization (onnxruntime.quantization.quantize_dynamic, QuantType.QInt8) was attempted to produce a smaller deployment artifact. Results on the same 20-sample battery:
| Metric | INT8 (all ops) | INT8 (MatMul-only) | INT8 acceptance threshold |
|---|---|---|---|
| Max absolute logit drift | 5.48e+00 | 4.24e+00 | (informational) |
| Per-token argmax disagreement | 10.37% | 10.14% | < 5% |
| Span agreement | 9 / 20 | 4 / 20 | 20 / 20 |
Both dynamic INT8 variants exceeded the argmax-disagreement budget by 2× and broke span-level predictions on roughly half the samples. DeBERTa-v3's heavy-tailed activation distributions are known to defeat dynamic per-tensor scales. The quantized artifact is not shipped. The FP16 variant achieves a comparable size reduction (270 MB) without the accuracy loss.
If a smaller artifact than FP16 is required, viable next steps are (a) static (calibration-based) INT8 with a representative PII corpus, or (b) QuantType.QUInt8 with QDQ pre-processing.
File inventory
| File | Size | Purpose |
|---|---|---|
model.onnx |
566 MB | FP32 ONNX graph + weights (accuracy reference) |
model.onnx.sha256 |
77 B | Integrity hash for model.onnx |
model_fp16.onnx |
270 MB | FP16 ONNX graph (production variant; FP32 I/O preserved) |
model_fp16.onnx.sha256 |
81 B | Integrity hash for model_fp16.onnx |
config.json |
6.3 KB | HF model config (architecture, id2label, label2id) |
tokenizer.json |
8.6 MB | Fast tokenizer (Rust-compatible, with offsets) |
tokenizer_config.json |
1.4 KB | Tokenizer settings |
special_tokens_map.json |
970 B | Special-token map ([CLS], [SEP], [PAD], [MASK], [UNK]) |
spm.model |
2.4 MB | SentencePiece model (slow tokenizer / interop) |
added_tokens.json |
23 B | Added-token map |
LICENSE |
11 KB | Apache License 2.0 |
verify.py |
9 KB | Reproducible conversion-verification script |
After download, sha256sum -c model.onnx.sha256 and sha256sum -c model_fp16.onnx.sha256 should both pass.
Citation
Please cite the upstream work — see the model card and arXiv:2508.01630 for the canonical citation.
- Downloads last month
- 29
Model tree for sidupadhyay/OpenMed-PII-SuperClinical-Small-44M-v1-ONNX
Base model
microsoft/deberta-v3-small