OpenMed PII SuperClinical Small — ONNX

ONNX export of OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1 — format-conversion derivative under Apache 2.0; no weights or labels modified. See the upstream model card for training data, intended use, and limitations.

Why an ONNX export?

The upstream model ships PyTorch weights. ONNX enables:

Inference outside the Python/PyTorch stack (Rust, Go, C++, .NET, mobile, browser via onnxruntime-web).
Embedding in self-contained binaries without a transformers/torch runtime dependency.
A smaller deployment footprint via FP16 (270 MB vs 566 MB FP32) with no observed loss in PII span extraction on the verification battery.

Model details

Architecture: DebertaV2ForTokenClassification (DeBERTa-v3-small backbone, ~44M parameters)
Task: Token classification for PII / PHI detection in clinical text
Format: ONNX (FP32 reference + FP16 production variant)
Max sequence length: 512
Tokenizer: SentencePiece (DeBERTa-v3 unigram, vocab 128k); fast-tokenizer tokenizer.json included
Number of labels: 106 (BIO scheme — 54 entity types: 51 with both B- and I-, 3 with B- only, plus O)

Labels

54 entity types, BIO-tagged:

Category	Entity types
Identity	`first_name`, `last_name`, `user_name`, `gender`, `age`, `race_ethnicity`, `religious_belief`, `sexuality`, `political_view`
Contact	`email`, `phone_number`, `fax_number`, `street_address`, `city`, `county`, `state`, `country`, `postcode`, `coordinate`
Government / financial IDs	`ssn`, `tax_id`, `account_number`, `bank_routing_number`, `swift_bic`, `credit_debit_card`, `cvv`, `pin`, `customer_id`, `unique_id`
Healthcare-specific	`medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
Credentials / device	`password`, `api_key`, `http_cookie`, `mac_address`, `ipv4`, `ipv6`, `url`, `device_identifier`, `biometric_identifier`
Documents / vehicles	`certificate_license_number`, `license_plate`, `vehicle_identifier`, `employee_id`
Demographics	`occupation`, `education_level`, `employment_status`, `language`, `company_name`
Temporal	`date`, `date_of_birth`, `date_time`, `time`

O is the outside-any-entity label. See config.json for the canonical 106-entry id2label map.

Usage

Python (`optimum` + `transformers`)

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForTokenClassification

REPO = "sidupadhyay/OpenMed-PII-SuperClinical-Small-44M-v1-ONNX"

tokenizer = AutoTokenizer.from_pretrained(REPO)

# FP16 (production, ~270 MB):
model = ORTModelForTokenClassification.from_pretrained(REPO, file_name="model_fp16.onnx")
# Or FP32 (accuracy reference, ~566 MB):
# model = ORTModelForTokenClassification.from_pretrained(REPO, file_name="model.onnx")

inputs = tokenizer(
    "The patient's date of birth is March 15, 1982 and her SSN is 123-45-6789.",
    return_tensors="pt",
)
outputs = model(**inputs)
predicted_label_ids = outputs.logits.argmax(-1)[0].tolist()
predicted_labels = [model.config.id2label[i] for i in predicted_label_ids]

Direct `onnxruntime` + `tokenizers` (no `transformers` / `torch`)

import onnxruntime as ort
from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model_fp16.onnx", providers=["CPUExecutionProvider"])

enc = tok.encode("Patient John Doe's SSN is 123-45-6789.")
input_ids = [enc.ids]
attention_mask = [[1] * len(enc.ids)]

logits, = sess.run(["logits"], {"input_ids": input_ids, "attention_mask": attention_mask})

The fast tokenizer ships with character offsets (offsets) — use them with the per-token argmax to extract PII spans for redaction or annotation.

Original model

Model card: OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1
Paper: arXiv:2508.01630

A per-record training-data inventory has not been published; consult the paper and contact the upstream authors for training-data due diligence.

Conversion

Source: OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1 (Apache 2.0)
Date: 2026-04-27
Tooling: optimum[onnxruntime]==2.1.0, transformers==4.57.6, onnx==1.21.0, onnxruntime==1.25.0, torch==2.11.0, onnxconverter-common==1.16.0. Python 3.11.

FP32 (model.onnx):

from optimum.onnxruntime import ORTModelForTokenClassification
m = ORTModelForTokenClassification.from_pretrained(SRC, export=True)
m.save_pretrained(OUT)

FP16 (model_fp16.onnx):

import onnx
from onnxconverter_common import float16
g = onnx.load("model.onnx")
g_fp16 = float16.convert_float_to_float16(g, keep_io_types=True)
onnx.save(g_fp16, "model_fp16.onnx")

keep_io_types=True preserves int64 input_ids / attention_mask inputs and the FP32 logits output, so tokenizer-side and downstream-consumer code paths are unchanged. No op_block_list entries were required.

A small post-processing pass aligns Cast node to attributes with the converter's updated value_info dtypes — onnxconverter_common updates downstream tensor types but doesn't always rewrite the to attribute on existing Cast nodes. 40 nodes (including casts inside an If subgraph used by DeBERTa's relative-position bucket logic) needed patching to match their declared output dtype, otherwise onnxruntime rejects the graph at session creation. The fix is a graph walk over model.graph.node (and recursively over If / Loop subgraphs in attribute.g) that, for each Cast node, sets the to attribute to the dtype declared on the node's output value_info. This is a known sharp edge with onnxconverter_common on transformer graphs containing explicit Cast nodes.

The exact conversion + verification script ships in this repo as verify.py.

License: Apache 2.0, inherited from the upstream model. Format conversion is permitted; no weights are altered.

Verification

verify.py reproduces the conversion-verification battery: 20 synthetic PII-containing sentences (no real personal data) compared against the upstream PyTorch model.

Metric	FP32 threshold	FP32 result	FP16 threshold	FP16 result
Max absolute logit drift (per token)	< 1e-4	2.10e-05	informational	5.86e-03
Per-token argmax disagreement	0%	0 / 299 tokens	< 0.5%	0 / 299 tokens
BIO span agreement (per sample)	100%	20 / 20 samples	100%	20 / 20 samples

The FP32 export is lossless within fp32 numerical noise. The FP16 variant matches FP32 argmax decisions on every token in the battery, so all extracted PII spans are identical; logit drift is the expected ~1e-3 magnitude for half-precision and does not cross any decision boundary on this battery.

python verify.py                                       # default: any model.onnx / model_fp16.onnx siblings
python verify.py --model model.onnx --model model_fp16.onnx   # explicit

INT8 quantization (not shipped)

Dynamic INT8 quantization (onnxruntime.quantization.quantize_dynamic, QuantType.QInt8) was attempted to produce a smaller deployment artifact. Results on the same 20-sample battery:

Metric	INT8 (all ops)	INT8 (MatMul-only)	INT8 acceptance threshold
Max absolute logit drift	5.48e+00	4.24e+00	(informational)
Per-token argmax disagreement	10.37%	10.14%	< 5%
Span agreement	9 / 20	4 / 20	20 / 20

Both dynamic INT8 variants exceeded the argmax-disagreement budget by 2× and broke span-level predictions on roughly half the samples. DeBERTa-v3's heavy-tailed activation distributions are known to defeat dynamic per-tensor scales. The quantized artifact is not shipped. The FP16 variant achieves a comparable size reduction (270 MB) without the accuracy loss.

If a smaller artifact than FP16 is required, viable next steps are (a) static (calibration-based) INT8 with a representative PII corpus, or (b) QuantType.QUInt8 with QDQ pre-processing.

File inventory

File	Size	Purpose
`model.onnx`	566 MB	FP32 ONNX graph + weights (accuracy reference)
`model.onnx.sha256`	77 B	Integrity hash for `model.onnx`
`model_fp16.onnx`	270 MB	FP16 ONNX graph (production variant; FP32 I/O preserved)
`model_fp16.onnx.sha256`	81 B	Integrity hash for `model_fp16.onnx`
`config.json`	6.3 KB	HF model config (architecture, `id2label`, `label2id`)
`tokenizer.json`	8.6 MB	Fast tokenizer (Rust-compatible, with offsets)
`tokenizer_config.json`	1.4 KB	Tokenizer settings
`special_tokens_map.json`	970 B	Special-token map (`[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`, `[UNK]`)
`spm.model`	2.4 MB	SentencePiece model (slow tokenizer / interop)
`added_tokens.json`	23 B	Added-token map
`LICENSE`	11 KB	Apache License 2.0
`verify.py`	9 KB	Reproducible conversion-verification script

After download, sha256sum -c model.onnx.sha256 and sha256sum -c model_fp16.onnx.sha256 should both pass.

Citation

Please cite the upstream work — see the model card and arXiv:2508.01630 for the canonical citation.

Downloads last month: 29

Model tree for sidupadhyay/OpenMed-PII-SuperClinical-Small-44M-v1-ONNX

Base model

microsoft/deberta-v3-small

Finetuned

OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1

Quantized

(2)

this model

Paper for sidupadhyay/OpenMed-PII-SuperClinical-Small-44M-v1-ONNX

OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

Paper • 2508.01630 • Published Aug 3, 2025 • 18