Fix smart quote handling in tokenizer normalizer

by RickRossTN - opened 1 day ago

ONNX Community org 1 day ago

The Problem:

The current tokenizer normalizer strips Unicode smart quotes instead of converting them to ASCII equivalents. This causes contractions like "I'm" to become "Im" which sounds wrong when spoken.

Many LLMs (Qwen, GPT, etc.) generate text with Unicode punctuation:

\u2018 (') and \u2019 (') instead of ASCII '
\u201C (") and \u201D (") instead of ASCII "
The current normalizer's final regex [^ -"$-.0-;?A-Za-z£́] strips these characters entirely rather than converting them.

Current behavior:

Input: "I'm right here" (with \u2019)
Output: "Im right here" (apostrophe deleted)

Expected behavior:

Input: "I'm right here" (with \u2019)
Output: "I'm right here" (apostrophe converted to ASCII)

Proposed Fix:

Add two Replace rules before the final strip rule in tokenizer.json:

{
  type: Replace,
  pattern: { Regex: [\u2018\u2019] },
  content: '
},
{
  type: Replace,
  pattern: { Regex: [\u201C\u201D] },
  content: "
}

Full normalizer section:

normalizer: {
  type: Sequence,
  normalizers: [
    {
      type: NFKD
    },
    {
      type: Replace,
      pattern: { Regex: \\s+ },
      content:  
    },
    {
      type: Replace,
      pattern: { Regex: [\u2013\u2014] },
      content: -
    },
    {
      type: Replace,
      pattern: { Regex: [\u2018\u2019] },
      content: '
    },
    {
      type: Replace,
      pattern: { Regex: [\u201C\u201D] },
      content: "
    },
    {
      type: Replace,
      pattern: { Regex: [^ -"$-.0-;?A-Za-z£́] },
      content: 
    }
  ]
}

This follows the same pattern already used for em/en dashes (\u2013\u2014 → -).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment