Fix smart quote handling in tokenizer normalizer

#3
by RickRossTN - opened
ONNX Community org

The Problem:

The current tokenizer normalizer strips Unicode smart quotes instead of converting them to ASCII equivalents. This causes contractions like "I'm" to become "Im" which sounds wrong when spoken.

Many LLMs (Qwen, GPT, etc.) generate text with Unicode punctuation:

  • \u2018 (') and \u2019 (') instead of ASCII '
  • \u201C (") and \u201D (") instead of ASCII "
    The current normalizer's final regex [^ -"$-.0-;?A-Za-z£́] strips these characters entirely rather than converting them.

Current behavior:

Input: "I'm right here" (with \u2019)
Output: "Im right here" (apostrophe deleted)

Expected behavior:

Input: "I'm right here" (with \u2019)
Output: "I'm right here" (apostrophe converted to ASCII)

Proposed Fix:

Add two Replace rules before the final strip rule in tokenizer.json:

{
  type: Replace,
  pattern: { Regex: [\u2018\u2019] },
  content: '
},
{
  type: Replace,
  pattern: { Regex: [\u201C\u201D] },
  content: "
}

Full normalizer section:

normalizer: {
  type: Sequence,
  normalizers: [
    {
      type: NFKD
    },
    {
      type: Replace,
      pattern: { Regex: \\s+ },
      content:  
    },
    {
      type: Replace,
      pattern: { Regex: [\u2013\u2014] },
      content: -
    },
    {
      type: Replace,
      pattern: { Regex: [\u2018\u2019] },
      content: '
    },
    {
      type: Replace,
      pattern: { Regex: [\u201C\u201D] },
      content: "
    },
    {
      type: Replace,
      pattern: { Regex: [^ -"$-.0-;?A-Za-z£́] },
      content: 
    }
  ]
}

This follows the same pattern already used for em/en dashes (\u2013\u2014 → -).

Sign up or log in to comment