Fails to load with transformers v4.57+

#14

by qgallouedec - opened 1 day ago

Discussion

qgallouedec

1 day ago

•

edited 1 day ago

I tried with 4.57.3 and v5, I get

>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained("bigscience/bloomz-560m")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx/qgallouedec/transformers/src/transformers/models/auto/tokenization_auto.py", line 1131, in from_pretrained
    return _try_load_tokenizer_with_fallbacks(tokenizer_class, pretrained_model_name_or_path, inputs, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/qgallouedec/transformers/src/transformers/models/auto/tokenization_auto.py", line 821, in _try_load_tokenizer_with_fallbacks
    raise ValueError(
ValueError: Could not load tokenizer from bigscience/bloomz-560m. No tokenizer class could be determined and no SentencePiece model found.

qgallouedec changed discussion title from Fails to load with transformers v5 to Fails to load with transformers v4.57+ 1 day ago

christopher

BigScience Workshop org about 22 hours ago

•

edited about 22 hours ago

The tokenizer itself seems fine

from tokenizers import Tokenizer
Tokenizer.from_pretrained("bigscience/bloomz-560m")

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"<unk>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":1, "content":"<s>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":2, "content":"</s>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":3, "content":"<pad>", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=None, pre_tokenizer=Sequence(pretokenizers=[Split(pattern=Regex(" ?[^(\s|[.,!?…。，、।۔،])]+"), behavior=Isolated, invert=False), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=False)]), post_processor=ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=False), decoder=ByteLevel(add_prefix_space=True, trim_offsets=True, use_regex=False), model=BPE(dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={"<unk>":0, "<s>":1, "</s>":2, "<pad>":3, "!":4, """:5, "#":6, "$":7, "%":8, "&":9, "'":10, "(":11, ")":12, "*":13, "+":14, ",":15, "-":16, ".":17, "/":18, "0":19, "1":20, "2":21, "3":22, "4":23, "5":24, "6":25, "7":26, "8":27, "9":28, ":":29, ";":30, "<":31, "=":32, ">":33, "?":34, "@":35, "A":36, "B":37, "C":38, "D":39, "E":40, "F":41, "G":42, "H":43, "I":44, "J":45, "K":46, "L":47, "M":48, "N":49, "O":50, "P":51, "Q":52, "R":53, "S":54, "T":55, "U":56, "V":57, "W":58, "X":59, "Y":60, "Z":61, "[":62, "\":63, "]":64, "^":65, "_":66, "`":67, "a":68, "b":69, "c":70, "d":71, "e":72, "f":73, "g":74, "h":75, "i":76, "j":77, "k":78, "l":79, "m":80, "n":81, "o":82, "p":83, "q":84, "r":85, "s":86, "t":87, "u":88, "v":89, "w":90, "x":91, "y":92, "z":93, "{":94, "|":95, "}":96, "~":97, "¡":98, ...}, merges=[("à", "¤"), ("à", "¦"), ("Ġ", "à"), ("à", "¥"), ("Ġ", "Ġ"), ("à", "®"), ("Ġ", "d"), ("Ġ", "à¤"), ("à", "²"), ("à", "°"), ("a", "n"), ("e", "n"), ("à", "´"), ("e", "r"), ("Ø", "§"), ("Ġ", "t"), ("e", "s"), ...]))

stefmat

about 20 hours ago

Try using the tokenizers library without a wrapper or downgrade to version 4.57-

christopher

BigScience Workshop org about 20 hours ago

@stefmat I think Quentin's focus is to fix/document the issue in transformers rather than needing to use the tokenizer.

stefmat

about 18 hours ago

I'm sorry, doc for use bloomz-560m-tokenizer.json only in next transformers v5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment