Unable to get accurate infilling

by narphorium - opened Oct 8, 2022

Discussion

narphorium

Oct 8, 2022

•

edited Oct 8, 2022

According to the model card, the way to do infilling is to pass in the input as :

<SUF> {some text following cursor} <PRE> {some prelude text here} <MID>

In the example code, the special token IDs are specified as:

<SUF> = 50253
<PRE> = 50254
<MID> = 50255

However, when I generate completions using those tokens I haven't been able to get any accurate results. For example:

prefix = "def top_k(values):\n"
suffix = "  return results"

... infills as:

def top_k(values):
return results.count(values  return results

This looks like the suffix is being ignored and the model is just completing after the prefix.

When I decode the special tokens back to text I get:

50253 = ' Outcomes'
50254 = 24 spaces
50255 = 23 spaces

So I'm wondering if those are really the correct tokens to separate the FIM inputs?

fjenett

Oct 8, 2022

hails

CarperAI org Oct 8, 2022

thanks for bringing this to our attention! Looking into this and will get back to you asap.

LouisCastricato

CarperAI org Oct 8, 2022

Thank you for raising this concern. It seems like it's an issue with the tokenizer. Unfortunately all of our engineers are OOO for the long weekend, we should have a patch out Tuesday or Wednesday. Thanks.

hails

CarperAI org Oct 11, 2022

There was an issue where the sentinel <|SUF|>, <|PRE|>, and <|MID|> tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))

This is what I get, attempting to try out open-ended generation on a simple code function

def score(x,y) -> int:
    """

and also infilling with

def score(x,y) -> int:
    """
    <|MID|> (infill here)
    """

    score = x + y
    return score

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
tok = AutoTokenizer.from_pretrained("CarperAI/

# infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
suffix = '"""\n\n    score = x + y\n    return score'

 model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
 output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])

print(output)

'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# non-infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
model_input = [*tok(prefix)["input_ids"]]
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
print(output)

'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'

Hope this helps! I will also update the model card with this example :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment