Struggling with reproducing paper results
Hi,
I have been struggling with this for some time now so that's why I wanted to reach out.
I trying to generate a library of proteins with similar properties to those discussed in the paper to ensure I am calling the model correctly.
However I keep getting a lot longer proteins than the ones generated in the paper.
My 1000 generations are on average around 300AA when I discard those that generate without an eos token.
But the 10000 ProtGPT2 generations average around 145AA which is similar to the 135AA pretraining data.
Other properties also don't match but this is the easiest to measure.
This is happening for my finetuned models but also the unmodified model gotten from huggingface which makes me question my method of generation.
Could you provide more information on how you generated the library of proteins that the paper was based on?
Code snippet to reproduce my issue:
model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
tokenizer.batch_decode(model.generate(torch.tensor([tokenizer("<|endoftext|>")['input_ids']]), max_new_tokens=250, temperature=1, top_k=950, top_p=1, repetition_penalty=1.2, do_sample=True, num_return_sequences=100), skip_special_tokens=False)
Is it posible that max new tokens was set to 100? This does give me the same rate of dropping truncated sequences and average resulting length.
I cannot unify this with the statement in the publication that says a context window of 250 tokens was used thought.
Hi Lukaas,
Thanks for reaching out! Let me see if we can fix this. Could you compute the perplexity of the generated sequences? Do you observe that the ones with lower perplexity are shorter? If so, I'd take the top 25-35% and proceed with those. It's been a while for me to remember the details of the manuscript but if that does not provide something closer to what you'd expect, we'd need to dig a lot deeper to see what is going on. Let me know how this goes anytime!
Hi nferruz. Thanks for responding!
When setting the max_new_tokens to 100 instead of 250 I do get the length distribution and rate of sequences dropped (without EOS token after 100 tokens) that is described in the paper. But I was confused by the statement in the paper that the generated proteins were generated with a context window of 250 tokens. Can you shed some light on this?
Three sequence datasets were produced to compare their properties. The ProtGPT2 dataset was generated by sampling 1000 batches of 100 sequences, each with the selected inference parameters and a window context of 250 tokens. This step produced 100,000 sequences. We filtered from this set those sequences whose length had been cut due to the window context, giving a total of 29,876 sequences. From this set, we randomly selected 10,000 sequences. Their average length is 149.2 ± 50.9 amino acids.
Hi Lukaas,
i am confused too, let me see if I can help. The max_new_token variable should not change the length of the natural generation. What it does is to truncate the sequence once it reaches 100 tokens if it has not produced the EOS token. But versions have changed wildly over the last five years, and I am not entirely sure if we can reproduce the same results given what you are observing. The code that you describe, to my knowledge, is correct. What I'd recommend you to is to generate sequences and rank them by perplexity. Those with very high perplexity will most likely (please correct me if you find out this is not the case) be longer. Then select the top-N sequences from that ordered list. Those should have lengths in line with the training set. Let me know if this helps :)