# Introducing Genstruct
Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.

Two previous methods aimed to improve upon this naive prompting approach:
- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.
- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.

Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations.  To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.

Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses.

## Generating instruction pairs
Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.

Like any other Mistral model, it can be imported from Huggingface Hub as follows:

In [1]:
!pip install -U bitsandbytes --no-deps
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = 'NousResearch/Genstruct-7B'

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

  from .autonotebook import tqdm as notebook_tqdm


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:  33%|███▎      | 1/3 [00:01<00:03,  1.75s/it]

Loading checkpoint shards:  67%|██████▋   | 2/3 [00:03<00:01,  1.72s/it]

Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.64s/it]

Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.66s/it]




Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:
```
[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]
```

The model then completes from `[[[User]]]`, generating an instruction and a response.


To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:

In [2]:
msg =[{
    'title': 'p-value',
    'content': "The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis."
}]
inputs = tokenizer.apply_chat_template(msg, return_tensors='pt').cuda()

Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):

In [3]:
gen = tokenizer.decode(model.generate(inputs, max_new_tokens=512)[0]).split(tokenizer.eos_token)[0]
print(gen)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]  The share prices of two rival companies, A and B, have been monitored for many years, allowing a large number of data points for rigorous statistical analysis. This year's summer, which is known to affect share prices, had two distinct sub-periods, A and B, which were roughly equal in length. The company 'A's share price, 

Note that the model is optimized for single-paragraph extracts from Wikipedia articles. You may have varying luck with other input types.

## Filtering outputs using a reward model
The model may occasionally generate incorrect or improperly formatted output - the likelihood of this can be reduced with clever sampling methods, such as rejection sampling using a reward model, or even simple regex filtering.

For instance, we might consider `OpenAssistant/reward-model-deberta-v3-large-v2` as a reward model, and perform best-of-n sampling:

In [4]:
import torch
from transformers import AutoModelForSequenceClassification

N = 4

rm_tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2')
rm_model = AutoModelForSequenceClassification.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2', torch_dtype=torch.bfloat16)

def extract_pair(resp):
    response = resp.split('[[[Content]]]')[1]
    inst, resp = resp.split('[[[User]]]')[:2]
    return inst.strip(), resp.strip()
    
def score(resp):
    inst, resp = extract_pair(resp.split(tokenizer.eos_token)[0])
    
    with torch.no_grad():
        inputs = rm_tokenizer(inst, resp, return_tensors='pt')
        score = float(rm_model(**inputs).logits[0].cpu())
        return score

gens = tokenizer.batch_decode(model.generate(inputs, max_new_tokens=256, num_return_sequences=N, do_sample=True))
print(max(gens, key=score))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]  Two medical procedures were compared by flipping 2 coins, procedure A assumed to be better and so it was labeled head, while procedure B was labeled as tail for a flip. The coins where then flipped 25 times, with the following results:[{'Tails', 12}, {'Heads', 13}]

Which procedure had better results with statistical signi