Instructions to use elinas/Llama-3-15B-Instruct-ft-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use elinas/Llama-3-15B-Instruct-ft-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="elinas/Llama-3-15B-Instruct-ft-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("elinas/Llama-3-15B-Instruct-ft-v2")
model = AutoModelForCausalLM.from_pretrained("elinas/Llama-3-15B-Instruct-ft-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use elinas/Llama-3-15B-Instruct-ft-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "elinas/Llama-3-15B-Instruct-ft-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elinas/Llama-3-15B-Instruct-ft-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/elinas/Llama-3-15B-Instruct-ft-v2

SGLang

How to use elinas/Llama-3-15B-Instruct-ft-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "elinas/Llama-3-15B-Instruct-ft-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elinas/Llama-3-15B-Instruct-ft-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "elinas/Llama-3-15B-Instruct-ft-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elinas/Llama-3-15B-Instruct-ft-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use elinas/Llama-3-15B-Instruct-ft-v2 with Docker Model Runner:
```
docker model run hf.co/elinas/Llama-3-15B-Instruct-ft-v2
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Llama-3-15B-Instruct-zeroed-ft-v2

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning of all LoRA modules.

The model was finetuned on 8192 context length and it can possibly be extended using RoPE up to 32k.

v3 of the model will contain significantly more data, primarily human focused, aimed to excel at writing as well as maintaining logic, coherency, and continuity.

GGUF Quants provided by @gelukuMLG

Datasets

Chat-Error/Pure-dove-sharegpt

A small, high quality, curated dataset was used as a PoC / validation on stabilizing the model after the original passthrough merge.

Finetuning details

This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets. the first version of this model only targeted o_proj and up_proj

lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head

The model is coherent even with training the "zeroed" layers plus the additional layers, as this was the recommendation from Charles Goddard (mergekit developer) - thank you for sharing the method of merging as well as Toasty Pigeon for bringing it to my attention!

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 3
- total_eval_batch_size: 3
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1

Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 4 hours total.

Unsloth was used for speed and memory savings.

Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

W&B Run Summary

wandb:                eval/loss 0.90895
wandb:             eval/runtime 463.4688
wandb:  eval/samples_per_second 0.833
wandb:    eval/steps_per_second 0.278
wandb:               total_flos 8270790524928.0
wandb:              train/epoch 1.0
wandb:        train/global_step 1157
wandb:          train/grad_norm 7.3847
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.8702
wandb:               train_loss 0.87814
wandb:            train_runtime 16425.2713
wandb: train_samples_per_second 0.211
wandb:   train_steps_per_second 0.07