Instructions to use elinas/Llama-3-15B-Instruct-ft-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use elinas/Llama-3-15B-Instruct-ft-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="elinas/Llama-3-15B-Instruct-ft-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("elinas/Llama-3-15B-Instruct-ft-v2") model = AutoModelForCausalLM.from_pretrained("elinas/Llama-3-15B-Instruct-ft-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use elinas/Llama-3-15B-Instruct-ft-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "elinas/Llama-3-15B-Instruct-ft-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "elinas/Llama-3-15B-Instruct-ft-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/elinas/Llama-3-15B-Instruct-ft-v2
- SGLang
How to use elinas/Llama-3-15B-Instruct-ft-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "elinas/Llama-3-15B-Instruct-ft-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "elinas/Llama-3-15B-Instruct-ft-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "elinas/Llama-3-15B-Instruct-ft-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "elinas/Llama-3-15B-Instruct-ft-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use elinas/Llama-3-15B-Instruct-ft-v2 with Docker Model Runner:
docker model run hf.co/elinas/Llama-3-15B-Instruct-ft-v2
Llama-3-15B-Instruct-zeroed-ft-v2
This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.
The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed
This was primarily an experiment to see how a passthrough merge will respond to further finetuning of all LoRA modules.
The model was finetuned on 8192 context length and it can possibly be extended using RoPE up to 32k.
v3 of the model will contain significantly more data, primarily human focused, aimed to excel at writing as well as maintaining logic, coherency, and continuity.
GGUF Quants provided by @gelukuMLG
Datasets
A small, high quality, curated dataset was used as a PoC / validation on stabilizing the model after the original passthrough merge.
Finetuning details
This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets. the first version of this model only targeted o_proj and up_proj
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_modules_to_save:
- embed_tokens
- lm_head
The model is coherent even with training the "zeroed" layers plus the additional layers, as this was the recommendation from Charles Goddard (mergekit developer) - thank you for sharing the method of merging as well as Toasty Pigeon for bringing it to my attention!
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 3
- total_eval_batch_size: 3
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1
Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 4 hours total.
Unsloth was used for speed and memory savings.
Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.
W&B Run Summary
wandb: eval/loss 0.90895
wandb: eval/runtime 463.4688
wandb: eval/samples_per_second 0.833
wandb: eval/steps_per_second 0.278
wandb: total_flos 8270790524928.0
wandb: train/epoch 1.0
wandb: train/global_step 1157
wandb: train/grad_norm 7.3847
wandb: train/learning_rate 0.0
wandb: train/loss 0.8702
wandb: train_loss 0.87814
wandb: train_runtime 16425.2713
wandb: train_samples_per_second 0.211
wandb: train_steps_per_second 0.07
Framework versions
- PEFT 0.10.0
- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
Model Evaluation
TBD
If you have any questions or comments on the model, feel free to open a discussion in the community tab.
- Downloads last month
- 9
Model tree for elinas/Llama-3-15B-Instruct-ft-v2
Base model
meta-llama/Meta-Llama-3-8B-Instruct