Instructions to use danielhanchen/open_llama_3b_600bt_preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use danielhanchen/open_llama_3b_600bt_preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="danielhanchen/open_llama_3b_600bt_preview")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b_600bt_preview") model = AutoModelForCausalLM.from_pretrained("danielhanchen/open_llama_3b_600bt_preview") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use danielhanchen/open_llama_3b_600bt_preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "danielhanchen/open_llama_3b_600bt_preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielhanchen/open_llama_3b_600bt_preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/danielhanchen/open_llama_3b_600bt_preview
- SGLang
How to use danielhanchen/open_llama_3b_600bt_preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "danielhanchen/open_llama_3b_600bt_preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielhanchen/open_llama_3b_600bt_preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "danielhanchen/open_llama_3b_600bt_preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielhanchen/open_llama_3b_600bt_preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use danielhanchen/open_llama_3b_600bt_preview with Docker Model Runner:
docker model run hf.co/danielhanchen/open_llama_3b_600bt_preview
ARCHIVED.
Download from original repo: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview
I made a few PRs to the original repo to include my changes!
Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_preview. Example below edited from https://github.com/openlm-research/open_llama
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "openlm-research/open_llama_3b_600bt_preview"
fast_model_name = "danielhanchen/open_llama_3b_600bt_preview"
tokenizer = AutoTokenizer.from_pretrained(fast_model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype = torch.float16, device_map = "auto")
prompt = "Q: What is the largest animal?\nA:"
input_ids = tokenizer(prompt, return_tensors = "pt").input_ids
print( tokenizer.decode( model.generate( input_ids, max_new_tokens = 32).ravel() ) )
This repo includes:
- Ported
LlamaTokenizertoLlamaTokenizerFastvia a few lines of code. Loading viaAutoTokenizertakes 4 to 5 minutes. Now, a few seconds! Essentially the porting is done via the below code:
# from huggingface_hub import notebook_login
# notebook_login()
from transformers import LlamaTokenizerFast
from tokenizers import AddedToken
tokenizer = LlamaTokenizerFast.from_pretrained(
"openlm-research/open_llama_3b_600bt_preview",
add_bos_token = True,
add_eos_token = False, # Original LLaMA is False -> add </s> during processing.
bos_token = AddedToken("<s>", single_word = True),
eos_token = AddedToken("</s>", single_word = True),
unk_token = AddedToken("<unk>", single_word = True),
pad_token = AddedToken("<unk>", single_word = True)
)
tokenizer.push_to_hub("open_llama_3b_600bt_preview")
AutoTokenizerdoes not recognize the BOS, EOS and UNK tokens. Weirdly<unk>ie the 0 token was added instead of the<s>or</s>token.- Manually added BOS
<s>, EOS</s>, UNK<unk>tokens, with PAD (padding) being also the<unk>token.
- Downloads last month
- 1,029