Instructions to use danielhanchen/open_llama_3b_600bt_preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use danielhanchen/open_llama_3b_600bt_preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="danielhanchen/open_llama_3b_600bt_preview")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b_600bt_preview")
model = AutoModelForCausalLM.from_pretrained("danielhanchen/open_llama_3b_600bt_preview")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use danielhanchen/open_llama_3b_600bt_preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "danielhanchen/open_llama_3b_600bt_preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "danielhanchen/open_llama_3b_600bt_preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/danielhanchen/open_llama_3b_600bt_preview

SGLang

How to use danielhanchen/open_llama_3b_600bt_preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "danielhanchen/open_llama_3b_600bt_preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "danielhanchen/open_llama_3b_600bt_preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "danielhanchen/open_llama_3b_600bt_preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "danielhanchen/open_llama_3b_600bt_preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use danielhanchen/open_llama_3b_600bt_preview with Docker Model Runner:
```
docker model run hf.co/danielhanchen/open_llama_3b_600bt_preview
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ARCHIVED.

Download from original repo: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview

I made a few PRs to the original repo to include my changes!

Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_preview. Example below edited from https://github.com/openlm-research/open_llama

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "openlm-research/open_llama_3b_600bt_preview"
fast_model_name = "danielhanchen/open_llama_3b_600bt_preview"

tokenizer = AutoTokenizer.from_pretrained(fast_model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype = torch.float16, device_map = "auto")

prompt = "Q: What is the largest animal?\nA:"
input_ids = tokenizer(prompt, return_tensors = "pt").input_ids
print( tokenizer.decode( model.generate( input_ids, max_new_tokens = 32).ravel() ) )

This repo includes:

Ported LlamaTokenizer to LlamaTokenizerFast via a few lines of code. Loading via AutoTokenizer takes 4 to 5 minutes. Now, a few seconds! Essentially the porting is done via the below code:

# from huggingface_hub import notebook_login
# notebook_login()
from transformers import LlamaTokenizerFast
from tokenizers import AddedToken
tokenizer = LlamaTokenizerFast.from_pretrained(
    "openlm-research/open_llama_3b_600bt_preview",
    add_bos_token = True,
    add_eos_token = False, # Original LLaMA is False -> add </s> during processing.
    bos_token = AddedToken("<s>",   single_word = True),
    eos_token = AddedToken("</s>",  single_word = True),
    unk_token = AddedToken("<unk>", single_word = True),
    pad_token = AddedToken("<unk>", single_word = True)
)
tokenizer.push_to_hub("open_llama_3b_600bt_preview")

AutoTokenizer does not recognize the BOS, EOS and UNK tokens. Weirdly <unk> ie the 0 token was added instead of the <s> or </s> token.
Manually added BOS <s>, EOS </s>, UNK <unk> tokens, with PAD (padding) being also the <unk> token.

Downloads last month: 1,029

Safetensors

Model size

3B params

Tensor type

F32

F16

Model tree for danielhanchen/open_llama_3b_600bt_preview

Quantizations

1 model

danielhanchen
/

open_llama_3b_600bt_preview

ARCHIVED.

Download from original repo: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview

I made a few PRs to the original repo to include my changes!

Model tree for danielhanchen/open_llama_3b_600bt_preview

Spaces using danielhanchen/open_llama_3b_600bt_preview 32