Instructions to use us4/fin-llama3.1-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use us4/fin-llama3.1-8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="us4/fin-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("us4/fin-llama3.1-8b")
model = AutoModelForCausalLM.from_pretrained("us4/fin-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use us4/fin-llama3.1-8b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="us4/fin-llama3.1-8b",
	filename="model-q4_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use us4/fin-llama3.1-8b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf us4/fin-llama3.1-8b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf us4/fin-llama3.1-8b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf us4/fin-llama3.1-8b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf us4/fin-llama3.1-8b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf us4/fin-llama3.1-8b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf us4/fin-llama3.1-8b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf us4/fin-llama3.1-8b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf us4/fin-llama3.1-8b:Q4_K_M

Use Docker

docker model run hf.co/us4/fin-llama3.1-8b:Q4_K_M

LM Studio
Jan

vLLM

How to use us4/fin-llama3.1-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "us4/fin-llama3.1-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "us4/fin-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/us4/fin-llama3.1-8b:Q4_K_M

SGLang

How to use us4/fin-llama3.1-8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "us4/fin-llama3.1-8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "us4/fin-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "us4/fin-llama3.1-8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "us4/fin-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use us4/fin-llama3.1-8b with Ollama:
```
ollama run hf.co/us4/fin-llama3.1-8b:Q4_K_M
```

Unsloth Studio new

How to use us4/fin-llama3.1-8b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for us4/fin-llama3.1-8b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for us4/fin-llama3.1-8b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for us4/fin-llama3.1-8b to start chatting

Pi new

How to use us4/fin-llama3.1-8b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf us4/fin-llama3.1-8b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "us4/fin-llama3.1-8b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use us4/fin-llama3.1-8b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf us4/fin-llama3.1-8b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default us4/fin-llama3.1-8b:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use us4/fin-llama3.1-8b with Docker Model Runner:
```
docker model run hf.co/us4/fin-llama3.1-8b:Q4_K_M
```

Lemonade

How to use us4/fin-llama3.1-8b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull us4/fin-llama3.1-8b:Q4_K_M

Run and chat with the model

lemonade run user.fin-llama3.1-8b-Q4_K_M

List all available models

lemonade list

Model Card for Fin-LLaMA 3.1 8B

This is the model card for Fin-LLaMA 3.1 8B, a fine-tuned version of LLaMA 3.1 trained specifically on financial news data. The model is built to generate coherent and relevant financial, economic, and business text responses. It also includes multiple quantized GGUF model formats for resource-efficient deployment.

Model Details

Model Description

The Fin-LLaMA 3.1 8B model was fine-tuned using the Unsloth library, employing LoRA adapters for efficient training, and is available in various quantized GGUF formats. The model is instruction-tuned to generate text in response to finance-related queries.

Developed by: us4
Model type: Transformer (LLaMA 3.1 architecture, 8B parameters)
Languages: English
License: [More Information Needed]
Fine-tuned from model: LLaMA 3.1 8B

Files and Formats

The repository contains multiple files, including safetensors and GGUF formats for different quantization levels. Below is the list of key files and their details:

adapter_config.json (778 Bytes): Configuration for the adapter model.
adapter_model.safetensors (5.54 GB): Adapter model in safetensors format.
config.json (978 Bytes): Model configuration file.
generation_config.json (234 Bytes): Generation configuration file for text generation.
model-00001-of-00004.safetensors (4.98 GB): Part 1 of the model in safetensors format.
model-00002-of-00004.safetensors (5.00 GB): Part 2 of the model in safetensors format.
model-00003-of-00004.safetensors (4.92 GB): Part 3 of the model in safetensors format.
model-00004-of-00004.safetensors (1.17 GB): Part 4 of the model in safetensors format.
model-q4_0.gguf (4.66 GB): Quantized GGUF format (Q4_0).
model-q4_k_m.gguf (4.92 GB): Quantized GGUF format (Q4_K_M).
model-q5_k_m.gguf (5.73 GB): Quantized GGUF format (Q5_K_M).
model-q8_0.gguf (8.54 GB): Quantized GGUF format (Q8_0).
model.safetensors.index.json (24 KB): Index file for the safetensors model.
special_tokens_map.json (454 Bytes): Special tokens mapping file.
tokenizer.json (9.09 MB): Tokenizer configuration for the model.
tokenizer_config.json (55.4 KB): Additional tokenizer settings.
training_args.bin (5.56 KB): Training arguments used for fine-tuning.

GGUF Formats and Usage

The GGUF formats are optimized for memory-efficient inference, especially for edge devices or deployment in low-resource environments. Here’s a breakdown of the quantized GGUF formats available:

Q4_0: 4-bit quantized model for high memory efficiency with some loss in precision.
Q4_K_M: 4-bit quantized with optimized configurations for maintaining precision.
Q5_K_M: 5-bit quantized model balancing memory efficiency and accuracy.
Q8_0: 8-bit quantized model for higher precision with a larger memory footprint.

GGUF files available in the repository:

model-q4_0.gguf (4.66 GB)
model-q4_k_m.gguf (4.92 GB)
model-q5_k_m.gguf (5.73 GB)
model-q8_0.gguf (8.54 GB)

To load and use these GGUF models for inference:

from unsloth import FastLanguageModel
#
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="us4/fin-llama3.1-8b", 
    max_seq_length=2048,
    load_in_4bit=True,  # Set to False for Q8_0 format
    quantization_method="q4_k_m"  # Change to the required format (e.g., "q5_k_m" or "q8_0")
)

Model Sources

Repository: Fin-LLaMA 3.1 8B on Hugging Face
Paper: LLaMA: Open and Efficient Foundation Language Models

Uses

The Fin-LLaMA 3.1 8B model is designed for generating business, financial, and economic-related text.

Direct Use

The model can be directly used for text generation tasks, such as generating financial news summaries, analysis, or responses to finance-related prompts.

Downstream Use

The model can be further fine-tuned for specific financial tasks, such as question-answering systems, summarization of financial reports, or automation of business processes.

Out-of-Scope Use

The model is not suited for use in domains outside of finance, such as medical or legal text generation, nor should it be used for tasks that require deep financial forecasting or critical decision-making without human oversight.

Bias, Risks, and Limitations

The model may inherit biases from the financial news data it was trained on. Since financial reporting can be region-specific and company-biased, users should exercise caution when applying the model in various international contexts.

Recommendations

Users should carefully evaluate the generated text in critical business or financial settings. Ensure the generated content aligns with local regulations and company policies.

Training Details

Training Data

The model was fine-tuned on a dataset of financial news articles, consisting of titles and content from various financial media sources. The dataset has been pre-processed to remove extraneous information and ensure consistency across financial terms.

Training Procedure

Preprocessing

The training data was tokenized using the LLaMA tokenizer, with prompts formatted to include both the title and content of financial news articles.

Training Hyperparameters

Training regime: Mixed precision (FP16), gradient accumulation steps: 8, max steps: 500.
Learning Rate: 5e-5 for fine-tuning, 1e-5 for embeddings.
Batch size: 8 per device.

Speeds, Sizes, Times

The model training took place over approximately 500 steps on an A100 GPU. Checkpoint files range from 4.98 GB to 8.54 GB depending on quantization.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was tested on unseen financial news articles from the same source domains as the training set.

Factors

Evaluation focused on the model’s ability to generate coherent financial summaries and responses.

Metrics

Common text-generation metrics such as perplexity, accuracy in summarization, and human-in-the-loop evaluations were used.

Results

The model demonstrated strong performance in generating high-quality financial text. It maintained coherence over long sequences and accurately represented financial data from the prompt.

Model Examination

No interpretability techniques have yet been applied to this model, but explainability is under consideration for future versions.

Environmental Impact

Training carbon emissions can be estimated using the Machine Learning Impact calculator.

Hardware Type: A100 GPU
Hours used: Approximately 72 hours for fine-tuning
Cloud Provider: AWS
Compute Region: US-East
Carbon Emitted: Estimated at 43 kg of CO2eq

Technical Specifications

Model Architecture and Objective

The Fin-LLaMA 3.1 8B model is based on the LLaMA 3.1 architecture and uses LoRA adapters to efficiently fine-tune the model on financial data.

Compute Infrastructure

The model was trained on A100 GPUs using PyTorch and the Hugging Face 🤗 Transformers library.

Hardware

GPU: A100 (80GB)
Storage Requirements: Around 20GB for the fine-tuned checkpoints, depending on quantization format.

Software

Library: Hugging Face Transformers, Unsloth, PyTorch, PEFT
Version: Unsloth v1.0, PyTorch 2.0, Hugging Face Transformers 4.30.0

Citation

If you use this model in your research or applications, please consider citing:

BibTeX:

@article{touvron2023llama,
  title={LLaMA: Open and Efficient Foundation Language Models},
  author={Touvron, Hugo and others},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}
@misc{us4_fin_llama3_1,
  title={Fin-LLaMA 3.1 8B - Fine-tuned on Financial News},
  author={us4},
  year={2024},
  howpublished={\url{https://huggingface.co/us4/fin-llama3.1-8b}},
}

More Information

For any additional information, please refer to the repository or contact the authors via the Hugging Face Hub.

Model Card Contact

[More Information Needed]

Downloads last month: 169

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for us4/fin-llama3.1-8b

Merges

1 model

Space using us4/fin-llama3.1-8b 1

Paper for us4/fin-llama3.1-8b

LLaMA: Open and Efficient Foundation Language Models

Paper • 2302.13971 • Published Feb 27, 2023 • 23