Instructions to use PleIAs/Pleias-RAG-350M-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PleIAs/Pleias-RAG-350M-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="PleIAs/Pleias-RAG-350M-gguf",
	filename="350m_test.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use PleIAs/Pleias-RAG-350M-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf PleIAs/Pleias-RAG-350M-gguf
# Run inference directly in the terminal:
llama-cli -hf PleIAs/Pleias-RAG-350M-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf PleIAs/Pleias-RAG-350M-gguf
# Run inference directly in the terminal:
llama-cli -hf PleIAs/Pleias-RAG-350M-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf PleIAs/Pleias-RAG-350M-gguf
# Run inference directly in the terminal:
./llama-cli -hf PleIAs/Pleias-RAG-350M-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf PleIAs/Pleias-RAG-350M-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf PleIAs/Pleias-RAG-350M-gguf

Use Docker

docker model run hf.co/PleIAs/Pleias-RAG-350M-gguf

LM Studio
Jan
Ollama
How to use PleIAs/Pleias-RAG-350M-gguf with Ollama:
```
ollama run hf.co/PleIAs/Pleias-RAG-350M-gguf
```

Unsloth Studio

How to use PleIAs/Pleias-RAG-350M-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PleIAs/Pleias-RAG-350M-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PleIAs/Pleias-RAG-350M-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for PleIAs/Pleias-RAG-350M-gguf to start chatting

Docker Model Runner
How to use PleIAs/Pleias-RAG-350M-gguf with Docker Model Runner:
```
docker model run hf.co/PleIAs/Pleias-RAG-350M-gguf
```

Lemonade

How to use PleIAs/Pleias-RAG-350M-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull PleIAs/Pleias-RAG-350M-gguf

Run and chat with the model

lemonade run user.Pleias-RAG-350M-gguf-{{QUANT_TAG}}

List all available models

lemonade list

Pleias-RAG-350M-gguf

Full model report

Pleias-RAG-350M-gguf is a 350 million parameters Small Reasoning Model, trained for retrieval-augmented general (RAG), search and source summarization converted in the gguf formats to be run on CPU and other constrained infrastructures. Along with Pleias-RAG-1B it belongs to the first generation of Pleias specialized reasoning models.

Pleias-RAG-350M outperform most SLMs (4 billion parameters and below) on standardized benchmarks for retrieval-augmented general (HotPotQA, 2wiki) and is a highly cost effective alternative with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B and Gemma-3-4B. It is the only SLM to date to maintain consistent RAG performance across leading European languages and to ensure systematic reference grounding for statements.

Due to its size, ease of deployment on constrained infrastructure (including mobile phone) and built-in support for factual and accurate information, Pleias-RAG-350M unlock a range of new use cases for generative AI.

This version is unquantized but, thanks to the small size of the original model, still runnable on most CPU infrastructure under reasonable time without any loss of performance.

Features

Pleias-RAG-350M is a specialized language model using a series of special tokens to process a structured input (query and sources) and generate a structured output (reasoning sequence and answer with sources). For easier implementation, we encourage to use the associated API library.

Citation support

Pleias-RAG-350M natively generated grounded answers on the basis of excerpts and citations extracted from the provided sources, using a custom syntax inspired by Wikipedia () It is one a handful open weights model to date to have been developed with this feature and the first one designed for actual deployment.

In contrast with Anthropic approach (Citation mode), citation are integrally generated by the model and are not the product of external chunking. As a result we can provide another desirable feature to simplify source checking: citation shortening for longer excerpts (using "(…)").

RAG reasoning

Pleias-RAG-350M generates a specific reasoning sequences incorporating several proto-agentic abilities for RAG applications. The model is able to make a series of decisions directly:

Assessing whether the query is understandable.
Assessing whether the query is trivial enough to not require a lengthy pre-analysis (adjustable reasoning)
Assessing whether the sources do contain enough input to generate a grounded answer.

The structured reasoning trace include the following steps:

Language detection of the query. The model will always strive to answer in the language of the original query.
Query analysis and associated query report. The analysis can either lead to a standard answer, a shortening reasoning trace/answer for trivial question, a reformulated query or a refusal (that could in the context of the application be transformed into user input querying).
Source analysis and associated source report. This step evaluates the coverage and depth of the provided sources in regards to the query.
Draft of the final answer.

Multilinguality

Pleias-RAG-350M is able to read and write in the main European languages: French, German, Italian, Spanish and, to a lesser extent, Polish, Latin and Portuguese.

To date, it is the only small language model with negligible loss of performance in leading European languages for RAG-related tasks. On a translated set of HotPotQA we observed a significant drop of performance in most SLMs from 10% to 30-35% for sub-1B models.

We do expect the results of any standard English evaluation on Pleias RAG models should be largely transferable to the main European languages limiting the costs of evaluation and deployment in multilingual settings.

Training

Pleias-RAG-350M is trained on large synthetic dataset emulating retrieval of wide variety of multilingual open sources from Common Corpus. They provide native support for citation and grounding with literal quotes. Following on the latest trends of agentification, the models reintegrate multiple features associated with RAG workflows such as query routing, query reformulation, source reranking.

Evaluation

Pleias-RAG-350M has been evaluated on three standard RAG benchmarks, 2wiki, HotpotQA and MuSique.

All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.

Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.

Deployment

With only 350 million parameters, Pleias-RAG-350M is classified among the phone-sized SLM, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.

This is an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.

Once integrated into a RAG system, Pleias-RAG-350M can also be use in a broader range of non-conversational use cases including user support or educational assistance. Through this release, we aims to make tiny model workable in production by relying systematically on an externalized memory.