Instructions to use PleIAs/Pleias-RAG-350M-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use PleIAs/Pleias-RAG-350M-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="PleIAs/Pleias-RAG-350M-gguf", filename="350m_test.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use PleIAs/Pleias-RAG-350M-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PleIAs/Pleias-RAG-350M-gguf # Run inference directly in the terminal: llama-cli -hf PleIAs/Pleias-RAG-350M-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PleIAs/Pleias-RAG-350M-gguf # Run inference directly in the terminal: llama-cli -hf PleIAs/Pleias-RAG-350M-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf PleIAs/Pleias-RAG-350M-gguf # Run inference directly in the terminal: ./llama-cli -hf PleIAs/Pleias-RAG-350M-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf PleIAs/Pleias-RAG-350M-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf PleIAs/Pleias-RAG-350M-gguf
Use Docker
docker model run hf.co/PleIAs/Pleias-RAG-350M-gguf
- LM Studio
- Jan
- Ollama
How to use PleIAs/Pleias-RAG-350M-gguf with Ollama:
ollama run hf.co/PleIAs/Pleias-RAG-350M-gguf
- Unsloth Studio
How to use PleIAs/Pleias-RAG-350M-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PleIAs/Pleias-RAG-350M-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PleIAs/Pleias-RAG-350M-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for PleIAs/Pleias-RAG-350M-gguf to start chatting
- Docker Model Runner
How to use PleIAs/Pleias-RAG-350M-gguf with Docker Model Runner:
docker model run hf.co/PleIAs/Pleias-RAG-350M-gguf
- Lemonade
How to use PleIAs/Pleias-RAG-350M-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull PleIAs/Pleias-RAG-350M-gguf
Run and chat with the model
lemonade run user.Pleias-RAG-350M-gguf-{{QUANT_TAG}}List all available models
lemonade list
Pleias-RAG-350M-gguf
Pleias-RAG-350M-gguf is a 350 million parameters Small Reasoning Model, trained for retrieval-augmented general (RAG), search and source summarization converted in the gguf formats to be run on CPU and other constrained infrastructures. Along with Pleias-RAG-1B it belongs to the first generation of Pleias specialized reasoning models.
Pleias-RAG-350M outperform most SLMs (4 billion parameters and below) on standardized benchmarks for retrieval-augmented general (HotPotQA, 2wiki) and is a highly cost effective alternative with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B and Gemma-3-4B. It is the only SLM to date to maintain consistent RAG performance across leading European languages and to ensure systematic reference grounding for statements.
Due to its size, ease of deployment on constrained infrastructure (including mobile phone) and built-in support for factual and accurate information, Pleias-RAG-350M unlock a range of new use cases for generative AI.
This version is unquantized but, thanks to the small size of the original model, still runnable on most CPU infrastructure under reasonable time without any loss of performance.
Features
Pleias-RAG-350M is a specialized language model using a series of special tokens to process a structured input (query and sources) and generate a structured output (reasoning sequence and answer with sources). For easier implementation, we encourage to use the associated API library.
Citation support
Pleias-RAG-350M natively generated grounded answers on the basis of excerpts and citations extracted from the provided sources, using a custom syntax inspired by Wikipedia () It is one a handful open weights model to date to have been developed with this feature and the first one designed for actual deployment.
In contrast with Anthropic approach (Citation mode), citation are integrally generated by the model and are not the product of external chunking. As a result we can provide another desirable feature to simplify source checking: citation shortening for longer excerpts (using "(…)").
RAG reasoning
Pleias-RAG-350M generates a specific reasoning sequences incorporating several proto-agentic abilities for RAG applications. The model is able to make a series of decisions directly:
- Assessing whether the query is understandable.
- Assessing whether the query is trivial enough to not require a lengthy pre-analysis (adjustable reasoning)
- Assessing whether the sources do contain enough input to generate a grounded answer.
The structured reasoning trace include the following steps:
- Language detection of the query. The model will always strive to answer in the language of the original query.
- Query analysis and associated query report. The analysis can either lead to a standard answer, a shortening reasoning trace/answer for trivial question, a reformulated query or a refusal (that could in the context of the application be transformed into user input querying).
- Source analysis and associated source report. This step evaluates the coverage and depth of the provided sources in regards to the query.
- Draft of the final answer.
Multilinguality
Pleias-RAG-350M is able to read and write in the main European languages: French, German, Italian, Spanish and, to a lesser extent, Polish, Latin and Portuguese.
To date, it is the only small language model with negligible loss of performance in leading European languages for RAG-related tasks. On a translated set of HotPotQA we observed a significant drop of performance in most SLMs from 10% to 30-35% for sub-1B models.
We do expect the results of any standard English evaluation on Pleias RAG models should be largely transferable to the main European languages limiting the costs of evaluation and deployment in multilingual settings.
Training
Pleias-RAG-350M is trained on large synthetic dataset emulating retrieval of wide variety of multilingual open sources from Common Corpus. They provide native support for citation and grounding with literal quotes. Following on the latest trends of agentification, the models reintegrate multiple features associated with RAG workflows such as query routing, query reformulation, source reranking.
Evaluation
Pleias-RAG-350M has been evaluated on three standard RAG benchmarks, 2wiki, HotpotQA and MuSique.
All the benchmarks only assess the "trivial" mode on questions requiring some form of multi-hop reasoning over sources (answer disseminated into different sources) as well as discrimination of distractor sources.
Pleias-RAG-350M is not simply a cost-effective version of larger models. We found it has been able to answer correctly to several hundred questions from HotPotQA that neither Llama-3-8b nor Qwen-2.5-7b could solve. Consequently we encourage its use as part of multi-model RAG systems.
Deployment
With only 350 million parameters, Pleias-RAG-350M is classified among the phone-sized SLM, a niche with very little alternatives (Smollm, Qwen-0.5) and none that currently works well for retrieval-augmented generation.
This is an unquantized GGUF version for deployment on CPU. Our internal performance benchmarks suggest that waiting times are currently acceptable for most either even under constrained RAM: about 20 seconds for a complex generation including reasoning traces on 8g RAM and below. Since the model is unquantized, quality of text generation should be identical to the original model.
Once integrated into a RAG system, Pleias-RAG-350M can also be use in a broader range of non-conversational use cases including user support or educational assistance. Through this release, we aims to make tiny model workable in production by relying systematically on an externalized memory.
- Downloads last month
- 140
We're not able to determine the quantization variants.
Model tree for PleIAs/Pleias-RAG-350M-gguf
Base model
PleIAs/Pleias-350m-Preview