Instructions to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="distillabs/tft-benchmark-s2-direct-Qwen3-1.7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distillabs/tft-benchmark-s2-direct-Qwen3-1.7B")
model = AutoModelForCausalLM.from_pretrained("distillabs/tft-benchmark-s2-direct-Qwen3-1.7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/distillabs/tft-benchmark-s2-direct-Qwen3-1.7B

SGLang

How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with Docker Model Runner:
```
docker model run hf.co/distillabs/tft-benchmark-s2-direct-Qwen3-1.7B
```

tft-benchmark-s2-direct-Qwen3-1.7B

A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark.

Pipeline: Direct Training
Scenario: S2 Noisy Labels — Noisy Labels
LLM-as-a-judge score: 0.721
staged_tool_call score: 0.731

For full benchmark details, see our blog post: Why Training on Production Traces Fails (and What to Do Instead)

Benchmark Overview

This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:

TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning
Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)

Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).

Scenario: S2 Noisy Labels — Noisy Labels

327 Restaurants_1 traces with 50% of assistant tool calls corrupted. Corruption types attack tool timing: service tools may be swapped, have wrong parameters, or be replaced with respond_to_user calls (and vice versa). 52% of corruptions change the tool choice itself.

Training Details

Trained using Direct Training: the student model is fine-tuned directly on the raw production traces (expanded into per-turn training examples) with no filtering, relabeling, or synthetic data generation.

Configuration

Base model: Qwen3-1.7B
Task: multi-turn-tool-calling-closed-book
Teacher / synth gen model: zai.glm-5
Judge model: openai.gpt-oss-120b
Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
Training: LoRA fine-tuning, merged weights

Target Tools

Based on the Schema-Guided Dialogue (SGD) dataset — restaurant search and reservation:

respond_to_user — send text messages to the user
FindRestaurants — search restaurants by cuisine, city, price range, live music, alcohol
ReserveRestaurant — reserve a table (restaurant name, city, time, date, party size)

Full Benchmark Results

Scenario	TFT	Direct	Delta
S1 Baseline	0.866	0.864	+0.2pp
S2 Noisy Labels	0.844	0.721	+12.3pp
S3 Schema Drift	0.844	0.585	+25.9pp
S4 Low Data	0.852	0.649	+20.3pp
S5 Trace Mixing	0.858	0.694	+16.4pp

TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.

Model tree for distillabs/tft-benchmark-s2-direct-Qwen3-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(799)

this model

distillabs
/

tft-benchmark-s2-direct-Qwen3-1.7B