Instructions to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="distillabs/tft-benchmark-s2-direct-Qwen3-1.7B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distillabs/tft-benchmark-s2-direct-Qwen3-1.7B") model = AutoModelForCausalLM.from_pretrained("distillabs/tft-benchmark-s2-direct-Qwen3-1.7B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/distillabs/tft-benchmark-s2-direct-Qwen3-1.7B
- SGLang
How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distillabs/tft-benchmark-s2-direct-Qwen3-1.7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use distillabs/tft-benchmark-s2-direct-Qwen3-1.7B with Docker Model Runner:
docker model run hf.co/distillabs/tft-benchmark-s2-direct-Qwen3-1.7B
tft-benchmark-s2-direct-Qwen3-1.7B
A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark.
- Pipeline: Direct Training
- Scenario: S2 Noisy Labels — Noisy Labels
- LLM-as-a-judge score: 0.721
- staged_tool_call score: 0.731
For full benchmark details, see our blog post: Why Training on Production Traces Fails (and What to Do Instead)
Benchmark Overview
This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:
- TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning
- Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)
Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).
Scenario: S2 Noisy Labels — Noisy Labels
327 Restaurants_1 traces with 50% of assistant tool calls corrupted. Corruption types attack tool timing: service tools may be swapped, have wrong parameters, or be replaced with respond_to_user calls (and vice versa). 52% of corruptions change the tool choice itself.
Training Details
Trained using Direct Training: the student model is fine-tuned directly on the raw production traces (expanded into per-turn training examples) with no filtering, relabeling, or synthetic data generation.
Configuration
- Base model: Qwen3-1.7B
- Task: multi-turn-tool-calling-closed-book
- Teacher / synth gen model: zai.glm-5
- Judge model: openai.gpt-oss-120b
- Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
- Training: LoRA fine-tuning, merged weights
Target Tools
Based on the Schema-Guided Dialogue (SGD) dataset — restaurant search and reservation:
respond_to_user— send text messages to the userFindRestaurants— search restaurants by cuisine, city, price range, live music, alcoholReserveRestaurant— reserve a table (restaurant name, city, time, date, party size)
Full Benchmark Results
| Scenario | TFT | Direct | Delta |
|---|---|---|---|
| S1 Baseline | 0.866 | 0.864 | +0.2pp |
| S2 Noisy Labels | 0.844 | 0.721 | +12.3pp |
| S3 Schema Drift | 0.844 | 0.585 | +25.9pp |
| S4 Low Data | 0.852 | 0.649 | +20.3pp |
| S5 Trace Mixing | 0.858 | 0.694 | +16.4pp |
TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.
Links
- Blog post: Why Training on Production Traces Fails (and What to Do Instead)
- Benchmark data & code: https://github.com/distil-labs/distil-tft-benchmarking
- Dataset: Schema-Guided Dialogue (SGD)
- Downloads last month
- 30