Instructions to use leeminwaan/qwen_3_4B_ViO_LR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leeminwaan/qwen_3_4B_ViO_LR with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="leeminwaan/qwen_3_4B_ViO_LR")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("leeminwaan/qwen_3_4B_ViO_LR")
model = AutoModelForCausalLM.from_pretrained("leeminwaan/qwen_3_4B_ViO_LR")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use leeminwaan/qwen_3_4B_ViO_LR with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "leeminwaan/qwen_3_4B_ViO_LR"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leeminwaan/qwen_3_4B_ViO_LR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/leeminwaan/qwen_3_4B_ViO_LR

SGLang

How to use leeminwaan/qwen_3_4B_ViO_LR with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "leeminwaan/qwen_3_4B_ViO_LR" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leeminwaan/qwen_3_4B_ViO_LR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "leeminwaan/qwen_3_4B_ViO_LR" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leeminwaan/qwen_3_4B_ViO_LR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use leeminwaan/qwen_3_4B_ViO_LR with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeminwaan/qwen_3_4B_ViO_LR to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeminwaan/qwen_3_4B_ViO_LR to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for leeminwaan/qwen_3_4B_ViO_LR to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="leeminwaan/qwen_3_4B_ViO_LR",
    max_seq_length=2048,
)

Docker Model Runner
How to use leeminwaan/qwen_3_4B_ViO_LR with Docker Model Runner:
```
docker model run hf.co/leeminwaan/qwen_3_4B_ViO_LR
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

imma kinda lazy to rename=)

Qwen-3.5-4B-ViO-LR-LoRA

Developed by: leeminwaan
Finetuned from model: unsloth/Qwen3.5-4B
Methodology: Latent Regularization (LR) via Contextual Representation Alignment.
Project Designation: ViO (Vague is Objective)

This model utilizes a non-standard training objective designed to mitigate Sequential Semantic Drift—a failure mode in smaller parameter models where autoregressive generation progressively diverges from the input context.

Methodology: Latent Regularization (LR)

The standard Cross-Entropy (CE) loss was augmented with an Auxiliary Contextual Penalty. During training, the model's internal representations were constrained using a custom objective function targeting the latent space.

Technical Implementation

Contextual Reference Vector: A reference anchor is derived by mean-pooling the hidden states of the penultimate layer of the input prompt.
Temporal Latent Smoothing: To isolate semantic signals from syntactic fluctuations, a 1D-Average Pooling filter (Window Size: 8) is applied across the sequence dimension of the generated hidden states.
Cosine Margin Constraint: A regularization loss is applied to penalize the cosine distance between the Reference Vector and the Smoothed Latent States.

Objective Function: $\mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda \cdot \mathbb{E} \left[ \max(0, \tau - \cos(z_{ref}, \Phi(H_{gen}))) \right]$ Where $\Phi$ denotes the 1D-temporal filter, $\tau$ represents the similarity margin, and $z_{ref}$ is the contextual reference.

Research Findings: Semantic Compression (SC³)

During the fine-tuning process, a novel behavioral pattern emerged, termed Semantic Compression via Cosine Constraint (SC³). When subjected to latent alignment penalties, the model converged toward a specific mathematical optimum.

Stochastic Convergence to Semantic Centroids

The model demonstrates a preference for semantically central tokens (high-level abstractions) over high-variance, specific tokens. In high-dimensional latent space, specific terms exist as outliers with high directional variance. Vague tokens function as "centroids"—maintaining higher cosine similarity across a broader range of the manifold, thus minimizing the regularization penalty while satisfying the Cross-Entropy objective.

Emergent Latent Variable Creation

The model demonstrates an ability to perform Symbolic Shorthand. It frequently maps complex, multi-token structures from the prompt to abstract internal variables (e.g., utilizing "expression" or "result" as persistent semantic pointers). This reduces the cumulative "latent walk" required to maintain logical continuity, leading to:

Reduced Sequence Entropy: Highly focused reasoning chains.
Implicit Abstraction: Automatic categorization of specific data into abstract classes to maintain proximity to the similarity margin.

Precision-Stability Equilibrium

Observed training dynamics indicate a marginal increase in Cross-Entropy loss in exchange for a significant reduction in Auxiliary Loss. The model optimizes for Stability over Precision.

Usage

This model is optimized for long-form reasoning where structural consistency and contextual anchoring are prioritized.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "leeminwaan/qwen_3_4B_ViO_LR_lora",
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

# Standard inference
messages = [
    {"role": "user", "content": "Analyze the following expression and solve for x: (x + 2)^2 = 0"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1000)