HinoMoto-100M-v4
個人 GPU + アイディア + 実装速度で、世界水準の研究所と同じ土俵に踏み込む実験
Model Description
HinoMoto-100M-v4 は 完全自前 (architecture / tokenizer / weights / corpus / bench) 日本語 LM の集約 SOTA model.
- Developed by: Project HinoMoto (solo developer)
- Architecture: Llama-style decoder-only (12 layers, 8 heads, d_model 512, RoPE base 10000)
- Parameters: 43.4M trainable (excl. embed) / 100M total
- Vocabulary: 9,506 (byte-level BPE, custom-trained on balanced corpus)
- Context length: 512 (training) / extensible via RoPE
- License: Apache-2.0
- Date: 2026-04-26
- Repository: https://github.com/FIshota/hinomoto-model
Performance (HinoMotoBench-ja v0.5)
| Axis | Score | vs base 7B GGUF |
|---|---|---|
| family (n=110) | **6.36/12 (53%) | base: 0% |
| keigo (n=70) | 16% | base: - |
| silence (n=50) | 16% | base: - |
| degenerate (family) | 0% | base: 96% |
100M model on RTX 3090 で 26 分学習, base 7B GGUF を全軸 outperform.
統計的有意性 (paired t-test)
- v7 vs v4: family p=0.61 (n.s.) — 同等
- v7 vs v6: silence p<0.05 — 有意改善
Training
| Item | Value |
|---|---|
| Hardware | 1× RTX 3090 (24GB VRAM) |
| Wallclock | 13 分 |
| Throughput | 12,200 tok/s (fp32 + seq 512) |
| Optimizer | AdamW (β=0.9, 0.95), wd 0.1, grad_clip 1.0 |
| LR Schedule | warmup 400, cosine decay (min ratio 0.1), peak 3e-4 |
| Batch | 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 toks/step |
| Steps | 10,000 |
| Total tokens | ~41M |
| Stability | EMA decay 0.999, Loss spike detector (ratio 4.0, floor 1.5) |
| Seed | 0 |
Training Data
| Source | License | Share | Size |
|---|---|---|---|
| 国会会議録 (Diet records, 1995-2024) | Public domain | 53% | 39 MB |
| 青空文庫 (Aozora Bunko, 610 works) | Copyright expired | 25% | 19 MB |
| Dolly-15k-ja + 自家製 family conv | CC-BY-SA | 21% | 16 MB |
| Total (balanced corpus v6) | 75 MB |
Tokenizer: tokenizer_v3_32k_clean.json (vocab 9,506, byte-level BPE).
Decontamination report: 8-gram overlap with HinoMotoBench-ja questions, mean 12-16% (formal phrases unavoidable).
Usage
Quick inference
import torch, json
from huggingface_hub import hf_hub_download
# Download weights + config + tokenizer
repo = "FiShota/hinomoto-100m-v4"
ckpt_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
config_path = hf_hub_download(repo_id=repo, filename="config.json")
tok_path = hf_hub_download(repo_id=repo, filename="tokenizer.json")
# Need the HinoMoto codebase for model class
# git clone https://github.com/FIshota/hinomoto-model && cd hinomoto-model && pip install -e .
from hinomoto.tokenizer import ByteBPETokenizer
from hinomoto.model.hinomoto_model import HinoMotoConfig, HinoMotoModel
from hinomoto.infer.generate import generate_ids
# Load
cfg = HinoMotoConfig(**json.load(open(config_path)))
model = HinoMotoModel(cfg).to("cuda")
model.load_state_dict(torch.load(ckpt_path, weights_only=False))
model.eval()
tok = ByteBPETokenizer.load(tok_path)
# Generate
prompt = "今日もいい天気"
ids = tok.encode(prompt, add_bos=True)
inp = torch.tensor([ids], dtype=torch.long, device="cuda")
with torch.no_grad():
out_ids = generate_ids(model, inp, max_new_tokens=80,
temperature=0.7, top_p=0.9)
print(tok.decode(out_ids[0].tolist()))
Recommended generation params
temperature: 0.7 (家族会話) / 0.5 (敬語) / 0.3 (factual)top_p: 0.9max_new_tokens: 80 (推奨), 上限 256eos_boost: 2.0 (文末で EOS bias)
Stop-token cleaning (推奨後処理)
import re
SENTENCE_END = re.compile(r"[。?!\n]")
LOOP_RE = re.compile(r"(.{4,30}?)\1{2,}")
def clean(text, max_sentences=2):
# 1. ループ検出 (n-gram repeat) → 切り詰め
m = LOOP_RE.search(text)
if m:
text = text[:m.start() + len(m.group(1))]
# 2. 文末で切り詰め
parts = []
last = 0
for m in SENTENCE_END.finditer(text):
parts.append(text[last:m.end()])
last = m.end()
if len(parts) >= max_sentences:
break
return "".join(parts).rstrip() if parts else text
→ family score +3pt, degenerate 5%→0% (paired t p=0.021)
Limitations
- Scale: 100M is a research smoke model, NOT production
- Domain bias: Diet record corpus introduces formal/political vocabulary
- No alignment: No DPO/RLHF; raw pretraining outputs
- Not safety-tuned: Outputs may include unfiltered language
- Single seed (0): Variance not reported, statistical significance limited
Bias and Risks
- Diet records: predominantly male, formal, politically diverse but Japan-centric
- Aozora skews toward Meiji-Showa era (modern usage gaps)
- SFT family data ~500 samples (intimate context less reliable)
Security
Model audit (2026-04-26):
- 🟢 Adversarial inputs: 7/7 survived
- 🟢 PII leakage: 0/6 prompts
- 🟢 Toxicity: 0/6 prompts
- 🟢 Memorization: 0% (excluding SFT system prompt template, see audit doc)
- 🟢 Tokenizer fuzz: 5/5 lossless
- 🟢 Resource exhaustion: 0.4s for 200 tokens
Full report: SECURITY_AUDIT_MODEL_v3.md
Citation
@misc{hinomoto2026v7,
title = {HinoMoto-100M-v4: A Solo-Built Japanese Family-Conversation LM},
author = {{Project HinoMoto}},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/FiShota/hinomoto-100m-v4},
}
Acknowledgements
- 国立国会図書館 (Diet records API, public domain)
- 青空文庫 volunteers
- kunishou/databricks-dolly-15k-ja
- HuggingFace ecosystem
- PyTorch SDPA flash attention
Related Models
- HinoMoto-100M-v4: Baseline 10k step, family 53%/keigo 17%. https://huggingface.co/FiShota/hinomoto-100m-v4
- HinoMoto-Family-3B v2: QLoRA SFT on Sarashina2.2-3B, family 67%/keigo 32% (separate repo, TBA)
Changelog
- v0.2.0 (2026-04-26): Initial public release. v4 baseline (集約 SOTA は v7 を参照) + stability infra (EMA, spike detector) + security audit.
- Downloads last month
- 446