HinoMoto-100M-v4

個人 GPU + アイディア + 実装速度で、世界水準の研究所と同じ土俵に踏み込む実験

Model Description

HinoMoto-100M-v4 は 完全自前 (architecture / tokenizer / weights / corpus / bench) 日本語 LM の集約 SOTA model.

Developed by: Project HinoMoto (solo developer)
Architecture: Llama-style decoder-only (12 layers, 8 heads, d_model 512, RoPE base 10000)
Parameters: 43.4M trainable (excl. embed) / 100M total
Vocabulary: 9,506 (byte-level BPE, custom-trained on balanced corpus)
Context length: 512 (training) / extensible via RoPE
License: Apache-2.0
Date: 2026-04-26
Repository: https://github.com/FIshota/hinomoto-model

Performance (HinoMotoBench-ja v0.5)

Axis	Score	vs base 7B GGUF
family (n=110)	**6.36/12 (53%)	base: 0%
keigo (n=70)	16%	base: -
silence (n=50)	16%	base: -
degenerate (family)	0%	base: 96%

100M model on RTX 3090 で 26 分学習, base 7B GGUF を全軸 outperform.

統計的有意性 (paired t-test)

v7 vs v4: family p=0.61 (n.s.) — 同等
v7 vs v6: silence p<0.05 — 有意改善

Training

Item	Value
Hardware	1× RTX 3090 (24GB VRAM)
Wallclock	13 分
Throughput	12,200 tok/s (fp32 + seq 512)
Optimizer	AdamW (β=0.9, 0.95), wd 0.1, grad_clip 1.0
LR Schedule	warmup 400, cosine decay (min ratio 0.1), peak 3e-4
Batch	2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 toks/step
Steps	10,000
Total tokens	~41M
Stability	EMA decay 0.999, Loss spike detector (ratio 4.0, floor 1.5)
Seed	0

Training Data

Source	License	Share	Size
国会会議録 (Diet records, 1995-2024)	Public domain	53%	39 MB
青空文庫 (Aozora Bunko, 610 works)	Copyright expired	25%	19 MB
Dolly-15k-ja + 自家製 family conv	CC-BY-SA	21%	16 MB
Total (balanced corpus v6)			75 MB

Tokenizer: tokenizer_v3_32k_clean.json (vocab 9,506, byte-level BPE).

Decontamination report: 8-gram overlap with HinoMotoBench-ja questions, mean 12-16% (formal phrases unavoidable).

Usage

Quick inference

import torch, json
from huggingface_hub import hf_hub_download

# Download weights + config + tokenizer
repo = "FiShota/hinomoto-100m-v4"
ckpt_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
config_path = hf_hub_download(repo_id=repo, filename="config.json")
tok_path = hf_hub_download(repo_id=repo, filename="tokenizer.json")

# Need the HinoMoto codebase for model class
# git clone https://github.com/FIshota/hinomoto-model && cd hinomoto-model && pip install -e .
from hinomoto.tokenizer import ByteBPETokenizer
from hinomoto.model.hinomoto_model import HinoMotoConfig, HinoMotoModel
from hinomoto.infer.generate import generate_ids

# Load
cfg = HinoMotoConfig(**json.load(open(config_path)))
model = HinoMotoModel(cfg).to("cuda")
model.load_state_dict(torch.load(ckpt_path, weights_only=False))
model.eval()
tok = ByteBPETokenizer.load(tok_path)

# Generate
prompt = "今日もいい天気"
ids = tok.encode(prompt, add_bos=True)
inp = torch.tensor([ids], dtype=torch.long, device="cuda")
with torch.no_grad():
    out_ids = generate_ids(model, inp, max_new_tokens=80,
                           temperature=0.7, top_p=0.9)
print(tok.decode(out_ids[0].tolist()))

Recommended generation params

temperature: 0.7 (家族会話) / 0.5 (敬語) / 0.3 (factual)
top_p: 0.9
max_new_tokens: 80 (推奨), 上限 256
eos_boost: 2.0 (文末で EOS bias)

Stop-token cleaning (推奨後処理)

import re
SENTENCE_END = re.compile(r"[。？！\n]")
LOOP_RE = re.compile(r"(.{4,30}?)\1{2,}")

def clean(text, max_sentences=2):
    # 1. ループ検出 (n-gram repeat) → 切り詰め
    m = LOOP_RE.search(text)
    if m:
        text = text[:m.start() + len(m.group(1))]
    # 2. 文末で切り詰め
    parts = []
    last = 0
    for m in SENTENCE_END.finditer(text):
        parts.append(text[last:m.end()])
        last = m.end()
        if len(parts) >= max_sentences:
            break
    return "".join(parts).rstrip() if parts else text

→ family score +3pt, degenerate 5%→0% (paired t p=0.021)

Limitations

Scale: 100M is a research smoke model, NOT production
Domain bias: Diet record corpus introduces formal/political vocabulary
No alignment: No DPO/RLHF; raw pretraining outputs
Not safety-tuned: Outputs may include unfiltered language
Single seed (0): Variance not reported, statistical significance limited

Bias and Risks

Diet records: predominantly male, formal, politically diverse but Japan-centric
Aozora skews toward Meiji-Showa era (modern usage gaps)
SFT family data ~500 samples (intimate context less reliable)

Security

Model audit (2026-04-26):

🟢 Adversarial inputs: 7/7 survived
🟢 PII leakage: 0/6 prompts
🟢 Toxicity: 0/6 prompts
🟢 Memorization: 0% (excluding SFT system prompt template, see audit doc)
🟢 Tokenizer fuzz: 5/5 lossless
🟢 Resource exhaustion: 0.4s for 200 tokens

Full report: SECURITY_AUDIT_MODEL_v3.md

Citation

@misc{hinomoto2026v7,
  title  = {HinoMoto-100M-v4: A Solo-Built Japanese Family-Conversation LM},
  author = {{Project HinoMoto}},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/FiShota/hinomoto-100m-v4},
}

Acknowledgements

国立国会図書館 (Diet records API, public domain)
青空文庫 volunteers
kunishou/databricks-dolly-15k-ja
HuggingFace ecosystem
PyTorch SDPA flash attention

Related Models

HinoMoto-100M-v4: Baseline 10k step, family 53%/keigo 17%. https://huggingface.co/FiShota/hinomoto-100m-v4
HinoMoto-Family-3B v2: QLoRA SFT on Sarashina2.2-3B, family 67%/keigo 32% (separate repo, TBA)

Changelog

v0.2.0 (2026-04-26): Initial public release. v4 baseline (集約 SOTA は v7 を参照) + stability infra (EMA, spike detector) + security audit.

Downloads last month: 446

FiShota
/

hinomoto-100m-v4