HinoMoto-100M-v4

個人 GPU + アイディア + 実装速度で、世界水準の研究所と同じ土俵に踏み込む実験

License: Apache 2.0 GitHub

Model Description

HinoMoto-100M-v4 は 完全自前 (architecture / tokenizer / weights / corpus / bench) 日本語 LM の集約 SOTA model.

  • Developed by: Project HinoMoto (solo developer)
  • Architecture: Llama-style decoder-only (12 layers, 8 heads, d_model 512, RoPE base 10000)
  • Parameters: 43.4M trainable (excl. embed) / 100M total
  • Vocabulary: 9,506 (byte-level BPE, custom-trained on balanced corpus)
  • Context length: 512 (training) / extensible via RoPE
  • License: Apache-2.0
  • Date: 2026-04-26
  • Repository: https://github.com/FIshota/hinomoto-model

Performance (HinoMotoBench-ja v0.5)

Axis Score vs base 7B GGUF
family (n=110) **6.36/12 (53%) base: 0%
keigo (n=70) 16% base: -
silence (n=50) 16% base: -
degenerate (family) 0% base: 96%

100M model on RTX 3090 で 26 分学習, base 7B GGUF を全軸 outperform.

統計的有意性 (paired t-test)

  • v7 vs v4: family p=0.61 (n.s.) — 同等
  • v7 vs v6: silence p<0.05 — 有意改善

Training

Item Value
Hardware 1× RTX 3090 (24GB VRAM)
Wallclock 13 分
Throughput 12,200 tok/s (fp32 + seq 512)
Optimizer AdamW (β=0.9, 0.95), wd 0.1, grad_clip 1.0
LR Schedule warmup 400, cosine decay (min ratio 0.1), peak 3e-4
Batch 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 toks/step
Steps 10,000
Total tokens ~41M
Stability EMA decay 0.999, Loss spike detector (ratio 4.0, floor 1.5)
Seed 0

Training Data

Source License Share Size
国会会議録 (Diet records, 1995-2024) Public domain 53% 39 MB
青空文庫 (Aozora Bunko, 610 works) Copyright expired 25% 19 MB
Dolly-15k-ja + 自家製 family conv CC-BY-SA 21% 16 MB
Total (balanced corpus v6) 75 MB

Tokenizer: tokenizer_v3_32k_clean.json (vocab 9,506, byte-level BPE).

Decontamination report: 8-gram overlap with HinoMotoBench-ja questions, mean 12-16% (formal phrases unavoidable).

Usage

Quick inference

import torch, json
from huggingface_hub import hf_hub_download

# Download weights + config + tokenizer
repo = "FiShota/hinomoto-100m-v4"
ckpt_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
config_path = hf_hub_download(repo_id=repo, filename="config.json")
tok_path = hf_hub_download(repo_id=repo, filename="tokenizer.json")

# Need the HinoMoto codebase for model class
# git clone https://github.com/FIshota/hinomoto-model && cd hinomoto-model && pip install -e .
from hinomoto.tokenizer import ByteBPETokenizer
from hinomoto.model.hinomoto_model import HinoMotoConfig, HinoMotoModel
from hinomoto.infer.generate import generate_ids

# Load
cfg = HinoMotoConfig(**json.load(open(config_path)))
model = HinoMotoModel(cfg).to("cuda")
model.load_state_dict(torch.load(ckpt_path, weights_only=False))
model.eval()
tok = ByteBPETokenizer.load(tok_path)

# Generate
prompt = "今日もいい天気"
ids = tok.encode(prompt, add_bos=True)
inp = torch.tensor([ids], dtype=torch.long, device="cuda")
with torch.no_grad():
    out_ids = generate_ids(model, inp, max_new_tokens=80,
                           temperature=0.7, top_p=0.9)
print(tok.decode(out_ids[0].tolist()))

Recommended generation params

  • temperature: 0.7 (家族会話) / 0.5 (敬語) / 0.3 (factual)
  • top_p: 0.9
  • max_new_tokens: 80 (推奨), 上限 256
  • eos_boost: 2.0 (文末で EOS bias)

Stop-token cleaning (推奨後処理)

import re
SENTENCE_END = re.compile(r"[。?!\n]")
LOOP_RE = re.compile(r"(.{4,30}?)\1{2,}")

def clean(text, max_sentences=2):
    # 1. ループ検出 (n-gram repeat) → 切り詰め
    m = LOOP_RE.search(text)
    if m:
        text = text[:m.start() + len(m.group(1))]
    # 2. 文末で切り詰め
    parts = []
    last = 0
    for m in SENTENCE_END.finditer(text):
        parts.append(text[last:m.end()])
        last = m.end()
        if len(parts) >= max_sentences:
            break
    return "".join(parts).rstrip() if parts else text

→ family score +3pt, degenerate 5%→0% (paired t p=0.021)

Limitations

  • Scale: 100M is a research smoke model, NOT production
  • Domain bias: Diet record corpus introduces formal/political vocabulary
  • No alignment: No DPO/RLHF; raw pretraining outputs
  • Not safety-tuned: Outputs may include unfiltered language
  • Single seed (0): Variance not reported, statistical significance limited

Bias and Risks

  • Diet records: predominantly male, formal, politically diverse but Japan-centric
  • Aozora skews toward Meiji-Showa era (modern usage gaps)
  • SFT family data ~500 samples (intimate context less reliable)

Security

Model audit (2026-04-26):

  • 🟢 Adversarial inputs: 7/7 survived
  • 🟢 PII leakage: 0/6 prompts
  • 🟢 Toxicity: 0/6 prompts
  • 🟢 Memorization: 0% (excluding SFT system prompt template, see audit doc)
  • 🟢 Tokenizer fuzz: 5/5 lossless
  • 🟢 Resource exhaustion: 0.4s for 200 tokens

Full report: SECURITY_AUDIT_MODEL_v3.md

Citation

@misc{hinomoto2026v7,
  title  = {HinoMoto-100M-v4: A Solo-Built Japanese Family-Conversation LM},
  author = {{Project HinoMoto}},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/FiShota/hinomoto-100m-v4},
}

Acknowledgements

  • 国立国会図書館 (Diet records API, public domain)
  • 青空文庫 volunteers
  • kunishou/databricks-dolly-15k-ja
  • HuggingFace ecosystem
  • PyTorch SDPA flash attention

Related Models

Changelog

  • v0.2.0 (2026-04-26): Initial public release. v4 baseline (集約 SOTA は v7 を参照) + stability infra (EMA, spike detector) + security audit.
Downloads last month
446
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train FiShota/hinomoto-100m-v4