File size: 4,427 Bytes
8dcbc77 23cadb4 0f266f7 23cadb4 8dcbc77 23cadb4 8dcbc77 23cadb4 8dcbc77 23cadb4 8dcbc77 2ec54a1 23cadb4 2ec54a1 8dcbc77 502aece 8dcbc77 23cadb4 8dcbc77 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
base_model:
- NiuTrans/LMT-60-1.7B-Base
datasets:
- NiuTrans/LMT-60-sft-data
language:
- en
- zh
- ar
- es
- de
- fr
- it
- ja
- nl
- pl
- pt
- ru
- tr
- bg
- bn
- cs
- da
- el
- fa
- fi
- hi
- hu
- id
- ko
- nb
- ro
- sk
- sv
- th
- uk
- vi
- am
- az
- bo
- he
- hr
- hy
- is
- jv
- ka
- kk
- km
- ky
- lo
- mn
- mr
- ms
- my
- ne
- ps
- si
- sw
- ta
- te
- tg
- tl
- ug
- ur
- uz
- yue
license: apache-2.0
metrics:
- bleu
- comet
pipeline_tag: translation
library_name: transformers
---
## LMT
- Paper: [Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs](https://arxiv.org/abs/2511.07003)
- Github: [LMT](https://github.com/NiuTrans/LMT)
**LMT-60** is a suite of **Chinese-English-centric** MMT models trained on **90B tokens** mixed monolingual and bilingual tokens, covering **60 languages across 234 translation directions** and achieving **SOTA performance** among models with similar language coverage.
We release both the CPT and SFT versions of LMT-60 in four sizes (0.6B/1.7B/4B/8B). All checkpoints are available:
| Models | Model Link |
|:------------|:------------|
| LMT-60-0.6B-Base | [NiuTrans/LMT-60-0.6B-Base](https://huggingface.co/NiuTrans/LMT-60-0.6B-Base) |
| LMT-60-0.6B | [NiuTrans/LMT-60-0.6B](https://huggingface.co/NiuTrans/LMT-60-0.6B) |
| LMT-60-1.7B-Base | [NiuTrans/LMT-60-1.7B-Base](https://huggingface.co/NiuTrans/LMT-60-1.7B-Base) |
| LMT-60-1.7B | [NiuTrans/LMT-60-1.7B](https://huggingface.co/NiuTrans/LMT-60-1.7B) |
| LMT-60-4B-Base | [NiuTrans/LMT-60-4B-Base](https://huggingface.co/NiuTrans/LMT-60-4B-Base) |
| LMT-60-4B | [NiuTrans/LMT-60-4B](https://huggingface.co/NiuTrans/LMT-60-4B) |
| LMT-60-8B-Base | [NiuTrans/LMT-60-8B-Base](https://huggingface.co/NiuTrans/LMT-60-8B-Base) |
| LMT-60-8B | [NiuTrans/LMT-60-8B](https://huggingface.co/NiuTrans/LMT-60-8B) |
Our supervised fine-tuning (SFT) data are released at [NiuTrans/LMT-60-sft-data](https://huggingface.co/datasets/NiuTrans/LMT-60-sft-data)
## Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "NiuTrans/LMT-60-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = """Translate the following text from English into Chinese.
English: The concept came from China where plum blossoms were the flower of choice.
Chinese:"""
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
outputs = tokenizer.decode(output_ids, skip_special_tokens=True)
print("response:", outputs)
```
## Support Languages
| Resource Tier | Languages |
| :---- | :---- |
| High-resource Languages (13) | Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh) |
| Medium-resource Languages (18) | Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(nb), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi) |
| Low-resouce Languages (29) | Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Chinese Mongolian(mn_cn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue) |
## Citation
If you find our paper useful for your research, please kindly cite our paper:
```bash
@misc{luoyf2025lmt,
title={Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs},
author={Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu},
year={2025},
eprint={2511.07003},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.07003},
}
``` |