arxiv:2606.11786

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

Published on Jun 10

· Submitted by

Joanito Agili Lopo on Jun 11

haim

Upvote

Authors:

Abstract

Continual Instruction Tuning enables effective fine-tuning of large language models for low-resource language translation, achieving superior performance compared to standard instruction tuning and multilingual models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.

View arXiv page View PDF Project page Add to collection

Community

joanitolopo

Paper submitter 1 day ago

We introduce Lius, an Indonesian → Kupang Malay translation model designed for low-resource machine translation.

Kupang Malay is a Malay-based creole spoken in East Nusa Tenggara, Indonesia, but it remains underrepresented in current NLP resources and commercial MT systems. In this work, we propose Instructional Linguistic, a linguistically informed instruction design strategy, and Continual Instruction Tuning (CIT), where the model is trained iteratively with multiple instruction types for the same translation target.

Our approach uses four instruction families: context-based, semantic mapping-based, phonetic-based, and list-group-label-based prompts. We train three Cendol-mT5 variants: small, base, and large. The best model, Lius-Large-MT, improves over standard instruction tuning and outperforms several multilingual LLM and NMT baselines on Indonesian → Kupang Malay translation.

Models are available on Hugging Face:

Code:
https://github.com/joanitolopo/instructional-linguistic-llm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11786

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11786 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11786 in a Space README.md to link it from this page.