Model Card for SEA-LION-ModernBERT-600M

Last update: 2026-03-16

SEA-LION is a collection of Large Language Models (LLMs) which have been pretrained and fine-tuned for the Southeast Asia (SEA) region.

Leveraging the advanced encoder-only ModernBERT architecture combined with the Gemma 3 SentencePiece tokenizer, this specific model achieves highly efficient and culturally nuanced text processing with improved tokenization fertility and compression rates for complex regional scripts, enabling it to handle longer context windows and cross-lingual tasks with greater computational efficiency. It was developed through a multi-stage training pipeline, establishing its foundation through extensive pre-training on 2 Trillion (2T) tokens, followed by a mid-training phase on an additional 1 Trillion (1T) tokens, with both phases covering code and 13 languages: Burmese, Chinese, English, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tamil, Thai, and Vietnamese. To enhance cross-lingual alignment, the model then underwent contrastive pre-training using 245 million text pairs (EN-EN and EN-SEA), and was finally instruct-tuned using a diverse dataset of 8 million text pairs (spanning EN-EN, CN-CN, EN-SEA, and SEA-SEA) to create the final instruction-tuned model.

Model Details

Model Description

The SEA-LION-ModernBERT-600M model is built on the ModernBERT-large architecture and has a vocabulary size of 262K.

For tokenization, the model employs our custom Gemma 3 tokenizer, which has excellent performance for SEA languages, ensuring optimal model performance.

Developed by: AI Products Pillar, AI Singapore
Funded by: Singapore NRF
Shared by: AI Products Pillar, AI Singapore
Model type: Encoder
Context length: 8k
Languages: Burmese, Chinese, English, Filipino, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tamil, Thai, and Vietnamese
License: MIT

Model Sources

Repository: We are releasing the weights for every training stage to promote transparency and research and to support a wide range of downstream applications. The final SEA-LION-Embedding-600M (fine-tuned variant) can be accessed at the Link to HF Repo

Uses

This model card details one of the variants available within this collection.

Model Variant	Model Repository	Suggesting Applications & Use Cases
Fine-tuned Embedding Models	- aisingapore/SEA-LION-E5-Embedding-600M - aisingapore/SEA-LION-ModernBERT-Embedding-300M - aisingapore/SEA-LION-ModernBERT-Embedding-600M	- Retrieval-Augmented Generation (RAG) - Information retrieval, and search - Similarity comparisons
Pre-trained Encoder Models	- aisingapore/SEA-LION-ModernBERT-300M - aisingapore/SEA-LION-ModernBERT-600M	- Fill mask - Text classification - Fine-tuning for downstream tasks (e.g., sentiment analysis, classification).
Pre-trained Model Checkpoints	- aisingapore/SEA-LION-ModernBERT-300M-checkpoints - aisingapore/SEA-LION-ModernBERT-600M-checkpoints	- Continued Pre-Training (CPT) - Fine-tuning for downstream tasks (e.g., sentiment analysis, classification).

Bias, Risks, and Limitations

The model was not tested for robustness against adversarial usage. It is important for users to be aware that our model exhibits certain limitations that warrant consideration. Users should also exercise caution in continue-implementing and validating the model's responses due to the potential inconsistencies.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

pip install -U transformers>=4.48.0

####################
# ## Example code adopted from https://huggingface.co/docs/transformers/main/en/model_doc/modernbert#modernbert
####################
import torch
from transformers import pipeline

pipeline = pipeline(
    task="fill-mask",
    model="aisingapore/SEA-LION-ModernBERT-600M"
    dtype=torch.float16,
    device=0
)
pipeline("Plants create  through a process known as photosynthesis.")

Training Details

Training Data

This model was tuned using a multi-stage training pipeline with the following datasets:

Contrastive Pre-training: 245 million text pairs (EN-EN and EN-SEA) to enhance cross-lingual alignment.
Fine-tuning: 8 million diverse text pairs (spanning EN-EN, CN-CN, EN-SEA, and SEA-SEA) to create the final fine-tuned model.

Language	Percentage
EN-EN	20%
CN-CN	20%
EN-SEA	10%
SEA-SEA	50%

Training Procedure

Preprocessing

Following the foundational training, the model's cross-lingual alignment was substantially enhanced by undergoing contrastive pre-training utilizing 245 million text pairs, specifically focusing on English-to-English and English-to-Southeast Asian language mappings (EN-EN and EN-SEA). Finally, to ensure the model could effectively follow user instructions and handle complex interactions, it was instruct-tuned using a diverse dataset of 8 million text pairs spanning EN-EN, CN-CN, EN-SEA, and SEA-SEA, culminating in the final highly capable instruction-tuned model.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model is evaluated across three primary benchmark suites to provide a comprehensive assessment of embedding quality across Southeast Asian, Chinese, and English contexts:

SEA-BED (Southeast Asia Embedding Benchmark): The primary testing suite, consisting of 169 datasets across 10 Southeast Asian languages (Burmese, Filipino, Indonesian, Khmer, Malay, Lao, Tamil, Tetum, Thai, and Vietnamese). Notably, 71% of these datasets are native-authored or human-curated to preserve regional linguistic properties.
CMTEB (Chinese Massive Text Embedding Benchmark): A specialised subset of MTEB focused on Chinese language tasks, used to evaluate performance in one of the region's most prominent scripts.
MTEB (Massive Text Embedding Benchmark): The industry-standard global benchmark used to gauge general-purpose English embedding performance across a wide array of tasks.

Factors

Evaluation factors are categorised by task type and linguistic diversity to ensure the model's "fertility" and "nuance" are captured accurately:

Linguistic Coverage: Evaluation spans across 10+ languages, including complex Brahmic scripts (Burmese, Khmer, Lao, Tamil, Thai) and Latin-based SEA scripts (Indonesian, Filipino, Malay, Tetum, Vietnamese).
Task Modality:
- Retrieval/Reranking: Efficiency in finding relevant documents within a large corpus.
- Semantic Textual Similarity (STS): Precision in sentence-level semantic alignment.
- Clustering & Classification: Ability to group or categorize text based on latent semantic meaning.
- Summarisation & Bitext Mining: High-level semantic matching and cross-lingual alignment.
Architecture Efficiency: Performance is measured in the context of the ModernBERT architecture and Gemma 3 tokenizer to assess computational efficiency versus embedding quality.

Metrics

To provide a standardized view of performance, we report the following metrics across the benchmark suites:

Classification: F1-score.
Multi-label Classification: F1-score
Pair Classification: Average Precision (AP).
Semantic Textual Similarity (STS): Cosine similarity scores.
Clustering: V-Measure Score.
Bitext Mining: F1-score.
Retrieval & Reranking: NDCG@10 (Primary) and MAP.
Instruction Retrieval: NNDCG@5

Results

Performance comparison of embedding models on SEA-BED (https://leaderboard.sea-lion.ai/embedding/SEA). Captured on 13/03/2026 02:50pm.

Environmental Impact

Carbon emission was estimated using the fact sheet from TRG Datacenters.

Hardware Type: Nvidia H200 140GB GPUs
Hours used: 14 hrs
Cloud Provider: SMC H200
Compute Region: Singapore
Carbon Emitted: appx. 252.13 kg CO2 e

Technical Specifications

Model Architecture and Objective

SEA-LION-ModernBERT is an encoder model using the ModernBERT architecture.

Parameter	SEA-LION-ModernBERT-Embedding
Layers	22 in 300M / 28 in 600M
d_model	768 in 300M / 1024 in 600M
head_dim	12 in 300M / 16 in 600M
Vocabulary	262144
Sequence Length	128/ 8k

Compute Infrastructure

Hardware

Hardware Type: Nvidia H200 140GB GPUs
Cloud Provider: SMC H200

Software

SEA-LION was trained using the ModernBERT code base which is powered by the Composer training framework from MosaicML.

Glossary

SEA-BED: Southeast Asia Embedding Benchmark – a comprehensive evaluation suite for embedding models on SEA languages.
Asymmetric Retrieval: Retrieval tasks where query and document formulations differ.
Mean Pooling: Aggregating token embeddings by averaging (weighted by attention mask) to produce a fixed-size sentence representation.

More Information

While this model supports masked language modeling, it is primarily optimized via contrastive fine-tuning for downstream tasks such as sequence classification, token classification, or question answering. Please note that these weights have not been specifically aligned for safety; therefore, developers should implement their own safety evaluations and security measures. The authors disclaim all liability for any claims, damages, or other liabilities arising from the use of the released code or weights.

For more info, please contact us at sealion@aisingapore.org

Team

Ahmed Dabeer, Ahn Jeongmi, Antonyrex Sajeban, Chan Hok Teng Adwin, Cheng Zi Yi Nicholas, Choa Hsueh Mei Esther, Heng Jonathan, Huang Yuli, Jann Railey Estrada Montalan, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Liew Rachel, Limkonchotiwat Peerat, Muhammad Ridzuan Bin Mokhtar, Nagarajan Karthik, Ng Boon Cheong Raymond, Ngee Chia Tai, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Tat-Wee David, Ong Zhi Hao, Pereira Mark, Poon Joseph, Rengarajan Hamsawardhini, Siow Wei Kang Bryan, Susanto Yosephine, Sutaveephamochanon Anocha, Tan Choon Meng, Tan Chor Phin Evelyn, Tan Siao Wei Jessica, Tan Yixian, Tee Jun Yun, Teng Kok Wai Walter, Teo Eng Sipp Leslie, Tjhi William, Wu Donghang, Yeo Yeow Tong, Yong Xianbin, Zhang Haoyang, Zhang Zhou

Acknowledgement

This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA), Singapore under its National Large Language Model Funding Initiative.

Contact

sealion@aisingapore.org

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including aisingapore/SEA-LION-ModernBERT-600M

SEA-LION ModernBERT and Embedding

Collection

13 items • Updated 5 days ago

Paper for aisingapore/SEA-LION-ModernBERT-600M

SEA-BED: Southeast Asia Embedding Benchmark

Paper • 2508.12243 • Published Aug 17, 2025