Vietnamese Language Models and AI Sovereignty: PhoGPT, VinaLLaMA, KiLM, and the Race to Master the Mother Tongue
An overview of large language models (LLMs) for Vietnamese — PhoGPT, VinaLLaMA, VNG's KiLM, models from Viettel, GreenMind — and the role of domestic models, Vietnamese data, and digital sovereignty in the AI era.
Vietnamese Language Models and AI Sovereignty
Vietnamese is spoken by nearly 100 million people, yet in the world of AI, it falls into the category of a “low-resource language” compared to English or Chinese. Developing large language models (LLMs) that deeply understand Vietnamese is not just a technical challenge; it’s also a matter of digital sovereignty — the right to national autonomy over data, culture, and knowledge infrastructure. This article reviews prominent Vietnamese models and explains why domestic models are crucial.
1. Why Does Vietnamese Need Its Own Models?
International models like GPT, Gemini, and Llama all support Vietnamese, but Vietnamese typically constitutes only a small fraction of their training data. The consequences include:
- Limited cultural context understanding: Vietnamese idioms, history, laws, and customs are easily misunderstood or hallucinated.
- Complex tones and orthography: Vietnamese has diacritics and tone marks; models with insufficient data can easily misplace marks, leading to changes in meaning.
- Reliance on foreign infrastructure: Using third-party APIs raises questions about data privacy, cost, and long-term stability.
These factors are the impetus for Vietnam to develop “made in Vietnam” LLMs.
2. Prominent Vietnamese Language Models
PhoGPT (VinAI Research)
PhoGPT-4B is a monolingual Vietnamese language model, pre-trained from scratch on a Vietnamese corpus of approximately 102 billion tokens, with a context length of 8192. PhoGPT was released by VinAI as an open-source research project, marking one of the first systematic efforts to build a Vietnamese LLM from the ground up. (Note: VinAI’s generative AI division was acquired by Qualcomm in April 2025.)
VinaLLaMA (Independent Research Group)
VinaLLaMA is an open-weight model built upon LLaMA-2, further trained with an additional 800 billion Vietnamese tokens. The VinaLLaMA-7B-chat version, trained on 1 million high-quality synthetic samples, achieved leading results on benchmarks such as VLSP, VMLU, and the Vietnamese version of Vicuna Benchmark. VinaLLaMA’s strengths lie in its proficiency in Vietnamese and its understanding of Vietnamese culture.
KiLM (VNG / Zalo)
VNG developed KiLM from scratch, placing Vietnam among the Southeast Asian nations possessing their own LLMs. The KiLM 7B-parameter model was launched in late 2023 at the Zalo AI Summit; by late 2024, the 13B-parameter version was reported to surpass several international models (GPT-4, Gemma2-9B, Phi-3-small) in Vietnamese processing capabilities within the VMLU evaluation framework, trailing only Meta’s Llama-70B. KiLM serves as the foundation for Zalo’s Kiki voice assistant.
Models from Viettel and GreenMind
Viettel AI developed VT-Super-120B-A12B (~120 billion parameters), which leads its segment in accuracy, and the Llama 3 ViettelSolution 8B model, which uses data cleaned with NVIDIA NeMo Curator. GreenNode’s GreenMind-Medium-14B-R1 became the first open-source Vietnamese reasoning LLM integrated with NVIDIA NIM, capable of running on a single NVIDIA H100 GPU — suitable for enterprise assistants, chatbots, and Vietnamese document retrieval.
ViGPT (VinBigData)
VinBigData’s ViGPT-1.6B-v1 model is among the notable Vietnamese models, aimed at virtual assistant applications and language processing within the Vingroup ecosystem.
3. The Role of International Models
Global LLMs remain important for Vietnamese users: GPT (OpenAI) and Gemini (Google) offer relatively good Vietnamese support thanks to their massive data scale, serving as popular tools for daily tasks. Meta’s open-source Llama model family has become a foundational platform for many Vietnamese teams to fine-tune rather than training from scratch — significantly saving costs. Vietnam’s practical strategy is therefore a hybrid approach: leveraging international open models as a base, then fine-tuning them with local data and knowledge.
4. Vietnamese Data — The “Oil” of Domestic AI
The quality of LLMs directly depends on the quality of their data. This is both a bottleneck and a strategic advantage:
- Scarcity of large-scale clean data: High-quality digitized Vietnamese texts (books, newspapers, legal documents, conversations) are still scarce compared to English.
- Data cleaning tools: Viettel’s use of NVIDIA NeMo Curator to curate Vietnamese data indicates that data processing is being standardized.
- Population-scale data: In 2026, NVIDIA announced the development of a population-scale dataset with FPT — a significant step for national data infrastructure.
Whoever controls high-quality Vietnamese data will have a decisive advantage in model development.
5. AI Sovereignty and Digital Sovereignty
“Sovereign AI” is a central concept in Vietnam’s strategic direction: achieving autonomy over models, data, and computing infrastructure rather than being entirely dependent on foreign entities. In 2026, Vietnam emerged as a focal point in NVIDIA’s sovereign AI strategy, with FPT and Viettel participating. Viettel AI is confirmed to be developing a national legal AI application on open model infrastructure — a prime example of an application requiring absolute data sovereignty.
AI sovereignty carries multi-layered significance: protecting citizen data, preserving Vietnamese cultural and historical values within machine knowledge, and ensuring security for sensitive applications (national defense, law, healthcare). This is why domestic models are not merely a technical choice but a national strategic imperative.
Conclusion
From PhoGPT, VinaLLaMA to KiLM, and models from Viettel and GreenMind, Vietnam has demonstrated its capability to build competitive Vietnamese LLMs independently. The path forward involves consolidating high-quality Vietnamese data, investing in computing infrastructure, and developing high-level research human resources. Mastering the mother tongue in the AI world is equivalent to mastering a part of the nation’s digital sovereignty in the 21st century.
References
- VinaLLaMA: LLaMA-based Vietnamese Foundation Model (arXiv)
- PhoGPT: Generative Pre-training for Vietnamese (arXiv)
- Báo cáo LLM của Zalo AI / VNG về KiLM
- GreenMind — LLM suy luận tiếng Việt đầu tiên trên NVIDIA NIM (GreenNode)
- Xử lý dữ liệu tiếng Việt chất lượng cao với NVIDIA NeMo Curator
- Việt Nam ra mắt LLM tiếng Việt đầu tiên trên NVIDIA NIM (VietnamNet)