Data and Model Sovereignty: The Foundation of Sovereign AI
Why data quality determines AI model quality, the role of Vietnamese language data, data governance under new laws, the open vs. closed model debate, and Vietnam's national AI sovereignty.
Data and Model Sovereignty: The Foundation of Sovereign AI
If computational infrastructure is the “muscles” of AI, then data is its “blood.” A language model is only as intelligent as the data it learns from. As AI becomes a strategic technology, two questions arise: who owns the data, and who controls the model trained on that data. This is the challenge of model sovereignty — not just a technical issue, but also an economic, legal, and national security one.
Data Quality Determines Model Performance
There’s a classic principle: “garbage in, garbage out.” In the era of LLMs, this principle holds even truer:
- Volume: large models require hundreds of billions to trillions of tokens to learn grammar, knowledge, and reasoning.
- Quality: clean, deduplicated, and noise-reduced data is more important than vast amounts of raw data. DeepSeek-V3 — announced to be trained with approximately $6 million in compute costs — demonstrates that good data and architecture can compensate for limited resources.
- Diversity and Representativeness: if data is biased, the model will reproduce and amplify that bias.
Data quality doesn’t come naturally. It requires rigorous processes for collection, cleaning, labeling, and evaluation — often accounting for the majority of effort in an AI project.
Vietnamese Language Data: A Strategic Asset
The majority of global AI training data is in English and Chinese. Vietnamese is a “low-resource” language from the perspective of international models, leading to misunderstandings of context, diacritics, and local culture.
The Vietnamese research community has built several foundational Vietnamese language models:
- PhoGPT (VinAI Research, 2023): an open-source Vietnamese generative language model; the PhoGPT-4B version was trained from scratch on a corpus of approximately 102 billion tokens (482 GB after cleaning and deduplication).
- ViGPT (VinBigdata): introduced as a “Vietnamese ChatGPT version” for end-users.
- VinaLLaMA, URA-LLaMA, Vietcuna: foundational models based on LLaMA or BLOOMZ, specifically designed to handle Vietnamese syntax and semantics.
Possessing a high-quality Vietnamese language data corpus is a prerequisite for building AI that serves the Vietnamese people — from virtual assistants and public services to healthcare and education.
Data Governance: Vietnam’s New Legal Framework
Data only creates value when properly governed. Vietnam has enacted a robust data legal framework for the 2024–2026 period:
- Data Law (Law No. 60/2024/QH15): passed on November 30, 2024, effective July 1, 2025. Extends governance to all digital data, introduces concepts of “important data” and “core data” with restrictions on international transfer based on national defense and security.
- Law on Personal Data Protection (Law No. 91/2025/QH15): passed on June 26, 2025, effective January 1, 2026. Applies to both domestic and foreign organizations processing personal data within Vietnamese territory; enhances data subject rights, requires impact assessments, and protects sensitive data.
- Data localization requirements: certain types of data — account names, service usage history, payment information, IP addresses — must be stored domestically.
This framework shapes how businesses build and operate AI: Vietnamese user data must be processed responsibly and, for many types, must reside on domestic infrastructure.
Open vs. Closed Models
The central debate in the AI industry for 2025–2026 is between open-weight models and closed/proprietary models:
- Closed models (GPT, Claude, Gemini): accessible only via API; users cannot download the weights. Advantages: top quality, ease of use. Disadvantages: vendor dependence, data outside user control, unpredictable long-term costs.
- Open models (Llama, DeepSeek, Qwen, GLM): weights are published and can be downloaded for self-operation. Advantages: control, avoidance of vendor lock-in, better data compliance, customization on proprietary data.
A key point in 2025–2026: the quality gap between open and closed models has narrowed to approximately 6–12 months and continues to shrink. DeepSeek-R1 (1/2025) rivals the GPT-4 class on many benchmarks but is released as open weights. Qwen 3.5 (2/2026) has become the strongest open model on various reasoning tests. For most enterprise tasks — programming, classification, summarization, structured data extraction — the best open models are now on par with leading closed models.
Notably, many Chinese labs release open models not primarily for commercial sale, but to enhance national AI standing and counter chip export restrictions — a strategic calculation rather than purely commercial one.
National AI Sovereignty
Sovereign AI is a nation’s ability to independently build, control, and operate AI using its own infrastructure, data, and human resources. For Vietnam, AI sovereignty is demonstrated through:
- Resolution 57-NQ/TW (December 22, 2024): emphasizes Vietnam’s need for AI technological autonomy to avoid digital dependence, setting a goal to be among the top 3 in Southeast Asia for AI R&D by 2030.
- Law on Artificial Intelligence: passed in December 2025, effective March 1, 2026 — establishing the first binding legal framework for AI in Vietnam.
- Domestic infrastructure: CMC’s C-OpenAI system operates on CMC Cloud to ensure the sovereignty of Vietnamese data; VNPT AI aims to reduce dependence.
Open models play a crucial role in AI sovereignty: instead of relying on foreign APIs, a nation can use open models as a foundation, fine-tune them on local data, and deploy them on domestic infrastructure — saving costs while maintaining control.
Conclusion
Data and models are two sides of the same strategic asset. High-quality Vietnamese language data, governed by new laws, combined with open models and domestic infrastructure, pave the way for Vietnam to build sovereign AI. Model sovereignty is not a slogan — it is the synthesis of clean data, a robust legal framework, and the ability to operate models independently without vendor lock-in.
References
- PhoGPT: Generative Pre-training for Vietnamese (arXiv)
- VinBigdata — ViGPT
- Securiti — Vietnam Law on Personal Data Protection 2025
- ITIF — Vietnam’s Data-Localization Regulation
- Lexology — Vietnam’s AI Push: Updated National Strategy and First AI Law
- DeepInfra — Open vs Closed Source AI Models
- MindStudio — Open-Weight AI Models Are Catching Up
- NVIDIA Blog — Thailand, Vietnam Embrace Sovereign AI
- VietnamPlus — Resolution 57: Vietnam advances domestic AI ecosystem