This episode reveals Japan's urgent drive to establish sovereign AI capabilities, detailing the National Institute of Informatics' (NII) aggressive strategy to build transparent, reliable, and Japanese-centric large language models (LLMs) that rival global leaders.
The Imperative for Japanese LLMs
- NII launched the LLM-JP initiative in May 2023, responding to the severe underrepresentation of Japanese language data in foundational LLMs like GPT-3, where Japanese constituted only 0.11% of the training corpus. This initiative addresses concerns about data sovereignty and the need for open, transparent, and reliable AI models tailored to Japanese culture and knowledge, free from commercial biases.
- Professor Takeda highlights the "sovereign AI" concept, emphasizing domestic control over AI development.
- The LLM-JP initiative, initially 30 researchers, commits to public release of all insights, data, tools, and models, including failures.
- NII's LLM Research and Development Center, established in April 2023 with MEXT funding, spearheads this effort.
- Core research pillars include LLM transparency, reliability, multimodal capabilities, model lightweighting, and domain-specific applications.
- Developing these models requires significant computational resources, often months on 60-100+ GPU nodes.
Key Learnings from Initial Model Development
- NII's first year focused on building robust Japanese LLMs, yielding critical insights into scaling and optimization. The team successfully constructed a 172 billion-parameter model with support from JST (Japan Science and Technology Agency).
- Initial training used a 2 trillion-token corpus, split 50/50 between English and Japanese.
- NII developed a Vision-Language Model (VILA) based on NVIDIA's architecture, demonstrating multimodal capabilities.
- A 13 billion-parameter model achieved strong performance, leading to the development of Mixture-of-Experts (MoE) models.
- Scaling laws were confirmed: performance directly correlates with increased parameters and training tokens.
- Base models achieved performance comparable to early ChatGPT, with instruction-tuned versions falling between GPT-3.5 and GPT-4, surpassing GPT-4 on specific Japanese tasks like machine translation and Q&A.
- Professor Takeda states, "We confirmed that performance improves according to parameter size and accumulated training tokens, validating scaling laws in our models."
Optimization Challenges and Corpus Saturation
- Two major lessons emerged from the 172 billion-parameter model's development, revealing critical optimization and data limitations. Fine-tuning the Adam optimizer's learning rate proved crucial for efficient training.
- Llama 2's reported learning rate (1e-5) led to stable but slow training; NII found 1e-8 (one-third the value) significantly accelerated learning and improved performance.
- The Japanese corpus hit a saturation point at approximately 0.6 trillion tokens, leading to overfitting and stalled performance improvement.
- Future strategy involves increasing the English corpus ratio while rigorously filtering and enhancing the quality of Japanese tokens.
- NII successfully implemented an "upcycling" method for MoE models, resetting half of the feedforward network parameters in pre-trained models, achieving performance exceeding the 172 billion-parameter baseline.
Current Year's Strategic Advancements
- NII is now developing 8 billion and 32 billion-parameter models, benchmarked against leading international models like Llama 3 and Qwen 3. The focus shifts to advanced training methodologies and corpus refinement.
- NII adopted the Warm-up Stable Decay (WSD) learning rate schedule, which demonstrated superior final performance compared to cosine decay.
- Current models are projected to surpass last year's performance, particularly in Japanese and English tasks.
- Future plans include MoE models with potentially dozens of experts, targeting active parameter counts around 300 billion.
- Corpus development evolved from V1 to V4, emphasizing stringent filtering to remove duplicates and low-quality data, while prioritizing high-quality, domain-specific (math, educational) content.
- Training strategy involves 10 trillion tokens of initial pre-training, followed by intermediate pre-training to enhance instruction-following, mathematical reasoning, and code generation capabilities.
Ensuring Transparency and Safety
- NII prioritizes transparency and safety, directly addressing issues like hallucination and developing specific safety datasets for Japanese LLMs.
- NII publicly releases its pre-training corpora, enabling researchers to trace the origin of factual errors and rare event misinterpretations, a key aspect of transparency.
- The team developed and released Japanese safety datasets (e.g., "do not answer," "caution needed") to train LLMs on appropriate responses, significantly improving safety behavior through upsampling.
- Professor Takeda emphasizes, "We must teach LLMs what not to answer, or when to answer with caution, otherwise safety scores will not improve."
The Future: AI for Science and Agent-Driven Research
- NII envisions its LLMs as foundational infrastructure for Japanese universities and research institutions, driving advancements in AI for Science (AI4Science) and AI agents. The integration of LLMs into research workflows is becoming indispensable.
- LLMs will support AI4Science by generating hypotheses, summarizing research, and coding experimental procedures for reproducibility.
- AI agents are already demonstrating the ability to author publishable research papers; NII had two agent-authored papers accepted at a recent Stanford-hosted conference.
- This trend suggests AI agents will become essential tools for academic research (theses, conference papers) and commercial applications (patents).
- NII aims to provide a robust, safe, and Japanese-centric knowledge infrastructure within the next year, comparable to international commercial services.
Investor & Researcher Alpha
- Capital Shift: Investment is moving towards specialized, high-quality data curation and advanced training methodologies (e.g., WSD, MoE upcycling) rather than just raw compute power. The bottleneck is now less about GPU count and more about intelligent data filtering and optimization strategies.
- Research Direction: The focus on "sovereign AI" and language-specific model development (e.g., Japanese LLMs) indicates a growing market for localized, culturally aware AI, potentially opening new investment avenues beyond general-purpose models. Research into optimizer parameter sensitivity and corpus saturation points offers critical insights for future LLM efficiency.
- Obsolete Approaches: Blindly scaling corpus size without rigorous filtering and quality control is becoming obsolete. The "more data is always better" mantra is being refined by the need for high-quality, diverse, and strategically allocated data, especially for non-English languages.
Strategic Conclusion
Japan's NII is rapidly building a transparent, reliable, and culturally relevant LLM ecosystem, essential for national AI sovereignty and academic advancement. This initiative underscores the critical need for high-quality, language-specific data and sophisticated training techniques, setting the stage for AI agents to revolutionize scientific discovery and commercial innovation within the next year.