LLM Research and Development Initiatives at the National Institute of Informatics

Japan's National Institute of Informatics (NII) is tackling a critical challenge: building open-source, high-performance, and culturally relevant Japanese LLMs. This initiative aims to establish a sovereign AI infrastructure, pushing past the English-centric bias of current models and fostering transparency in AI development for both academic and commercial use.

The Sovereign AI Imperative

"Originally, in GPT-3's training corpus, Japanese ranked around 15th, with over 90% being English. Japanese text accounted for only about 0.11%. This raised concerns that Japanese text, culture, and knowledge might not be reflected in language models. Recently, this is often expressed as 'sovereign AI' or 'data sovereignty'."
The Language Gap: Early LLMs like GPT-3 had minimal Japanese data, raising concerns about cultural and linguistic bias. Imagine learning about a complex topic from a book written almost entirely in a foreign language, with only a few translated sentences.
National Strategy: NII's LLM-JP initiative, backed by the Ministry of Education, is building open-source, Japanese-centric LLMs to ensure data sovereignty and prevent commercial biases from dictating AI development.
Transparency First: The project prioritizes open access to models, data, and tools, including insights from failures, to cultivate a transparent and trustworthy AI ecosystem.

Technical Breakthroughs & Training Innovations

"One major lesson learned from last year's model construction was that a single parameter value can significantly change learning. For example, adjusting the Adam optimizer's learning rate from 1E-5 to 1E-8 dramatically improved performance."
Scaling Laws Confirmed: NII's models demonstrate clear scaling laws: increasing parameters and training tokens directly improves performance on Japanese benchmarks. Their base models approach GPT-3.5 performance, sometimes exceeding GPT-4 on specific Japanese tasks.
Optimizer Precision: A critical discovery involved fine-tuning the Adam optimizer's learning rate. A small adjustment (from 1E-5 to 1E-8) dramatically accelerated stable learning for large models. This is like finding the perfect gear ratio for a race car – a small tweak yields a big speed boost.
Mixture of Experts (MoE): NII successfully used an "upscaling" technique with MoE, combining eight specialized experts to surpass the performance of a single 172 billion parameter model. This is a key strategy for efficient scaling.
Intermediate Training: Current models emphasize "intermediate training" after initial pre-training to enhance reasoning capabilities, moving beyond simple word prediction to instruction-following and complex problem-solving.

LLMs as Research & Education Infrastructure

"Stanford University recently hosted a conference for papers written using agents, with over 50% of submissions created by agents. It's becoming a daily reality to produce international conference-level papers using agent combinations."
AI for Science: NII envisions LLMs as fundamental tools for scientific research, capable of generating hypotheses, summarizing literature, and even coding experimental procedures.
Agentic Academia: The rise of AI agents is already impacting academic publishing. NII had two agent-authored papers accepted at a recent Stanford conference, signaling a shift in research methodology.

Key Takeaways:

Sovereign AI is Real: Nations are investing in domestic AI capabilities to counter linguistic bias and ensure data control. This creates opportunities for specialized models and infrastructure.
Builder's Edge: Meticulous parameter tuning, high-quality data curation, and innovative architectures like MoE are crucial for achieving top-tier LLM performance.
The Agentic Future: AI agents are rapidly becoming indispensable tools in research and education, demanding robust, reliable, and culturally relevant LLM backbones.

Podcast Link: https://www.youtube.com/watch?v=md5l0BQ_VXA

This episode reveals Japan's urgent drive to establish sovereign AI capabilities, detailing the National Institute of Informatics' (NII) aggressive strategy to build transparent, reliable, and Japanese-centric large language models (LLMs) that rival global leaders.

The Imperative for Japanese LLMs

NII launched the LLM-JP initiative in May 2023, responding to the severe underrepresentation of Japanese language data in foundational LLMs like GPT-3, where Japanese constituted only 0.11% of the training corpus. This initiative addresses concerns about data sovereignty and the need for open, transparent, and reliable AI models tailored to Japanese culture and knowledge, free from commercial biases.
Professor Takeda highlights the "sovereign AI" concept, emphasizing domestic control over AI development.
The LLM-JP initiative, initially 30 researchers, commits to public release of all insights, data, tools, and models, including failures.
NII's LLM Research and Development Center, established in April 2023 with MEXT funding, spearheads this effort.
Core research pillars include LLM transparency, reliability, multimodal capabilities, model lightweighting, and domain-specific applications.
Developing these models requires significant computational resources, often months on 60-100+ GPU nodes.

Key Learnings from Initial Model Development

NII's first year focused on building robust Japanese LLMs, yielding critical insights into scaling and optimization. The team successfully constructed a 172 billion-parameter model with support from JST (Japan Science and Technology Agency).
Initial training used a 2 trillion-token corpus, split 50/50 between English and Japanese.
NII developed a Vision-Language Model (VILA) based on NVIDIA's architecture, demonstrating multimodal capabilities.
A 13 billion-parameter model achieved strong performance, leading to the development of Mixture-of-Experts (MoE) models.
Scaling laws were confirmed: performance directly correlates with increased parameters and training tokens.
Base models achieved performance comparable to early ChatGPT, with instruction-tuned versions falling between GPT-3.5 and GPT-4, surpassing GPT-4 on specific Japanese tasks like machine translation and Q&A.
Professor Takeda states, "We confirmed that performance improves according to parameter size and accumulated training tokens, validating scaling laws in our models."

Optimization Challenges and Corpus Saturation

Two major lessons emerged from the 172 billion-parameter model's development, revealing critical optimization and data limitations. Fine-tuning the Adam optimizer's learning rate proved crucial for efficient training.
Llama 2's reported learning rate (1e-5) led to stable but slow training; NII found 1e-8 (one-third the value) significantly accelerated learning and improved performance.
The Japanese corpus hit a saturation point at approximately 0.6 trillion tokens, leading to overfitting and stalled performance improvement.
Future strategy involves increasing the English corpus ratio while rigorously filtering and enhancing the quality of Japanese tokens.
NII successfully implemented an "upcycling" method for MoE models, resetting half of the feedforward network parameters in pre-trained models, achieving performance exceeding the 172 billion-parameter baseline.

Current Year's Strategic Advancements

NII is now developing 8 billion and 32 billion-parameter models, benchmarked against leading international models like Llama 3 and Qwen 3. The focus shifts to advanced training methodologies and corpus refinement.
NII adopted the Warm-up Stable Decay (WSD) learning rate schedule, which demonstrated superior final performance compared to cosine decay.
Current models are projected to surpass last year's performance, particularly in Japanese and English tasks.
Future plans include MoE models with potentially dozens of experts, targeting active parameter counts around 300 billion.
Corpus development evolved from V1 to V4, emphasizing stringent filtering to remove duplicates and low-quality data, while prioritizing high-quality, domain-specific (math, educational) content.
Training strategy involves 10 trillion tokens of initial pre-training, followed by intermediate pre-training to enhance instruction-following, mathematical reasoning, and code generation capabilities.

Ensuring Transparency and Safety

NII prioritizes transparency and safety, directly addressing issues like hallucination and developing specific safety datasets for Japanese LLMs.
NII publicly releases its pre-training corpora, enabling researchers to trace the origin of factual errors and rare event misinterpretations, a key aspect of transparency.
The team developed and released Japanese safety datasets (e.g., "do not answer," "caution needed") to train LLMs on appropriate responses, significantly improving safety behavior through upsampling.
Professor Takeda emphasizes, "We must teach LLMs what not to answer, or when to answer with caution, otherwise safety scores will not improve."

The Future: AI for Science and Agent-Driven Research

NII envisions its LLMs as foundational infrastructure for Japanese universities and research institutions, driving advancements in AI for Science (AI4Science) and AI agents. The integration of LLMs into research workflows is becoming indispensable.
LLMs will support AI4Science by generating hypotheses, summarizing research, and coding experimental procedures for reproducibility.
AI agents are already demonstrating the ability to author publishable research papers; NII had two agent-authored papers accepted at a recent Stanford-hosted conference.
This trend suggests AI agents will become essential tools for academic research (theses, conference papers) and commercial applications (patents).
NII aims to provide a robust, safe, and Japanese-centric knowledge infrastructure within the next year, comparable to international commercial services.

Investor & Researcher Alpha

Capital Shift: Investment is moving towards specialized, high-quality data curation and advanced training methodologies (e.g., WSD, MoE upcycling) rather than just raw compute power. The bottleneck is now less about GPU count and more about intelligent data filtering and optimization strategies.
Research Direction: The focus on "sovereign AI" and language-specific model development (e.g., Japanese LLMs) indicates a growing market for localized, culturally aware AI, potentially opening new investment avenues beyond general-purpose models. Research into optimizer parameter sensitivity and corpus saturation points offers critical insights for future LLM efficiency.
Obsolete Approaches: Blindly scaling corpus size without rigorous filtering and quality control is becoming obsolete. The "more data is always better" mantra is being refined by the need for high-quality, diverse, and strategically allocated data, especially for non-English languages.

Strategic Conclusion

Japan's NII is rapidly building a transparent, reliable, and culturally relevant LLM ecosystem, essential for national AI sovereignty and academic advancement. This initiative underscores the critical need for high-quality, language-specific data and sophisticated training techniques, setting the stage for AI agents to revolutionize scientific discovery and commercial innovation within the next year.

LLM Research and Development Initiatives at the National Institute of Informatics

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story

LLM Research and Development Initiatives at the National Institute of Informatics

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story