Akshat Jagga: Dippy AI Companions, Image/Video/Voice Inference, Embedded Ads, World Models | Ep. 73

Akshat Jagga, co-founder of Dippy Subnet 11, discusses Dippy's evolution from LLM training to decentralized media inference, content moderation strategies, personalized ads, upcoming voice calling features, and the push toward richer multimodal AI experiences.

Evolution and Pivot to Media Inference

"We realized that we had to move away from just doing text to expanding to these different modalities and actually getting ahead of the curve so that we can get all the gains from the underlying models improving across many different modalities."
"Miners didn't make characters as such as much as they created the models powering the characters. Since the beginning, all the characters on the app have been generated by users."
Dippy started with LLM training but pivoted to media inference (image, video, voice) due to diminishing returns on text-based interactions and increasing user demand for multimodal experiences.
The company leverages Subnet 4 (Targon) for text inference because of its superior performance, cost-effectiveness, and support.
Dippy created its own inference engine because of the lack of decentralized platforms that offer media inference as a product that is broad in breadth and depth of models. The company sought a platform capable of fast speeds and low cost. Dippy's inference engine is optimized for speed using techniques like Tensor RT.

Content Moderation and Monetization Strategies

"So our stance on at least the mobile app side has always been that you have to be above 18 to even download the app. That's been there since the very beginning."
"If you want to serve the rest of humanity, which is the majority of humanity, 97% of users, you inevitably have to turn to ads."
Dippy enforces an 18+ age restriction to download the app. The company uses a multi-layered content moderation policy, removing illegal, illicit, and "untasteful" content to avoid being pigeonholed as an NSFW platform.
Embedded, personalized ads are planned to be the primary monetization strategy. These ads are integrated naturally into character conversations, leveraging user data (with disclaimers for transparency). The first transition will be ads in simple text messages.
The company is planning different ways of presenting embedded ads, which include personalized text messages and coupon codes.

Future of AI Modalities and Model Specialization

"I think image is almost already there. A lot of advancements have been made on the image side. I think the next shift that we'll see would be with video over the next year, year and a half."
"I think specialization will still win. Data is the real mode at this point. if you're able to capture some incredibly important data for a use case, I don't see how like a generic generically trained model beats that unless there's like a really vast intelligence gap and the bigger multimodal model has some new algorithmic breakthrough or something."
Akshat predicts image models are near maturity, with video models expected to advance significantly in the next 1-1.5 years. World models, enabling video game-like experiences, are approximately two years away from LLM quality.
Despite the trend toward multimodal models, Akshat believes specialized models focused on specific use cases will continue to outperform generic models, especially where proprietary data is crucial.
Dippy leverages conversational data to fine-tune LLMs, using user signals like conversation length, likes/dislikes, and message regeneration frequency to optimize reward models.

Key Takeaways:

Embrace Media Inference: Dippy's strategic shift to media inference underscores the rising demand for multimodal AI experiences, presenting significant opportunities for innovation and monetization beyond text-based interactions.
Prioritize Specialized Models: Focus on developing specialized AI models tailored to specific use cases, leveraging proprietary data to create unique value propositions that outperform generic, multimodal solutions.
Monetize with Embedded Ads: Explore embedding personalized, context-aware advertisements within AI interactions as a viable and scalable monetization strategy, acknowledging the limitations of subscription-based models for mass consumer adoption.

For further insights, watch the podcast here: Link

This episode dives into Dippy's unique product-first journey, its strategic pivot to decentralized media inference, and the future of multimodal AI, offering critical insights for Crypto AI investors and researchers.

Dippy's Product-First Journey and BitTensor Integration

Akshat, co-founder of Dippy Subnet 11, details Dippy's unconventional path, starting as a consumer-facing AI friend app with a substantial user base (10,000-50,000 downloads) before integrating with BitTensor. The initial ambition was to train proprietary Large Language Models (LLMs)—AI models capable of understanding and generating human-like text—but the prohibitive costs in compute, expertise, and data collection made this unfeasible. BitTensor emerged as a "match made in heaven," offering a decentralized solution that aligned perfectly with Dippy's product needs without requiring hundreds of millions in funding.
Strategic Implication: This highlights the potential for established consumer applications to leverage decentralized AI networks like BitTensor for cost-effective scaling and specialized compute, offering a viable alternative to traditional venture capital-intensive routes.

Pivot to Decentralized Media Inference

Dippy strategically pivoted from core LLM training to focusing on LoRa (Low-Rank Adaptation) models—a technique for fine-tuning large models with minimal computational cost—and image inference. This shift was driven by the observation that text-based LLMs were reaching diminishing returns, and users increasingly sought richer, multimodal experiences like voice calls, image generation, and video interactions. While Dippy utilizes Subnet 4 (Targon) for its text inference due to its superior speed, cost-efficiency, and support, a critical gap was identified in decentralized platforms for media inference. Centralized providers like FAL AI offered thousands of models with unmatched speed and cost, prompting Dippy to develop its own solution.
Actionable Insight: The market's demand for diverse AI modalities beyond text is growing. Investors should monitor decentralized networks that can efficiently support image, video, and voice inference, as these represent significant growth areas.

Character Creation and User-Generated Content

Akshat clarifies that Dippy's miners do not create characters; instead, they power the underlying models. All of the platform's half-million characters are user-generated, empowering individuals to craft their AI companions. Users can define characters through simple prompts or elaborate backstories, generate accompanying images, and now, customize their character's voice. This user-centric approach fosters a vibrant ecosystem of unique AI personalities.
Strategic Implication: Platforms that successfully integrate user-generated content (UGC) with AI capabilities can create powerful network effects and proprietary data moats, making them highly defensible against competitors.

Technical Deep Dive: Inference Optimization and Subnet 11's Role

Dippy initially invested heavily in deterministic LoRa training to ensure consistent character appearances across generated images. However, rapid advancements, such as the emergence of "Context" models that enable one-shot, consistent character inference, quickly rendered their LoRa training efforts "outdated." Dippy now leverages Targon (Subnet 4) for language inference and is building Subnet 11 (Dippy Studio) as its dedicated platform for media inference. Voice inference will initially rely on Eleven Labs due to a lack of high-quality open-source alternatives but will transition to SN11 as open-source models mature.
Akshat emphasizes SN11's competitive edge, particularly against alternatives like Shoots, which he found "incredibly limiting" in model breadth and inference speed. Dippy has developed a proprietary "Dippy inference engine" utilizing techniques like TensorRT—an NVIDIA library for high-performance deep learning inference—to optimize image models for unparalleled speed. Low latency is paramount for natural, real-time interactions in voice and video calls. SN11 aims to provide interconnected workflows, allowing seamless transitions from text prompts to images, and then to videos, rather than fragmented individual requests. Akshat states, "We see SN11 as like the one-stop shop for your generative media needs." The core optimization factors for SN11 are speed, cost, and the interconnectedness of models, with miner scoring based on reliability, latency, and request handling capacity.
Actionable Insight: The rapid obsolescence of AI techniques underscores the need for agile development and continuous adaptation. Investors should prioritize projects demonstrating strong R&D capabilities in inference optimization and a clear strategy for integrating diverse, high-performance models.

Future of Modalities: Multimodal vs. Specialized Models

Akshat believes that, for now, specialized individual models will continue to outperform blanket multimodal models for specific use cases like character consistency or unique styling. He asserts that "Data is the real moat at this point," suggesting that models trained on highly specific and valuable datasets will maintain an advantage unless generic multimodal models achieve significant algorithmic breakthroughs. Dippy leverages its extensive conversational data for LLM fine-tuning, using signals like conversation length and user feedback to continuously improve its "reward model"—an AI component that guides model behavior towards desired outcomes. This data-driven approach will extend to other modalities, such as collecting user preferences on generated images.
Strategic Implication: While multimodal AI is a long-term vision, specialized AI models with proprietary data moats offer immediate competitive advantages. Researchers should focus on data acquisition strategies and fine-tuning techniques for niche applications.

Content Moderation and Age Gating

Dippy maintains a strict 18+ policy for its mobile app. Recently, the website was updated to require age confirmation before any interaction, closing a loophole that allowed younger users to view content for a limited time. The platform employs a multi-layer moderation system, prohibiting illegal or illicit content and enforcing a "tastefulness code." This proactive approach aims to prevent Dippy from being pigeonholed as an "NSFW-first platform," instead positioning it as the "next generation of YouTube" or mainstream entertainment. Moderated categories include nudity, hate speech, underage content, and anything deemed untasteful or inappropriate for a broad audience.
Actionable Insight: Responsible AI development, particularly in content moderation and age gating, is crucial for mainstream adoption and regulatory compliance. Projects prioritizing ethical AI and robust safety protocols will likely gain greater trust and market penetration.

Monetization Strategy: Embedded Ads vs. Subscriptions

Dippy plans to monetize primarily through "embedded hyperpersonalized ads" rather than a subscription-first model. Akshat illustrates this with an example: a character, having learned a user's habits, might casually recommend a nearby competitor coffee shop at an opportune moment, making the ad feel like a friend's suggestion. Future possibilities include physical mail with coupons or branded items on character outfits. While acknowledging potential impacts on retention, Akshat notes that only about 3% of consumers subscribe to services like ChatGPT Plus, making ads essential for serving the vast majority of users. He views this as an inevitable shift for mass consumer AI products, citing Google, Facebook, and Instagram as historical precedents.
Strategic Implication: The low conversion rates for AI subscriptions suggest that ad-based models will dominate consumer AI. Investors should evaluate AI platforms' ability to integrate personalized, non-intrusive advertising that leverages deep user understanding.

Challenges and Future Vision for AI Modalities

Adding new modalities introduces significant development complexity and requires a "strange bet" on future model improvements. Akshat recounts how Dippy's initial image generation feature was "absolutely hated" by users, negatively impacting engagement. However, by continuously iterating through six or seven different image models, the feature eventually became a "huge hit" and the app's second-largest revenue driver.
Akshat expresses particular excitement for:

Voice Calling: Dippy is pioneering custom voice creation via prompting, allowing users to personalize character voices—a "first" in the AI friend app space.
World Models: Projects like DeepMind's Genie 3 and Odyssey ML are paving the way for video game-like experiences where users can create dynamic virtual worlds and interact with characters (e.g., driving a car in Brazil with a Dippy character).

He provides a timeline for modalities reaching LLM-level quality:

Image: "almost already there."
Video: 1-1.5 years away, with expectations for consistent characters across scenes and real-time inference, potentially enabling AI representations of individuals to interact.
World Models: Approximately 2 years away.

Actionable Insight: Investing in AI requires a long-term vision and resilience against initial user feedback. The convergence of AI with gaming and virtual worlds, driven by advancements in video and world models, presents significant future market opportunities.

Advice for BitTensor Subnets

Akshat offers two key pieces of advice for BitTensor subnets:

Incentive Mechanism: Designing a robust incentive mechanism is paramount for achieving desired outputs from miners. He commends Rigid's work in this area.
Problem-Driven Development: Subnets should be built to address a clear, unmet need, either within BitTensor or the broader market. Starting a subnet without a defined problem is a "bad spot to be in."

Strategic Implication: For researchers and developers in the decentralized AI space, focusing on well-defined problems and crafting effective incentive structures are critical for building sustainable and impactful subnets.

Conclusion

Dippy's journey underscores the strategic imperative of product-market fit and agile adaptation in decentralized AI. The pivot to media inference and the focus on embedded ads highlight emerging monetization models and the critical role of data moats. Investors and researchers should track advancements in multimodal AI, particularly video and world models, and prioritize projects with robust inference optimization and clear problem-solving approaches.

Akshat Jagga: Dippy AI Companions, Image/Video/Voice Inference, Embedded Ads, World Models | Ep. 73

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story

Akshat Jagga: Dippy AI Companions, Image/Video/Voice Inference, Embedded Ads, World Models | Ep. 73

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story